nameprep2 and the slash homograph issue

Discussion:

nameprep2 and the slash homograph issue

Erik van der Poel

2005-02-22 19:14:12 UTC

All,

In a way, it was pretty sensible for the IETF to decide to avoid
subsetting Unicode too much so that national registries could make those
decisions on their own. After all, those countries know more about their
own characters than the IETF does, and it seems more fair to let them
make such a decision.

One could see this as an instance of "Push the problem downstream, so
that we cannot be blamed for being overly restrictive up here".

Now I'm wondering if we could make a similar argument in the slash
homograph case. If nameprep2 bans the slash homograph, then there is no
way for any community to use it in a domain name, even if that domain
name appears in a context where slash means nothing. Consider the email
case. There are no slashes in the vicinity of a domain name in an email
app. The URI case is, of course, different. Here you often see slashes,
so a slash homograph could easily spoof someone.

So, could nameprep2's position be, "Push the slash homograph problem
downstream, to the app, so that we cannot be blamed for being overly
restrictive up here"?

Or is the slash fundamentally different from national characters? And if
so, who are we to make that statement? Shouldn't the countries be
deciding that? (Not that TLDs can restrict names in 3LDs and up, only
the apps can address those.)

Another argument against banning the slash homograph is that any new
banning would require a new ACE prefix, which is a lot of work, and, as
John said, there should be a high threshold for any demonstration that
tries to show that a new prefix is necessary.

Instead of banning the slash homograph, nameprep2 could simply warn
implementors of the spoofing problem, giving some vague advice (without
overly restricting the apps).

Erik

Erik van der Poel

2005-02-22 20:07:04 UTC

> Or is the slash fundamentally different from national characters?

I guess you could make the case that U+2215 is different from national
characters, because it is a mathematical operator. Similarly, U+2044 is
a general punctuation character. Are either of these already banned by
nameprep?

There is another character U+30CE that looks like a slash. Perhaps we
would warn about characters like this in nameprep2?

I still wonder whether U+2215 and U+2044 are sufficiently problematic to
require a new ACE prefix.

Erik

Adam M. Costello

2005-02-23 07:28:37 UTC

Erik van der Poel <***@vanderpoel.org> wrote:

> Another argument against banning the slash homograph is that any new
> banning would require a new ACE prefix, which is a lot of work, and,
> as John said, there should be a high threshold for any demonstration
> that tries to show that a new prefix is necessary.

An alternative, rather than banning the character, is to recommend
that it not be shown; the ACE form could be shown instead. This would
effectively make the character useless in domain names (for both
phishers and honest folks) without requiring a new ACE prefix.

We could push ToUnicode down inside a wrapper function, ToDisplay.
Applications would never call ToUnicode directly anymore. Whenever
they wanted to display a domain name, they'd call ToDisplay, which
would call ToUnicode, check the result, and if it didn't like it, call
ToASCII. (Of course, since ToUnicode typically calls ToASCII, there are
opportunities to optimize that logic.)

AMC

John C Klensin

2005-02-23 08:39:58 UTC

--On Wednesday, 23 February, 2005 07:28 +0000 "Adam M. Costello"
<idn.amc+***@nicemice.net.RemoveThisWord> wrote:

> Erik van der Poel <***@vanderpoel.org> wrote:
>
>> Another argument against banning the slash homograph is that
>> any new banning would require a new ACE prefix, which is a
>> lot of work, and, as John said, there should be a high
>> threshold for any demonstration that tries to show that a new
>> prefix is necessary.
>
> An alternative, rather than banning the character, is to
> recommend that it not be shown; the ACE form could be shown
> instead. This would effectively make the character useless in
> domain names (for both phishers and honest folks) without
> requiring a new ACE prefix.
>
> We could push ToUnicode down inside a wrapper function,
> ToDisplay. Applications would never call ToUnicode directly
> anymore. Whenever they wanted to display a domain name,
> they'd call ToDisplay, which would call ToUnicode, check the
> result, and if it didn't like it, call ToASCII. (Of course,
> since ToUnicode typically calls ToASCII, there are
> opportunities to optimize that logic.)

Adam, there are two problems with this. First, it effectively
dictates UI behavior, which is generally a bad idea. It is a
particularly bad idea in this case because you are proposing the
sort of UI behavior that generates a lot of very confused
questions and trouble reports, which is something no sane
implementer wants. And "won't display" is not the right answer
on the registration side of the process, even if it were right
on the lookup side. The second problem is that it is a kludge,
and that inserting kludges into critical protocols or procedures
--and IDNs are certainly critical-- almost always turns out to
be a seriously bad idea sooner or later.

If we find a need to start banning characters that we could not
agree on banning the first time around, there is another
approach, also unpleasant but IMO less problematic, that could
be considered. Just as RFC 2822 moved past a lot of legacy
nonsense by having two separate "create" and "accept" syntaxes,
we could define an additional profile, say "NameRegisterPrep".
It would look a lot like Nameprep but would ban the characters
you are now suggesting banning, plus, based on what I think is
growing experience in the registries, ban any character that
mapped to anything else. The effect would be to permit only
those code points as input to the registration process that
could be output into punycode and the DNS. Several registries
have adopted the latter part of that model already: basically
what you register is ToASCII(string) and/or
ToUnicode(ToASCII(string)), but never "string". The lookup
process would remain the same, with no changes to Nameprep being
made at all. And, by eliminating all of the mapping tables and
replacing them with prohibitions, it would make the question
"can this character appear in an IDN" a great deal less
complicated, which would certainly be an advantage.

This type of registration restriction is rather different from
our asking/expecting ICANN and the registries to adopt rules
about, e.g., mixed-script registrations that would help people
stay out of trouble. For better or worse, ICANN has a great
deal less trouble asking (or demanding) that people conform to
the protocols than it does with making up a somewhat fuzzy
guideline and enforcing it. In the most extreme of cases,
violating a protocol in a significant way is one of those "not
acting in the best interests of the local users and the Internet
community" that RFC 1591 warns against and indicates could be
grounds for redelegating a registry.

It would leave the registries and ICANN stuck with the problem
of what to do about anything that was now registered which
violated the new rules, but that problem would exist for _any_
substantive change we made.

john

Adam M. Costello

2005-02-23 10:52:44 UTC

John C Klensin <***@jck.com> wrote:

> If we find a need to start banning characters that we could not agree
> on banning the first time around, there is another approach, also
> unpleasant but IMO less problematic, that could be considered. Just
> as RFC 2822 moved past a lot of legacy nonsense by having two separate
> "create" and "accept" syntaxes, we could define an additional profile,
> say "NameRegisterPrep". It would look a lot like Nameprep but would
> ban the characters you are now suggesting banning, plus, based on what
> I think is growing experience in the registries, ban any character
> that mapped to anything else.
>
> The lookup process would remain the same, with no changes to Nameprep
> being made at all.

But browser implementers want to protect their users today against
malicious names that may exist in the DNS today. I don't see how
this proposal would help them do that. Browser implementors are
comtemplating banning characters in IDNs the browser (that is, failing
to look up names containing blacklisted characters), and I was trying
to think of a less drastic, less blatantly nonconformant, but equally
protective measure that could be taken in the browser.

AMC

Erik van der Poel

2005-02-23 15:28:20 UTC

Adam,

The IETF generally only specifies the "wire" protocol, not UI behavior.
The IETF does not specify how apps interface with users; it only
specifies how apps interface with other apps, over the wire. Note that
this does not even include APIs in many cases.

However, it *would* be very wise to *warn* implementors about any
dangerous homographs in the new RFC (if we decide not to ban them outright).

Erik

Adam M. Costello wrote:
> But browser implementers want to protect their users today against
> malicious names that may exist in the DNS today. I don't see how
> this proposal would help them do that. Browser implementors are
> comtemplating banning characters in IDNs the browser (that is, failing
> to look up names containing blacklisted characters), and I was trying
> to think of a less drastic, less blatantly nonconformant, but equally
> protective measure that could be taken in the browser.

Erik van der Poel

2005-02-23 17:06:26 UTC

> However, it *would* be very wise to *warn* implementors about any
> dangerous homographs in the new RFC (if we decide not to ban them
> outright).

Yesterday, I started a new Web site to collect nameprep issues, and now
I'm thinking that it might also be a good place to collect info about
dangerous homographs. Implementors could look here for guidance.

I stress that this site is unofficial. I am merely offering a place for
us to collect info and present it to others. The mailing list archive is
a daunting and unorganized place to look for such info.

Please send any additions, comments, etc to me or to the mailing list. I
promise to be open about it, but if that turns out to be a problem,
maybe I or someone can start a Wiki instead.

http://nameprep.org/

Erik

JFC (Jefsey) Morfin

2005-02-23 17:06:33 UTC

At 16:28 23/02/2005, Erik van der Poel wrote:
>Adam,
>The IETF generally only specifies the "wire" protocol, not UI behavior.
>The IETF does not specify how apps interface with users; it only specifies
>how apps interface with other apps, over the wire. Note that this does not
>even include APIs in many cases.
>
>However, it *would* be very wise to *warn* implementors about any
>dangerous homographs in the new RFC (if we decide not to ban them outright).

Yes. But this does not prevent to warn in proposing solutions to the system
conceptual problems, like for tables. The same for TLD Managers.
jfc

Erik van der Poel

2005-02-23 17:52:36 UTC

JFC (Jefsey) Morfin wrote:
> At 16:28 23/02/2005, Erik van der Poel wrote:
>> However, it *would* be very wise to *warn* implementors about any
>> dangerous homographs in the new RFC (if we decide not to ban them
>> outright).
>
> Yes. But this does not prevent to warn in proposing solutions to the
> system conceptual problems, like for tables. The same for TLD Managers.

I agree. Specs often have normative and informative sections,
appendices, etc. We could easily insert some *advice* in an informative
part.

Erik

Adam M. Costello

2005-02-24 08:17:21 UTC

Erik van der Poel <***@vanderpoel.org> wrote:

> The IETF generally only specifies the "wire" protocol, not UI
> behavior. The IETF does not specify how apps interface with users;

Generally, that's true, but IDNA is an exception. It state four
requirements (RFC 3490 section 3.1), and one of those four has rather
little to do with wire protocols, and quite a lot to do with UI
behavior:

3) ACE labels obtained from domain name slots SHOULD be hidden from
users when it is known that the environment can handle the non-ACE
form, except when the ACE form is explicitly requested. When it
is not known whether or not the environment can handle the non-ACE
form, the application MAY use the non-ACE form (which might fail,
such as by not being displayed properly), or it MAY use the ACE
form (which will look unintelligible to the user).

I think this discussion is headed toward an update to IDNA that would
add a second exception to that requirement, for protecting the user
against phishing. What we need to figure out is how to describe that
exception, and how specific or deliberately vague that description
should be.

AMC

Erik van der Poel

2005-02-24 15:08:47 UTC

Adam M. Costello wrote:
> Erik van der Poel <***@vanderpoel.org> wrote:
>>The IETF generally only specifies the "wire" protocol, not UI
>>behavior. The IETF does not specify how apps interface with users;
>
> Generally, that's true, but IDNA is an exception. It state four
> requirements (RFC 3490 section 3.1), and one of those four has rather
> little to do with wire protocols, and quite a lot to do with UI
> behavior:
>
> 3) ACE labels obtained from domain name slots SHOULD be hidden from
> users when it is known that the environment can handle the non-ACE
> form, except when the ACE form is explicitly requested. When it
> is not known whether or not the environment can handle the non-ACE
> form, the application MAY use the non-ACE form (which might fail,
> such as by not being displayed properly), or it MAY use the ACE
> form (which will look unintelligible to the user).

I don't think IDNA is an exception. Note that the part you quote above
uses words like SHOULD and MAY. I would say that those words were chosen
for *exactly* the reasons I mentioned (i.e. IETF *specifies* wire
protocols, not UI behavior). See section 6 of:

http://ietf.org/rfc/rfc2119.txt

This RFC focusses on "interoperability" (but also mentions "harm") so I
would say that the wire protocol is the main concern.

> I think this discussion is headed toward an update to IDNA that would
> add a second exception to that requirement, for protecting the user
> against phishing. What we need to figure out is how to describe that
> exception, and how specific or deliberately vague that description
> should be.

Here I agree with you. I'm not going to try to come up with the wording
for that, but this morning I started to think that the right-to-left DNS
and IDN spoofing problems *could* be addressed at the UI level by
providing a *tool* that security-conscious users could *choose* to use.

I'm thinking of a tool that might be implemented as an extension for
Mozilla, for example. It would offer to display domain names in the safe
order, i.e. left-to-right for users whose main language is
left-to-right. I have not heard of any UIs that offer top-to-bottom in
their menus, dialogs, etc, so I would guess that this would be omitted
in the extension too, though right-to-left might be offered for
right-to-left users (many of which are in the Middle East -- Hebrew and
Arabic).

In addition, such a tool would offer to display domain names in a clear
font, unlike the sans-serif that is commonly used today. This would make
the distinction between lowercase l and digit 1 clearer. And it would
separate the domain name from its context, e.g. using color.

Finally, this tool would offer to display characters outside the user's
language(s) in a special way, to make them stand out and catch the
user's attention. I believe we need to focus on the user here, because
we are talking about how things *look* to the user.

For example, people in the Far East are used to spotting small
differences between complicated characters because they were taught as
children to read and write the thousands of complex "Han" characters
that differ in small ways.

Americans, on the other hand, are only used to seeing a small number of
characters, and would not even be able to *read* Han characters, let
alone spot differences between them. This is why I believe that a tool
that focusses on the user might be a good idea.

You may claim that nobody would ever want to read domain names
left-to-right, to which I would counter that some people are willing to
try Dvorak keyboards, which are totally different from QWERTY. I.e. it's
the user's choice. Internet security education may eventually lead
*some* users to make this choice.

Erik

JFC (Jefsey) Morfin

2005-02-24 22:13:06 UTC

Very, very good idea.

This could be a _standard_ way to print the domain name in the browser bar.
A very simple way could just be to print every URL in the bar with a
different color for each level. This does not change anything in any
procedure and can be very easily tought and understood. If one standardizes
the colors it may help customer support, advertizing, discussing, etc?
permitting to speak of the "blue", "red", "green" part of a discussed URI.
This would be far easier for lay people than to talk of first, second, etc.
level - whatever the language.

I suggest you write a private Draft of this asap. Probably a single page
enough. Mozilla could make it an immediate update if it would be an RFC on
the standard track. This is would stop the developing campaign of concerns,
with something which would be considered as positive.

Or, would this be for the W3C?
jfc

At 16:08 24/02/2005, Erik van der Poel wrote:
>Adam M. Costello wrote:
>>Erik van der Poel <***@vanderpoel.org> wrote:
>>>The IETF generally only specifies the "wire" protocol, not UI
>>>behavior. The IETF does not specify how apps interface with users;
>>Generally, that's true, but IDNA is an exception. It state four
>>requirements (RFC 3490 section 3.1), and one of those four has rather
>>little to do with wire protocols, and quite a lot to do with UI
>>behavior:
>> 3) ACE labels obtained from domain name slots SHOULD be hidden from
>> users when it is known that the environment can handle the non-ACE
>> form, except when the ACE form is explicitly requested. When it
>> is not known whether or not the environment can handle the non-ACE
>> form, the application MAY use the non-ACE form (which might fail,
>> such as by not being displayed properly), or it MAY use the ACE
>> form (which will look unintelligible to the user).
>
>I don't think IDNA is an exception. Note that the part you quote above
>uses words like SHOULD and MAY. I would say that those words were chosen
>for *exactly* the reasons I mentioned (i.e. IETF *specifies* wire
>protocols, not UI behavior). See section 6 of:
>
>http://ietf.org/rfc/rfc2119.txt
>
>This RFC focusses on "interoperability" (but also mentions "harm") so I
>would say that the wire protocol is the main concern.
>
>>I think this discussion is headed toward an update to IDNA that would
>>add a second exception to that requirement, for protecting the user
>>against phishing. What we need to figure out is how to describe that
>>exception, and how specific or deliberately vague that description
>>should be.
>
>Here I agree with you. I'm not going to try to come up with the wording
>for that, but this morning I started to think that the right-to-left DNS
>and IDN spoofing problems *could* be addressed at the UI level by
>providing a *tool* that security-conscious users could *choose* to use.
>
>I'm thinking of a tool that might be implemented as an extension for
>Mozilla, for example. It would offer to display domain names in the safe
>order, i.e. left-to-right for users whose main language is left-to-right.
>I have not heard of any UIs that offer top-to-bottom in their menus,
>dialogs, etc, so I would guess that this would be omitted in the extension
>too, though right-to-left might be offered for right-to-left users (many
>of which are in the Middle East -- Hebrew and Arabic).
>
>In addition, such a tool would offer to display domain names in a clear
>font, unlike the sans-serif that is commonly used today. This would make
>the distinction between lowercase l and digit 1 clearer. And it would
>separate the domain name from its context, e.g. using color.
>
>Finally, this tool would offer to display characters outside the user's
>language(s) in a special way, to make them stand out and catch the user's
>attention. I believe we need to focus on the user here, because we are
>talking about how things *look* to the user.
>
>For example, people in the Far East are used to spotting small differences
>between complicated characters because they were taught as children to
>read and write the thousands of complex "Han" characters that differ in
>small ways.
>
>Americans, on the other hand, are only used to seeing a small number of
>characters, and would not even be able to *read* Han characters, let alone
>spot differences between them. This is why I believe that a tool that
>focusses on the user might be a good idea.
>
>You may claim that nobody would ever want to read domain names
>left-to-right, to which I would counter that some people are willing to
>try Dvorak keyboards, which are totally different from QWERTY. I.e. it's
>the user's choice. Internet security education may eventually lead *some*
>users to make this choice.
>
>Erik
>
>

Erik van der Poel

2005-02-25 01:18:16 UTC

Hi Jefsey,

Thanks for the support. Your idea about giving each level of the domain
name a standard color is very interesting. I am reworking nameprep.org
right now, and I'll probably include a section near the end about this.
Note that just because this section is in the doc at nameprep.org does
not mean that it would necessarily find its way into any Internet Draft,
let alone a new RFC. If this section tries to mandate colors, it
probably would not be in a normative (or MUST) part of an IETF RFC. It
would end up in an informative appendix (or a MAY).

No, I don't think W3C is the right organization for this. IDNA has
already been produced by the IETF, and domain name issues don't really
belong at W3C, even though URIs are covered by some of their docs.

Mozilla has very different plans in this space. I have been unable to
convince them of the need for tools/extensions. So far. :-)

Erik

JFC (Jefsey) Morfin wrote:
> Very, very good idea.
>
> This could be a _standard_ way to print the domain name in the browser
> bar. A very simple way could just be to print every URL in the bar with
> a different color for each level. This does not change anything in any
> procedure and can be very easily tought and understood. If one
> standardizes the colors it may help customer support, advertizing,
> discussing, etc? permitting to speak of the "blue", "red", "green" part
> of a discussed URI. This would be far easier for lay people than to talk
> of first, second, etc. level - whatever the language.
>
> I suggest you write a private Draft of this asap. Probably a single page
> enough. Mozilla could make it an immediate update if it would be an RFC
> on the standard track. This is would stop the developing campaign of
> concerns, with something which would be considered as positive.
>
> Or, would this be for the W3C?
> jfc
>
> At 16:08 24/02/2005, Erik van der Poel wrote:
>
>> Here I agree with you. I'm not going to try to come up with the
>> wording for that, but this morning I started to think that the
>> right-to-left DNS and IDN spoofing problems *could* be addressed at
>> the UI level by providing a *tool* that security-conscious users could
>> *choose* to use.

Gervase Markham

2005-03-02 11:50:19 UTC

Erik van der Poel wrote:
> Here I agree with you. I'm not going to try to come up with the wording
> for that, but this morning I started to think that the right-to-left DNS
> and IDN spoofing problems *could* be addressed at the UI level by
> providing a *tool* that security-conscious users could *choose* to use.

While security-conscious users are always less at risk than ordinary
users, thinking in terms of a tool is IMO wrong.

> I'm thinking of a tool that might be implemented as an extension for
> Mozilla, for example. It would offer to display domain names in the safe
> order, i.e. left-to-right for users whose main language is
> left-to-right. I have not heard of any UIs that offer top-to-bottom in
> their menus, dialogs, etc, so I would guess that this would be omitted
> in the extension too, though right-to-left might be offered for
> right-to-left users (many of which are in the Middle East -- Hebrew and
> Arabic).

The problem this is supposed to mitigate is mitigated in Firefox by the
domain-only indicator in the status bar.

> In addition, such a tool would offer to display domain names in a clear
> font, unlike the sans-serif that is commonly used today. This would make
> the distinction between lowercase l and digit 1 clearer. And it would
> separate the domain name from its context, e.g. using color.

Assuming we could determine such a font, why would we not always use it?
Why wait for a tool to be deployed?

Gerv

Erik van der Poel

2005-03-02 01:05:29 UTC

Gervase Markham wrote:
> While security-conscious users are always less at risk than ordinary
> users, thinking in terms of a tool is IMO wrong.

Perhaps I was wrong to use the word "tool". There is a fundamental
tension between security and user-friendliness. Some applications and
vendors have a history of making their user interfaces *too* friendly,
thereby neglecting to warn users of potential security risks. Other
vendors have tried hard to strike a balance between security and
seamlessness. I believe Netscape and Mozilla have been in this camp
since Day One.

I hope that mozilla.org will deploy a better solution than the TLD and
domain black/whitelists that have been discussed.

>> It would offer to display domain names in the
>> safe order, i.e. left-to-right for users whose main language is
>> left-to-right.
>
> The problem this is supposed to mitigate is mitigated in Firefox by the
> domain-only indicator in the status bar.

I just double-checked Firefox 1.0.1, and it just says "Done" at the
lower left. Then I tried a secure (https) site, and, lo and behold, I
saw the "domain-only" indicator at the lower right, next to the padlock
icon. This is very good news (to me). And thank you for educating this
particular user (me) about this security issue. As I have often said,
education is key.

A couple of questions/comments: It might be nice to have this
domain-only display even for non-secure sites (http). Also, do you know
what happens if the domain name is very long? Finally, do you have any
thoughts about the slash homograph problem? Thanks.

>> In addition, such a tool would offer to display domain names in a
>> clear font, unlike the sans-serif that is commonly used today. This
>> would make the distinction between lowercase l and digit 1 clearer.
>
> Assuming we could determine such a font, why would we not always use it?
> Why wait for a tool to be deployed?

Indeed, why wait? I filed a bug a while ago:

https://bugzilla.mozilla.org/show_bug.cgi?id=282079

My feeling is that a sans-serif font (such as Arial) places the
characters too close to each other and does not have the serifs that
often serve to distinguish the characters better. How about a fixed
width font with serifs, such as Courier New?

Erik

Gervase Markham

2005-03-02 08:56:05 UTC

Erik van der Poel wrote:
> Perhaps I was wrong to use the word "tool". There is a fundamental
> tension between security and user-friendliness.

Well, maybe. I'm not convinced the tension is absolute, but I agree you
need to work very hard indeed to get both.

> A couple of questions/comments: It might be nice to have this
> domain-only display even for non-secure sites (http).

We are probably going to change this for 1.1. It takes some careful
thought so as not to confuse people.

> Also, do you know
> what happens if the domain name is very long?

It just gets very long, currently.

> Finally, do you have any
> thoughts about the slash homograph problem? Thanks.

Well, the current domain indicator will show the domain, slash
homographs and all. We're still developing our response, but it's likely
that we'll have to blacklist this character. Opera's new beta already
has a small set of characters it doesn't allow.

Ideally, we wouldn't be acting unilaterally on this one, and would be
doing the restrictions based on consensus. But before we can go there,
we need to figure out what we think is needed first. That process is
still going on.

> Indeed, why wait? I filed a bug a while ago:
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=282079

Thanks :-)

> My feeling is that a sans-serif font (such as Arial) places the
> characters too close to each other and does not have the serifs that
> often serve to distinguish the characters better. How about a fixed
> width font with serifs, such as Courier New?

The issue, of course, is that the font designation we use has to produce
a good font on all platforms. This isn't fundamentally impossible, it
just requires work and testing.

Gerv

Erik van der Poel

2005-03-02 17:18:49 UTC

Gervase Markham wrote:
> Erik van der Poel wrote:
> >
> > Also, do you know what happens if the domain name is very long?
>
> It just gets very long, currently.

I hope you will think about what to do when the name is really long,
keeping in mind that the most important part of the name is the end (the
right side).

> Well, the current domain indicator will show the domain, slash
> homographs and all. We're still developing our response, but it's likely
> that we'll have to blacklist this character. Opera's new beta already
> has a small set of characters it doesn't allow.
>
> Ideally, we wouldn't be acting unilaterally on this one, and would be
> doing the restrictions based on consensus. But before we can go there,
> we need to figure out what we think is needed first. That process is
> still going on.

It's really great to hear this! I fully agree with you that consensus
would be ideal. I'm taking the liberty of Cc'ing one of the Opera
developers, in the hope that they might be willing to discuss their
particular choices for the character blacklist. If various people can
start to agree on a blacklist, this info can be fed into any RFC
revision process (which I hope will begin).

> The issue, of course, is that the font designation we use has to produce
> a good font on all platforms.

Yes, a good font on *all* platforms, and also for *all* parts of the
world. I am quite aware of those issues. See the bottom of:

http://www.mozilla.org/projects/intl/fonts.html

One of the great things about open source is that lots of people from
all over the world come to help you, with localizations, default font
choices, etc.

Erik

JFC (Jefsey) Morfin

2005-02-23 17:02:26 UTC

John,
your proposition makes more sense than Adam as kludges do not escalate, and
when you start censoring you never know where you will have to stop, and I
do not know how you can enforce it.

But your proposition is not the panacea.

1) it is meant only for registration. So it is for TLD Registry Managers or
goodwill registrants (for 3+LD). They are ignorant, lazzy, opposed or
accepting, but they do not oppose. So the real problem is not with them but
with phishers.
2) your poposition is to have an oline correction of the problem. Some
Registry may like it, some not. I would have strong objections because it
complexifies the user support. But we can try it. This is why I asked the
solution I asked and asked if a Perl program existed to test a correction
at registration.

Actually I repeat that all the propositions to change what the user can see
is user hurting. The need for the click to send a request which the one the
user want, not the one the phisher want. IMHO one does not increase
security in hiding the existand of the danger, one increases the risks.
jfc

At 09:39 23/02/2005, John C Klensin wrote:

>--On Wednesday, 23 February, 2005 07:28 +0000 "Adam M. Costello"
><idn.amc+***@nicemice.net.RemoveThisWord> wrote:
>
> > Erik van der Poel <***@vanderpoel.org> wrote:
> >
> >> Another argument against banning the slash homograph is that
> >> any new banning would require a new ACE prefix, which is a
> >> lot of work, and, as John said, there should be a high
> >> threshold for any demonstration that tries to show that a new
> >> prefix is necessary.
> >
> > An alternative, rather than banning the character, is to
> > recommend that it not be shown; the ACE form could be shown
> > instead. This would effectively make the character useless in
> > domain names (for both phishers and honest folks) without
> > requiring a new ACE prefix.
> >
> > We could push ToUnicode down inside a wrapper function,
> > ToDisplay. Applications would never call ToUnicode directly
> > anymore. Whenever they wanted to display a domain name,
> > they'd call ToDisplay, which would call ToUnicode, check the
> > result, and if it didn't like it, call ToASCII. (Of course,
> > since ToUnicode typically calls ToASCII, there are
> > opportunities to optimize that logic.)
>
>Adam, there are two problems with this. First, it effectively
>dictates UI behavior, which is generally a bad idea. It is a
>particularly bad idea in this case because you are proposing the
>sort of UI behavior that generates a lot of very confused
>questions and trouble reports, which is something no sane
>implementer wants. And "won't display" is not the right answer
>on the registration side of the process, even if it were right
>on the lookup side. The second problem is that it is a kludge,
>and that inserting kludges into critical protocols or procedures
>--and IDNs are certainly critical-- almost always turns out to
>be a seriously bad idea sooner or later.
>
>If we find a need to start banning characters that we could not
>agree on banning the first time around, there is another
>approach, also unpleasant but IMO less problematic, that could
>be considered. Just as RFC 2822 moved past a lot of legacy
>nonsense by having two separate "create" and "accept" syntaxes,
>we could define an additional profile, say "NameRegisterPrep".
>It would look a lot like Nameprep but would ban the characters
>you are now suggesting banning, plus, based on what I think is
>growing experience in the registries, ban any character that
>mapped to anything else. The effect would be to permit only
>those code points as input to the registration process that
>could be output into punycode and the DNS. Several registries
>have adopted the latter part of that model already: basically
>what you register is ToASCII(string) and/or
>ToUnicode(ToASCII(string)), but never "string". The lookup
>process would remain the same, with no changes to Nameprep being
>made at all. And, by eliminating all of the mapping tables and
>replacing them with prohibitions, it would make the question
>"can this character appear in an IDN" a great deal less
>complicated, which would certainly be an advantage.
>
>This type of registration restriction is rather different from
>our asking/expecting ICANN and the registries to adopt rules
>about, e.g., mixed-script registrations that would help people
>stay out of trouble. For better or worse, ICANN has a great
>deal less trouble asking (or demanding) that people conform to
>the protocols than it does with making up a somewhat fuzzy
>guideline and enforcing it. In the most extreme of cases,
>violating a protocol in a significant way is one of those "not
>acting in the best interests of the local users and the Internet
>community" that RFC 1591 warns against and indicates could be
>grounds for redelegating a registry.
>
>It would leave the registries and ICANN stuck with the problem
>of what to do about anything that was now registered which
>violated the new rules, but that problem would exist for _any_
>substantive change we made.
>
> john

Erik van der Poel

2005-02-23 18:20:51 UTC

JFC (Jefsey) Morfin wrote:
> Actually I repeat that all the propositions to change what the user can
> see is user hurting. The need for the click to send a request which the
> one the user want, not the one the phisher want. IMHO one does not
> increase security in hiding the existand of the danger, one increases
> the risks.

Jefsey, it must be difficult to participate in this kind of group when
English is not your main language, but I, for one, do appreciate your
wise contributions, so I take them seriously.

However, I must disagree with this particular suggestion (if I
understand you correctly). If a phisher spams users, it is not the email
app's responsibility to direct the user to whatever site the app might
guess is the "correct" one. No, I think it's better for the app to warn
the user in some way that this is a phishy email, and might be evil.

This is similar to the advice that you should not give your Social
Security Number (SSN) or credit card number to someone over the phone,
unless *you* are the one dialing the phone number (using a well-known,
published phone number).

Erik

Erik van der Poel

2005-02-24 07:36:49 UTC

Adam M. Costello wrote:
> I imagine you'd want all the characters that could immediately follow
> the host name in a URI, so add "?" and "#" to that list.
>
> But how well do average users know URI syntax anyway? What would they
> think of:
>
> http://foo.com&bar.baz.xx
> http://foo.com~bar.baz.xx
> http://foo.com|bar.baz.xx
>
> Maybe we either need to ban all punctuation (as in my proposal about
> internationalized host names), or always make the boundaries of the
> domain name apparent to the user (using color or highlighting or
> underlining or something).

I started to write down all the delimiters that could appear in DNS,
URIs and email, and then I realized that this problem is not just about
the homographs of the *legal* delimiters used in these contexts. No, it
is about whatever *looks like* a legal delimiter to the average user,
because the phishers don't have to stick to the (homographs of the)
legal delimiters. Then I went back in the archives, and of course, Adam
has already pointed this out.

The implications of this are actually quite profound. Since there are so
many characters in Unicode, and since many of those are unfamiliar to
the average user, a lot of those might look like punctuation.

As Adam also points out in another email, it's too bad that domain names
are usually displayed in "little-endian" order. If they were displayed
in the opposite (big-endian) order, the 3rd example above would become:

http://xx.baz.com|bar.foo

Notice how the "com" and "foo" are now separated. The "real" (unspoofed)
URI would look like this:

http://com.foo

If users were actually used to seeing it this way, they might notice the
spoof above more easily. But they aren't used to seeing it this way, and
it would be pretty difficult to change this convention now. It's too late.

Back to punctuation: Banning all punctuation would not be enough. We
would have to ban anything that might look like punctuation to the user.
That would mean banning a huge swath of Unicode, which is probably not
in the best interests of various communities around the world. Besides,
different people will have different ideas about what looks like
punctuation. So it might be hard to decide which huge swath of Unicode
to ban.

So maybe it's better to consider Adam's alternative idea: make the
boundaries of the domain name apparent (using color or whatever). Over
time, the users will get used to seeing domain names this way, and then
they will be able to spot domain name spoofs more easily too.

But even if we were to color the whole domain name:

foo.com|bar.baz.xx

The user might still think that this site is somehow related to foo.com
and therefore safe (as was also pointed out). So you'd have to display
the "unusual" characters like '|' differently. Or something. Sigh. Seems
hopeless.

Are the phishers going to have a field day with IDN, or what?

But is this problem really limited to IDN? What about the following
legal ASCII DNS name:

foo.com--secure-user-services-and-products.tech-mecca.biz

Does this mean that we should try to switch left-to-right readers (most
of the world) over to big-endian domain names? Please tell me I'm
overreacting!

Erik

Jaap Akkerhuis

2005-02-24 10:02:21 UTC

But is this problem really limited to IDN? What about the following
legal ASCII DNS name:

foo.com--secure-user-services-and-products.tech-mecca.biz

I don't know what the problem is with this one. It is a perfect
normal domain name. (*)

Does this mean that we should try to switch left-to-right readers (most
of the world) over to big-endian domain names? Please tell me I'm
overreacting!

If I remember correctly, that is what the plan9 people did with
``text'' files (although there was not really a concept of file
types). The information in the file was considered to be in ``time
order'', and the time went from left to right. Time order as in,
if you read it out allowed, the first phrase you utter was the first
in time. Whether it should be displayed right to left or in time
order is considered an display (application) problem. So ``cat''
would not bother with the write order and just dump the bytes, but
your display might parse the unicode (runes) and certainly ``troff''
should (**).

jaap

(*) Although YMMV. In .nl, two consecutive hyphens have never been
allowed for registration by the registry.

(**) There was a version of troff around (done in Israel) which
actually did all four directions (left-right, right-left, printing
downwards and, I believe, for good measures also upwards)

tedd

2005-02-24 15:47:56 UTC

Erik et al:

>But even if we were to color the whole domain name:
>
>foo.com|bar.baz.xx
>
>The user might still think that this site is somehow related to
>foo.com and therefore safe (as was also pointed out). So you'd have
>to display the "unusual" characters like '|' differently. Or
>something. Sigh. Seems hopeless.

Yes, it may seem hopeless. I believe that the "fruit-loop" solution
would fall short of expectations. However, browser makers may find
opportunity in providing a more in-your-face homographic solution by
analyzing url's and alerting users of potential problems (i.e.,
beating them about the head). But this possibility/solution is beyond
the scope of this group.

>Are the phishers going to have a field day with IDN, or what?

Yes, they probably are going to have a field day, but I don't think
there is much that can be done about that. Much of this problem will
be dealt with in the courts -- where it should be.

As for end-users, remember less than ten years ago the average user
didn't care squat about spam, but now they think different. This
homographic phenomena will run its course as well and solutions will
be found.

>But is this problem really limited to IDN? What about the following
>legal ASCII DNS name:
>
>foo.com--secure-user-services-and-products.tech-mecca.biz
>
>Does this mean that we should try to switch left-to-right readers
>(most of the world) over to big-endian domain names? Please tell me
>I'm overreacting!

Possibly... but perhaps everyone is overreacting. IMO no safeguards
will stop illegal use of anything. Stop signs don't stop everyone
regardless of size, color, placement, fines, and laws regarding stop
signs. Likewise, and no offense, the efforts of this group will be no
different. There will be abuse regardless.

The most I think anyone can do is to focus on approaches like the
"Delimiter solution" such as those noted at: http://nameprep.org/
Therein, I think there is solid logic in this approach.

You might even go after punctuation or symbols, but then there are
honest reasons for people having punctuation and symbols in domain
names -- do you want to prohibit them because of the possibility of
abuse? Abuse, I might add, that could/should be dealt with via ICANN
and/or the courts -- where both sides can present their arguments.
Not everyone who uses a symbol in a domain name is wrong or is
attempting to commit fraud.

For example, I have the domain "not-equal sign" dot com. Why? It
seemed kind of neat at the time, and being disabled, I was thinking
of using it as a discrimination related web site. But, I had a
business approach me yesterday saying that they wanted to purchase
the name because the design (the not-equal sign) resembles their
product, which is a cat toy -- imagine that.

So, for what purpose/use can a symbol domain name be? It depends upon
the market and regardless if you believe in, or approve of, market
forces, there are honest reasons for such domain names. So, let's not
throw the baby out with the bath water.

There are going to be many avenues for abuse, and I suspect many more
than this group can imagine. I know that after reading:
http://www.unicode.org/reports/tr36/tr36-2.html I was alerted to more
than what I wanted to know. However, my advice (being one of the
lessor thinkers in this group) is to concentrate on solid logic, like
the delimiter argument, and not on what "may" happen.

I'm not saying "give-up" -- I'm simply saying "don't overreact".

tedd
--
--------------------------------------------------------------------------------
http://sperling.com/

Erik van der Poel

2005-02-24 17:21:16 UTC

tedd wrote:
> I believe that the "fruit-loop" solution
> would fall short of expectations.

I wasn't talking about many colors. A character is either in the user's
set or not. So we only need 2 colors (if colors are used at all). The
user's set is typically derived from the browser localization or HTTP
Accept-Language preference setting.

Erik

tedd

2005-02-24 18:22:57 UTC

>tedd wrote:
>>I believe that the "fruit-loop" solution would fall short of expectations.
>
>I wasn't talking about many colors. A character is either in the
>user's set or not. So we only need 2 colors (if colors are used at
>all). The user's set is typically derived from the browser
>localization or HTTP Accept-Language preference setting.
>
>Erik

Erik:

I understand -- however:

First, I was addressing the idea of a colored url via a "tool-tip
plug-in" type solution -- a multi-colored or two toned fruit loop --
it doesn't make any difference, it's the same idea.

Second, while the user's set is typically derived as you say, I
imagine there will be users who will transcend those boundaries
(i.e., multilingual). Considering such, whatever the solution, it
will most likely "look" different for the same url and thus lose some
of it's potency as an alert.

Third, I imagine there may even be concerns (i.e., companies,
persons, organizations, sport groups, and even countries) who may
want a fruit-loop domain with their specific colors. It might get
that ridiculous or maybe this will be a new way to market domains. :-)

It's not a bad idea, just one that falls short of solving the
problem. Sometimes adding partial solutions to a problem become part
of the problem.

tedd
--
--------------------------------------------------------------------------------
http://sperling.com/

John C Klensin

2005-02-24 18:57:44 UTC

Tedd,

It seems to me that, if an application wanted to provide its
users a way to specify a list of expected characters that was
relatively short compared to the size of Unicode, and then to
warn the user in some way when an unexpected character appeared,
that would be reasonable and, for some users, helpful. I think
the idea gets into trouble only when the application starts
making guesses as to what should (or should not) be on that
list.

No, this wouldn't "solve" the problem. I agree with Erik that,
if people had people had high expectations for it, they would be
badly disappointed.

But I'm convinced that we aren't going to find a magic bullet
here. Instead, we can do some large fraction of the "useful,
but not a solution" tools that have been suggested. We can try
to restrict characters that are clearly dangerous, adopting, if
necessary, a view that the fact someone wants to register or use
a particular string doesn't mean that they are entitled to do
so. We can adopt a variety of warning technologies --whether
they involve colors, displaying punycode, pop-up warnings, or
something else-- and let applications compete on which ones can
do a better job of that. We can try some user education. We
can use the UDRP and/or the legal system in various countries to
push back on those who register deceptive names and on the
registrars and registries that encourage the registration of
such names. And other ideas may come along that should be
implemented.

Then we can hope that those things, in combination, reduce the
problem to some tolerable level, understanding that it will
never completely go away.

john

--On Thursday, 24 February, 2005 13:22 -0500 tedd
<***@sperling.com> wrote:

>> tedd wrote:
>>> I believe that the "fruit-loop" solution would fall short of
>>> expectations.
>>
>> I wasn't talking about many colors. A character is either in
>> the user's set or not. So we only need 2 colors (if colors
>> are used at all). The user's set is typically derived from
>> the browser localization or HTTP Accept-Language preference
>> setting.
>>
>> Erik
>
> Erik:
>
> I understand -- however:
>
> First, I was addressing the idea of a colored url via a
> "tool-tip plug-in" type solution -- a multi-colored or two
> toned fruit loop -- it doesn't make any difference, it's the
> same idea.
>
> Second, while the user's set is typically derived as you say,
> I imagine there will be users who will transcend those
> boundaries (i.e., multilingual). Considering such, whatever
> the solution, it will most likely "look" different for the
> same url and thus lose some of it's potency as an alert.
>
> Third, I imagine there may even be concerns (i.e., companies,
> persons, organizations, sport groups, and even countries) who
> may want a fruit-loop domain with their specific colors. It
> might get that ridiculous or maybe this will be a new way to
> market domains. :-)
>
> It's not a bad idea, just one that falls short of solving the
> problem. Sometimes adding partial solutions to a problem
> become part of the problem.

Erik van der Poel

2005-02-24 19:54:26 UTC

John C Klensin wrote:
> We can try
> to restrict characters that are clearly dangerous, adopting, if
> necessary, a view that the fact someone wants to register or use
> a particular string doesn't mean that they are entitled to do
> so.

You can write RFCs, move them to STD status, and jump up and down all
you want, but you can't stop domain name owners from creating "deep"
sub-domains with deceptive names that make the important part of the
name go off the end of the display area.

> We
> can use the UDRP and/or the legal system in various countries to
> push back on those who register deceptive names and on the
> registrars and registries that encourage the registration of
> such names.

The registrars and registries are not the problem. The domain name
owners are. If a poor individual has created a deceptive name that hurts
a huge company, that company may go after Microsoft (since it has deep
pockets) instead of the poor person.

So, the apps' current way of displaying the domain name (right-to-left)
in left-to-right cultures is the problem. I tried to make the case that
this is even a problem in the ASCII DNS (regardless of IDN), since
hyphens are allowed in most DNS implementations. I wonder if a phisher
would only have to change their own DNS server to get other characters
(like ASCII slash '/') into the names? Or would many of the DNS clients
refuse to lookup names containing such characters? (I tried to create a
name containing ASCII slash yesterday, but my DNS server wouldn't accept
it.)

Hasn't this stuff been covered in any RFC yet?

Erik

John C Klensin

2005-02-24 22:22:35 UTC

--On Thursday, 24 February, 2005 11:54 -0800 Erik van der Poel
<***@vanderpoel.org> wrote:

> John C Klensin wrote:
>> We can try
>> to restrict characters that are clearly dangerous, adopting,
>> if necessary, a view that the fact someone wants to register
>> or use a particular string doesn't mean that they are
>> entitled to do so.
>
> You can write RFCs, move them to STD status, and jump up and
> down all you want, but you can't stop domain name owners from
> creating "deep" sub-domains with deceptive names that make the
> important part of the name go off the end of the display area.

Of course not. But there are several separate problems here.
For example:

(i) No one _makes_ an application author write things so
that "go off the end of the display area" is an option.
It may or may not be worth it, but there are all sorts
of ways to design a UI so that things wrap, scroll,
pop-up, warn, or are otherwise accessible from end to
end. It seems to me that convincing your favorite
applications author to not let long FQNS disappear
off-screen is likely to be a lot easier than turning
domain names around (see below).

(ii) The Internet has never had a presentation layer,
and the IETF and its predecessors have never tried to
standardize one or what happens in it. To some extent,
many of the issues with URIs/IRIs, IDNs, etc., suggest
that may have been a mistake. But we just haven't gone
in that direction and starting to do so now would be a
pretty big deal. A presentation layer might solve this
problem because a user could then specify how various
things are to be displayed. Without it, and without
common operating system or utility library interfaces
for these things that everyone uses, one risks having
one application use one display order and another
application, on the same host and for the same user, use
a different one. That would create a mess and its own
set of risks; see below.

>> We
>> can use the UDRP and/or the legal system in various countries
>> to push back on those who register deceptive names and on the
>> registrars and registries that encourage the registration of
>> such names.
>
> The registrars and registries are not the problem. The domain
> name owners are. If a poor individual has created a deceptive
> name that hurts a huge company, that company may go after
> Microsoft (since it has deep pockets) instead of the poor
> person.

As I have said before, there is no magic bullet solution to this
group of problems. And, for the zones that are under their
control, the registries are the problem (for some part of the
broader problem) because they are in a position to prohibit
unacceptable registrations .. as they have done for years in
prohibiting names that aren't "hostname" (LDH)-conformant. That
conformance is not not now, and has never been, a DNS protocol
requirement.

> So, the apps' current way of displaying the domain name
> (right-to-left) in left-to-right cultures is the problem. I
> tried to make the case that this is even a problem in the
> ASCII DNS (regardless of IDN), since hyphens are allowed in
> most DNS implementations. I wonder if a phisher would only
> have to change their own DNS server to get other characters
> (like ASCII slash '/') into the names? Or would many of the
> DNS clients refuse to lookup names containing such characters?
> (I tried to create a name containing ASCII slash yesterday,
> but my DNS server wouldn't accept it.)

There are people who would claim that your DNS server is broken
-- see RFC 2181.

However...

The reality is that whether DNS names were to be treated as
big-endian or little-endian was hotly developed when the DNS was
first being designed and, if I recall what I was told, actually
changed once or twice. Plus or minus a bit, for every argument
that it should be one way, there was an argument that it should
be the other. For example, while you would like to see
com.mumblefraz.foo so as to detect issues the TLD chosen, it is
equally the case that, for many of us in daily use, the
distinction between foo.mumblefraz.com and bar.mumblefraz.com is
more important.

Regardless of how one might toss that particular coin, it has
been tossed. We have a huge deployed base of the current order
and, even worse than software issues, the current order has been
imprinted on the consciousness of a lot of folks who don't
really know what a domain name is. Due to the intersection of
old JANET Coloured Book names with DNS names, we also have
considerable experience trying to operate an Internet in which
some names run from left to right and others run from right to
left. It wasn't a lot of fun and sometimes people (and
software) made mistakes: a domain name like uk.ac.ucl.bar.com
was a public nuisance or worse.

So, like it or not, I think you had best let this one go and get
used to it. FWIW, my ordering preference is the same as yours.
But, if I were to make a list of the Internet design decisions
that, in retrospect, I would have been happier if they had been
made some other way, this one wouldn't make the top ten. And I
suspect that opinion would be consistent with a poll of either
users or protocol designers, had such a poll been held. (Also,
FWIW, "not having domain names in URLs at all" would be close to
the top of my list.)

john

Erik van der Poel

2005-02-24 23:04:00 UTC

John C Klensin wrote:
>
> So, like it or not, I think you had best let this one go and get
> used to it. FWIW, my ordering preference is the same as yours.

OK, I'll stop talking about domain name display order. (But you've all
been warned.)

Back to the business at hand: For Nameprep, I've been wondering whether
it would make sense to explore what exactly would have to happen if we
were to change the ACE prefix from xn-- to something else. I.e. clients
would have to be updated, servers would have to accept both the old and
the new during the transition period, etc. I'm not even sure this is
correct.

Anyway, I think such an exploration, including many details, might tell
us whether or not we should even contemplate making an incompatible
change to Nameprep.

Or do people think that we should approach this from the other side?
I.e. if we can come up with seriously dangerous homographs and other
characters that must be banned or mapped in a new nameprep, then this
could in itself be viewed as justification for a new ACE prefix,
regardless of how hard it might be to add it or move to it.

Thoughts?

Erik

Erik van der Poel

2005-02-25 03:51:37 UTC

All,

1. Is this the right time to start working on Internet Drafts leading up
to new version(s) of the IDNA RFC(s)? If not, when?

2. Am I stepping on someone's toes by creating nameprep.org? Feel free
to respond publicly or privately.

3. If this is the right time to start work on drafts, who would like to
write some prose?

4. Do we need to revive the IDN WG?

5. Any other process questions?

Thanks,

Erik

Doug Ewell

2005-02-25 04:35:43 UTC

Erik van der Poel <erik at vanderpoel dot org> wrote:

> 1. Is this the right time to start working on Internet Drafts leading
> up to new version(s) of the IDNA RFC(s)? If not, when?
> ...

I don't know about anyone else, but something seems badly wrong here.

Is it really possible that we spent a year and a half, two years on
putting together an IDN architecture, and during all that time nobody
ever gave the slightest thought to the possibility of someone using IDNs
for spoofing purposes, and now that one or two well-publicized spoofing
examples have appeared, we are ready to start all over again with a new
and probably incompatible version of the architecture?

Is this sending the kind of "stability" message that was considered so
important two or three years ago?

Is there even enough solid information to begin writing anything, or
just a general feeling that Something Needs To Be Done?

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Stephane Bortzmeyer

2005-02-25 11:37:25 UTC

On Thu, Feb 24, 2005 at 08:35:43PM -0800,
Doug Ewell <***@adelphia.net> wrote
a message of 26 lines which said:

> Is it really possible that we spent a year and a half, two years on
> putting together an IDN architecture, and during all that time
> nobody ever gave the slightest thought to the possibility of someone
> using IDNs for spoofing purposes,

This is absolutely wrong. The issue has been discussed at length. See
the "Security Considerations" of RFC 3490.

> and now that one or two well-publicized spoofing examples have
> appeared,

Pure marketing, BTW. Nothing new in the recent announces, just
sensation papers.

> Is there even enough solid information to begin writing anything, or
> just a general feeling that Something Needs To Be Done?

The Powers Above require that Something should be done (they will
forget about it in a few weeks) but, technically speaking, there is
indeed nothing now.

Erik van der Poel

2005-02-25 15:45:47 UTC

Stephane Bortzmeyer wrote:
> The issue has been discussed at length. See
> the "Security Considerations" of RFC 3490.

It is true that some of the issues are pointed out by that section, so
the registries and application developers have to pay attention. But one
might argue that we have recently been discussing a new *class* of
homographs. The RFC mentions "multiple scripts" and one and l. These two
refer to letters such as Cyrillic small 'a' and digits (the "one"). But
the slash homograph recently raised on this list might be considered to
be a new class of homograph (punctuation), not specifically indicated in
the RFC. Not only is this type of character different from letters and
digits, it is arguably even more dangerous than the script-based
(Cyrillic) attack, since it can be done in a domain label that is not
under the control of the registries. So that first line of defense is
not there, and we must rely totally on the apps, and there are many.

One could argue that a new document should be published and widely
circulated to warn about this new kind of attack. One of my questions is
whether this warning should appear in a new version of the RFC, or in a
separate document. Alternatively, it may be decided that this type of
homograph is so different and so dangerous that a new version of the
protocol that prohibits these characters, with a new ACE prefix, should
be created. I don't know.

Also, the "multiple scripts" wording does not specifically cover the
all-Cyrillic case. So that part could be tightened up too.

By the way, the RFC's Security section includes the following:

No security issues such as string length increases or new
allowed values are introduced by the encoding process or the use of
these encoded values, apart from those introduced by the ACE encoding
itself.

What does this mean, exactly? Are any new allowed values introduced by
the ACE encoding? This part could be clearer.

Also, O and 0 are mentioned, but is this technically correct? I mean,
aren't uppercase ASCIIs supposed to be lowercased? I'm sorry if I'm
wrong about this part.

> Nothing new in the recent announces, just
> sensation papers.

Again, I think the slash homograph might be new. Do you have evidence to
suggest that it *was* considered by the WG or anybody else?

> The Powers Above require that Something should be done

Have you seen any indication of this?

Thanks,

Erik

James Seng

2005-02-25 16:50:44 UTC

FYI, punctuation were also debated extensively back then so it isn't
really 'Oh, we just discover this!'.

It's just interesting to see people putting it into action. Does apps
needs to be fixed? Yes, absolutely. App developers needs to go read the
security consideration of rfc3490 and think about how to implement some
solution to reduce the spoofing risk.

but does it warrant changing the existing rfc? I dunno..I havent seen
anything that suggest we missed something back then.

-James Seng

On 25-Feb-05, at PM 11:45, Erik van der Poel wrote:

> Stephane Bortzmeyer wrote:
>> The issue has been discussed at length. See
>> the "Security Considerations" of RFC 3490.
>
> It is true that some of the issues are pointed out by that section, so
> the registries and application developers have to pay attention. But
> one might argue that we have recently been discussing a new *class*
> of homographs. The RFC mentions "multiple scripts" and one and l.
> These two refer to letters such as Cyrillic small 'a' and digits (the
> "one"). But the slash homograph recently raised on this list might be
> considered to be a new class of homograph (punctuation), not
> specifically indicated in the RFC. Not only is this type of character
> different from letters and digits, it is arguably even more dangerous
> than the script-based (Cyrillic) attack, since it can be done in a
> domain label that is not under the control of the registries. So that
> first line of defense is not there, and we must rely totally on the
> apps, and there are many.
>
> One could argue that a new document should be published and widely
> circulated to warn about this new kind of attack. One of my questions
> is whether this warning should appear in a new version of the RFC, or
> in a separate document. Alternatively, it may be decided that this
> type of homograph is so different and so dangerous that a new version
> of the protocol that prohibits these characters, with a new ACE
> prefix, should be created. I don't know.
>
> Also, the "multiple scripts" wording does not specifically cover the
> all-Cyrillic case. So that part could be tightened up too.
>
> By the way, the RFC's Security section includes the following:
>
> No security issues such as string length increases or new
> allowed values are introduced by the encoding process or the use of
> these encoded values, apart from those introduced by the ACE
> encoding
> itself.
>
> What does this mean, exactly? Are any new allowed values introduced by
> the ACE encoding? This part could be clearer.
>
> Also, O and 0 are mentioned, but is this technically correct? I mean,
> aren't uppercase ASCIIs supposed to be lowercased? I'm sorry if I'm
> wrong about this part.
>
>> Nothing new in the recent announces, just
>> sensation papers.
>
> Again, I think the slash homograph might be new. Do you have evidence
> to suggest that it *was* considered by the WG or anybody else?
>
>> The Powers Above require that Something should be done
>
> Have you seen any indication of this?
>
> Thanks,
>
> Erik
>

Erik van der Poel

2005-02-25 17:14:34 UTC

James Seng wrote:
> FYI, punctuation were also debated extensively back then so it isn't
> really 'Oh, we just discover this!'.
>
> It's just interesting to see people putting it into action. Does apps
> needs to be fixed? Yes, absolutely. App developers needs to go read the
> security consideration of rfc3490 and think about how to implement some
> solution to reduce the spoofing risk.
>
> but does it warrant changing the existing rfc? I dunno..I havent seen
> anything that suggest we missed something back then.

That's a pretty cavalier attitude for someone who used to be co-chair of
an IETF Working Group.

http://ietf.org/html.charters/OLD/idn-charter.html

Maybe I should write an Informational RFC titled "IDN Considered
Harmful". Would that catch your attention?

Erik

PS Re: RFC 3490 Security Considerations, please tell me exactly where
punctuation is specifically mentioned.

James Seng

2005-02-25 17:33:30 UTC

Erik,

> That's a pretty cavalier attitude for someone who used to be co-chair
> of an IETF Working Group.

Please focus on my logic of my explanation instead of paying attention
to who I am or who I was. Whats wrong with my explanation?

> Maybe I should write an Informational RFC titled "IDN Considered
> Harmful". Would that catch your attention?

You should write what you think, and stop writing what will catch my
(or others) attention. Perhaps that would be most constructive for all.

> PS Re: RFC 3490 Security Considerations, please tell me exactly where
> punctuation is specifically mentioned.

I said it was discussed in the wg before. Lots of the discussion wasn't
recorded into the RFC. Please go thru the archive and you find at least
2 threads discussing it. Its 2:32am for me now so I am heading to bed
so sorry I cant post any URLs you right now.

-James Seng

Jaap Akkerhuis

2005-02-25 17:07:52 UTC

Also, the "multiple scripts" wording does not specifically cover
the all-Cyrillic case. So that part could be tightened up too.

The all cyrillic case is quite known by a lot of people. I seem to
remember court case about it dating back to even before IDNS was
discussed. Security sections are not necessarily describing all
corner cases.

Also, O and 0 are mentioned, but is this technically correct?
I mean, aren't uppercase ASCIIs supposed to be lowercased?

When the input to the encoding is consists of ASCII in the set LDH,
it wil be left untouched. For DNS is deosn't matter (case insensitive),
but the case should be preserved when possible. Only when the label
contains LDH, the ASCII is lower cased.

jaap

Jaap Akkerhuis

2005-02-25 17:28:10 UTC

Oops,

When the input to the encoding is consists of ASCII in the set LDH,
it wil be left untouched. For DNS is deosn't matter (case insensitive),
but the case should be preserved when possible. Only when the label
contains LDH, the ASCII is lower cased.

s/contains LDH/contains other characters then just LDH/

jaap

Adam M. Costello

2005-02-26 07:06:39 UTC

Erik van der Poel <***@vanderpoel.org> wrote:

> Stephane Bortzmeyer wrote:
>
> >The issue has been discussed at length. See the "Security
> >Considerations" of RFC 3490.
>
> It is true that some of the issues are pointed out by that section, so
> the registries and application developers have to pay attention. But
> one might argue that we have recently been discussing a new *class*
> of homographs. The RFC mentions "multiple scripts" and one and l.
> These two refer to letters such as Cyrillic small 'a' and digits
> (the "one"). But the slash homograph recently raised on this list
> might be considered to be a new class of homograph (punctuation), not
> specifically indicated in the RFC. Not only is this type of character
> different from letters and digits, it is arguably even more dangerous
> than the script-based (Cyrillic) attack, since it can be done in a
> domain label that is not under the control of the registries.

We knew that punctuation could be hazardous, and we expected that it
would be severely restricted by the registries. I don't think we
understood that punctuation could be used to spoof top-level domains
even if every top-level registry prohibited punctuation.

As for application implementors, we made no attempt to mention
every kind of hazard we had thought of; we just wanted to give a
motivating example to start them thinking about what safeguards would be
appropriate for their applications.

Maybe the emerging UTR#36 will become the canonical reference for
spoofing hazards, in which case any revision of the IDNA spec should
certainly cite it.

> No security issues such as string length increases or new allowed
> values are introduced by the encoding process or the use of these
> encoded values, apart from those introduced by the ACE encoding
> itself.
>
> What does this mean, exactly? Are any new allowed values introduced
> by the ACE encoding? This part could be clearer.

It might mean that IDNA does not introduce any new ASCII domain names,
it only introduces new non-ASCII domain names. In any case, that's
true.

> Also, O and 0 are mentioned, but is this technically correct? I mean,
> aren't uppercase ASCIIs supposed to be lowercased?

Nameprep (which includes case-folding) is used for encoding and
comparing domain names, not for displaying them. At least, IDNA makes
no suggestion that Nameprep be used for displaying domain names. If an
IRI contains a mixed-case non-ASCII domain name, IDNA suggests applying
ToUnicode to each domain label, which will internally use Nameprep
before looking for the ACE prefix, but then, not finding the ACE prefix,
it will return the original un-Nameprep'd input for display.

The draft UTR#36, unlike the IDNA spec, recommends using Nameprep for
displayed domain names, to simplify detection of confusable names.

AMC

John C Klensin

2005-02-25 17:29:58 UTC

--On Thursday, 24 February, 2005 20:35 -0800 Doug Ewell
<***@adelphia.net> wrote:

> Erik van der Poel <erik at vanderpoel dot org> wrote:
>
>> 1. Is this the right time to start working on Internet Drafts
>> leading up to new version(s) of the IDNA RFC(s)? If not, when?
>> ...
>
> I don't know about anyone else, but something seems badly
> wrong here.
>
> Is it really possible that we spent a year and a half, two
> years on putting together an IDN architecture, and during all
> that time nobody ever gave the slightest thought to the
> possibility of someone using IDNs for spoofing purposes, and
> now that one or two well-publicized spoofing examples have
> appeared, we are ready to start all over again with a new and
> probably incompatible version of the architecture?

I certainly hope not. And certainly we knew about these issues.
Even the potential problem with symbols and box-drawing
characters was identified, although not in the lurid detail of
some of the recent example. There was discussion around the WG
about what to do about those issues, and how completely to
describe them. I think the consensus at that time was to not
write a lot of these issues up in detail for fear of
discouraging IDNA implementations. That consensus was, IMO,
reached in a WG in which many of the participants were, for
various reasons, just anxious to get finished and not paying
much attention to the finer details.

While I advocated at least one radically different architecture
while the IDN WG work was going on, I think, personally, that
looking at a new and incompatible architecture would be pretty
close to insane. As I tried to explain in my remarks on Erik's
proposal that we reverse the presentation order of domain names,
I just don't think it is possible to go there. And, even if we
wanted to, there is no reason to believe that any other
architecture would work better: these homograph problems are the
inevitable consequence of the relationships among the scripts
themselves: it is unlikely that even dumping Unicode and
switching to something else would help very much. And there
isn't any "something else".

However, the decision to adopt two philosophical principles, and
one strong assumption, went into the design of IDNA and its
supporting tables. The assumption has not turned out to be
completely valid and we may need to look harder at the
implications of its failure. I suggest that either or both of
the philosophical principles could be reviewed and, if
necessary, changed in the light of experience and that neither
change would be fatal to IDNs or IDNA, or even especially
disruptive if not fairly soon.

This is _not_ a suggestion that those changes should be made,
only that it would be plausible for us to review the decisions
and reach some conclusions about whether they are still
appropriate in the light of experience.

I hope that those who wrote the IDNA specs will agree with the
statement of those principles I'm about to make, or at least
that they are close... they may not.

(1) To the extent possible, we should accommodate all Unicode
characters, excluding as little as possible. This position was
reinforced by the view that, at the time, the Unicode
classifications of characters were considered a little soft and
a general conviction that the IETF should not be making
character-by-character decisions. A counter-principle, now if
not then, is that we should permit a relatively narrow extension
of the "letter-digit-hyphen" rule, i.e., permitting, only
letters (in any alphabet or script), perhaps local digits, and
the hyphen, but no other punctuation, symbols, drawing
characters, or other non-letter characters. Adam has argued for
that revised principle recently; several people argued for it
when IDNA was being produced. We could probably still impose
it, and, in any event, it would not require a change in the
basic architecture (see below).

(2) When code points had been identified by UTC as the same as,
or equivalent to, others, we tended to map them together, rather
than picking one and prohibiting the others. This has caused
more problems than most of us expected, with people being
surprised when they register or query using one character and
the result that comes back uses another. It also creates a
near-homograph problem that we haven't "discovered" in the last
couple of weeks: If we have character X mapping to character Y,
but X looks vaguely like Z, then there may be no Y-Z homograph,
but there may be an X-Z one. That could make display decisions,
etc., quite critical and, unless applications got it entirely
right, we might end up with a new family of attacks. Again,
that decision could be reviewed. Perhaps there are groups of
characters that should be prohibited from being included in a
lookup or registration operation, not just mapped to something
more reasonable. And, again, this would be a tuning of tables,
not a change in the basis architecture.

The assumption I referred to above was that ICANN would take a
strong role in determining which characters were really
appropriate for registration and under what circumstances, that
they would institute and enforce appropriate rules, and that
everyone relevant would pay attention to whatever they said.
Every element of that assumption has turned out to be false:
they haven't taken that role; their guidelines are weak,
ambiguous, and at least partially wrong; and some registries
have just ignored the rules that do exist without any penalty.
If there is a problem, either we are going to need to solve it,
or we are going to risk different solutions in different
applications that, taken together, compromise interoperability.

> Is this sending the kind of "stability" message that was
> considered so important two or three years ago?

It is sending the "get it right and get it interoperable"
message that is supposed to dominate IETF decision-making,
especially with Proposed Standards.

> Is there even enough solid information to begin writing
> anything, or just a general feeling that Something Needs To Be
> Done?

I think it is time for us to ask the questions that are
suggested above, and to ask them explicitly. If doing so
produces the answer that it is time to make changes --table
changes, not architectural changes-- I think we should do so.
Perhaps we could combine that table review process with an
upgrade to Unicode 4.x, which would accommodate several scripts
we can't handle today.

Could this be done compatibly? Not quite. For starters, we
would have to address more squarely the question that the first
principle identified above bypassed: does someone have the
_right_ to register a particular sequence of Unicode characters?
If the answer is that, because I can draw out a symbol that
represents my business, or my religion, or my location, then I
have the "right" to register it, then we are in trouble: someone
out there will organize the Church of the Holy Right-Slash and
prohibiting it will discriminate against that religion,
especially if left-slashes and vertical bars are permitted. If
we can get past "right to register", we need to look at the
experience of the browser implementers who have already
concluded that, registered or not, they really don't want to
recognize or process domain names containing such characters.
And then we need to present the transition problem of
eliminating any such domains that may exist to ICANN and say
"you were unable or ineffective at preventing these problems
from occurring, so, as a prize, you get to figure out how to
retire those names and are now prohibited by the updated
standard".

Curiously, if we followed existing precedents, we could even
more IDNA from Proposed to Draft and change the tables to
eliminate many mappings and characters: no change to the
algorithm, just elimination of some features that didn't work in
practice. That is not a proposal, just an observation :-)

john

Erik van der Poel

2005-02-25 22:23:23 UTC

John,

Thank you for taking the time to write such a well-thought-out response.
I agree with some of the points you make, but I'm going to present
arguments against the others. I'm currently leaning towards *not*
changing IDNA (other than to fix mistakes and clarify some sections).

John C Klensin wrote:
>
> (1) To the extent possible, we should accommodate all Unicode
> characters, excluding as little as possible. This position was
> reinforced by the view that, at the time, the Unicode
> classifications of characters were considered a little soft and
> a general conviction that the IETF should not be making
> character-by-character decisions. A counter-principle, now if
> not then, is that we should permit a relatively narrow extension
> of the "letter-digit-hyphen" rule, i.e., permitting, only
> letters (in any alphabet or script), perhaps local digits, and
> the hyphen, but no other punctuation, symbols, drawing
> characters, or other non-letter characters. Adam has argued for
> that revised principle recently; several people argued for it
> when IDNA was being produced. We could probably still impose
> it, and, in any event, it would not require a change in the
> basic architecture (see below).

I believe it would be difficult to reach consensus on a relatively
narrow extension of the LDH rule. Just for starters, the hyphen used to
separate names and other strings in the Western world is not used in
Japan for Katakana, because Katakana uses a middle dot (U+30FB) to
separate 2 Katakana strings. In fact, this character is allowed in .jp.

If we do *not* allow these special local characters that function in the
same way as the hyphen in the West, then people in other parts of the
world would not only claim that our spec is unfair, they might even
ignore it. If we *do* allow this Japanese example, then we have started
sliding down a slippery slope that ends with a rather large extension of
the LDH rule (for the rest of the world), and then the phishing problem
would not be alleviated as much as we might have hoped when we started
with just LDH. This would be a lot of work for little gain.

So it's a lose-lose situation. Instead, we should probably stick to
IDNA's original principle of allowing a lot of Unicode, and have the
local registries, zone administrators and apps address the phishing problem.

> (2) When code points had been identified by UTC as the same as,
> or equivalent to, others, we tended to map them together, rather
> than picking one and prohibiting the others. This has caused
> more problems than most of us expected, with people being
> surprised when they register or query using one character and
> the result that comes back uses another. It also creates a
> near-homograph problem that we haven't "discovered" in the last
> couple of weeks: If we have character X mapping to character Y,
> but X looks vaguely like Z, then there may be no Y-Z homograph,
> but there may be an X-Z one. That could make display decisions,
> etc., quite critical and, unless applications got it entirely
> right, we might end up with a new family of attacks. Again,
> that decision could be reviewed. Perhaps there are groups of
> characters that should be prohibited from being included in a
> lookup or registration operation, not just mapped to something
> more reasonable. And, again, this would be a tuning of tables,
> not a change in the basis architecture.

It may be possible to "tune" the tables, but nowhere in your email do I
find any reference to the ACE prefix. I think that we should also figure
out exactly which types of changes would absolutely require a new ACE
prefix, and then explore in detail what all the affected parties would
have to do to add a new prefix to the mix or to transition to it. The
parties I'm thinking of are app developers and registries, mostly, but
content developers might also be affected.

> The assumption I referred to above was that ICANN would take a
> strong role in determining which characters were really
> appropriate for registration and under what circumstances, that
> they would institute and enforce appropriate rules, and that
> everyone relevant would pay attention to whatever they said.
> Every element of that assumption has turned out to be false:
> they haven't taken that role; their guidelines are weak,
> ambiguous, and at least partially wrong; and some registries
> have just ignored the rules that do exist without any penalty.
> If there is a problem, either we are going to need to solve it,
> or we are going to risk different solutions in different
> applications that, taken together, compromise interoperability.

I'm currently thinking that we (IETF) can't really solve these problems,
and that the registries and apps are going to have to address them. But
I strongly sympathize with your stated concern about differing solutions
leading to interoperability problems, and so I think "we" (not IETF)
must come up with much better registry guidelines and even
recommendations and proposals for the apps. Such documents would not
necessarily be IETF documents, though they could be if they are merely
informational (not standards track). Other organizations like ICANN
could then take some of that and fold it into their own doc, but they
would probably make some of it normative (or MUST). There isn't really a
single organization for the apps (W3C doesn't cover all), so an IETF
informational RFC might be good for them.

> If
> we can get past "right to register", we need to look at the
> experience of the browser implementers who have already
> concluded that, registered or not, they really don't want to
> recognize or process domain names containing such characters.

Some of these implementors might decide to disable IDNA labels under
some circumstances, but the existence of a number of IDN plug-ins for
MSIE and the extensibility of Mozilla and the need for IDNs around the
world suggest that their decisions may be circumvented. Eventually,
these implementors may decide to improve their own IDN support. I
realize that the short-term decisions may be bad for IDN, but I am
hopeful for the future.

Erik

Erik van der Poel

2005-02-26 01:07:35 UTC

> If we do *not* allow these special local characters that function in the
> same way as the hyphen in the West, then people in other parts of the
> world would not only claim that our spec is unfair, they might even
> ignore it. If we *do* allow this Japanese example, then we have started
> sliding down a slippery slope that ends with a rather large extension of
> the LDH rule (for the rest of the world), and then the phishing problem
> would not be alleviated as much as we might have hoped when we started
> with just LDH. This would be a lot of work for little gain.
>
> So it's a lose-lose situation.

Sorry, I said that wrong. What I meant was, "Damned if you do, damned if
you don't."

However, one avenue that might be worth exploring some more is to check
each registry's character table (for those that have one) and see what
the Unicode category is for each character. The Japanese Katakana middle
dot U+30FB has the category "Pc" which means "punctuation, connector"
and LDH's hyphen U+002D has the category "Pd" which means "punctuation,
dash".

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
http://vanderpoel.org/networking/i/idn.html (see bottom)

If it turns out that all or most of the registries that have tables are
using characters with only a small number of Unicode categories, then we
may wish to consider moving IDNA to that set of categories (disallowing
all others). This would keep the registries happy while keeping *some*
of the phishy characters out of DNS.

Erik

Erik van der Poel

2005-02-28 02:15:27 UTC

> However, one avenue that might be worth exploring some more is to check
> each registry's character table (for those that have one) and see what
> the Unicode category is for each character. The Japanese Katakana middle
> dot U+30FB has the category "Pc" which means "punctuation, connector"
> and LDH's hyphen U+002D has the category "Pd" which means "punctuation,
> dash".
>
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
>
> If it turns out that all or most of the registries that have tables are
> using characters with only a small number of Unicode categories, then we
> may wish to consider moving IDNA to that set of categories (disallowing
> all others). This would keep the registries happy while keeping *some*
> of the phishy characters out of DNS.

Even if we do not end up prohibiting a larger number of characters in
nameprep-bis, it might still be a good idea to have the results of the
investigation proposed above, since these Unicode character categories
could then be entered into the guidelines for the registries.

So, these two sub-projects (nameprep-bis and registry table
investigation) could proceed in parallel. I think it would be good to
divide and conquer, since one person cannot do all of this. Perhaps we
could invite volunteers to work on sub-projects?

As I indicate at nameprep.org, I found some character tables at the IANA
site, but I found even more at the GNU libidn site. One of the first
things to do is to agree on a single machine-readable format. The tables
do not all use the same format yet, it seems. Then we would also need to
have the latest and most official tables from the registries themselves
(instead of possibly out of date IANA tables and possibly embellished
unofficial GNU libidn tables).

http://nameprep.org/#related-work

Erik

John C Klensin

2005-02-28 02:58:52 UTC

--On Sunday, 27 February, 2005 18:15 -0800 Erik van der Poel
<***@vanderpoel.org> wrote:

>...
> As I indicate at nameprep.org, I found some character tables
> at the IANA site, but I found even more at the GNU libidn
> site. One of the first things to do is to agree on a single
> machine-readable format. The tables do not all use the same
> format yet, it seems. Then we would also need to have the
> latest and most official tables from the registries themselves
> (instead of possibly out of date IANA tables and possibly
> embellished unofficial GNU libidn tables).

Erik,

I'm been mildly resisting standard-format,
machine-interpretable, tables at IANA for a few reasons. The two
most important ones are:

(i) ICANN is still assuming that this is a registry
issue. As such, if someone else starts guessing at what
a registry is doing, we may get into trouble, especially
since the tables may not show all of the relevant
registry rules and restrictions.

(ii) We've got at least two models for processing a
proposed IDN. One compares the proposed label against a
list of characters for the selected language as
maintained by that registry and, if it passes, registers
it if it isn't already taken. The other involves the
JET "variant" model, or some relative of it, to
determine what labels, or sets of labels, are permitted.
The first plan requires a simple list of characters; the
second requires a three (or two, or four) column table.

Also please note that the IANA tables make no attempt to be
authoritative for any given language. They are just
documentation of what characters a given registry permits to be
associated with a given "language" for their registry. To have
three different, and incompatible, tables --associated with
three different registries-- for "the same language" is not only
possible, but likely.

john

JFC (Jefsey) Morfin

2005-02-28 03:50:04 UTC

At 03:58 28/02/2005, John C Klensin wrote:
>To have three different, and incompatible, tables --associated with three
>different registries-- for "the same language" is not only possible, but
>likely.

There are 260 TLDs. There are 7260 languages, some of them having 2 or even
3 scripts. There are around 13000 dialects of some importance (one consider
that a language needs 100.000 people speaking it to survive. e-colonization
(dominance of an e-culture) should probably lead to the initial deprecation
of some languages, but recent history shows a cultural resistance and
resurgence after such a chock.So one can consider that Internet will most
probably help languages to survive and develop: a 50.000 people minium
might be a good rule of thumb (think of trade, community idioms).

So roughly one can consider that 50.000 languages with possible 260
variants (at TLD level) are to be considered when planning a solution able
to scale. Obviously most of them will try to use the same script as much as
they can for the TLDs. But this cannot be considered as systematic all
throughout a language. So one has to consider 10 millions possibilities
most of them synonyms or not implemented. I am just talking of the legacy:
PADs may introduce 10 times this.

This looks an impossible task in following the IDNA internationalization
concepts. I do not think it is a big deal for the DNS concepts if respected
and real life constraints.

jfc

NB. The problem discussed was the phishing permitted by IDNA. Not to
(re)build a consistent global namespace support by the DNS. The current
grassroots process of which NETPIA is only one of the participants will
necessarily address it, only because 86% of the people need it. IMHO if we
wanted the IETF to do something positive in that area, we should start in
making a review of all the existing solutions, projects, trends, and
probable possibilities and try to give them some common guidance towards a
common consistency.

Erik van der Poel

2005-02-28 04:19:08 UTC

John C Klensin wrote:
>
> (i) ICANN is still assuming that this is a registry
> issue. As such, if someone else starts guessing at what
> a registry is doing, we may get into trouble, especially
> since the tables may not show all of the relevant
> registry rules and restrictions.

Hmmm... GNU libidn already seems to be trying to use machine-readable
tables. I had a look at the GNU libidn page:

http://www.gnu.org/software/libidn/

It has a copy of an expired Internet Draft by Paul Hoffman:

http://josefsson.org/cgi-bin/rfcmarkup?url=http://josefsson.org/cgi-bin/viewcvs.cgi/*checkout*/libidn/doc/specifications/draft-hoffman-idn-reg-02.txt

This draft seems to be talking about bundling and blocking, which your
draft talks about too. What happened here? Did Paul decide to let his
expire?

Anyway, my only reason for trying to get machine-readable tables was to
figure out which Unicode character categories were being used. Another
way to get this info is to simply ask the registries. Or, we can suggest
a list of categories and see if they would be happy with a nameprep-bis
that limits the characters to those categories.

Erik

Paul Hoffman

2005-02-28 04:33:54 UTC

At 8:19 PM -0800 2/27/05, Erik van der Poel wrote:
>What happened here? Did Paul decide to let his expire?

Yes. It turned out to be a bad idea that got more complicated and
less justifiable (that is, worse) with each rev, so I let it die.
Others (notably JET) tried different things.

--Paul Hoffman, Director
--Internet Mail Consortium

Gervase Markham

2005-03-02 12:03:55 UTC

Paul Hoffman wrote:
> Yes. It turned out to be a bad idea that got more complicated and less
> justifiable (that is, worse) with each rev, so I let it die. Others
> (notably JET) tried different things.

Paul,

Could you tell us more about the problems you found with the ideas of
bundling and blocking?

Gerv

Paul Hoffman

2005-03-02 00:16:07 UTC

At 12:03 PM +0000 3/2/05, Gervase Markham wrote:
>Could you tell us more about the problems you found with the ideas
>of bundling and blocking?

It was impossible to come up with a bundling scheme that kept
everyone happy. The needs of the Chinese language communities for
bundling were different than the needs of the Scandinavian language
communities, which in turn were different than the needs of the Indic
language communities, which were different than the needs of the
Arabic language communities, and so on. Then toss in the communities
that truly want multiple scripts but want to avoid homograph attacks
(yes, we really did think about that years ago...), and your brain
starts dripping from your ears.

Other folks with more brains or who are less prone to dripping are
welcome to try to fix this for the world, or at least for one
community as the JET folks did.

--Paul Hoffman, Director
--Internet Mail Consortium

Erik van der Poel

2005-03-02 01:48:48 UTC

Paul Hoffman wrote:
> At 12:03 PM +0000 3/2/05, Gervase Markham wrote:
>
>> Could you tell us more about the problems you found with the ideas of
>> bundling and blocking?
>
> It was impossible to come up with a bundling scheme that kept everyone
> happy. The needs of the Chinese language communities for bundling were
> different than the needs of the Scandinavian language communities, which
> in turn were different than the needs of the Indic language communities,
> which were different than the needs of the Arabic language communities,
> and so on. Then toss in the communities that truly want multiple scripts
> but want to avoid homograph attacks (yes, we really did think about that
> years ago...), and your brain starts dripping from your ears.

Yes, as a long-time internationalization engineer, I can imagine that it
was difficult to come up with a single set of guidelines for all of the
world's registries. (In addition to language differences, some comments
on this list have led me to believe that there are also protocol
differences between the registries, i.e. VeriSign's multiple versions of
RRP vs the EPP that Edmon Chung seems to have been working on vs fax and
sneaker net vs any others?)

However, I note that this particular conversation is between a browser
developer (Gervase) and one of the IDNA authors (Paul), neither of which
is a registry representative, so why exactly are you 2 having this
conversation? :-)

Sorry, I'm half joking. Half, because you two have every right to
discuss whatever you wish. The other half because I believe browser
developers can afford to focus more on their end of things. Allow me to
insert an excerpt from a previous email I wrote up:

-----------------

It is pretty clear that none of the organizations can completely solve
the problem on its own. Unicode can warn about these issues, but that is
all they can do. They cannot remove characters. The IETF is currently
discussing the prohibition of certain characters or character types.
Even if the IETF publishes updated versions of the specs, there will
still be the problem of certain characters being unfamiliar to many
users (simply because they do not know all the legitimate characters in
the world), thereby leaving them exposed to the phishers. The registries
can enforce rules at their level, but nobody has yet shown that they can
truly enforce any rules at other levels. So, the browser developers must
address that problem.

There are several issues here. One is that domain names are typically
displayed inside something else, e.g. a URI. This, in itself, gives the
phishers something to work with. So the browser developers must think
about other ways to display domain names. This is not very easy. People
exchange URIs via email and other means all the time. Apps turn those
URIs into clickable links, as a service to users. If not, they can copy
and paste the URI into the URI field. Both of these methods could be
improved to highlight the domain name in the interests of security.

Another problem is that humans are only familiar with a small set of
characters. Some humans know *many* characters (i.e. the East Asians),
but most know a lot less than that. Now, within the set of characters
that each user is familiar with, there are no homograph problems (or
just a few). However, as soon as you stray outside any single user's
familiar set, there are many homographs, near-homographs and unfamiliar
symbols. When a typical computer user is faced with something
unfamiliar, they are quite likely to shrug it off and assume it's just
one of those "computer" things that they cannot understand. This is
something that IDN phishers could take advantage of, if the browsers do
not take steps to highlight the unfamiliar characters (via HTTP
Accept-Language and browser localization as I suggested). Of course,
highlighting is not sufficient. Education is also very important.

So, instead of wasting time talking about a non-solution (white/black
lists), it would be nice to see these parties spending their valuable
time on real solutions. The registries could be working on the
guidelines, to address the concerns about language tagging, variants and
so on. They could also get in touch with the IETF, to let them know
which Unicode characters and character types they wish to use, so that
the IETF can consider how to publish new specs that might prohibit other
characters. Browser developers could start working on ways to display
domain names in ways that give the phishers less to work with.

---------------------

In other words, I do not think browser developers need to be overly
concerned with the particular bundling/blocking schemes that the
registries might be using. Instead, I wish the browser developers would
focus more on the *user*, who may be "surfing" from one site to the
next, spanning the globe, and crossing language boundaries. In order to
protect such a user, the browser should focus on the core set of
characters that s/he is familiar with, and provide some sort of
indication when unfamiliar characters appear, so that the
security-conscious, educated user may know when to be careful. I.e. the
language of the *user* is important, not the language of the domain name.

I am *not* saying that this would be easy to implement. I am not at all
surprised that Mozilla and Opera have chosen an easy stopgap, hopefully
only for the interim. It's great to see Mozilla and Opera lead the way
as they have been!

Erik

Erik van der Poel

2005-03-02 04:47:24 UTC

> However, I note that this particular conversation is between a browser
> developer (Gervase) and one of the IDNA authors (Paul), neither of which
> is a registry representative, so why exactly are you 2 having this
> conversation? :-)
>
> Sorry, I'm half joking. Half, because you two have every right to
> discuss whatever you wish. The other half because I believe browser
> developers can afford to focus more on their end of things.

Sorry, I've been told that this half-joking thing was confusing, and I
now believe I shouldn't have tried to be so cute.

All I'm trying to say to *Gervase* is that it doesn't really matter
*what* characters are allowed to be registered in a registry, as long as
the browser takes steps to warn the user when something phishy might be
going on, e.g. a slash homograph, or a Cyrillic small 'a' when the user
was probably expecting a Latin small 'a'. As I have pointed out, the
registry does *not* have control over higher-numbered level domains.
E.g. .de controls the 2nd level domain (2LD), but not the 3LD, 4LD and
so on. That is where the slash homograph problem *really* matters.

> Instead, I wish the browser developers would
> focus more on the *user*, who may be "surfing" from one site to the
> next, spanning the globe, and crossing language boundaries.

Sorry, this may not have been the best logic to use in my argument. It
would have been better to talk about phishers, who often spam users with
email containing URIs that *could* contain IDN labels with dangerous
homographs at any level of the name, 2LD, 3LD, or whatever.

(Most users *don't* surf around the world, since many are monolingual or
maybe bilingual.)

Anyway, help me out, guys and gals. Pull my logic through the wringer,
and comb it with the finest comb you have at your disposal. This way, we
can collectively improve our understanding of the IDN phishing problem
and ways to address it.

Erik

John C Klensin

2005-03-02 15:43:27 UTC

Erik,

A few observations...

(1) First, a registry does have the right to require
that registrants observe particular rules and conditions
in subdomains they delegate and to pass those rules down
the tree. Whether that is wise or sensible is another
issue, and enforceability is yet another question.
But, unless national law prevents it, RFC 1591, to which
all TLD registries more or less agreed, rather
explicitly provides for passing the responsibilities to
the community down the tree. Even ignoring troublesome
concepts like "require" and "enforce", certainly nothing
prevents registries from educating and persuading
registrants about how they should behave.

(2) In my regular role as a luser, I really like fast,
easily-used, small-footprint browsers. I'm more
security-conscious and suspicious than the user average,
and therefore also like handy tools to help me dissect
and verify things that might look suspicious. Tying up
a browser with heuristics, such as mixed-script
detectors, that may not work well and have a large
footprint, doesn't impress me as a good tradeoff. For
better or worse, the assumption of a decade ago that
most criminals, especially most electronic criminals,
were stupid is no longer applicable, if ever it was.
That implies, I think, that if we design a simple test
that blocks some look-alike cases but permits other,
more subtle, ones, we will simply drive the phishers to
better understand and use the subtle stuff: not a good
tradeoff.

(3) As far as surfing around the world is concerned,
we've got a situation today in which the domain name
associated with a particular URL does not really predict
the content to be found on that page. That will
undoubtedly get worse, as more folks discover that the
intersection of domain and host administration with web
site organization often makes it much easier to maintain
versions of pages in multiple languages in the same,
rather than different, DNS trees. So, since I don't
read Chinese, I'm unlikely to frequently seek out pages
whose content is in Chinese. But I frequently find
pages I can read via URLs that contain elements written
in pinyin. I fully expect those elements, and some of
the subdomain names, will shift to Chinese characters as
IDNs and IRIs are more widely available. I also expect
that transition will make things more comfortable for
someone who reads Chinese and would prefer to not deal
with Latin characters and harder for me, but that is a
reasonable tradeoff over which none of us will have much
influence.

(4) We need to get unstuck from thinking about this
purely as a browser problem. The usual phishing attack
involves an email message containing a link. For those
email clients that don't immediately invoke a full
browser as soon as a link appears --and many of those
links occur in plain-text, not HTML, email-- they are
invoking the browser when the link is clicked on. The
situation in the browser is then different, since none
of the "hover over link", "look at status bar", etc.,
tools are going to apply, or, at least, are not going to
work in the ways that some of these discussions suggest
for links that appear on web pages that are already open
in the browser. Now, we have given MUA writers no
advice about what they should pass to the browser if
they see an IRI or otherwise-encoded string that
contains an IDN. If they pass the IRI/
native-script-form IDN, they risk passing it to a
browser version that doesn't have a clue. So maybe they
force the thing into URI/ punycode form and pass that.
Now, do you really want the browser to look at the
thing, perform ToUnicode on the name (which, of course,
may yield something other than what the user saw),
perform some tests, and then pop up a "you just passed
me an IDN that looks suspicious, do you really want to
open that page?" box. I think probably not. Moreover,
I think that, if you do, there would quickly be a
sufficient number of false positives (positive for bad
stuff) to get users really used to clicking "yes"
without thinking... and cursing the browser implementer
for bothering them with a pointless warning.

So my conclusion is that we need a mixed
protocol-registry-browser strategy. That strategy, IMO, should
shifted the processing burdens as much as possible to the first
two. And I think that notions that the problem can or should be
solved in any of those three places alone are probably misguided.

john

--On Tuesday, 01 March, 2005 20:47 -0800 Erik van der Poel
<***@vanderpoel.org> wrote:

>> However, I note that this particular conversation is between
>> a browser developer (Gervase) and one of the IDNA authors
>> (Paul), neither of which is a registry representative, so
>> why exactly are you 2 having this conversation? :-)
>>
>> Sorry, I'm half joking. Half, because you two have every
>> right to discuss whatever you wish. The other half because I
>> believe browser developers can afford to focus more on their
>> end of things.
>
> Sorry, I've been told that this half-joking thing was
> confusing, and I now believe I shouldn't have tried to be so
> cute.
>
> All I'm trying to say to *Gervase* is that it doesn't really
> matter *what* characters are allowed to be registered in a
> registry, as long as the browser takes steps to warn the user
> when something phishy might be going on, e.g. a slash
> homograph, or a Cyrillic small 'a' when the user was probably
> expecting a Latin small 'a'. As I have pointed out, the
> registry does *not* have control over higher-numbered level
> domains. E.g. .de controls the 2nd level domain (2LD), but not
> the 3LD, 4LD and so on. That is where the slash homograph
> problem *really* matters.
>
>> Instead, I wish the browser developers would
>> focus more on the *user*, who may be "surfing" from one site
>> to the next, spanning the globe, and crossing language
>> boundaries.
>
> Sorry, this may not have been the best logic to use in my
> argument. It would have been better to talk about phishers,
> who often spam users with email containing URIs that *could*
> contain IDN labels with dangerous homographs at any level of
> the name, 2LD, 3LD, or whatever.
>
> (Most users *don't* surf around the world, since many are
> monolingual or maybe bilingual.)
>
> Anyway, help me out, guys and gals. Pull my logic through the
> wringer, and comb it with the finest comb you have at your
> disposal. This way, we can collectively improve our
> understanding of the IDN phishing problem and ways to address
> it.
>
> Erik

JFC (Jefsey) Morfin

2005-03-02 16:37:51 UTC

On 16:43 02/03/2005, John C Klensin said:
>I fully expect those elements, and some of the subdomain names, will shift
>to Chinese characters as IDNs and IRIs are more widely available.

The problem is the two necessary keyboards.
Please stop considering the solution to your own problems. Or let turn the
Internet entirely in Chinese by default, and let help the IETF to document
the best way to access pages written in English.
jfc

Erik van der Poel

2005-03-02 19:02:55 UTC

John,

John C Klensin wrote:
> So my conclusion is that we need a mixed
> protocol-registry-browser strategy. That strategy, IMO, should
> shifted the processing burdens as much as possible to the first
> two. And I think that notions that the problem can or should be
> solved in any of those three places alone are probably misguided.

I strongly agree with everything you said. I am sorry if I gave the
impression that the browser implementors are the *only* people that can
and should address this IDN phishing problem. I don't think I said that,
but maybe I'm just not very good at email and/or expressing myself. I
also agree that the burden should be shifted as much as possible to the
first two. It would be bad if the many implementations all did it
differently.

I would only add one component to your strategy. It should be a
Unicode-protocol-registry-browser strategy. Unicode has already started
working on their part:

http://www.unicode.org/reports/tr36/

Erik

Gervase Markham

2005-03-02 23:46:58 UTC

John C Klensin wrote:
> So my conclusion is that we need a mixed
> protocol-registry-browser strategy. That strategy, IMO, should
> shifted the processing burdens as much as possible to the first
> two. And I think that notions that the problem can or should be
> solved in any of those three places alone are probably misguided.

Absolutely. A strategy which involves changes all the way down the line
is going to have the best chance of quashing the problem.

Gerv

Cary Karp

2005-03-02 16:06:44 UTC

Quoting Erik van der Poel:

> I note that this particular conversation is between a browser
> developer (Gervase) and one of the IDNA authors (Paul), neither of
> which is a registry representative, so why exactly are you 2
> having this conversation? :-)
>
> Sorry, I'm half joking. Half, because you two have every right to
> discuss whatever you wish. The other half because I believe
> browser developers can afford to focus more on their end of
> things.

Under the assumption that the discussion might be furthered by a
registry rep describing that end of things --

I'm responsible for the development and maintenance of the policies
for .museum. This is a sponsored gTLD (sTLD), which means that there
are eligibility requirements for name holders, and restrictions on
the way names may be structured and used. The policies specific to
IDN are stated in detail at http://about.museum/idn/idnpolicy.html,
which includes links to both the general policy statement and the
listing of permitted characters and code points. Although
prospective name holders are unlikely to read the fine print, the
detailed statement provides an unequivocal reference in the
case-to-case discussions about policy instantiation that we
frequently conduct with individual museums. That dialogue provides
an effective barrier to deliberate abuse of the .museum namespace
(which we are contractually obligated to prevent) and is also a
safeguard against inadvertent homographic confusion.

There are significant differences in the operation of an
unrestricted gTLD (uTLD), where there may be a far greater volume of
registration traffic, and direct contact between the registry and
registrant is not a part of the registration process. In this
situation, such things as restriction on IDN registration need to be
fully automated, with sentient review being a remedial device when
it is noted that something the algorithmic filter is intended to
stop has nonetheless passed through.

The gTLD registries are as concerned by and with the present
homograph alert as is any other group attempting to curb the risk
for such damage. In addition to the range of operating conditions
implied above, there are differences in the way the registries have
interpreted and implemented ICANN's IDN Guidelines. That document
states, "As the deployment of IDNs proceeds, ICANN and the IDN
registries will review these Guidelines at regular intervals, and
revise them as necessary based on experience." We are now clearly at
such a juncture, and the revision process is already being
initiated. Reducing the latitude for interpretation of the
Guidelines will bring the registries one important step closer to
being able to establish a best practice that can bring an end to the
concern about point-of-registration control that is currently being
expressed.

The development of these best practices would be further abetted by
the candidate IDN repertoire being purged of the graphic symbols and
other signs that are not needed in a naming system (using that term
in a sense that a linguist might regard as reasonable) but clearly
exacerbate the risk for malicious exploitation. There is similar
utility in freeing IDNA from its current lock to Unicode 3.2. Since
both of these concerns can only be addressed through revision of
nameprep and stringprep, that action is of priority concern to the
registries.

There is more to be said about gTLD perspectives on the role IDN
plays in the communities that we serve, but I'll save that for
separate communication.

/Cary

Erik van der Poel

2005-03-02 19:57:54 UTC

Cary Karp wrote:
> In addition to the range of operating conditions implied above, there
> are differences in the way the registries have interpreted and
> implemented ICANN's IDN Guidelines. That document states, "As the
> deployment of IDNs proceeds, ICANN and the IDN registries will review
> these Guidelines at regular intervals, and revise them as necessary
> based on experience." We are now clearly at such a juncture, and the
> revision process is already being initiated.

Hi Cary,

It is really nice to hear from you. I have an idea for the Guidelines.
As Paul has indicated, the various communities around the world have
different needs, and some have already started writing down the rules
that they are following in their registries. The JET community comes to
mind:

http://www.ietf.org/rfc/rfc3743.txt

Other communities have other needs. I've been told that some communities
use a set of letters that are currently encoded in two different ranges
of the Unicode space (e.g. Latin and Cyrillic). Today, my idea is that
these communities can "occupy" their "own" part of the DNS space, for
example a .tld or a .2ld.tld. They can publish the rules that they
enforce in their registries, and then the browsers can either allow any
character sequence in those labels or check them to see if the rules
were indeed followed.

Of course, it is much harder to come up with and enforce rules in a
"global" TLD like .com. As a result, the browsers may simply blacklist
.com in its entirety. Or maybe .com will eventually figure out some
rules and actually enforce them in the 2LDs, so that the browsers don't
have to check the 2LDs. Indeed, in a perfect world, .com would even
enforce rules in 3LDs, 4LDs, etc, so that browsers would not have to
check those either. We shall see what .com does.

But in the meantime, how about my idea for the Guidelines?

Erik

Adam M. Costello

2005-03-03 06:32:58 UTC

Erik van der Poel <***@vanderpoel.org> wrote:

> Other communities have other needs. I've been told that some
> communities use a set of letters that are currently encoded in two
> different ranges of the Unicode space (e.g. Latin and Cyrillic).
> Today, my idea is that these communities can "occupy" their "own" part
> of the DNS space, for example a .tld or a .2ld.tld. They can publish
> the rules that they enforce in their registries, and then the browsers
> can either allow any character sequence in those labels or check them
> to see if the rules were indeed followed.

I've also thought along these lines, but I rejected this approach. The
domain hierarchy is ultimately based on delegation of naming authority,
and trying to use it for any other purpose will run into conflicting
constraints. Suppose country X wants to support language Y, which is
used in many countries around the world. Who would be the registry for
the Y domain, and how would you get worldwide agreement on that? Would
country X be delegated a subdomain of Y? Would registrants accept X.Y
as a legitimate country X domain, or would they demand to be in an X
top-level domain? Would users of language Y not get annoyed at seeing Y
at the end of almost every domain name they use? It's bad enough that
so many domains end in .com, imagine if they all ended in .com.lat (for
"Latin").

I still like the idea of allowing every TLD to have one synonym-TLD per
script, although we might need to recognize some scripts in addition to
the Unicode scripts, for example, the subset-of-(Latin-plus-Cyrillic)
script that you allude to.

AMC

William Tan

2005-03-03 06:32:31 UTC

> Other communities have other needs. I've been told that some
> communities use a set of letters that are currently encoded in two
> different ranges of the Unicode space (e.g. Latin and Cyrillic).
> Today, my idea is that these communities can "occupy" their "own" part
> of the DNS space, for example a .tld or a .2ld.tld.

If by "community" you mean users of a certain language / culture group
within a geographic region, yes. In fact, they already do. The Japanese
already "occupy" .jp, Korean .kr, Chinese in PRC .cn, Chinese in Taiwan
.tw, Chinese in Singapore .sg, etc.

I'm not sure what you are proposing here - are you saying to allocate
new TLDs for each "community"?

> They can publish the rules that they enforce in their registries, ...

They already do. The rules are just not in a machine readable format,
and John has already made the case against standardizing language
tables, let alone other rules that may not be character-based (imagine
the .th registry saying, allow both Thai and Latin digits, but not both
in the same label/domain).

> and then the browsers can either allow any character sequence in those
> labels or check them to see if the rules were indeed followed.

I'd vote against browsers trying to enforce rules set by various
registries. This is the sort of thing you'd build into a specialized
tool (as you mentioned) but not in a general application.

>
> Of course, it is much harder to come up with and enforce rules in a
> "global" TLD like .com.

Don't forget countries who choose to honour multiple cultures within
their society - .PL allows many different tables, including Cyrillic
(but does not allow mixing of Cyrillic and Latin scripts, see Andrzej
Bartosiewicz's draft.

> As a result, the browsers may simply blacklist .com in its entirety.

It looks like a reasonable interim solution, but I'm worried about
whether .com can actually get off the list. Unlike DNSBL, if the list is
hardcoded or statically included in the installation package, it's going
to be difficult to get off that list.

Come to think of it, as an off-IETF solution, maintaining an IDNBL of
sorts might be a good idea. I know, it didn't really work for mail
abuses, the landscape is quite different for the problem at hand though.
The IDNBL can ban an entire zone based on the reasoning that the zone
administrator has shown to be negligent (".com"), or ban individual
domains of known phishers, and can even be used to implement character
blacklists instead of having them hard wired in the browser.

> Or maybe .com will eventually figure out some rules and actually
> enforce them in the 2LDs, so that the browsers don't have to check the
> 2LDs.

I'm hopeful that this will happen.

> Indeed, in a perfect world, .com would even enforce rules in 3LDs,
> 4LDs, etc, so that browsers would not have to check those either.

It's not enforceable in 3LD and beyond. period.

wil.

YAO Jiankang

2005-03-03 07:13:00 UTC

----- Original Message -----
From: "Adam M. Costello" <idn.amc+***@nicemice.net.RemoveThisWord>
To: <***@ops.ietf.org>
> I still like the idea of allowing every TLD to have one synonym-TLD per
> script, although we might need to recognize some scripts in addition to
> the Unicode scripts, for example, the subset-of-(Latin-plus-Cyrillic)
> script that you allude to.

I appreciate this idea of allowing every TLD to have one synonym-TLD per
script in general. however, in China, Simple Chinese Characters and Traditonal Chinese Characters are equality. So the TLDs which have both Simple Chinese Characters and Traditonal Chinese Characters should be authorized to th

Cary Karp

2005-03-03 09:03:34 UTC

Quoting Erik van der Poel:

> I've been told that some communities use a set of letters that are
> currently encode in two different ranges of the Unicode space
> (e.g. Latin and Cyrillic). Today, my idea is that these
> communities can "occupy" their "own" part of the DNS space, for
> example a .tld or a .2ld.tld.

The community occupying 2ld.tld doesn't write the rules that
determine the character repertoire available for use in .tld, and
can therefore not necessarily represent even its own name as it
might ideally prefer. The 2ld.tld folks do get to make the
corresponding decision for 3ld.2ld.tld (if permitted in .tld
policy). In the reasonably commonplace situation where all
subdomains under 2ld.tld are operated by a single entity, coherent
rules can be applied throughout. This situation is, however, by no
means the only one that pertains, and it certainly does not apply to
the point of delegation under a TLD.

Most people would probably use the term 'language' to designate the
attribute of community identity that is expressed by its use of a
certain set of letters. A community wishing to project that identity
into the domain namespace will therefore need either to locate a
parent domain that accepts the registration of names including the
needed characters, or convince what would otherwise be the most
desirable parent to implement that support. Languages are, however,
frequently shared by numerous communities without any other aspect
of shared identity, and identical sets of letters often appear in
more than one language. (This is one of the reasons why gTLDs so
prominently appear in the present discussion.)

> I have an idea for the Guidelines. As Paul has indicated, the
> various communities around the world have different needs, and
> some have already started writing down the rules that they are
> following in their registries. The JET community comes to mind:

Both the JET and the ICANN guidelines are intended to assist TLD
operators in establishing safe and responsible IDN policies that
will prove useful to the broadest number of nameholder communities.
The JET action addresses the needs of three languages but I doubt
that the people who use those languages perceive the slightest
additional sense of 'JET community'. And, as an uncomfortable matter
of fact, the extent to which the genuinely excellent JET Guidelines
will be accepted by the target language groups remains to be
determined.

> Of course, it is much harder to come up with and enforce rules in
> a "global" TLD like .com

One might think the situation to be straightforward in a ccTLD, but
there are clear current trends in ccTLD policy development toward
removing the entry-level requirement for national nexus, and
permitting the use of more extensive character repertoires than
would normally be associated with the nominal TLD designation. It is
by no means uncommon for a country to have more than one official
language and an even larger number of officially recognized minority
languages. It is also common for countries to belong to
multinational alliances, with member states recognizing all of the
languages used within that union. All this needs to be reflected in
a ccTLD's IDN policies, which will often require every bit as artful
juggling of a range of scripts and languages as would be needed in a
gTLD, with a heavy further amount of political intricacy that a gTLD
might be able to avoid.

It is true, nonetheless, that a ccTLD operator will generally be in
a better position to produce an authoritative statement of the
character repertoire necessary for the IDN representation of one of
'its' languages, than would the operator of a gTLD serving the same
language community. For this reason, many gTLDs introduce IDN
support for a given language first subsequent to a ccTLD clearly
associated with that language having described its requirements, or
when a similar statement has been produced by some other obviously
authoritative group.

Please also keep in mind that the geographic permimeters within
which many languages are used do not coincide with national
boundaries, and that many communities do not associate their
language identities with the national identities of the countries in
which they reside. IDN provides unprecedented means for such things
as allowing a diaspora to maintain it sense of cultural cohesion, or
furthering the cause of a group struggling to have its language
officially recognized or attempting to reverse threats to its
survival. In such contexts the national implication of a cc label
may be undesired, which is another of the reasons why gTLDs so
prominently appear in the present discussion.

> They can publish the rules that they enforce in their registries,
> and then the browsers can either allow any character sequence in
> those labels or check them to see if the rules were indeed
> followed.

I am grateful that my only headache in this regard is anticipating
the policy and technical requirements for supporting the thousand or
so languages that some segment of the museum community may sooner or
later express interest in representing via IDN in .museum. The same
repertoire may also appear elsewhere in the TLD space and I
certainly don't envy the people who intend to devise and implement
the algorithmic underpinnings for the automation of that process or
the validation of its results :-)

/Cary

Erik van der Poel

2005-03-11 23:42:26 UTC

John C Klensin wrote:
> the view that, at the time, the Unicode
> classifications of characters were considered a little soft

FYI, I asked about Unicode category stability on the Unicode list, and
received the following info:

From: "Andrew C. West" <***@alumni.princeton.edu>

According to my calculations, the number of characters which changed
their General Category from one version of Unicode to the next is :

1.1.5 -> 2.0.14 = 474 (1.384%)
2.0.14 -> 2.1.2 = 1 (0.0025%)
2.1.2 -> 2.1.5 = 16 (0.0410%)
2.1.5 -> 2.1.8 = 18 (0.0462%)
2.1.8 -> 2.1.9 = 3 (0.0077%)
2.1.9 -> 3.0.0 = 85 (0.2182%)
3.0.0 -> 3.0.1 = 0 (0%)
3.0.1 -> 3.1.0 = 3 (0.0061%)
3.1.0 -> 3.2.0 = 7 (0.0074%)
3.2.0 -> 4.0.0 = 16 (0.0168%)
4.0.0 -> 4.0.1 = 1 (0.0010%)
4.0.1 -> 4.1.0 = 12 (0.0124%)

I don't know what this tells you about the stability of the UCD data though.

Erik van der Poel

2005-03-12 09:14:37 UTC

All,

Please do not draw any conclusions from the raw Unicode category
stability data that I sent earlier. Ken Whistler, a Technical Director
at the Unicode Consortium, was so kind to provide further information to
put the data into their proper perspective. See below.

Sorry about that,

Erik

-------------------------------------------

Date: Fri, 11 Mar 2005 18:23:51 -0800 (PST)
From: Kenneth Whistler <***@sybase.com>
Subject: Re: UCD stability
To: ***@vanderpoel.org
Cc: ***@unicode.org, ***@sybase.com

Erik,

If you are going to do things like pass these raw calculations
along to the IDN list, ostensibly as some measure of stability
of the UCD data, then you should take into consideration another
metric.

The raw number of characters changing is less reflective of
stability than considering how many *decisions* to change
a property (of one or more characters) were taken.

I intersperse some notes to Andrew West's calculated numbers
below, to help put this in context.

> Andrew C. West wrote:
> > According to my calculations, the number of characters which
changed their
> > General Category from one version of Unicode to the next is :
> >
> > 1.1.5 -> 2.0.14 = 474 (1.384%)

Many, many, changes, since 1.1.5 was developed in house,
without general public review, and since 2.0.14 (the
data version corresponding to Unicode 2.0) was the first
public release of the data files.

> > 2.0.14 -> 2.1.2 = 1 (0.0025%)

1 decision

> > 2.1.2 -> 2.1.5 = 16 (0.0410%)

2 decisions: addition of Pi/Pf subcategories, and 1 fix for 8 Tibetan
characters

> > 2.1.5 -> 2.1.8 = 18 (0.0462%)

1 decision: changes to converge identifier definitions

> > 2.1.8 -> 2.1.9 = 3 (0.0077%)

2 decisions: fix for Greek numeral signs; fix for halfwidth forms light
vertical

> > 2.1.9 -> 3.0.0 = 85 (0.2182%)

I'd have to dig further for this, but these were likely mostly
changes involved in nailing down normalization for Unicode 3.0.

> > 3.0.0 -> 3.0.1 = 0 (0%)
> > 3.0.1 -> 3.1.0 = 3 (0.0061%)

1 decision: 3 Runic golden numbers

> > 3.1.0 -> 3.2.0 = 7 (0.0074%)

5 decisions: 2 fixes for Khmer signs, 1 for Tamil aytham, 1 for
Arabic end of ayah (architectural), 1 for the 3 Mongolian free
variation selectors

> > 3.2.0 -> 4.0.0 = 16 (0.0168%)

2 decisions: 1 fix for 12 modifier letters, 1 fix for decimal digit
alignment

> > 4.0.0 -> 4.0.1 = 1 (0.0010%)

1 decision: fix for ZWSP

> > 4.0.1 -> 4.1.0 = 12 (0.0124%)

3 decisions: 1 fix for Ethiopic digits, 1 for 2 Katakana middle dots,
1 for Yi syllable wu

> >
> > I don't know what this tells you about the stability of the UCD
data though.

The significant point of instability in General Category
assignments was in establishing Unicode 2.0 data files
(now more than 8 years in the past).

There was a significant hiccup for Unicode 3.0, at the point
when it became clear that normalization stability was going
to be a major issue, and when the data was culled for
consistency under canonical and compatibility equivalence.

Since that time, the UTC has been very conservative, indeed,
in approving any General Category change for an existing
character. The types of changes have been limited to:

A. Clarification regarding obscure characters for which
insufficient information was available earlier.

B. Establishment of further data consistency constraints
(this impacted some numeric categories, and also
explains the change for the Katakana middle dot)

C. Implementation issues with a few format characters
(ZWSP, Arabic end of ayah, Mongolian free variation selectors)

Since the publication of Unicode 3.0 in 2000, the only
significantly common-use characters that had any General
Category change were:

U+0B83 TAMIL SIGN VISARGA (=aytham, Tamil data)
U+200B ZERO WIDTH SPACE (mostly relevant to Thai data)
U+30FB KATAKANA MIDDLE DOT (Japanese)

Of those 3, only U+30FB would exist in any commonly
interchanged character set other than Unicode, and
*that* change was merely to
change a punctuation subclass (gc=Pc --> gc=Po) -- and
was additionally a *reversion* to the General Category
assignment that U+30FB had in 2.1.5 and earlier.

--Ken

Erik van der Poel

2005-03-12 09:45:58 UTC

All,

This is probably well known to most of you, but the General Category
Value in the Unicode Character Database and the stability of that value
are not very relevant to IDNA, which does not depend on the Unicode
Categories.

IDNA depends on the Unicode Normalization Form KC table, and there have
been very few changes indeed in this table:

http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt

Also, IDNA apps depend on tables for converting from various non-Unicode
encodings to Unicode. This is another place where instability could
affect lookups, potentially even in dangerous ways. Stringprep and IDNA
already mention this issue in their Security Considerations sections.

Erik

Martin v. Löwis

2005-03-12 12:09:18 UTC

Erik van der Poel wrote:
> Also, IDNA apps depend on tables for converting from various non-Unicode
> encodings to Unicode. This is another place where instability could
> affect lookups, potentially even in dangerous ways. Stringprep and IDNA
> already mention this issue in their Security Considerations sections.

I think an evaluation of whether something is potentially dangerous
should take linguistic research into account.

As a layman, I recognize that these are all compatibility characters
(and all for CJK encodings). For example, F951 is, according to my
Linux /usr/share/i18n/charmaps, part of CP 949, EUC-KR, and JOHAB.
Looking at the character tables privided by the Unicode consortium,
it seems obvious that it is correctly mapped to 964B and not 96FB
(but it could be that the font tables were explicitly designed to
show the same glyph).

According to Unicode corrigendum 3,

http://www.unicode.org/versions/corrigendum3.html

this character is only used for Korean encodings, to support
a different pronounciation (so the character was duplicated
in the Korean standard also). The corrigendum also asserts
that the character will not readily be recognized by most Korean
speakers (Chinese speakers recognize the character, but they
won't use the compatibility character).

So in practical life, it appears that this normalization change
is unlikely to occur, so it is hard to see how it could cause
danger. I believe the same is true for all the other cases where
the normalization has changed.

The real danger is to judge other people's work inadequately,
and to believe one can do better than them. The people who
created these tables are experts in their field (linguistics
and scripts), if they make mistakes, then likely because the
subject matter is so obscure that "normal people" never
encounter the issue at hand.

I'm not concerned about stability of IDN.

Regards,
Martin

Simon Josefsson

2005-03-12 11:04:36 UTC

Erik van der Poel <***@vanderpoel.org> writes:

> All,
>
> This is probably well known to most of you, but the General Category
> Value in the Unicode Character Database and the stability of that value
> are not very relevant to IDNA, which does not depend on the Unicode
> Categories.
>
> IDNA depends on the Unicode Normalization Form KC table, and there have
> been very few changes indeed in this table:
>
> http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt

Don't forget the normalization flaw in Unicode 3.2 NFKC discussed in:

http://www.unicode.org/review/pr-29.html

Apparently the recommendation will be applied to future Unicode
versions.

PR-29 doesn't merely affect a small set of code points, but rather a
class of strings. The special strings are all unstable under NFKC3.2.

I think PR-29 is a useful example to consider when deciding how much
trust you should place in the UTC's stability guarantees. The UTC's
track record in this area suggest to me that the guarantee is
worthless in practice. I haven't seen an evaluation of alternative
solutions to the PR-29 problem. Not even signs that alternative
approaches were considered. I would have expected both.

> Also, IDNA apps depend on tables for converting from various non-Unicode
> encodings to Unicode. This is another place where instability could
> affect lookups, potentially even in dangerous ways. Stringprep and IDNA
> already mention this issue in their Security Considerations sections.

Right.

Thanks,
Simon

JFC (Jefsey) Morfin

2005-03-14 02:31:44 UTC

On 12:04 12/03/2005, Simon Josefsson said:
>I think PR-29 is a useful example to consider when deciding how much
>trust you should place in the UTC's stability guarantees.

> > Also, IDNA apps depend on tables for converting from various non-Unicode
> > encodings to Unicode. This is another place where instability could
> > affect lookups, potentially even in dangerous ways. Stringprep and IDNA
> > already mention this issue in their Security Considerations sections.

I try to understand the best way for ccTLDs to approach the variations in
Unicode tables and a possible punycode update. I also try take into account
the impact of the revision of BCP 047 (RFC 3066) engaged by the WG-ltru
(which seems mainly motivated by W3C concerns).

1. the IDN Tables are in fact the list of Unicode points accepted by a
Registry manager to form domain names in the "virtual zone" related to that
table. This table may or may not be related to a language in the way RFC
3066 means it. Let for example take ASCII 256, it does not include every
French characters, but includes characters used in many European countries.
I don't think any European ccTLD Registry Manager would object to support
all the ASCII 256 characters and to add the missing ones which would not
conflict - but would be faced with the problem of their ASCII equivalent
(French "oe" as 2 characters instead of one, for example). The reason why
is that there are/will be more and more words of European languages being
accepted in other languages. Specially in commercial, cultural, societal
areas which are a good source of domain name. Just let think about sport
champions, singers, etc.

This means that the list of accepted characters would not relate to a
determined language, while the IANA calls for its registration in
documenting it with the language/script attributes which are those
considered by the WG-ltru draft.

We therefore need another tag system for the IDN tables. This tag is
attached to many management aspects, registrant support, homograph control,
etc. It must also be clearly documented to the SLD zone managers if one
want to push them to respect the same rules for 3+LD labels.

This could lead to a virtual zone management BCP (preferable?), or to an
IDN Table support part in the WG-ltru Draft (my first idea, but I am less
inclined to that now, unless we propose that a IDN Table is made of the
addition of the language tables accepted by the TLD Registry Manager, but
obviously IESG has not followed my suggestion of a more complete tag review
permitting to document tag operations, like tag(eu)=sigma european language
tags).

I think this does not create a problem for the European scripts. But I have
no idea about the non Latin scripts?

2. the IDN Tables adopted by a ccTLD Manager (either code by code, or as
the addition of IDN language tables) are contractual annexes to the IDN
registration contracts. The ccTLD Manager should word in its terms and
conditions what happens if the code supporting a character is changed, or
if punycode is updated resulting in the same change. My understanding is
that such a change will result in having the registered IDN to transcode in
another ACE label.

- if this label does not exist the two versions should be registered for free.
- if the new label conflicts with an existing label, we have a problem -
but I understand this is unlikely?
- I understand we should not have the case of a stringprep failure where
there was none before? Or that this should then result in an update?

Then the ccTLD will have to document its new IDN Table. This means that she
will add the new code to the table. But she will not be able to remove the
former code since it is still used, but she should be able to mention that
it cannot be used in requests of names - only in free CNAMEs.

Also a program should help the ccTLD Registry to parse her table to make
sure all the listed codes are correct (last date of change).

Am I correct with this?

jfc

=====================================================
For your convenience:
Draft: http://www.inter-locale.com/ID/draft-ietf-ltru-registry-00.txt
Charter: http://ietf.org/html.charters/ltru-charter.html
gmane: http://dir.gmane.org/gmane.ietf.ltru
If you were Bcced for information and not familliar with the IETF process:
http://www.ietf.org/rfc/rfc2026.txt
http://www.ietf.org/rfc/rfc2418.txt
http://www.ietf.org/rfc/rfc3934.txt
http://www.ietf.org/rfc/rfc3669.txt
http://www.ietf.org/rfc/rfc3160.txt
http://www.ietf.org/internet-drafts/draft-hoffman-taobis-02.txt
=====================================================
Jon Postel (RFC 1591): "The IANA is not in the business of deciding
what is and what is not a country. The selection of the ISO 3166 list
as a basis for country code top-level domain names was made with
the knowledge that ISO has a procedure for determining which
entities should be and should not be on that list."
=====================================================
Brian Carpenter (RFC 1958/3.2): "If there are several ways of doing the
same thing, choose one. If a previous design, in the Internet context
or elsewhere, has successfully solved the same problem, choose the
same solution unless there is a good technical reason not to.
Duplication of the same protocol functionality should be avoided as far as
possible, without of course using this argument to reject improvements."
=====================================================
It seems that what works for countries and ISO 3166 since 1978 should
apply to languages and to ISO 693.
=====================================================

Mark Davis

2005-03-14 15:15:04 UTC

You keep harping on that, but we really had no choice in that matter. The
definition of normalization in UAX #15 was internally inconsistent. Certain
implementations of the UAX algorithm would exhibit unacceptably aberrant
behavior, although only in a small number of degenerate cases, none of which
occurring in ordinary text. The problems are:

1. Broken Idempotency. A non-idempotent implementation by its very nature
cannot be stable, because repeated application of a non-idempotent
normalization could produce different results.The application of the
inconsistent interpretation therefore causes fundamental problems for
implementations as further outlined in PRI#29; briefly, these are
comparable to using a comparison function that isn't transitive when
sorting.

2. Broken Canonical Equivalence. The inconsistent interpretation of the
old UAX version could "normalize" some text to something that is not
canonically equivalent to the input -- it changes some text to some
completely different text.

3. Broken Canonical Order. Application of NFC[old UAX] or NFKC[old UAX]
produces output that is not only different text (not canonically
equivalent) but also not in canonical order. As a result, something
returned from a normalization function may not even pass the normalization
quick check: NFC_quick_check(NFC(string))=NO.

After carefully evaluating the nature and effects of this inconsistency
the UTC reached a decision to address these problems as follows:

The current version of UAX #15 in Unicode 4.1.0 addresses the internal
inconsistency. The changes do not affect any versions of UAX #15 prior to
Unicode 4.1.0 and therefore do not affect stringprep or IDN. No
backwards-compatibility problems will be introduced as a result of the
changes.

Stringprep and IDN rely on Unicode 3.2 version of UAX #15, which is:

http://www.unicode.org/unicode/reports/tr15/tr15-22.html

Implementations that claim conformance to Unicode 3.2 normalization may
not produce identical results in all cases, and may not produce *correct*
normalizations, because versions of UAX #15 prior to 4.1.0 have been
internally inconsistent. While normalization problems only happen in
degenerate cases, the inconsistency in the definition is significant enough
that UTC felt compelled to make the change. During deliberations, UTC did
discuss stability policies in the standard, and concluded that this
inconsistency itself is unstable; it led to demonstrably divergent
implementations, and could not stand without correction.

In addition to the new 4.1.0 version of UAX #15, the UTC decided to issue
a corrigendum which can be applied to other versions of Unicode. None of
the prior versions of the Unicode Standard or its annexes will be changed
in any way. Any implementation that claims conformance to Unicode 3.2 can
stay precisely the same. Only if an implementation claims conformance to
3.2 plus the new corrigendum, or to version 4.1.0 or later of Unicode,
would it change. So the current stringprep and IDN are not affected.

When it comes time to update stringprep to a new version of Unicode, such
as 4.1.0, there are two paths that IETF can take:

(a) simply update to the newer version, or
(b) specify a method which takes the previous algorithm and applies it to
the new Unicode data.

Option (a) sacrifices some compatibility, although (1) strings that have
already been stringprepped *once* with the old version will have the same
results under either version, and (2) the UTC does not expect any real data
to contain the degenerate cases that trigger the problem.

The UTC strongly recommends against Option (b). While it maintains
backwards compatibility It does not fix the underlying problems: two
successive applications of stringprep can still result in different
strings.

And if you look carefully at the stability requirements, you see "If a
string contains only characters from a given version of the Unicode Standard
(e.g., Unicode 3.1.1), and it is put into a normalized form in accordance
with that version of Unicode, then it will be in normalized form according
to any past or future versions of Unicode. " Which is true, even after
applying PRI #29.

It would also be interesting to me to see the level of stability that is
guaranteed by the other organizations. I know that there are W3C
Recommendations that do not maintain perfect stability. How about the IETF?
Is there a policy that any RFC that obsoletes another RFC is required to be
absolutely -- bug-for-bug -- backwards compatible?

‎Mark

----- Original Message -----
From: "Simon Josefsson" <***@extundo.com>
To: "Erik van der Poel" <***@vanderpoel.org>
Cc: <***@ops.ietf.org>
Sent: Saturday, March 12, 2005 03:04
Subject: [idn] Re: stability

> Erik van der Poel <***@vanderpoel.org> writes:
>
> > All,
> >
> > This is probably well known to most of you, but the General Category
> > Value in the Unicode Character Database and the stability of that value
> > are not very relevant to IDNA, which does not depend on the Unicode
> > Categories.
> >
> > IDNA depends on the Unicode Normalization Form KC table, and there have
> > been very few changes indeed in this table:
> >
> > http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
>
> Don't forget the normalization flaw in Unicode 3.2 NFKC discussed in:
>
> http://www.unicode.org/review/pr-29.html
>
> Apparently the recommendation will be applied to future Unicode
> versions.
>
> PR-29 doesn't merely affect a small set of code points, but rather a
> class of strings. The special strings are all unstable under NFKC3.2.
>
> I think PR-29 is a useful example to consider when deciding how much
> trust you should place in the UTC's stability guarantees. The UTC's
> track record in this area suggest to me that the guarantee is
> worthless in practice. I haven't seen an evaluation of alternative
> solutions to the PR-29 problem. Not even signs that alternative
> approaches were considered. I would have expected both.
>
> > Also, IDNA apps depend on tables for converting from various non-Unicode
> > encodings to Unicode. This is another place where instability could
> > affect lookups, potentially even in dangerous ways. Stringprep and IDNA
> > already mention this issue in their Security Considerations sections.
>
> Right.
>
> Thanks,
> Simon
>
>

Simon Josefsson

2005-03-14 15:37:18 UTC

"Mark Davis" <***@jtcsv.com> writes:

> Implementations that claim conformance to Unicode 3.2 normalization may
> not produce identical results in all cases, and may not produce *correct*
> normalizations, because versions of UAX #15 prior to 4.1.0 have been
> internally inconsistent.

We seem to disagree on this. I believe Unicode 3.2 was consistent.
Only the non-normative sections was in conflict with the normative
text. I admit an implementation would not meet some normalization
invariants discussed in the document. But I don't believe the
invariants were discussed as requirements on the implementation.

> It would also be interesting to me to see the level of stability that is
> guaranteed by the other organizations. I know that there are W3C
> Recommendations that do not maintain perfect stability. How about the IETF?
> Is there a policy that any RFC that obsoletes another RFC is required to be
> absolutely -- bug-for-bug -- backwards compatible?

For the IETF, my understanding is that the policy is to make whatever
changes works best for people. Including breaking backwards
compatibility when appropriate. The stability guarantee place the UTC
in a different seat, though.

Thanks,
Simon

Erik van der Poel

2005-03-15 06:06:27 UTC

Simon Josefsson wrote:
> "Mark Davis" <***@jtcsv.com> writes:
>>Implementations that claim conformance to Unicode 3.2 normalization may
>>not produce identical results in all cases, and may not produce *correct*
>>normalizations, because versions of UAX #15 prior to 4.1.0 have been
>>internally inconsistent.
>
> We seem to disagree on this. I believe Unicode 3.2 was consistent.
> Only the non-normative sections was in conflict with the normative
> text. I admit an implementation would not meet some normalization
> invariants discussed in the document. But I don't believe the
> invariants were discussed as requirements on the implementation.

I read UAX #15 and PRI #29. It's quite unfortunate that such a mistake
was made in the spec, and that several implementations have implemented
that mistake so faithfully. Although I would normally feel that the IETF
should just stick with the original normalization table and rules (to
avoid DNS lookup failures or, heaven forbid, security breaches), in this
case, it may be wiser to adopt the new UAX #15 rules, since the
invariants are important to IDNA also. The idempotence invariant seems
especially important.

I feel that we are still at the very beginning of the adoption of the
particular Unicodes affected by this mistake. Most of them are for South
Asian languages. Hangul is much further along, but not the particular
Unicodes that are affected here (i.e. the Jamo). More importantly, this
mistake only affects highly unusual, malformed data. I think that if
IDNA decides not to follow Unicode's recommendation now or in the next
couple of years, 10 or 20 years from now we would look back in time and
regret this decision. If there is a time to break compatibility for
something, it is now, for this.

The Korean IDN table at IANA does not contain the Jamo that are affected
by this mistake. (They use the precomposed syllables, rather than the
individual pieces.) I don't know anything about IDN in South Asia, but I
doubt that any labels have been registered with this particular type of
malformed data.

It is interesting that, in this case, Unicode seems to have implemented
first and written the spec later, which is the way the IETF is supposed
to do things too. It's just unfortunate that the Unicode spec was
transcribed incorrectly from the implementation(s). On the other hand,
IDNA seems to have done it in the opposite order. First, the spec was
written, and now that we have deployed some implementations, we are
finding serious problems with punctuation marks and symbols.

Erik

Erik van der Poel

2005-03-15 19:04:13 UTC

It's very quiet on this mailing list...

> The idempotence invariant seems
> especially important.

Keep in mind that IDNA requires Nameprep to be applied a 2nd time in
order to determine whether a Punycode label may be displayed in its
Unicode form.

> If there is a time to break compatibility for
> something, it is now, for this.

Note that we wouldn't actually be "breaking compatibility" since it is
highly unlikely that anyone would have created a domain label with such
strange combinations of characters. Also, even though there are
implementations out there that implement UAX #15 the "wrong" way, keep
in mind that there are also implementations that implement it the other
way. This, in itself, is sufficient grounds for IETF to not only show
concern, but also take action.

> On the other hand,
> IDNA seems to have done it in the opposite order. First, the spec was
> written, and now that we have deployed some implementations, we are
> finding serious problems with punctuation marks and symbols.

Nobody has bothered to point out that this remark was unfair, so I'll
just say it myself. IDNA is at the Proposed Standard Maturity Level, so
we still have the opportunity to improve the specs for the Draft
Standard and Internet Standard Maturity Levels. This is not only
possible, it is expected.

Erik

Martin v. Löwis

2005-03-15 20:26:50 UTC

Erik van der Poel wrote:
> I read UAX #15 and PRI #29. It's quite unfortunate that such a mistake
> was made in the spec, and that several implementations have implemented
> that mistake so faithfully.

It's also quite understandable. It is not at all obvious that the
correction is necessary; even know that I read it, and even though
I have implemented the algorithm myself (for Python), I found it very
difficult to understand the issue. Here is the problem:

In NFD, combining characters are sorted according to their combining
class, in increasing order. So you always have

starter small_combiner_A large_combiner_B next_start ...
(with A <= B)

The old text says that a combiner is blocked if it has the same
combining class, so

starter combiner_A other_combiner_B (with A==B; if starter
cannot be combined with combiner_A, then combiner_A blocks
combiner_B)

Now, the correction says that you should consider also the case

starter combiner_A combiner_B; with A > B ?!

How can that be? NFD should have sorted them so that combiner_B
comes *before* combiner_A, so it would not be blocked. Think about it.

The answer is this: This is *only* possible if combiner_B is a
starter, i.e. B==0. But if so, why could you possibly combine it
with the starter? Can you ever combine two starters? Think about it.

The answer is yes: for Hangul Jamo. They all have combining class 0,
yet they can be combined. There are also a few other characters which
have combining class 0 and still can be combined. However, it is not
at all obvious.

For the specific case of Python, it turns out that I special-cased
Hangul composition, so it won't apply the standard algorithm (of
looking for blockers); this means that all the examples in PR#29
apparently work "correctly" with Python. However, for the non-Hangul
cases, it is possible to produce the "bad" behaviour with Python 2.4.

> I feel that we are still at the very beginning of the adoption of the
> particular Unicodes affected by this mistake. Most of them are for South
> Asian languages. Hangul is much further along, but not the particular
> Unicodes that are affected here (i.e. the Jamo).

It's not that easy. When you use the old algorithm, you get normal
Hangul syllables, which would be allowed in IDNA. It's only that the
sequence *before* the normalization should not be allowed.

> More importantly, this
> mistake only affects highly unusual, malformed data. I think that if
> IDNA decides not to follow Unicode's recommendation now or in the next
> couple of years, 10 or 20 years from now we would look back in time and
> regret this decision.

I don't think so. "We" could still change the decision in 20 years, and
not a single registration would be affected. The sequences causing the
behaviour change are *really* unusual - I don't know if software can
visually render them in a meaningful way, and I guess a native speaker
would just consider them moji-bake. So it is unlikely that anybody would
try to use them as input to IDNA in the next 20 years in a reasonable
application.

> It is interesting that, in this case, Unicode seems to have implemented
> first and written the spec later, which is the way the IETF is supposed
> to do things too. It's just unfortunate that the Unicode spec was
> transcribed incorrectly from the implementation(s). On the other hand,
> IDNA seems to have done it in the opposite order. First, the spec was
> written, and now that we have deployed some implementations, we are
> finding serious problems with punctuation marks and symbols.

That's why IDNA is still a Proposed Standard Protocol (not even
a Draft Standard Protocol); see STD 1. It will advance to Draft
Standard if two independent and interoperable implementations
from different code bases have been developed, and sufficient
successful operational experience has been gained; see BCP 9.

It also *not* the case that it was specified first and implemented then.
All along the process, people have been implementing bits and pieces of
it, test beds have been run, and so on. You might not have been around,
but some people still remember.

Regards,
Martin

Erik van der Poel

2005-03-15 21:34:43 UTC

Martin v. Löwis wrote:
> Erik van der Poel wrote:
>> I feel that we are still at the very beginning of the adoption of the
>> particular Unicodes affected by this mistake. Most of them are for
>> South Asian languages. Hangul is much further along, but not the
>> particular Unicodes that are affected here (i.e. the Jamo).
>
> It's not that easy. When you use the old algorithm, you get normal
> Hangul syllables, which would be allowed in IDNA. It's only that the
> sequence *before* the normalization should not be allowed.

No, these strange sequences should not be disallowed. The specs should
be corrected so that the implementations can all treat these strange
sequences the same way.

>> More importantly, this mistake only affects highly unusual, malformed
>> data. I think that if IDNA decides not to follow Unicode's
>> recommendation now or in the next couple of years, 10 or 20 years from
>> now we would look back in time and regret this decision.
>
> I don't think so. "We" could still change the decision in 20 years, and
> not a single registration would be affected. The sequences causing the
> behaviour change are *really* unusual - I don't know if software can
> visually render them in a meaningful way, and I guess a native speaker
> would just consider them moji-bake. So it is unlikely that anybody would
> try to use them as input to IDNA in the next 20 years in a reasonable
> application.

If we do not correct the specs, more and more implementations will be
created and deployed, some implementing it one way, the others the other
way. It is hard to change something when a lot of implementations have
been deployed. This is why we have to act now (or soon). We have to nip
it in the bud.

Erik

John C Klensin

2005-03-12 16:40:26 UTC

--On Saturday, 12 March, 2005 01:14 -0800 Erik van der Poel
<***@vanderpoel.org> wrote:

> All,
>
> Please do not draw any conclusions from the raw Unicode
> category stability data that I sent earlier. Ken Whistler, a
> Technical Director at the Unicode Consortium, was so kind to
> provide further information to put the data into their proper
> perspective. See below.
>...

Erik, Ken, and others,

The difficulty here is not, IMO, the specific numbers or
percentages. It is an important difference in perspective.
>From the standpoint of UTC, these changes are few, minor, and
corrections to obscure errors. That is a perfectly sensible
position.

>From the standpoint of the IETF, or anyone else worried about a
piece of protocol that must support many applications, the
problem is a little different. Some of the recent developments
in automatic updating tools notwithstanding, IDNA (and its
supporting tables) are designed to be embedded in and used from
clients. Many of those clients, and the associated operating
systems, have been historically updated only when the machine in
which they run is replaced. That argues for an extremely
conservative view of protocol design and compatibility, with
very high thresholds for justifying incompatible changes of any
sort. From that viewpoint, the differences between 0.01%
changes and 5% changes is like measures of being partially
pregnant: perhaps helpful in some types of risk assessment, but
less so in making the next design decision.

john

Martin v. Löwis

2005-03-12 18:14:02 UTC

John C Klensin wrote:
>>From the standpoint of the IETF, or anyone else worried about a
> piece of protocol that must support many applications, the
> problem is a little different. Some of the recent developments
> in automatic updating tools notwithstanding, IDNA (and its
> supporting tables) are designed to be embedded in and used from
> clients. Many of those clients, and the associated operating
> systems, have been historically updated only when the machine in
> which they run is replaced. That argues for an extremely
> conservative view of protocol design and compatibility, with
> very high thresholds for justifying incompatible changes of any
> sort. From that viewpoint, the differences between 0.01%
> changes and 5% changes is like measures of being partially
> pregnant: perhaps helpful in some types of risk assessment, but
> less so in making the next design decision.

While the facts are all true (clients only updated rarely, and
protocol stability being important), I somewhat disagree with the
conclusion. Design decisions need to take all these things into
account, but on a factual basis, not a moral one. E.g. if a protocol
change is known to potentially break existing clients, but the
number of actual clients being broken is also known to be small,
the protocol change might still be acceptable. Likewise if the
impact of "breakage" would be "small".

In the context of IDNA, I would conclude that upgrading to a
newer Unicode version in IDNA would do no significant harm, even
if it is, strictly speaking, an incompatible protocol change.
Existing clients must be considered, but in doing so, the effects
of a proposed change to such clients must also be considered -
not in a theoretical way, but trying the changes on a real
existing client. For IDNA, it seems likely that the existing
clients would not change in any user-visible way, and that
the behaviour change in artificial cases should be considered
acceptable.

That said, I also believe that upgrading to a newer Unicode
version would do little good. Additional characters are likely
of little use to the broad masses, as font support etc. is still
lacking. If certain new characters are of high interest to IDNA
users, I expect registrars to weaken the "AllowUnassigned"
setting of "false" to, say, "AllowUnassignedAsPerLatestUnicodeSpec" -
independent of what any RFC says.

Regards,
Martin

Mark Davis

2005-03-12 18:15:18 UTC

We do not make any absolute guarantees of stability for the general category
and many other properties, because a miscategorized character would cause
incorrect behavior in computers for the languages that use it. And as new,
increasingly obscure characters are added to the standard, it may take some
time to get really accurate information.

However, I do want to call attention to certain properties of the Unicode
Standard that may be relevant -- characterizing identifer and pattern syntax
characters -- which do have strict requirements on stability. There is a
draft of the newest specification on
http://www.unicode.org/reports/tr31/tr31-5.html. Programming identifiers
have a bit different requirements from IDN, but they are related enough that
the information may be useful.

The data for Pattern_Syntax and Pattern_White_Space is in
http://www.unicode.org/Public/4.1.0/ucd/PropList-4.1.0d12.txt. For XID_Start
and XID_Continue, it is in
http://www.unicode.org/Public/4.1.0/ucd/DerivedCoreProperties-4.1.0d12.txt

All of these will be finalized and released as part of Unicode 4.1.0.

‎Mark

----- Original Message -----
From: "John C Klensin" <***@jck.com>
To: "Erik van der Poel" <***@vanderpoel.org>; <***@ops.ietf.org>
Cc: "Kenneth Whistler" <***@sybase.com>
Sent: Saturday, March 12, 2005 08:40
Subject: [idn] Re: Unicode categories

>
>
> --On Saturday, 12 March, 2005 01:14 -0800 Erik van der Poel
> <***@vanderpoel.org> wrote:
>
> > All,
> >
> > Please do not draw any conclusions from the raw Unicode
> > category stability data that I sent earlier. Ken Whistler, a
> > Technical Director at the Unicode Consortium, was so kind to
> > provide further information to put the data into their proper
> > perspective. See below.
> >...
>
> Erik, Ken, and others,
>
> The difficulty here is not, IMO, the specific numbers or
> percentages. It is an important difference in perspective.
> >From the standpoint of UTC, these changes are few, minor, and
> corrections to obscure errors. That is a perfectly sensible
> position.
>
> >From the standpoint of the IETF, or anyone else worried about a
> piece of protocol that must support many applications, the
> problem is a little different. Some of the recent developments
> in automatic updating tools notwithstanding, IDNA (and its
> supporting tables) are designed to be embedded in and used from
> clients. Many of those clients, and the associated operating
> systems, have been historically updated only when the machine in
> which they run is replaced. That argues for an extremely
> conservative view of protocol design and compatibility, with
> very high thresholds for justifying incompatible changes of any
> sort. From that viewpoint, the differences between 0.01%
> changes and 5% changes is like measures of being partially
> pregnant: perhaps helpful in some types of risk assessment, but
> less so in making the next design decision.
>
> john
>
>
>
>

Adam M. Costello

2005-02-26 08:19:13 UTC

Doug Ewell <***@adelphia.net> wrote:

> Is it really possible that we spent a year and a half, two years on
> putting together an IDN architecture, and during all that time nobody
> ever gave the slightest thought to the possibility of someone using
> IDNs for spoofing purposes,

No, it was thought about, and it was decided that the IDNA protocol was
not the place to address those issues; that they should be addressed in
registries and user interfaces.

IDNA could have addressed the easier portion of the problem (prohibiting
punctuation and symbols) (and for a while I was arguing for that), but
it still would have left the harder part of the problem (dealing with
script mixtures and homographs among letters) for the registries and
user interfaces to deal with, so why not let them deal with the easier
part too?

(Of course, one could then ask why that argument doesn't apply to all
the invisible characters that IDNA does prohibit. I have no good answer
at the moment. Maybe invisibility was the only disqualifying attribute
that everyone could agree on.)

John C Klensin <***@jck.com> wrote:

> I hope that those who wrote the IDNA specs will agree with the
> statement of those principles I'm about to make, or at least that they
> are close... they may not.
>
> (1) To the extent possible, we should accommodate all Unicode
> characters, excluding as little as possible.

That (or something very similar) was a principle that went into the
IDNA spec. I personally was inclined to define both internationalized
domain names and internationalized host names, where the former would
be completely general (allowing *all* Unicode characters, even the
invisible ones), and the latter would be much narrower (excluding most
punctuation and symbols). This would be an analogy to traditional
domain names (which allow all ASCII characters, even control characters)
and traditional host names (which allow only the ASCII letters, digits,
and one punctuation mark, the hyphen-minus).

On the other hand, there was an argument that the traditional
distinction between domain names and host names was the source of
endless confusion and debate, and was a mistake that should not be
repeated with IDNs. I have some sympathy for that argument.

In any case, we ended up with just one set of non-ASCII characters for
IDNs, between the two extremes: only invisible characters are excluded.
(I think there's one exception--a visible space character that is also
excluded).

> (2) When code points had been identified by UTC as the same as, or
> equivalent to, others, we tended to map them together, rather than
> picking one and prohibiting the others.

This was more than a tendency; it was strictly followed.

> This has caused more problems than most of us expected, with people
> being surprised when they register or query using one character and
> the result that comes back uses another.

I think this happens only for the case-folding mappings. The
normalization mappings should not surprise anyone.

> It also creates a near-homograph problem that we haven't "discovered"
> in the last couple of weeks: If we have character X mapping to
> character Y, but X looks vaguely like Z, then there may be no Y-Z
> homograph, but there may be an X-Z one.

True. And again, I think it's just the case-folding mappings that do
this, not the normalization mappings.

> Curiously, if we followed existing precedents, we could even move
> IDNA from Proposed to Draft and change the tables to eliminate many
> mappings and characters: no change to the algorithm, just elimination
> of some features that didn't work in practice.

If we want to place further restrictions the set of characters used
in IDNs, I think it would be pretty rude of us to simply add them to
the set of prohibited characters in Nameprep. What about the guy who
registered <not_equal>.com? What if people had already bookmarked that
site, and created links to it? Are we just going to break those links?

A less rude approach would be to recommend that domain labels containing
certain characters not be displayed. Their ACE forms could still be
display, and they could still be looked up. The domain holder in this
example could register a new displayable domain name, and could put an
HTTP redirector at the old site, and existing bookmarks and links would
continue to work.

Erik van der Poel <***@vanderpoel.org> wrote:

> I believe it would be difficult to reach consensus on a relatively
> narrow extension of the LDH rule. Just for starters, the hyphen used
> to separate names and other strings in the Western world is not used
> in Japan for Katakana, because Katakana uses a middle dot (U+30FB) to
> separate 2 Katakana strings. In fact, this character is allowed in
> .jp.

But notice how seldom the hyphen-minus is actually used in domain
names. People prefer to just run words together, even in languages that
customarily use word breaks. Maybe the analogous characters in other
scripts (like the katakana middle dot) would likewise be very seldom
used in practice (especially in Japan where the lack of word breaks is
the norm), and would not be missed if they were deprecated.

> It may be possible to "tune" the tables, but nowhere in your email do
> I find any reference to the ACE prefix. I think that we should also
> figure out exactly which types of changes would absolutely require a
> new ACE prefix,

Coming up with the necessary and sufficient conditions will be tricky,
but now that you've got me thinking about it, I think I can supply
one sufficient condition: If the only changes you make are to add
characters to the prohibited table, I don't think you need to change the
ACE prefix. This would cause some valid IDN labels under the old spec
to become invalid under the new spec, and would cause some valid ACE
labels under the old spec to become bogo-ACE labels under the new space.
(The bogo-ACE phenomenon already exists: there are labels that begin
with the ACE prefix but don't validate during ToUnicode and therefore
display as literal ASCII strings.) It would not cause anything to
encode or decode to something different than it used to.

But I don't advocate making such a change (see my argument above about
rudeness).

AMC

Erik van der Poel

2005-02-26 19:45:08 UTC

Oh, this one's just priceless. I have to share it with you all:

http://e.netpia.com/

Move your mouse over the "Why NLIA" near the top, and then read the
words that appear.

With apologies to Netpia,

Erik

Adam M. Costello

2005-02-28 01:55:45 UTC

Erik van der Poel <***@vanderpoel.org> wrote:

> Oh, this one's just priceless. I have to share it with you all:
>
> http://e.netpia.com/
>
> Move your mouse over the "Why NLIA" near the top, and then read the
> words that appear.

That was extremely funny.

And then, after I read more about this company, somewhat ominous. In a
nutshell, they're setting up an alternate root for DNS, so that people
unfamiliar with Latin characters won't have to deal with ASCII TLDs.
It's a nice goal, but the means are worrisome to say the least.

Maybe this is a sign that it's time to figure out a standard way to
support non-ASCII TLDs.

> Finally, regarding displaying ".com" in Chinese, there is currently
> no reason to display ".com" in ASCII. This could easily be displayed
> in Chinese if the application developers were only willing to modify
> their programs to be more user-friendly.

I don't thin it's that simple. The purpose of domain names is to
serve as global identifiers. If non-ASCII synonyms for TLDs were
left as a UI issue for each application to solve independently, two
different applications could choose different Thai spellings for .uk
(for example), and their users wouldn't be able to refer each other
to sites; the domain names wouldn't be fulfilling their purpose as
global identifiers. Therefore, the spellings of all the TLDs in all
the scripts need to be standardized. That's about 300 TLDs times about
50 scripts, potentially around 15,000 localized TLDs. Such a table
probably shouldn't be hard-coded into every application. It should be
kept in an online database, like... the DNS!

According to the Unicode standard, there are 52 scripts. Currently,
all TLDs use the Latin script. I suggest that every country be allowed
to register up to 51 additional TLDs, one per non-Latin script, in the
root zone. Countries would choose abbreviations for themselves, which
would need to be ratified by some review process to make sure they were
reasonable, and not homographs of other TLDs.

A similar policy could exist for gTLDs, except that .com and .$B>&(B
[that's my guess at the analogue of .com in the Han script] would not
necessarily be operated by the same registry; any accredited registry
could apply to operate a synonym for an existing ASCII gTLD in a script
that was not already in service, and the proposed new gTLD would be
checked for being a reasonable synonym, but would not have to redo the
arduous approval process that the original ASCII gTLD did.

To be fair to early registrants of IDNs, perhaps every new non-ASCII
gTLD should be required to initialize its zone with any names from
the corresponding ASCII gTLD zone that satisfy the (possibly more
restrictive) syntax rules of the new gTLD. The names should be added
in order of seniority, in case the new gTLD has more restrictive
name-blocking rules that prevent two admissible names from coexisting.
The names in the new zone would inherit their owners and expiration
dates from the old zone. After the initialization, the new zone would
be independent of the old zone, and people could opt to register/renew
names in one and not the other.

Non-ASCII TLDs would be represented using IDNA, no different from labels
at any other level.

AMC

Doug Ewell

2005-02-28 16:23:31 UTC

Adam M. Costello <idn dot amc plus 0 at nicemice dot net dot
RemoveThisWord> wrote:

> According to the Unicode standard, there are 52 scripts.

There may be 52 scripts currently encoded in Unicode, but I am sure
Unicode does not claim that is the total number of scripts in the world.
Others can and will be encoded.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

John C Klensin

2005-02-28 18:49:57 UTC

--On Monday, 28 February, 2005 08:23 -0800 Doug Ewell
<***@adelphia.net> wrote:

> Adam M. Costello <idn dot amc plus 0 at nicemice dot net dot
> RemoveThisWord> wrote:
>
>> According to the Unicode standard, there are 52 scripts.
>
> There may be 52 scripts currently encoded in Unicode, but I am
> sure Unicode does not claim that is the total number of
> scripts in the world. Others can and will be encoded.

At least as important, those are "scripts" with regard to
typographic relationships, more or less. The issue that started
ICANN (and others) down the "language" path is that, if one
wants to avoid homographs _as seen by a user who is not familiar
with a different script or part of a script_, one often needs to
make more constrained lists of characters than those that appear
to be "scripts" from the Unicode standpoint. And, as Michel
points out, the Unicode list is growing and will probably
continue to grow.

john

Adam M. Costello

2005-03-01 07:02:48 UTC

Doug Ewell <***@adelphia.net> wrote:

> There may be 52 scripts currently encoded in Unicode, but I am sure
> Unicode does not claim that is the total number of scripts in the
> world. Others can and will be encoded.

Michel Suignard <***@windows.microsoft.com> wrote:

> The 2 current amendments of ISO/IEC 10646 (The ISO sibling of Unicode)
> being processed are adding about 10 new scripts. And a new amendment
> will be initiated in September with few more scripts. Any scheme
> based on a finite number of scripts is doomed.

Of course the number 52 would not be hard-coded into the policy. It
would be expressed as "the number of Unicode scripts supported by the
latest version of IDNA".

Unicode tends to grow linearly, not exponentially, so that shouldn't
present a scaling problem.

AMC

JFC (Jefsey) Morfin

2005-02-27 04:13:54 UTC

http://www.earthtimes.org/articles/show/1754.html
Will people not associate IDN now with phishing.
In a way it seems worst than saying we block IDNs?
jfc

Erik van der Poel

2005-02-27 19:08:39 UTC

Adam M. Costello wrote:
> That (or something very similar) was a principle that went into the
> IDNA spec. I personally was inclined to define both internationalized
> domain names and internationalized host names, where the former would
> be completely general (allowing *all* Unicode characters, even the
> invisible ones), and the latter would be much narrower (excluding most
> punctuation and symbols). This would be an analogy to traditional
> domain names (which allow all ASCII characters, even control characters)
> and traditional host names (which allow only the ASCII letters, digits,
> and one punctuation mark, the hyphen-minus).
>
> On the other hand, there was an argument that the traditional
> distinction between domain names and host names was the source of
> endless confusion and debate, and was a mistake that should not be
> repeated with IDNs. I have some sympathy for that argument.
>
> In any case, we ended up with just one set of non-ASCII characters for
> IDNs, between the two extremes: only invisible characters are excluded.
> (I think there's one exception--a visible space character that is also
> excluded).

Another bifurcation that could be considered somewhat analogous is that
of http vs https. We might even want to consider bringing the topic of
security into the ACE prefix discussion. One could imagine a world where
two different ACE prefixes co-exist, one new prefix for "secure" domain
labels, the other (old) prefix for less secure labels. The secure prefix
would have similar encoding and decoding rules, but would not have the
sometimes-confusing mappings currently found in nameprep, and would
prohibit a rather large number of Unicode characters and/or character
types (for future expansion).

We might then choose "xn--s" as the prefix, so that the raw Punycode
form would also be more secure since there would be an 's' next to
whatever follows, rather than a hyphen, which looks more like a
delimiter. E.g. xn--spypal-4ve instead of xn--pypal-4ve. Note that the
spypal looks quite different from pypal. Of course, this example isn't
very good since the beginning of pypal doesn't resemble the beginning of
paypal. A better example would be one where the 2nd 'a' of paypal was a
homograph.

However, a 2nd ACE prefix might be fraught with difficulties. Just for
starters, we might end up with FQDNs with 3 different encodings (if
there are 3 or more labels), i.e. both ACE prefixes and the pure ASCII
TLD name. And then there would also be the question of *which* ACE
prefix to choose while encoding. We might just have to specify that
*all* the labels use the same ACE prefix (or pure ASCII, e.g. for the
TLD). This would be consistent with RFC 1591 and current conventions
(except for TLDs that allow just about anything underneath them). E.g.
the .jp registry might have a rule that says that *all* domain labels
either use one prefix or the other, together with pure ASCII for the
final ".jp" part (or any part).

Co-existence is quite different from transition. Although migration
typically requires the co-existence of the old and the new during the
transition period, people normally intend to complete the transition by
getting rid of the old (entirely or almost entirely). However, there are
probably many examples of migrations that started with good intentions
but ended up with rather long periods of co-existence. One that comes to
mind is HTML vs XHTML. I don't know whether we will ever be able to
exterminate HTML, regardless of our "good" intentions.

Erik

Erik van der Poel

2005-02-27 19:15:00 UTC

Erik van der Poel wrote:
> Another bifurcation that could be considered somewhat analogous is that
> of http vs https. We might even want to consider bringing the topic of
> security into the ACE prefix discussion. One could imagine a world where
> two different ACE prefixes co-exist, one new prefix for "secure" domain
> labels, the other (old) prefix for less secure labels.

Sorry, I forgot to say that a Web site would choose the new secure ACE
prefix when they use https. In fact, they would make that choice for
similar reasons, i.e. to allow the user agent to distinguish this site
from a less secure one, similar to Mozilla's current choice of using the
padlock icon and a different color near the URI at the top for https.

Erik

JFC (Jefsey) Morfin

2005-02-27 22:55:47 UTC

(I will correct Erik's "xn--s" proposed notation into "xs--" not to create
havoc in punycode)

One of the reason why I disagreed with IDNA is it creates a possibly
conflicting left-to-right hierarchy while the DNS hierarchy is
right-to-left. Erik's proposition makes a lot of sense. But it means that a
label at the same time:
- belongs to a DNS zone
- belongs to a zone of encoding (ascii, punycode, his punysecure, new
versions, tables, other transcoding, etc. )
- may belong to a zone of encoding different from the zone of encoding of
other labels (ex.: "xn--abc.xs-def.tld").

This does not simplify understanding, management, security. Why not to just
use DNS zones? I have not yet understood why it was opposed. IMHO the
future of ML.ML names are in the form "name2.name.xx--chicom.com" where
"xx-nn.com" will print as ".com" in Chinese and name, name2 etc. will all
have to use codes from the Chinese Table of ".com".

jfc

At 20:15 27/02/2005, Erik van der Poel wrote:
>Erik van der Poel wrote:
>>Another bifurcation that could be considered somewhat analogous is that
>>of http vs https. We might even want to consider bringing the topic of
>>security into the ACE prefix discussion. One could imagine a world where
>>two different ACE prefixes co-exist, one new prefix for "secure" domain
>>labels, the other (old) prefix for less secure labels.
>
>Sorry, I forgot to say that a Web site would choose the new secure ACE
>prefix when they use https. In fact, they would make that choice for
>similar reasons, i.e. to allow the user agent to distinguish this site
>from a less secure one, similar to Mozilla's current choice of using the
>padlock icon and a different color near the URI at the top for https.
>
>Erik

Erik van der Poel

2005-02-28 00:06:18 UTC

JFC (Jefsey) Morfin wrote:
> (I will correct Erik's "xn--s" proposed notation into "xs--" not to
> create havoc in punycode)

You are right, of course. What was I thinking? Duh. Well, how about
"xs--n", to differentiate it from "xn--", and to still have the
advantage of a non-delimiter-like character before the rest of the string?

> One of the reason why I disagreed with IDNA is it creates a possibly
> conflicting left-to-right hierarchy while the DNS hierarchy is
> right-to-left.

I'm afraid I don't understand this.

> This does not simplify understanding, management, security. Why not to
> just use DNS zones? I have not yet understood why it was opposed. IMHO
> the future of ML.ML names are in the form "name2.name.xx--chicom.com"
> where "xx-nn.com" will print as ".com" in Chinese and name, name2 etc.
> will all have to use codes from the Chinese Table of ".com".

I think your proposal makes some sense. It is similar to my proposal in
a way -- recall my .jp example, with the rule that *all* labels would
have to use the same ACE prefix or pure ASCII.

There isn't really any way to force the TLDs or zone administrators to
follow any rules that we might come up with. The best we can do is write
down some guidelines that are well thought out. And write them clearly.

If registries and zone administrators fail to follow the guidelines, the
applications may have to display their domain names differently, to
indicate some level of risk.

Also, we can't just suddenly switch a TLD from one encoding to another
and then expect all the subdomains to follow suit the same night.
Instead, we might have a rule specifying that all labels under the first
new ACE prefix must use the same prefix. For example, suppose we have a
new domain with the new prefix, called "xs--nfoo-abc.jp". Since the 2LD
uses the new prefix, any 3LDs, 4LDs and so on would also have to use the
new prefix. Does this make sense?

Finally, regarding displaying ".com" in Chinese, there is currently no
reason to display ".com" in ASCII. This could easily be displayed in
Chinese if the application developers were only willing to modify their
programs to be more user-friendly. Of course, this brings up all sorts
of issues like what to do with copy and paste, educating users about the
new kind of display, being able to type ".com" in Chinese in one
application while being required to type it in ASCII in another, and so
on. There are some issues, but theoretically, you *can* already display
".com" in Chinese.

This is not so different from Punycode itself, which you wouldn't
normally display as is. You first decode it and then show the Unicode to
the user.

Erik

JFC (Jefsey) Morfin

2005-02-28 03:18:51 UTC

On 01:06 28/02/2005, Erik van der Poel said:
>JFC (Jefsey) Morfin wrote:
>>(I will correct Erik's "xn--s" proposed notation into "xs--" not to
>>create havoc in punycode)
>
>You are right, of course. What was I thinking? Duh. Well, how about
>"xs--n", to differentiate it from "xn--", and to still have the advantage
>of a non-delimiter-like character before the rest of the string?

No need to have 4 and 5 characters. xs-- would be enough.

>>One of the reason why I disagreed with IDNA is it creates a possibly
>>conflicting left-to-right hierarchy while the DNS hierarchy is right-to-left.
>
>I'm afraid I don't understand this.

The DNS hierarchy are 1st level, 2nd level, etc. from right to left.

The scripting hierarchy introduced by the ACE prefix (on the left of the
name) decides if the label is ASCII or not. This is a local hierarchy in
the label. But using Tables, permitting other versions as you proposes is
creating a de facto lame hierarchy with various Tables applying or not. In
the same URI.....

>>This does not simplify understanding, management, security. Why not to
>>just use DNS zones? I have not yet understood why it was opposed. IMHO
>>the future of ML.ML names are in the form "name2.name.xx--chicom.com"
>>where "xx-nn.com" will print as ".com" in Chinese and name, name2 etc.
>>will all have to use codes from the Chinese Table of ".com".
>
>I think your proposal makes some sense. It is similar to my proposal in a
>way -- recall my .jp example, with the rule that *all* labels would have
>to use the same ACE prefix or pure ASCII.

ACE prefix: I suppose you mean the same table.
Please understand there are three layers:
- internationalization: the scripts, the strcuture, etc. what is discussed
by IDNA. This is the basis. But it does not make a service.
- multilingualization: the support of languages. IDNA is poor at that
because it has not analyzed enough the structure of naming and that ASCII
is just another lingual support using the ASCII Table which IS defined (by
default) and forgot to structurally require the other tables to be defined.
(".com" is actually ".ascii.com" - what helps you to understand that
".chinese.com" is NOT ".ascii.com" or ".com", not the same zone while IDNA
makes them the same.
- vernacularization: what permits the users to use the system (ex. the
colors, etc.). They not only refused to consider but did not work out the
tools permitting it to be built (like a compression method to distribute
Tables, etc.).

>There isn't really any way to force the TLDs or zone administrators to
>follow any rules that we might come up with. The best we can do is write
>down some guidelines that are well thought out. And write them clearly.

We have no _rule_ at all to write for anyone. If you edict a rule you must
be able to enforce it. You can only describe the way things should work and
hope people will adhere. So you have to keep it as simple, unique and logic
as possible. The strength of the DNS is to be such. This results from the
zones.

When you manage a zone you are the master in your zone but you _cannot_
affect the layer above. With IDNA you can: because if you enter an ACE
prefix in your zone, the whole FQDN becomes a FQIDN.

For 22 years there is only one version of the DNS, the ACE prefix does not
change that because it is unique. But if there are several prefixes the DN
becomes complex.

>If registries and zone administrators fail to follow the guidelines, the
>applications may have to display their domain names differently, to
>indicate some level of risk.

The error (IMHO) of IDNS is to require "guidelines". The DNS has no
"guidelines". It has functions. You use them correctly and it works, you
don't and it does not.

>Also, we can't just suddenly switch a TLD from one encoding to another and
>then expect all the subdomains to follow suit the same night. Instead, we
>might have a rule specifying that all labels under the first new ACE
>prefix must use the same prefix. For example, suppose we have a new domain
>with the new prefix, called "xs--nfoo-abc.jp". Since the 2LD uses the new
>prefix, any 3LDs, 4LDs and so on would also have to use the new prefix.
>Does this make sense?

No. Because today we have already two rules.
1. a hierarchy on zones. From right to left. If I have ".com" TLD, the SLD
will obey the ".com" rules, etc.
2. a local ACE prefix is _local_ to the label. Even if as I said above it
has an impact on the whole FQDN creating a lame bottom-up hierarchy.

Now, what you propose is that if you put an "xn--" label somewhere it will
"pollute" the whole FQDN into an FQIDN to be entirely read in using
punycode without ACE prefix. This would be mad. Each label can be read/used
separately, the ACE prefix is part of the punycode transcoding.

Important point: once the FQDN has been properly declared a FQIDN (through
the top level information) you can have all the possible transcoding (with
as many xn/xs/zq/etc-- ) you want and stay consistent: the right to left
hierarchy has been respected.

>Finally, regarding displaying ".com" in Chinese, there is currently no
>reason to display ".com" in ASCII. This could easily be displayed in
>Chinese if the application developers were only willing to modify their
>programs to be more user-friendly.

No. Here is the confusion between the internationalization and the
multilingualization layer. What Adams called the host name (which becomes
too complex in reality due to the probable extensive use of aliases -
another topic). ".com" in Chinese does not make sense: who is going to say
that it is to be printed in Chinese? as a Chinese name chosen by who? or in
ASCII? ".com" is a default for ".ascii.com". Once you have understood that,
there is no more problem of any kind.

The DNS is something very powerful because it is simple. It knows very
little: labels and dots. IDNA says that applications can transcode labels
at the application level. But they did not address the top level. . This
was partly corrected with the Tables: but there is no mechanical relation
between the TLD and a given Table as in the ASCII Domain Name. In the ASCII
Domain Name, the Table is ASCII, you cannot use EBCDIC.

This information MUST be provided. The way the DNS (actually the global
naming) does it is through a zone. This zone can be a primary zone (a _new_
Chinese .com equivalent having Chinese Table as a default) or lingual
primary zone (a .chinese.com) zone. In that zone names will then have to be
IDN labels using the .com Chinese Table (or other Chinese accepted codes).
If there is no Table, there cannot be any name registered.

Just remind this very simple ".com" actually is an abreviation for
".ascii.com" using the ANSI Table.

This means that the transcoding must be adapted.
- nameprep can know the table and check the IDN consistency.
- read/present the ".chinese.com" sequence in Chinese as the Chinese ".com"
label.

> Of course, this brings up all sorts of issues like what to do with copy
> and paste, educating users about the new kind of display, being able to
> type ".com" in Chinese in one application while being required to type it
> in ASCII in another, and so on. There are some issues, but theoretically,
> you *can* already display ".com" in Chinese.
>This is not so different from Punycode itself, which you wouldn't normally
>display as is. You first decode it and then show the Unicode to the user.

I am not sure about what you discuss here. If you punycode a TLD you create
a new TLD?
jfc

Paul Hoffman

2005-02-25 16:56:09 UTC

At 7:51 PM -0800 2/24/05, Erik van der Poel wrote:
>1. Is this the right time to start working on Internet Drafts
>leading up to new version(s) of the IDNA RFC(s)? If not, when?

That's your call. There is certainly no one stopping you (but, as far
as I can tell, very few people encouraging you either).

>2. Am I stepping on someone's toes by creating nameprep.org? Feel
>free to respond publicly or privately.

You are only stepping on toes if you try to pass it off as something
official, and so far you haven't at all. It certainly looks like a
useful service to me.

>4. Do we need to revive the IDN WG?

No. Individuals can submit proposals to update or replace standards
without a WG. The IESG will probably expect proof that there is
community support for a change to an existing, widely-deployed
standard. If you can show that and write a good revision to the
standard, no WG is needed. In the current case, trying to start up
the IDN WG is probably a bad idea, given the history of the previous
WG.

--Paul Hoffman, Director
--Internet Mail Consortium

JFC (Jefsey) Morfin

2005-02-26 01:20:06 UTC

Erik,
I will try to respond that with caution. Based upon real world situation.
Trying not to hurt anyone.

At 04:51 25/02/2005, Erik van der Poel wrote:
>1. Is this the right time to start working on Internet Drafts leading up
>to new version(s) of the IDNA RFC(s)? If not, when?

The IDNA solution as well described by John Klensin, has IMHO low chances
to commercially take off. I suppose it will progressively be replaced by
different grassroots solutions in non-Latin countries at lease, as this has
started. Due to this progressive evolution, we may suppose these solutions
will still use punycode, so the experience acquired will remain of real
interest. nameprep.org is OK.

I may be wrong but the solution will most probably be based on simple
principles:
- respect of the DNS. Either with lingual TLD (extension of the root or PAD
[private alias directory]) or with .lingual_sld.tld and conversion.
- language homogeneity for the whole URL
- reduced number of authorized characters as decided by the TLD/PAD Manager.

This will probably supported by local ISPs and by plug in (lingual names
are probably to be of different usage, much more like DNs were used in the
sole USA in 1984). The "plug in functions" will probably be extended and
made part of the OS once stabilized. This will result from a grassroots
effort, documented further on as RFCs for information. So there is no need
for a WG no one wants to reopen, for ICANN which has no impact (2 mails on
the ICANN IDNA mailing list in one year), etc. but for relations with TLDs,
Govs, Users Reps, Cultural organizations.

>2. Am I stepping on someone's toes by creating nameprep.org? Feel free to
>respond publicly or privately.

Certainly not. You may accumulate an experience which will be precious to
everyone. But understand no one is really happy with the IDNA. The imposed
terms by "Powers Above" were unworkable. There has been a lot of
efforts.Everyone tried his best. There is still the IRI to fully
understand. There is e-mail to support. There are the babel names. There
are the PADs to come.

>3. If this is the right time to start work on drafts, who would like to
>write some prose?

Frankly I think this time it should be carried the other way around. To
understand the real world. Then to put a solution together. To test it.
Then once it starts working to document for information.

But in the meanwhile we should do everything to keep a good image to IDNs.
I do think that a multicolored URI support Draft could help in providing a
way out from the current concern, and restoring credibility give a
credibility to a new team.

>4. Do we need to revive the IDN WG?

Certainly not!

Then what else? There are different point of views. Mine is that the real
need is to consider the whole matter of the multilingual internet. Once the
framework has been clearly understood, ML.ML DNs will be far easier to
understand and discuss. But the IETF/IAB is obviously not interested
(yet?). As RFC 3869 shows it.

I had started writing a Draft on Multilingual Internet. My idea was to
document how the existing Internet standard process can document the
Multilingual Internet. The idea was to use RFC 3066 and the work of Paul
Hoffman as a basis, adding a mutilingual considerations part in every RFC
(like the security consideration, cf. RFC 3066), to extend the concepts in
extending and structuring Paul Hoffman"s definitions lists, and to build an
MLTF as there is a TFIPv6. To gather the concerned people and
organizations. Its purpose would be to share into the standard process
comment period, to help a culture to develop. The first thing was to
document questions for an IAB guidance over the architectural aspects
called by a MI.

The recent saga of the RFC 3066 bis shown there is still work ahead before
this can be considered. Your initiative could help to prepare the ground.

>5. Any other process questions?

Why not to work on practical functions and on real tools testing? You have
seemingly good developer skills, we could help?

jfc

Erik van der Poel

2005-02-26 07:19:44 UTC

All,

Recently, we have been talking about creating a number of new documents.
In order to avoid overlap with other documents, I thought it might be
nice to have a list of related work, so I started a new section:

http://nameprep.org/#related-work

In particular, the Unicode Security document and John Klensin's
registration guidelines are very relevant to our discussions. If you
would like to add items to the list, please send them to me.

Thanks,

Erik

Gervase Markham

2005-03-02 11:55:12 UTC

Erik van der Poel wrote:
> 1. Is this the right time to start working on Internet Drafts leading up
> to new version(s) of the IDNA RFC(s)? If not, when?

IMO, no. Nothing like consensus has yet emerged. However, I feel that
the way forward will become clear eventually - we aren't going round in
circles. It's just a big issue.

Gerv

tedd

2005-02-24 22:25:34 UTC

John:

>I think the idea gets into trouble only when the application starts
>making guesses as to what should (or should not) be on that
>list.

Yes, that's one problem -- a big one.

>No, this wouldn't "solve" the problem. I agree with Erik that,
>if people had people had high expectations for it, they would be
>badly disappointed.

Agreed.

>But I'm convinced that we aren't going to find a magic bullet
>here. Instead, we can do some large fraction of the "useful,
>but not a solution" tools that have been suggested.

Provided that those tools do not instill a false sense of security or
overly-warn the end user. I certainly would not want to be on the
receiving end of a law suit claiming that my tool-tip didn't warn an
end-user of a spoof OR from an honest company suing me for defamation
of character because my tool-tip suggested otherwise.

I know the law well enough to realize that sometimes doing anything
is an admission of responsibility. As such, regardless of the
intellectual intent and pursuit, doing something is not always better
than doing nothing. If you're going to do something, make sure you
have a defendable foundation for your actions.

>We can try
>to restrict characters that are clearly dangerous, adopting, if
>necessary, a view that the fact someone wants to register or use
>a particular string doesn't mean that they are entitled to do
>so.

If you are talking about characters that are "clearly" dangerous,
then of course you have sufficient proof for your actions and I agree
with you. However, the fact that someone "wants" to register a
particular string doesn't mean that they are NOT entitled to do so.
In fact, considering civil liberties and freedom of speech, they may
even have the right to do so. However, I am not offering a legal
opinion. I'm just saying that the entire process is not evolving a
vacuum, but rather driven by multifaceted interest and concerns --
that are larger than this list.

>We can adopt a variety of warning technologies --whether
>they involve colors, displaying punycode, pop-up warnings, or
>something else-- and let applications compete on which ones can
>do a better job of that. We can try some user education. We
>can use the UDRP and/or the legal system in various countries to
>push back on those who register deceptive names and on the
>registrars and registries that encourage the registration of
>such names. And other ideas may come along that should be
>implemented.

Again, I believe that you are assuming liability by doing so. How
much liability depends upon how deep your pockets are, or those of
your employer. That's not good, nor bad -- it's just life.

>Then we can hope that those things, in combination, reduce the
>problem to some tolerable level, understanding that it will
>never completely go away.

I wish you the best. But before you do, you might want to look into
the problems that the anti-spam industry has with the back-lash of
law suits from spammers. Not a pretty picture and it seems that
everyone has rights, or at the very least, money to spend to prove
their point.

tedd
--
--------------------------------------------------------------------------------
http://sperling.com/

Erik van der Poel

2005-02-24 18:00:59 UTC

> If they were displayed
> in the opposite (big-endian) order, the 3rd example above would become:
>
> http://xx.baz.com|bar.foo
>
> Notice how the "com" and "foo" are now separated.

There was a gap in my logic here. A phisher could easily keep the "com"
and "foo" next to each other. My real point is that big-endian display
of domain names would put the important parts of the name near the
beginning for a left-to-right reader:

> The "real" (unspoofed)
> URI would look like this:
>
> http://com.foo

Erik

Michel Suignard

2005-02-24 18:23:47 UTC

Many web sites use deep DNS sub-zones today (including the company I work for) and I really doubt our customers would appreciate such a radical change in user experience. Most people expect to see www.example.org not org.example.www. These days the 'www' is really noise, but it also frequent to see myzone.example.org which works really well with autocompletion logic. Putting org first would sort of defeat that logic as well.

Don't get me wrong, your idea is good, it is just way too late.

Like somebody else already said, I feel some level of overreaction in these threads.

Michel

-----Original Message-----
From: owner-***@ops.ietf.org [mailto:owner-***@ops.ietf.org] On Behalf Of Erik van der Poel
Sent: Thursday, February 24, 2005 10:01 AM
To: IETF idn working group
Subject: Re: [idn] punctuation

> If they were displayed
> in the opposite (big-endian) order, the 3rd example above would become:
>
> http://xx.baz.com|bar.foo
>
> Notice how the "com" and "foo" are now separated.

There was a gap in my logic here. A phisher could easily keep the "com"
and "foo" next to each other. My real point is that big-endian display of domain names would put the important parts of the name near the beginning for a left-to-right reader:

> The "real" (unspoofed)
> URI would look like this:
>
> http://com.foo

Erik

Erik van der Poel

2005-02-24 19:03:45 UTC

Michel Suignard wrote:
> Many web sites use deep DNS sub-zones today
> (including the company I work for) and I really
> doubt our customers would appreciate such a
> radical change in user experience.

As has already been pointed out on this list, a phisher could take
advantage of a very long domain name to push the important part past the
end of the display area. In the long run, the phishing problem may
outweigh the user experience problem that would result from the radical
change that I am talking about.

Besides, there is a much better way to tell the user where they are. See
the blurb about "breadcrumb trails" at:

http://useit.com/alertbox/20000109.html

You can see these breadcrumb trails near the top of:

http://msdn.microsoft.com/Longhorn/toolsamp/default.aspx

> it also frequent to see
> myzone.example.org which works really well with
> autocompletion logic. Putting org first would
> sort of defeat that logic as well.

I have seen autocompleters that work for the middle or end of a string.
In fact, Microsoft Internet Explorer has one.

> Like somebody else already said, I feel some
> level of overreaction in these threads.

Ten years from now, let's revisit this topic together, shall we? I am
not a betting man, but let's just see how it goes.

Erik

Michel Suignard

2005-02-24 19:30:23 UTC

Erik, I don't think it is useful to engage on a debate about web site design and the appropriateness of multi node web site. All your points are probably valid there but they are mostly orthogonal to the discussion here.

I am still not convinced that reversing the display order of domain names is a good idea. And there are many more reasons that the ones I already gave, such as disruption of the current logical order. Especially if we can convince quickly ICANN and most registry orgs to effectively deprecate usage of all homographs of URI reserved characters. This seems to me a more realistic approach and it doesn't prevent browsers to pay attention to those homographs when they are detected in IRIs.

Michel

John C Klensin

2005-02-24 19:55:09 UTC

--On Thursday, 24 February, 2005 11:30 -0800 Michel Suignard
<***@windows.microsoft.com> wrote:

> Erik, I don't think it is useful to engage on a debate about
> web site design and the appropriateness of multi node web
> site. All your points are probably valid there but they are
> mostly orthogonal to the discussion here.
>
> I am still not convinced that reversing the display order of
> domain names is a good idea. And there are many more reasons
> that the ones I already gave, such as disruption of the
> current logical order. Especially if we can convince quickly
> ICANN and most registry orgs to effectively deprecate usage of
> all homographs of URI reserved characters. This seems to me a
> more realistic approach and it doesn't prevent browsers to pay
> attention to those homographs when they are detected in IRIs.

Michel,

If we can reach some reasonable consensus about what characters
(URI-reserved or otherwise) that ICANN should deprecate, I'm
more than happy to put my liaison hat on and forcefully carry
the message over there.

But we need to remember that ICANN's authority, and hence the
impact of their deprecating something, is very limited. In
particular:

* For the gTLDs, they can create additional guidelines
or modify the existing ones, but it is not clear how
forcefully they can, or will, apply them if some gTLDs
decide to ignore those guidelines. As has been pointed
out, the example that started these thread could almost
certainly not have been registered if the intent of the
existing guidelines had been observed by all relevant
domains.

* For the ccTLDs, ICANN can recommend and advise, but
their enforcement power is between "slight" and "none",
at least unless there is a clear violation of a standard
(see below).

* For any domain below the second level -- i.e., a
registration in a TLD -- ICANN is generally not only
completely lacking in authority but most of the relevant
domain administrators don't even have a way to hear
about the fact that ICANN has deprecated (or otherwise
recommended against the use of) some characters.

So, yes, let's do something. Note, fwiw, that they have opened
a public comment forum at idn-***@icann.org -- see the
"IDN Homograph Concerns" section on
http://www.icann.org/topics/idn.html.

But, if we think particular characters are bad news, we need to
follow up whatever actions browser-writers take spontaneously,
and whatever we ask ICANN to do, by deprecating them in
nameprep. If those nameprep changes are implemented, then the
characters get rejected at both registration and lookup time, in
a consistent way, and get rejected in any domain and at any
level of the tree. And violating the standard is something
that gives ICANN at least a little leverage, since most
registries have agreed -- as a consequence of accepting RFC 1591
if not by explicit agreements with ICANN, that ignoring
standards is a Bad Thing.

john

Michel Suignard

2005-02-28 18:13:38 UTC

The 2 current amendments of ISO/IEC 10646 (The ISO sibling of Unicode) being processed are adding about 10 new scripts. And a new amendment will be initiated in September with few more scripts. Any scheme based on a finite number of scripts is doomed.

Michel

-----Original Message-----
From: owner-***@ops.ietf.org [mailto:owner-***@ops.ietf.org] On Behalf Of Doug Ewell

Adam M. Costello <idn dot amc plus 0 at nicemice dot net dot
RemoveThisWord> wrote:

> According to the Unicode standard, there are 52 scripts.

There may be 52 scripts currently encoded in Unicode, but I am sure Unicode does not claim that is the total number of scripts in the world.
Others can and will be encoded.

-Doug Ewell
Fullerton, Califor

John C Klensin

2005-03-01 02:12:02 UTC

--On Sunday, 27 February, 2005 20:19 -0800 Erik van der Poel
<***@vanderpoel.org> wrote:

> John C Klensin wrote:
>>
>> (i) ICANN is still assuming that this is a registry
>> issue. As such, if someone else starts guessing at what
>> a registry is doing, we may get into trouble, especially
>> since the tables may not show all of the relevant
>> registry rules and restrictions.
>
> Hmmm... GNU libidn already seems to be trying to use
> machine-readable tables. I had a look at the GNU libidn page:
>
> http://www.gnu.org/software/libidn/
>
> It has a copy of an expired Internet Draft by Paul Hoffman:
>
> http://josefsson.org/cgi-bin/rfcmarkup?url=http://josefsson.or
> g/cgi-bin/viewcvs.cgi/*checkout*/libidn/doc/specifications/dra
> ft-hoffman-idn-reg-02.txt
>
> This draft seems to be talking about bundling and blocking,
> which your draft talks about too. What happened here? Did Paul
> decide to let his expire?

Yes. Paul more or less gave up (he can explain that decision; I
won't try to do it for him), then generously consented to the
inclusion of some of his text and definitions, and even more of
his concepts, into my draft. A different way of looking at this
is that we found our drafts converging and I got the short straw
for producing a consolidated version and trying to walk it
through the works. Paul gets considerable credit but bears no
responsibility or blame, for the result, but I hope we still are
in at least broad agreement.

> Anyway, my only reason for trying to get machine-readable
> tables was to figure out which Unicode character categories
> were being used. Another way to get this info is to simply ask
> the registries. Or, we can suggest a list of categories and
> see if they would be happy with a nameprep-bis that limits the
> characters to those categories.

As has been pointed out in other contexts, this is probably a
fool's errand. If a registry declines to register something,
then is isn't present and there isn't much value in guessing
whether a lookup fails because the name was not registered or
because its registration was prohibited. Conversely, if a
registry either has no rules or declines to follow the rules it
does have, knowing what those rules were supposed to be is not
terribly useful.

If a browser wants to apply sanity checks before it attempts a
DNS lookup, that is really a separate set of issues and
constraints. And, again, knowing the rules a registry would
have applied if its authority reached down more than a level or
two probably isn't a big help.

john

Yao Jiankang

2005-03-03 02:09:07 UTC

> ----- Original Message -----
> From: "JFC (Jefsey) Morfin" <***@jefsey.com>
> To: "John C Klensin" <***@jck.com>
> > Please stop considering the solution to your own problems. Or let turn the
> > Internet entirely in Chinese by default, and let help the IETF to document
> > the best way to access pages written in English.
>

If you think so(only let help the IETF to document the best way to access pages written in English.), it is not wise. Since only in mainland China there are more 90 million internet users which almost occupy 10% of the world internet population. If you neglect the Chinese internet community, it's like you have a 10% blackhole in your system, saided by James Seng(SGNIC).
So we must consider the best way to access pages written in not only Engl

128 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Erik van der Poel 2005-02-22 19:14:12 UTC

Erik van der Poel 2005-02-22 20:07:04 UTC

Adam M. Costello 2005-02-23 07:28:37 UTC

John C Klensin 2005-02-23 08:39:58 UTC

Adam M. Costello 2005-02-23 10:52:44 UTC

Erik van der Poel 2005-02-23 15:28:20 UTC

Erik van der Poel 2005-02-23 17:06:26 UTC

JFC (Jefsey) Morfin 2005-02-23 17:06:33 UTC

Erik van der Poel 2005-02-23 17:52:36 UTC

Adam M. Costello 2005-02-24 08:17:21 UTC

Erik van der Poel 2005-02-24 15:08:47 UTC

JFC (Jefsey) Morfin 2005-02-24 22:13:06 UTC

Erik van der Poel 2005-02-25 01:18:16 UTC

Gervase Markham 2005-03-02 11:50:19 UTC

Erik van der Poel 2005-03-02 01:05:29 UTC

Gervase Markham 2005-03-02 08:56:05 UTC

Erik van der Poel 2005-03-02 17:18:49 UTC

JFC (Jefsey) Morfin 2005-02-23 17:02:26 UTC

Erik van der Poel 2005-02-23 18:20:51 UTC

Erik van der Poel 2005-02-24 07:36:49 UTC

Jaap Akkerhuis 2005-02-24 10:02:21 UTC

tedd 2005-02-24 15:47:56 UTC

Erik van der Poel 2005-02-24 17:21:16 UTC

tedd 2005-02-24 18:22:57 UTC

John C Klensin 2005-02-24 18:57:44 UTC

Erik van der Poel 2005-02-24 19:54:26 UTC

John C Klensin 2005-02-24 22:22:35 UTC

Erik van der Poel 2005-02-24 23:04:00 UTC

Erik van der Poel 2005-02-25 03:51:37 UTC

Doug Ewell 2005-02-25 04:35:43 UTC

Stephane Bortzmeyer 2005-02-25 11:37:25 UTC

Erik van der Poel 2005-02-25 15:45:47 UTC

James Seng 2005-02-25 16:50:44 UTC

Erik van der Poel 2005-02-25 17:14:34 UTC

James Seng 2005-02-25 17:33:30 UTC

Jaap Akkerhuis 2005-02-25 17:07:52 UTC

Jaap Akkerhuis 2005-02-25 17:28:10 UTC

Adam M. Costello 2005-02-26 07:06:39 UTC

John C Klensin 2005-02-25 17:29:58 UTC

Erik van der Poel 2005-02-25 22:23:23 UTC

Erik van der Poel 2005-02-26 01:07:35 UTC

Erik van der Poel 2005-02-28 02:15:27 UTC

John C Klensin 2005-02-28 02:58:52 UTC

JFC (Jefsey) Morfin 2005-02-28 03:50:04 UTC

Erik van der Poel 2005-02-28 04:19:08 UTC

Paul Hoffman 2005-02-28 04:33:54 UTC

Gervase Markham 2005-03-02 12:03:55 UTC

Paul Hoffman 2005-03-02 00:16:07 UTC

Erik van der Poel 2005-03-02 01:48:48 UTC

Erik van der Poel 2005-03-02 04:47:24 UTC

John C Klensin 2005-03-02 15:43:27 UTC

JFC (Jefsey) Morfin 2005-03-02 16:37:51 UTC

Erik van der Poel 2005-03-02 19:02:55 UTC

Gervase Markham 2005-03-02 23:46:58 UTC

Cary Karp 2005-03-02 16:06:44 UTC

Erik van der Poel 2005-03-02 19:57:54 UTC

Adam M. Costello 2005-03-03 06:32:58 UTC

William Tan 2005-03-03 06:32:31 UTC

YAO Jiankang 2005-03-03 07:13:00 UTC

Cary Karp 2005-03-03 09:03:34 UTC

Erik van der Poel 2005-03-11 23:42:26 UTC

Erik van der Poel 2005-03-12 09:14:37 UTC

Erik van der Poel 2005-03-12 09:45:58 UTC

Martin v. Löwis 2005-03-12 12:09:18 UTC

Simon Josefsson 2005-03-12 11:04:36 UTC

JFC (Jefsey) Morfin 2005-03-14 02:31:44 UTC

Mark Davis 2005-03-14 15:15:04 UTC

Simon Josefsson 2005-03-14 15:37:18 UTC

Erik van der Poel 2005-03-15 06:06:27 UTC

Erik van der Poel 2005-03-15 19:04:13 UTC

Martin v. Löwis 2005-03-15 20:26:50 UTC

Erik van der Poel 2005-03-15 21:34:43 UTC

John C Klensin 2005-03-12 16:40:26 UTC

Martin v. Löwis 2005-03-12 18:14:02 UTC

Mark Davis 2005-03-12 18:15:18 UTC

Adam M. Costello 2005-02-26 08:19:13 UTC

Erik van der Poel 2005-02-26 19:45:08 UTC

Adam M. Costello 2005-02-28 01:55:45 UTC

Doug Ewell 2005-02-28 16:23:31 UTC

John C Klensin 2005-02-28 18:49:57 UTC

Adam M. Costello 2005-03-01 07:02:48 UTC

JFC (Jefsey) Morfin 2005-02-27 04:13:54 UTC

Erik van der Poel 2005-02-27 19:08:39 UTC

Erik van der Poel 2005-02-27 19:15:00 UTC

JFC (Jefsey) Morfin 2005-02-27 22:55:47 UTC

Erik van der Poel 2005-02-28 00:06:18 UTC

JFC (Jefsey) Morfin 2005-02-28 03:18:51 UTC

Paul Hoffman 2005-02-25 16:56:09 UTC

JFC (Jefsey) Morfin 2005-02-26 01:20:06 UTC

Erik van der Poel 2005-02-26 07:19:44 UTC

Gervase Markham 2005-03-02 11:55:12 UTC

tedd 2005-02-24 22:25:34 UTC

Erik van der Poel 2005-02-24 18:00:59 UTC

Michel Suignard 2005-02-24 18:23:47 UTC

Erik van der Poel 2005-02-24 19:03:45 UTC

Michel Suignard 2005-02-24 19:30:23 UTC

John C Klensin 2005-02-24 19:55:09 UTC

Michel Suignard 2005-02-28 18:13:38 UTC

John C Klensin 2005-03-01 02:12:02 UTC

Yao Jiankang 2005-03-03 02:09:07 UTC

about - legalese

Loading...