Slash-like mark problem

Discussion:

James Seng

2005-02-20 11:26:55 UTC

try again.

Date: 20 Feb 2005 PM 06:08:39
Subject: Fwd: Slash-like mark problem
this was demostrated to me just now. basically, the problem is U+2215,
a slash-like mathematical symbol.
this problem will not be solved even by registration policy because it
can be done on 3rd or 4th level.
james

Date: 20 Feb 2005 PM 05:51:47
Subject: Slash-like mark problem
Slash-like
http://bugbug.cocolog-nifty.com∕info.nekodama.com/icons/
nekodama64b.gif
http://member.wide.ad.jp/~fujiwara/test.html

William Tan

2005-02-20 11:53:20 UTC

Permalink

this was demostrated to me just now. basically, the problem is
U+2215, a slash-like mathematical symbol.

This makes a strong case for application-level filtering. AMC's
blacklist will fail here, as any TLD can be used to exploit this. Apps
will have to start detecting character properties such as symbol and
punctuations (which I guess would be allowed by some TLD's IDN roll-out
including VGRS's).

wil.
[gone phishing with U+2215]

Adam M. Costello

2005-02-20 13:40:14 UTC

Permalink

http://bugbug.cocolog-nifty.com∕info.nekodama.com/icons/
nekodama64b.gif
http://member.wide.ad.jp/~fujiwara/test.html
This makes a strong case for application-level filtering. AMC's
blacklist will fail here, as any TLD can be used to exploit this.

Indeed.

If only domain names were big-endian like newsgroup names, IP address
literals, and pathnames, this wouldn't be a problem (and they'd sort
better). I've always wondered why domain names were different.

If only the IDN working group had chosen to distinguish
internationalized host names (which would allow only certain Unicode
categories) from internationalized domain names (which allow all visible
characters) analogous to the distiction between traditional host names
(which allow only ASCII letters, digits, and hyphen) and traditional
domain names (which allow all ASCII characters)... (I regret having
been unable to persude people on this point. If only I had forseen
this slash-homograph attack, maybe that would have been persuasive, but
back then people might still have believed that the registries could be
counted on to take care of it. I think maybe I even believed it.)

Apps will have to start detecting character properties such as symbol
and punctuations

Apparently so. Here is my straw man proposal from three years ago:

---quote---

The Unicode character database classifies each character as belonging to
exactly one of the following broad classes:

L: letter
M: mark
N: number
P: punctuation
S: symbol
Z: separator
C: other

We can start by examining which of these classes of ASCII characters are
allowed in ASCII host labels.

L: 52 exist, all are allowed
M: 0 exist
N: 10 exist, all are allowed
P: 23 exist, only hyphen-minus is allowed
S: 9 exist, none are allowed
Z: 1 exists, it is not allowed
C: 33 exist, none are allowed

We can trivially extend these results to form a simple rule covering the
entire Unicode repertoire, except that we have no precedent for class
M. Since characters in class M tend to be things like diacritics, they
should be allowed. So the proposed rule is:

All characters in classes L (letter), M (mark), and N (number) are
allowed, and U+002D (hyphen-minus) is also allowed. Everything else is
forbidden.

Notice that there is no conflict with Nameprep, because Nameprep does
not prohibit any characters in classes L, M, or N.

If we were to adopt this definition of internationalized host name, it
would best be understood as an amendment of ToASCII step 3 (which checks
host name restrictions if applicable), tightening substep 3a from:

(a) Verify the absence of non-LDH ASCII code points; that is,
the absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.

to:

(a) Verify that the sequence contains only host code points;
that is, U+002D (hyphen-minus) and code points classified
as L (letter), M (mark), or N (number). See appendix ? for
an enumeration of host code points.

Or maybe the enumeration would go in Nameprep, or in a separate document
that defines internationalized host names.

---unquote---

AMC

Adam M. Costello

2005-02-20 14:09:26 UTC

Permalink

That proposal proposed to alter ToASCII back when it was still being
developed. It's too late for that now. If this Unicode category filter
is resurrected, it would have to be used differently than I originally
envisioned. It would have to be used to deprecate, but not disqualify,
some IDNs from use as host names,

For example, we could say that domain names containing characters
outside classes L, M, N, and hyphen-minus should not be used as host
names (which implies that they should not be used in URIs, IRIs, and
email addresses), but if they are used as host names, applications must
honor them but should display them in ASCII form or in some sort of
warning mode.

AMC