http://bugbug.cocolog-nifty.com∕info.nekodama.com/icons/
nekodama64b.gif
http://member.wide.ad.jp/~fujiwara/test.html
This makes a strong case for application-level filtering. AMC's
blacklist will fail here, as any TLD can be used to exploit this.
Indeed.
If only domain names were big-endian like newsgroup names, IP address
literals, and pathnames, this wouldn't be a problem (and they'd sort
better). I've always wondered why domain names were different.
If only the IDN working group had chosen to distinguish
internationalized host names (which would allow only certain Unicode
categories) from internationalized domain names (which allow all visible
characters) analogous to the distiction between traditional host names
(which allow only ASCII letters, digits, and hyphen) and traditional
domain names (which allow all ASCII characters)... (I regret having
been unable to persude people on this point. If only I had forseen
this slash-homograph attack, maybe that would have been persuasive, but
back then people might still have believed that the registries could be
counted on to take care of it. I think maybe I even believed it.)
Apps will have to start detecting character properties such as symbol
and punctuations
Apparently so. Here is my straw man proposal from three years ago:
---quote---
The Unicode character database classifies each character as belonging to
exactly one of the following broad classes:
L: letter
M: mark
N: number
P: punctuation
S: symbol
Z: separator
C: other
We can start by examining which of these classes of ASCII characters are
allowed in ASCII host labels.
L: 52 exist, all are allowed
M: 0 exist
N: 10 exist, all are allowed
P: 23 exist, only hyphen-minus is allowed
S: 9 exist, none are allowed
Z: 1 exists, it is not allowed
C: 33 exist, none are allowed
We can trivially extend these results to form a simple rule covering the
entire Unicode repertoire, except that we have no precedent for class
M. Since characters in class M tend to be things like diacritics, they
should be allowed. So the proposed rule is:
All characters in classes L (letter), M (mark), and N (number) are
allowed, and U+002D (hyphen-minus) is also allowed. Everything else is
forbidden.
Notice that there is no conflict with Nameprep, because Nameprep does
not prohibit any characters in classes L, M, or N.
If we were to adopt this definition of internationalized host name, it
would best be understood as an amendment of ToASCII step 3 (which checks
host name restrictions if applicable), tightening substep 3a from:
(a) Verify the absence of non-LDH ASCII code points; that is,
the absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
to:
(a) Verify that the sequence contains only host code points;
that is, U+002D (hyphen-minus) and code points classified
as L (letter), M (mark), or N (number). See appendix ? for
an enumeration of host code points.
Or maybe the enumeration would go in Nameprep, or in a separate document
that defines internationalized host names.
---unquote---
AMC