Discussion:
space-like unicode char
Soobok Lee
2005-02-20 05:27:49 UTC
Permalink
You can paste this html/javascript codelet to an html file in your
webserver and see in your MSIE brower.
You will see "www.microsoft.com" isolated in the addressbar from the
"mozilla.org" domain suffix.
Fortunately, you will see blank space (no phishing page) if you have
recent IE patch.
This won't work in firefox 1.x which strips off those special chars
for unknow reasons before sending to
the address bar.

<script>
window.open(unescape("http://www.microsoft.com%u1160%u1160%u1160%u1160%u1160%u1160.mozilla.org/"),"_blank");
</script>

U+1160 is a space-like char and even stringprep/nameprep does not
filter it out because
the char is not for punctuational purpose.
U+1160 is just one example, and i guess there may be many alternatives
that can be
used as blank char alternatives.

U+1160 in the above example is placed in the 3rd level domain name label,
over which .org registry cannot impose any regulations.

Soobok Lee
Soobok Lee
2005-02-20 06:08:08 UTC
Permalink
For those who do not have a webserver: plz copy this url into your MSIE
addressbar .

javascript:void(window.open(unescape("http://www.microsoft.com%u2044%u1160%u1160.uni.cc/"),"_self"))

You will see an error page if you have recent MSIE patch.

Soobok
You can paste this html/javascript codelet to an html file in your
webserver and see in your MSIE brower.
You will see "www.microsoft.com" isolated in the addressbar from the
"mozilla.org" domain suffix.
Fortunately, you will see blank space (no phishing page) if you have
recent IE patch.
This won't work in firefox 1.x which strips off those special chars
for unknow reasons before sending to
the address bar.
<script>
window.open(unescape("http://www.microsoft.com%u1160%u1160%u1160%u1160%u1160%u1160.mozilla.org/"),"_blank");
</script>
U+1160 is a space-like char and even stringprep/nameprep does not
filter it out because
the char is not for punctuational purpose.
U+1160 is just one example, and i guess there may be many alternatives
that can be
used as blank char alternatives.
U+1160 in the above example is placed in the 3rd level domain name label,
over which .org registry cannot impose any regulations.
Soobok Lee
Soobok Lee
2005-02-20 06:22:12 UTC
Permalink
The real problem comes when "com.%1160%1160*" is punycoded into "xn--blah".
( You can increase the number of "%1160"s until 63-char limit is reached)

"www.microsoft.xn--blah.uni.cc"
is decoded and displayed in the native form on the MSIE/i-Nav or Firefox
1.x.
what would you see on the address bar and in the webpage?

The legitimate ASCII url http://www.microsoft.xn--blah.uni.cc would
succeed to be resolved and deliver the phishing page, while the end user see
"www.microsoft.com" isolated in the beginning part of the address bar.

the end user may not see "uni.cc" part if the frame width of the MSIE
window instance
is narrow enough to hide ".uni.cc" .


Soobok
Post by Soobok Lee
For those who do not have a webserver: plz copy this url into your MSIE
addressbar .
javascript:void(window.open(unescape("http://www.microsoft.com%u2044%u1160%u1160.uni.cc/"),"_self"))
You will see an error page if you have recent MSIE patch.
Soobok
You can paste this html/javascript codelet to an html file in your
webserver and see in your MSIE brower.
You will see "www.microsoft.com" isolated in the addressbar from the
"mozilla.org" domain suffix.
Fortunately, you will see blank space (no phishing page) if you have
recent IE patch.
This won't work in firefox 1.x which strips off those special chars
for unknow reasons before sending to
the address bar.
<script>
window.open(unescape("http://www.microsoft.com%u1160%u1160%u1160%u1160%u1160%u1160.mozilla.org/"),"_blank");
</script>
U+1160 is a space-like char and even stringprep/nameprep does not
filter it out because
the char is not for punctuational purpose.
U+1160 is just one example, and i guess there may be many alternatives
that can be
used as blank char alternatives.
U+1160 in the above example is placed in the 3rd level domain name label,
over which .org registry cannot impose any regulations.
Soobok Lee
Erik van der Poel
2005-04-08 01:55:20 UTC
Permalink
U+1160 is a space-like char and even stringprep/nameprep does not
filter it out because the char is not for punctuational purpose.
U+1160 is HANGUL JUNGSEONG FILLER and it is used to transform
nonstandard syllables into standard ones (Unicode 3.0 section 3.11 (RFC
3454 refers to Unicode 3.2.0)). However, this transformation is one of
the additional transformations not considered part of Unicode
normalization (3.2.0's UAX #15 Annex 10). So this character is not
generated by Stringprep/Nameprep.

However, it is not prohibited either, so it may occur in the input to
(and output from) Stringprep/Nameprep. I read some of the sections on
Hangul in the Unicode book and Web site, but I did not see any rules
regarding repeated occurrences of U+1160 (as you had in your example,
not quoted above). I also did not see any rules about what to do when a
filler is not followed by a Hangul jamo. It would be nice to have these
rules in Unicode or in Stringprep.

I tried U+1160 followed by a Latin character in MSIE with i-Nav and in
Firefox with IDN turned on, and it was displayed as a wide space. It is
unfortunate that both implementations chose to display it as a space
instead of deleting it.

Erik
Soobok Lee
2005-04-08 07:05:06 UTC
Permalink
Post by Erik van der Poel
U+1160 is a space-like char and even stringprep/nameprep does not
filter it out because the char is not for punctuational purpose.
U+1160 is HANGUL JUNGSEONG FILLER and it is used to transform
nonstandard syllables into standard ones (Unicode 3.0 section 3.11
(RFC 3454 refers to Unicode 3.2.0)). However, this transformation is
one of the additional transformations not considered part of Unicode
normalization (3.2.0's UAX #15 Annex 10).
Exactly. U+1160 is not "touched" by Unicode normalization (NFC).
Post by Erik van der Poel
So this character is not generated by Stringprep/Nameprep.However, it
is not prohibited either, so it may occur in the input to (and output
from) Stringprep/Nameprep.
Yes, it may occur.
Post by Erik van der Poel
I read some of the sections on Hangul in the Unicode book and Web
site, but I did not see any rules regarding repeated occurrences of
U+1160 (as you had in your example, not quoted above). I also did not
see any rules about what to do when a filler is not followed by a
Hangul jamo. It would be nice to have these rules in Unicode or in
Stringprep.
U+1160 problem has been raised 3.5 years ago (you can look into this
huge idn-list archive by keyword search for 1160 or filler)
with some additional hangul jamo problem. One draft has been submitted
by me (you may find that in www.i-d-n.net)
to filter out these invalid char sequences. But the draft had been
discarded . Someone argued that such filtering * complicates *
stringprep algorithms with context-sensitive filtering/prohibiting and
the problem is up to UTC/NFC not to IETF. of course, i couldn't accept that.

Anyway, we can't backtrack into 2002/Dec without giving up backward
compatibility promise of stringprep.
Post by Erik van der Poel
I tried U+1160 followed by a Latin character in MSIE with i-Nav and in
Firefox with IDN turned on, and it was displayed as a wide space. It
is unfortunate that both implementations chose to display it as a
space instead of deleting it.
Yes. Plugins M U S T filter out U+1160 from validated ToUnicode()ed
labels, whether or not IDNA requires that.

Soobok
Soobok Lee
2005-04-08 07:21:28 UTC
Permalink
Post by Soobok Lee
Post by Erik van der Poel
I tried U+1160 followed by a Latin character in MSIE with i-Nav and in
Firefox with IDN turned on, and it was displayed as a wide space. It
is unfortunate that both implementations chose to display it as a
space instead of deleting it.
Yes. Plugins M U S T filter out U+1160 from validated ToUnicode()ed
labels, whether or not IDNA requires that.
Soobok
I will add this: In standard hangul writing system,
U+1160 is meaningful only in some context (surrounded by at least one
jamo char).
But, is standalone U+1160 is illegal ? No, it is NOT illegal.

So, blind filtering of U+1160 is fault. Plugins' filtering should be
context-sensitive.
That is why it would complicate stringprep if it were included into
stringprep. :-)

We can find similar problems in "combining diacritical marks" (U+3xx).
What if
a label with single char 'combining accent or above-dot ' without any
preceding
alphabet? It will combine with its preceding dot delimiter. and that
will produce
confusing looks ( looks like a colon which is a protocol delimiter).

AFAIK, any single standalone combining accent char is not prohibited by
stringprep.

Sooobk
Erik van der Poel
2005-04-08 19:36:24 UTC
Permalink
Post by Soobok Lee
U+1160 problem has been raised 3.5 years ago (you can look into this
huge idn-list archive by keyword search for 1160 or filler)
with some additional hangul jamo problem. One draft has been submitted
by me (you may find that in www.i-d-n.net)
to filter out these invalid char sequences. But the draft had been
discarded . Someone argued that such filtering * complicates *
stringprep algorithms with context-sensitive filtering/prohibiting and
the problem is up to UTC/NFC not to IETF. of course, i couldn't accept that.
The i-d-n.net name no longer takes you to a real site, but I believe I
found your draft here:

http://www.watersprings.org/pub/id/draft-ietf-idn-hangeulchar-00.txt

I agree that the U+1160 issues would complicate a spec, and I can see
why the IETF decided not to include them in the RFCs, but now that we
have seen that a number of implementations display this character in a
potentially dangerous way, we should reconsider the specs.

Unicode may not be able to address these issues in the normalization
spec since they have promised not to make any incompatible changes.
Unicode might be able to address the issues in other normative or
informative parts of their book or documents, and the IETF might just
want to refer to those parts of Unicode.

Alternatively, the IETF can write up its own specifications or
recommendations. It's not immediately clear to me whether U+1160 ought
to be addressed in Stringprep or Nameprep. As we have seen, Stringprep
is used in various protocols, including SASLprep, which is for user
names and passwords. Some perverse people might suggest that passwords
ought to allow strange character sequences like multiple consecutive
U+1160s in order to make it harder to guess the password. I'm new to
Stringprep, so I don't know how most IETFers feel about this type of thing.

In the meantime, I have added U+1160 and the combining mark issue to my
list and I have filed a bug report for Mozilla:

http://nameprep.org/#display
https://bugzilla.mozilla.org/show_bug.cgi?id=289588

Erik

Loading...