IDNA flaws with regard to U+2024 – Simon Josefsson's blog

In a bug report against libidn, Erik van der Poel gives an example of an internationalized domain name that is handled differently by different implementation. Another example of one such string is:

‘rÃ¤ksmÃ¶rgÃ¥s’ U+2024 ‘com’

If your browser supports Unicode, the string is: rÃ¤ksmÃ¶rgÃ¥sâ€¤com. Use cut’n’paste of the string into your browser and see what it tries to lookup (please let me know what you notice!).

The problem with this string is that it is on the form “[non-ASCII][DOT-Like code point]com”. Here ‘rÃ¤ksmÃ¶rgÃ¥s’ represents the non-ASCII string, which can be any non-ASCII string. Further, the U+2024 represent one character which looks like a dot, there are others that also contain dot-like characters.

The IDNA algorithm (section 3.1) implies that applications should treat the string as one label. The U+2024 character is not one of the dot-like characters that needs to be treated as a label separator. The ASCII string which is output after applying the IDNA algorithm is:

xn--rksmrgs.com-l8as9u

Note that the string contains an ASCII dot ‘.’ (0x0E). If applications are not careful how they resolv the name in the DNS, they will request information in a non-existing top-level domain ‘com-l8as9u’. This is because the DNS do not use ‘.’ to separate labels, but instead uses a length-value pair for each label. Thus the wrong string to lookup would be:

(11)xn--rksmrgs(10)com-l8as9u

Whereas the right string to lookup would be:

(22)xn--rksmrgs.com-l8as9u

Using DNS master file syntax, the name to lookup is xn--rksmrgs.com-l8as9u.

What’s interesting here is that some implementations, such as Microsoft Internet Explorer and Firefox implements IDNA not according to the standard. Instead, they compute the following string:

xn--rksmrgs-5wao1o.com

Arguable, this is a better approach than what is specified by RFC 3490. MSIE/Firefox recognize that U+2024 is a “dot-like” character, by using NFKC. What is debatable is whether U+2024 will actually occur in practice, Unicode expert Kenneth Whistler says U+2024 will not be entered accidentally.

As the maintainer of GNU Libidn, I’m not yet sure about what to do about the situation. The conservative approach is to do nothing until the RFCs are updated. I have come up with a patch to add a new IDNA flag that treat U+2024 as a dot-like character early on. This would at least make it possible to produce the same (RFC non-conforming) output that MSIE/Firefox computes.

3 Replies to “IDNA flaws with regard to U+2024”

RFC 3490 section 4 steps 4) and 5) would seem to indicate that the ToASCII’d labels are recombined, inserting U+002E between each pair of labels. Presumably, the recombined domain name is then looked up (resolved) using gethostbyname() or whatever, so the U+002E’s that were produced by NFKC in the Nameprep step then become label separators.

In other words, I don’t think RFC 3490 intended this weird NFKC/dot issue to become some backhanded way to get 0x2E into a DNS packet a la RFC 1035’s . syntax (escaped dot).

I’ve tested Safari on Mac OS X, and it behaves like libidn here.

For what it’s worth, I tend to agree with Erik that RFC 3490 probably didn’t intend for ToASCII to be able to output U+002E’s in the middle of a label. Alas, the intent and words doesn’t match.

Erik van der Poel on 2008-01-14 at 18:02 said:

RFC 3490 section 4 steps 4) and 5) would seem to indicate that the ToASCII’d labels are recombined, inserting U+002E between each pair of labels. Presumably, the recombined domain name is then looked up (resolved) using gethostbyname() or whatever, so the U+002E’s that were produced by NFKC in the Nameprep step then become label separators.

In other words, I don’t think RFC 3490 intended this weird NFKC/dot issue to become some backhanded way to get 0x2E into a DNS packet a la RFC 1035’s . syntax (escaped dot).
http://josefsson.org/ on 2008-01-14 at 18:24 said:

I’ve tested Safari on Mac OS X, and it behaves like libidn here.
http://josefsson.org/ on 2008-01-14 at 18:48 said:

For what it’s worth, I tend to agree with Erik that RFC 3490 probably didn’t intend for ToASCII to be able to output U+002E’s in the middle of a label. Alas, the intent and words doesn’t match.