IDNA flaws with regard to U+2024

In a bug report against libidn, Erik van der Poel gives an example of an internationalized domain name that is handled differently by different implementation. Another example of one such string is:

‘räksmörgÃ¥s’ U+2024 ‘com’

If your browser supports Unicode, the string is: räksmörgÃ¥s․com. Use cut’n’paste of the string into your browser and see what it tries to lookup (please let me know what you notice!).

The problem with this string is that it is on the form “[non-ASCII][DOT-Like code point]com”. Here ‘räksmörgÃ¥s’ represents the non-ASCII string, which can be any non-ASCII string. Further, the U+2024 represent one character which looks like a dot, there are others that also contain dot-like characters.

The IDNA algorithm (section 3.1) implies that applications should treat the string as one label. The U+2024 character is not one of the dot-like characters that needs to be treated as a label separator. The ASCII string which is output after applying the IDNA algorithm is:

xn--rksmrgs.com-l8as9u

Note that the string contains an ASCII dot ‘.’ (0x0E). If applications are not careful how they resolv the name in the DNS, they will request information in a non-existing top-level domain ‘com-l8as9u’. This is because the DNS do not use ‘.’ to separate labels, but instead uses a length-value pair for each label. Thus the wrong string to lookup would be:

(11)xn--rksmrgs(10)com-l8as9u

Whereas the right string to lookup would be:

(22)xn--rksmrgs.com-l8as9u

Using DNS master file syntax, the name to lookup is xn--rksmrgs.com-l8as9u.

What’s interesting here is that some implementations, such as Microsoft Internet Explorer and Firefox implements IDNA not according to the standard. Instead, they compute the following string:

xn--rksmrgs-5wao1o.com

Arguable, this is a better approach than what is specified by RFC 3490. MSIE/Firefox recognize that U+2024 is a “dot-like” character, by using NFKC. What is debatable is whether U+2024 will actually occur in practice, Unicode expert Kenneth Whistler says U+2024 will not be entered accidentally.

As the maintainer of GNU Libidn, I’m not yet sure about what to do about the situation. The conservative approach is to do nothing until the RFCs are updated. I have come up with a patch to add a new IDNA flag that treat U+2024 as a dot-like character early on. This would at least make it possible to produce the same (RFC non-conforming) output that MSIE/Firefox computes.