Joined: 10 Jan 2009
|Posted: Tue Aug 17, 2010 12:20 pm Post subject: UTF-8 Encoding of directory dump dir.xiph.org/yp.xml broken
the UTF-8 encoding of the directory dump (dir.xiph.org/yp.xml) is broken.
It seems that UTF-8 is applied multiple times to encode the output.
Simply open the directory dump dir.xiph.org/yp.xml in your browser to see the effects of the over encoding.
f.i. german für becomes fÃÂ¼r
However the für is displayed correctly on the website version of the directory, so something is going wrong while creating the yp.xml.
|Example of over encoded ü
ü in utf8= c3 bc
c3 bc in UTF8 = c3 83 c2 bc
c3 83 c2 bc in UTF8 = c3 83 c2 83 c3 82 c2 bc
ü found in yp.xml = c3 83 c3 82 c2 bc
It is not possible to undo the over encoding by just decoding utf8 multiple times, because some control chars (like utf8 c2 83, see above) are filtered out.
I also opened a ticket at trac for this issue, see Ticket#1729.