You can’t escape encoding
July 6, 2009 #webdev
A short while later, I saw a bug where the subject line of delivery emails looked like:
I referred the developer to the three parameter version of the
encodeText function of Java’s MimeUtility. This has a “set of methods to encode and decode MIME headers as per RFC 2047”. In this case, by using the one parameter version, the code was subject to some OS specific setting which worked on the developer’s computer running Windows but not when deployed to Linux.
¶§ into the plain text field and look at the results. The “MIME + Quoted-printable” section is the same as Encoded Word from RFC 2047.
For me, to fully understand UTF-8 encoding, I had to walk myself through an example, which I include here:
When a Unicode (ISO 10646) character in the range of 0x80 - 0x799 is encoded using UTF-8, it takes up 2 bytes of space. Since printable ISO-8859-1 characters (aka Latin 1 or upper ASCII) are in the range of 0xA0 - 0xFF - they fall into this category.
BYTE 1 = 1100 0000 | (0001 1111 & b) BTYE 2 = 1000 0000 | (0011 1111 & b)
So for a character like
é (Latin Small Letter E with Acute)
é = 233 = 0xE9 = 1110 1011
when it’s UTF-8 encoded
1100 0011 1010 1001 = 0xC3 0xA9 = Ã©
Notice that when encoding ISO-8859-1 with UTF-8 and you happen to view the output as ISO-8859-1, the first character of the two is always a capital A with some variation. The following table can help describe why.
range byte 1 display char 0x80 - 0x9F (128-159) 0xC0 À 0xA0 - 0xBF (160-191) 0xC1 Á 0xC0 - 0xDF (192-223) 0xC2 Â 0xE0 - 0xFF (224-255) 0xC3 Ã
Another “odd” situation exists when the
Copyright Sign or
Pound Sign value from unicode…
© = 169 = 0xA9 = 1010 1001 £ = 163 = 0xA3 = 1010 0011
…is UTF-8 encoded, and become the
Â character followed by themselves.
1100 0010 1010 1001 = 0xC3 0xA9 = Â© 1100 0010 1010 0011 = 0XC3 0xA3 = Â£