Unicodified

February 5, 2006 / If you have ever had the even the vaguest query about Unicode, character sets, character encoding or why some symbols do or do not appear on a web page/e-mail message/RSS feed, you may find enlightenment in the following article by Joel Spolsky.

If you have ever had the even the vaguest query about Unicode, character sets, character encoding or why some symbols do or do not appear on a web page/e-mail message/RSS feed, you may find enlightenment in the following article by Joel Spolsky.

“The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

Admittedly, Spolsky’s intended audience is people who write software, but having found it extremely useful in answering questions I’ve had for a while now, I’m putting it out there as something every web designer might find useful as well (and probably should know if committed to internationalisation as an accessibility issue).

On the technical side the most helpful part of the discussion for me was Spolsky’s explanation of the difference between the Unicode character set (the total set of characters that can be represented in Unicode) and the various available character encodings (methods of implementing the Unicode character set—or a subset of them—in documents). He also provides some interesting historical notes to put the whole mess into context. Unicode comes out smelling quite sweetly once Spolksy has broken it down.

The upshot of all of this, at least for me, is the following two inferences (feel free to weigh-in if you think them invalid):

  1. At the least, committed web designers should be using the UTF-8 character encoding (i.e. “charset=utf-8” in the Content-Type meta tag) on all of their web pages.
  2. Since UTF-8 is a superset of the ASCII or “plain text” character set, even if you use UTF-8 when springing for fancy characters above the first 128 (the legacy/ASCII-compatible character set) they may not display if user agent’s underlying system does not recognise those characters. (I think this explains why my fancy bullet characters display on Firefox/Win running under Windows XP, but not on Firefox/Win running under Windows 2000—Windows 2000 having a reduced default character set/font library.)

My Unicode education is still in the steep part of the curve however, so if any readers can enlighten me further I’d be happy to hear how. Meanwhile, I wish Spolsky would write for ALA. They could do with an article or two urging the faithful to unicodify.

Comments are closed.


Zero to One-Eighty contains writing on design, opinion, stories and technology.