David Baron's weblog: February 2006

Friends & Colleagues

Sunday 2006-02-12

Character encodings (addendum) (10:05 -0800)

Mike Hommey wrote me with a good point about my earlier entry: that the statement that If everybody used UTF-8, all these problems could go away isn't really true, because of one problem: the unification of CJK characters means that the same codepoint in Unicode is sometimes used to represent the same conceptual character, but one that's drawn in a significantly different way in Simplified Chinese, Traditional Chinese, or Japanese.

So we really need textual data to specify its human language in addition to its character encoding. Some of the old-style character encodings can be used to imply a certain language, but UTF-8 cannot. So switching to UTF-8 would increase the need for language identification (although the problems language identification fixes would only appear for an even smaller subset of languages than the current problems related to character encoding identification).

But that doesn't detract from the point that you're always better off explicitly specifying your character encoding.

And in fact, I wanted to mention one other thing along those lines: yet another reason for specifying encoding is form input. If your Web page has forms in it, and you don't specify an encoding on the page or in the form, then you won't know the encoding of the data you get back from the form. And you'll have the same problem all over again, but in your database. (And for forms, it's particularly good to use an encoding like UTF-8, since any characters that the user types will work.)

Why Web authors must specify character encodings (00:43 -0800)

[ I've been looking for a clear explanation of why it's important to label Web pages with character encodings. I want it because I want something to point to when emailing Web authors who don't when their pages cause problems for me (and my default of UTF-8, which I chose primarily so that I write UTF-8 in Mozilla's Bugzilla, which is still broken). I couldn't find one, so here it is. I've chosen to be a little bit loose with terminology in order to get the point across more clearly. ]

In the earlier days of computers, different parts of the world had different ways to convert the ones and zeros stored in a computer file into characters that mean something to a human. Different ways allowed different sets of characters to be represented. (In the even earlier days of computers, they often differed between different manufacturers, but the market took care of that problem because users had trouble moving data from one system to another.)

Then the Web came about. The World Wide Web. And suddenly people were reading things written in another part of the world. But even if they knew the language, the characters didn't always come out right, because the computers were sending ones and zeros from end to end, and the computers at the two ends sometimes thought same sequence of ones and zeros represented a different character. The same sequence could be interpreted as "jalapeños", "jalapeños", "jalape単os", "橡污灥쎱潳", or "慪慬数뇃獯".

A way of converting sequences of ones and zeros to characters, or the other way around, is called a character encoding. The files sent on the Web are still just sequences of ones and zeros, but the Web browser needs to know what characters those ones and zeros represent. So it needs to know which encoding to use.

Bad Web content often doesn't tell the browser which encoding to use, so Web browsers often guess based on the user's language. So an American's Web browser would guess the normal encoding used for English, but a Japanese user's browser would guess the normal encoding used for Japanese. This means that even if a page works fine for you and all your colleagues, it might not work for somebody in Japan. Or for me, because I have my browser configured strangely. (The reason Web browsers do this is that the first Web browsers just displayed the ones and zeros the way they were interpreted on the computer where the browser was running. So authors got used to pages that they wrote working for themselves, and for other people in their country. So keeping these old pages working and making the Web truly World Wide became conflicting goals.) Some Web browsers even try to guess which encoding the page uses by looking at the pattern of ones and zeros.

Many encodings have the same rules for dealing with the characters that were in ASCII, a very old character encoding that includes only the unaccented letters, numbers, and a small number of symbols. This means problems with encodings often don't show up until you use characters outside of ASCII, such as the copyright sign ©, the accented characters in Résumé or jalapeño, or the Euro sign €. That's why the example above had three variants that were mostly the same. So just because some of your pages work both for you and people elsewhere in the world doesn't mean they all will.

Your operating system or your editor has an encoding that it uses by default when you type characters into a file. If you know what encoding that is, you can label your pages using it. Once you've done this, there's a much higher chance that if the characters in the page work for you, they'll work just as well for everybody else, at least as long as they have a font that can display the character. If you don't know what it is, you should guess, label your pages anyway, and test to see whether it worked. If it worked, then it's probably OK, but remember to test again if you use unusual characters for the first time. Finding out for sure is better than guessing, though.

One final note, about two technical terms that often come up in discussions of this topic: Unicode is the standard that contains the "complete" list of characters. UTF-8 is a new(er) encoding that can encode all the characters in Unicode. If everybody used UTF-8, all these problems could go away.