Mike Hommey wrote me with
a good point about my earlier entry: that the
statement that If everybody used UTF-8, all these problems could go
away
isn't really true, because of one problem: the unification of CJK
characters means that the same codepoint in Unicode is sometimes
used to represent the same conceptual character, but one that's drawn in
a significantly different way in Simplified Chinese, Traditional
Chinese, or Japanese.
So we really need textual data to specify its human language in
addition to its character encoding. Some of the old-style character
encodings can be used to imply a certain language, but UTF-8 cannot. So
switching to UTF-8 would increase the need for language identification
(although the problems language identification fixes would only appear
for an even smaller subset of languages than the current problems
related to character encoding identification).
But that doesn't detract from the point that you're always better off
explicitly specifying your character encoding.
And in fact, I wanted to mention one other thing along those lines:
yet another reason for specifying encoding is form input. If your Web
page has forms in it, and you don't specify an encoding on the page or
in the form, then you won't know the encoding of the data you get back
from the form. And you'll have the same problem all over again, but in
your database. (And for forms, it's particularly good to use an
encoding like UTF-8, since any characters that the user types will
work.)
[ I've been
looking for a clear explanation of why it's important to label Web pages
with character encodings. I want it because I want something to point
to when emailing Web authors who don't when their pages cause problems
for me (and my default of UTF-8, which I chose primarily so that I write
UTF-8 in Mozilla's Bugzilla,
which is still
broken). I couldn't find one, so here it is. I've chosen to be a
little bit loose with terminology in order to get the point across more
clearly. ]
In the earlier days of computers, different parts of the world had
different ways to convert the ones and zeros stored in a computer file
into characters that mean something to a human. Different ways allowed
different sets of characters to be represented. (In the even earlier
days of computers, they often differed between different manufacturers,
but the market took care of that problem because users had trouble
moving data from one system to another.)
Then the Web came about. The World Wide Web. And
suddenly people were reading things written in another part of the
world. But even if they knew the language, the characters didn't always
come out right, because the computers were sending ones and zeros from
end to end, and the computers at the two ends sometimes thought same
sequence of ones and zeros represented a different character. The same
sequence could be interpreted as "jalapeños",
"jalapeños", "jalape単os", "橡污灥쎱潳", or "慪慬数뇃獯".
A way of converting sequences of ones and zeros to characters,
or the other way around, is called a character
encoding. The files sent on the Web are still just sequences
of ones and zeros, but the Web browser needs to know what characters
those ones and zeros represent. So it needs to know which encoding to
use.
Bad Web content often doesn't tell the browser which
encoding to use, so Web browsers often guess based on the user's
language. So an American's Web browser would guess the normal encoding
used for English, but a Japanese user's browser would guess the normal
encoding used for Japanese. This means that even if a page works
fine for you and all your colleagues, it might not work for somebody in
Japan. Or for me, because I have my browser configured strangely.
(The reason Web browsers do this is that the first Web browsers just
displayed the ones and zeros the way they were interpreted on the
computer where the browser was running. So authors got used to pages
that they wrote working for themselves, and for other people in their
country. So keeping these old pages working and making the Web truly
World Wide became conflicting goals.) Some Web browsers even try to
guess which encoding the page uses by looking at the pattern of ones and
zeros.
Many encodings have the same rules for dealing with the characters
that were in ASCII, a
very old character encoding that includes only the unaccented letters,
numbers, and a small number of symbols. This means problems with
encodings often don't show up until you use characters outside of ASCII,
such as the copyright sign ©, the accented characters in Résumé or
jalapeño, or the Euro sign €. That's why the example above had three variants that were mostly
the same. So just because some of your pages work both for you and
people elsewhere in the world doesn't mean they all will.
Your operating system or your editor has an encoding that it uses by
default when you type characters into a file. If you know what encoding
that is, you can label your pages using
it. Once you've done this, there's a much higher chance that if the
characters in the page work for you, they'll work just as well for
everybody else, at least as long as they have a font that can display
the character. If you don't know what it is, you should guess, label
your pages anyway, and test to see whether it worked. If it worked,
then it's probably OK, but remember to test again if you use unusual
characters for the first time. Finding out for sure is better than
guessing, though.
One final note, about two technical terms that often come up in
discussions of this topic: Unicode is the standard that
contains the "complete" list of characters. UTF-8 is a new(er)
encoding that can encode all the characters in Unicode. If everybody
used UTF-8, all these problems could go away.