David Baron's weblog: Beware of locale-specific behavior in the C library

In the English speaking world, we write the decimal number for one and a half as "1.5", but in France, they write "1,5". When a programming environment is designing an interface, they could give that interface locale-independent behavior (so that a function writing a number to a string or reading a number from a string always writes or reads "1.5" only) or locale-specific behavior (so that such a function behaves differently depending on the system settings). This applies to much more than number formatting: for example, it also applies to things dealing with dates, translation of messages, or alphabetic sorting.

Programs often want locale-specific behavior when presenting data to the end user or receiving input from the end user. However, in file formats and network protocols, programs generally want locale-independent behavior so that files can be exchanged around the world and servers in different parts of the world can communicate.

The C programming language made what I consider a horrible design decision. Rather than having separate functions for locale-independent and locale-specific actions, C has a global function called setlocale to change the behavior of large sets of functions. It defaults to locale-independent behavior, but many applications such as Firefox choose to change it to the locale-specific behavior so they can get at the locale-specific behavior in some cases. (It's not clear to me what Firefox needs this for, though.) The setlocale function is global (and presumably not threadsafe); it changes the behavior of all threads. (There are platform-specific better ways (uselocale or _configthreadlocale, strtod_l or _strtod_l), but these don't appear to be easily portable.)

I think this global switch was a design mistake. Programs that spend a lot of time printing messages to users also deal with data formats or protocols that should be consistent across locales. And in programs that would like to be portable and would like to be fast don't want to use non-threadsafe function of unknown performance characteristics around every simple string operations. But programs like Firefox that want some locale-specific behavior end up switching into locale-specific mode and just staying there (though we actually have a bunch of code to call setlocale at various times, which is even scarier).

This creates what I think Henri would call an attractive nuisance. We have a constant stream of bugs (such as the bug that prompted me to write this) where code dealing with file formats or network protocols accidentally uses locale-specific functions, works correctly for the US-based developer, and doesn't work as it's supposed to somewhere else in the world.

In the Firefox codebase, our typical workaround for these bugs is to avoid the C library. We can use NSPR's functions (PR_smprintf, PR_strtod), which are locale-independent, instead of similar C library functions, or we can use other alternatives in our source tree, such as the double conversion library (in mfbt/double-conversion/ or one-off functions like nsCRT::atoll. But quick searches of our source tree show hundreds of potential bugs sitting there today that we haven't yet worked around. We need a better solution.

First, I wonder if we can switch to running in "C" locale (i.e., in C's default locale-independent mode). It's not clear to me what we get from changing the C library locale to the user's locale; most of our important localization behavior is implemented at a different level from the C library. This might require figuring out how to use the platform-specific APIs to get the information we need without changing global behavior for all threads.

Second, I think we should be running our unit tests in locales other than US English. We don't have to run a full matrix of tests (which is what such proposals always seem to get blocked on), but we could, say, run 32-bit Linux tests in French locale, 64-bit Linux tests in Japanese locale, etc., perhaps even replacing the current runs in US English locale.

Third, unless we can switch to locale-independent behavior quickly, we need better awareness that many functions from the C library can have locale-specific behavior: anything to do with date formatting, float and integer reading and writing, and string sorting. (I'm having trouble finding evidence of this behavior for integers, although glibc certainly has some #ifdef-ed code in its strtol implementation to parse locale-specific group separators (e.g., "1,000,000" for one million; see my test program). I think the vast majority of the problems we've had have dealt with reading and writing floats, though.

David Baron's Weblog

Beware of locale-specific behavior in the C library

Monday, 2012-12-22, 14:23 -0500