Apache 2.0 On Windows NT A companion to the panel discussion
"Apache upon Win32 in the Round"
presented at ApacheCon, April 6th 2001
By William A. Rowe, Jr.
Internationalization Spoken Here Apache 2.0 represents a new page in the evolution of the Apache server. Apache 1.3 was the first implementation introduced in October of 1998 by the Apache group that used the native Windows 32-bit API (Win32). Rather than rely on Microsoft's standard C library, many parts of the code used the equivilant, and often more predictable API calls. But it remained true to the Unix implementation, and all character data was passed around as single bytes, so the promise of Windows NT and Windows 2000 (WinNT) native Unicode support was still a long way off.
For those unfamiliar with Windows NT or Unicode, Windows NT stores its internal text as Unicode and uses Unicode to store filenames in its native file system (NTFS). Rather than a single byte, Unicode represents each character, or glyph, with one word (two bytes) of storage. So rather than using code pages (one per western language) that is limited to 256 possibile glyphs, Windows NT has 65536 possible representations with Unicode. More recent enhancements to the Unicode standard (and Windows 2000) now provide two word long combined characters, resulting in just over a million possible glphs! Windows NT users aren't constrained when naming files in multiple languages, they have the alphabet of the world at hand.
Windows 95, 98 and Millennium Edition (Win9x) users aren't so fortunate. Some parts of the system support Unicode, such as writing text to the screen. But the file system and kernel based on byte-oriented code pages (as are FAT and FAT32 volumes when mounted on Windows NT, since they come from Win9x and MS-DOS before it). This article does not address those users, and in fact, while many features described here will still work; the file saved to the system in the Greek alphabet will appear as drivel in Microsoft Explorer. Unix machines have these features, or problems, as well. The directory listing only makes sense when viewed with i18n tools. Only Windows NT and utf-8 enabled Unix shells shine in respect to the Apache 2.0 i18n design.
[Internationalization is 20 letters long, so by chopping out the middle 18 letters and replacing it with the number 18, you are left with the commonly used abbreviation i18n. This term is rarely used by Microsoft itself, but is very common when discussion Internationalization on the Internet.]
Internationalization of the Internet In fact, Apache needed no work to name files in i18n convention. It simply works. Take the Japanese word 言語 or 'language'. There are two display glyphs in that word, which fit nicely into two Unicode words (numbered 8A00 and 8A9E). On Windows NT, that is exactly how they will appear, if you choose any font that includes the CJK Unified Ideograph glyphs (which are common among eastern languages.) This encoding, called ucs-2, simply doesn't work in the Internet world, and the reasons are simple.
Many computer languages, including C (the language of Apache), treat the byte (character) value of zero as the end of the text. Look again at the example word. The lower byte of the glyph 言 (8A00) is 00. If we used Unicode, without massive changes to the code, that byte would end the entire string, and we would never get to the second glyph. Writing a program specifically for Unicode is quite simple today. But writing one program that compiles into both single byte character sets, and Unicode word character sets is a very dicey proposition, which risks introducing bugs into one version or the other.
For many of computer languages and programs that were written long ago in C, the Unicode standard simply is not the best solution. Instead of sometimes having zero byte values in glyphs, which these older applications can't tolerate, the Unicode words needed to be wrapped into 'safe' byte values. This solution called utf-8 uses safe single byte values, defining the first 128 possible values to have exactly their American-English ASCII-7 values, and the last 128 values stretch the character values across two or more bytes. This carries a substantial benefit for every language.
The Magic of utf-8 If you are using an American-English keyboard, you have nothing but ASCII characters on your keycaps. Since these codes are only one byte long, most html tags, the markup language of the web, take no longer to transmit than did the older single byte character sets. If the web page is transmitted in true Unicode characters, the page takes twice as long just to convey the markup (which is far longer than the actual text in most web content.) Several commonly used control codes fall in this set as well, so a new line, tab, or space still fit in a single byte. Finally, the byte value zero retains its meaning to Apache and as the end of string marker.
To add the next 1,920 possible glyphs, we need to tack on another byte. We cannot simply use the other half of the 256 possible values in the first byte, because they are used to determine how many bytes we are actually using. While we shrink common web page symbols and control codes into a byte, there are sacrifices. Rather than fitting 63488 different codes in two bytes, we can fit only 2048 possible glyphs into those first two bytes of a utf-8 encoded character.
For the remaining 61440 codes, we must use a third byte to fit the possible values. Our example of the word 言語 is encoded into E8A880 E8AA9E, which uses 50% more space. In most web content, this is simply not noticeably different from Unicode. If our example word is in bold, we need to wrap it in and tags, which add 7 characters, or 14 bytes for Unicode. This content of 9 characters 言語 needs 18 bytes for Unicode, but only 13 bytes for utf-8. The nightmare of Unicode bytes of the zero value is gone, and all text can be treated as byte streams, rather than word streams. Apache is happy being international.
Observations for the Pedantic If you are doing the math, we are up to 63,488, a bit shy of 65,536 values that fit into a word. The remaining 2,048 values were sacrificed when Unicode was expanded to allow double word characters. Eliminating these values introduced 1,048,576 more codes that are represented with two words, instead of one. In order to encode them into utf-8, we tack on a fourth byte for the next 2,031,616 values (almost twice as the number required to represent these two word glyphs). Those four bytes are the same size as the two word sequence that the characters represent, so there is no tradeoff.
The Unicode of Windows NT is probably the end of the road for most languages of the world. But there are some who still cry "More!" The proper name for the Unicode encoding is ucs-2, since the two byte word is the basic character (two word character pairs notwithstanding.) For the full potential of an entirely double word (four byte) encoding, there is also ucs-4. Since every character occupies four bytes, this is not a terribly efficient encoding. However, utf-8 (properly, utf-16) doesn't let us down. We can tack on a fifth byte for another 65,011,712 values, and yet a sixth that gives us the final 2,080,374,784 possible values. The high bit of the four byte word isn't used, and the 2,048 lost to double word sequences are gone forever, so we have our grand total of 2,147,481,600 glyphs.
Appearances are Deceiving Rather than single or double word characters of Unicode, utf-8 gives us all the same symbols packed into 1 to 4 bytes, some glyphs a bit more compact, some a bit longer. These codes can be saved on Win9x and Unix systems exactly as is. If the system is not configured to display and edit utf-8, things will appear very odd. Come back for another moment to our word 言語 and look at it on Windows 98, or a typical Unix system. It appears in a filename as è¨€èªž, which is not what one expects, or even finds comprehensible.
This doesn’t pose a problem for Apache. When Apache goes back to look for the file 言語, it knows just the codes the file was created with. The problem for a Win9x or Unix user is the file name è¨€èªž makes no sense at all. The only way to manage the naming of such files is to save the file into the server with a utf-8 aware application, such as WebDAV. Since mod_dav is built into Apache 2.0, it makes it very simple to read and write utf-8 filenames into the server. They just don't make any sense at the server's own console with the dir or ls commands, or from a Windows 9x Explorer view.
This is where WinNT really shines. The native support for each Apache 2.0 platform is hidden within the Apache Portability Runtime (APR). Under Windows NT, APR expects that every filename in Apache is given in utf-8, so it translates directly between utf-8 and Unicode. This conversion is actually faster than most codepage translations, because it isn't a lookup, it's simple math. The resulting Unicode filename can be far longer than the 260 character limit of Windows 9x, because APR then bypasses the Windows NT internal translations.
The only other near competition is from an i18n enabled Unix shell, which can speak utf-8 rather well. With a utf-8 enabled vi editor, the ls directory listing command and other tools, the obscure è¨€èªž once again appears as 言語 to the user at the server console. Of course many utilities won't manipulate that file name the same way as the i18 enabled shell. Some commands will display 言語 correctly while others may shift the columns of their output. These older programs believe the name is 6 glyphs instead of 2 glyphs wide, and mis-format the appearance.
Configuration Files [A word of warning before we begin. Apache 2.0 is evolving as this article went to press. While all of the support that follows should be in the server before the conference, there are no assurances that the final implementation exactly matched what is described below.]
Apache 2.0 accepts utf-8 strings from the requested URL, and within the httpd.conf configuration file. This is simple with the vi editor of an i18n enabled Unix shell, but is a bit more problematic on Windows. Windows NT has no built-in utf-8 editor. Apache 2.0 needed to grow some brains for those without the ability to edit utf-8, but with a Unicode editor. While Microsoft Word, or the built-in WordPad both speak Unicode, the underrated Notepad editor speaks Unicode as well. As Unicode files created on Windows carry an FEFF sequence as the first word, Apache 2.0 will recognize that it must read this file as Unicode and convert it into utf-8. This is true of the httpd.conf file, any other included configuration files, and the .htaccess directory-control files.
To use Notepad to edit the .conf or .htaccess file in Unicode, you may first need to choose a different font, depending on the languages you need to type. You can copy characters from charmap, but that isn't the simplest approach. Loading the keyboard mappings or Input Method Editors (IMEs) for other languages makes typing in Notepad very simple, if you are familiar with that language. On Windows 2000, you can simply save the file as utf-8, which eliminates this conversion and make the server startup just a bit more quickly.
More International Text On Windows NT, even the environment and command line arguments are primarily Unicode, and only mapped to the local code page for the user of older applications. The Apache 2.0 Windows NT MPM (multi-process module) will learn this, with some special configuration options, so these can be processed between Unicode into utf-8 as well. CGI applications will receive both Unicode and native code page support when they are invoked, so a utf-8 argument will get to Perl or Java as a true Unicode string. The administrator will be able to disable this feature for older CGI applications with no internal Unicode support, so they can receive the strings in either the native code page or the utf-8 encoded text.
The answer for authors of international CGI applications and servlets is to write them for Unicode. With Apache 2.0 and their Unicode-enabled languages doing all the work, authoring i18n web applications couldn't be simpler.
The weakest link in the Apache 2.0 for Windows NT i18n schema, all log files are written in utf-8. Without a utf-8 reader, this becomes very difficult to read. They are not illegible, but going back to our example, instead of "/言語.html" appearing in the log, it will look like "/è¨€èªž.html" instead. If the log will be passed through a utf-8 enabled traffic analysis application, it's best to leave this alone, and let that reporting application display things properly. A filter program can be used to pipe the logs into Unicode as they are written by the server, or they can be processed later as part of a batch program.
Requesting the URN and HTML markup Internationalizing URN uniform resource names pose a very special problem, and there is not agreement yet on the answer. Older browsers and some servers simply presented requests in whatever code page they gave for the content-language tag, or as opaque byte strings, who's meaning they had no clue about. Newer browsers and servers are beginning to agree that the URL or URI is always presented in utf-8 encoding. The problem lies in the fact that they do not agree how.
RFC2616, the blueprint of http servers, is very specific about how names that do not fall in the usual format should be presented. Spaces must always be escaped as the + symbol, although some implementers have passed around %20 instead. The plus symbol must therefore be presented as %2B. But without any question, the specification states that utf-8 encoded characters will be escaped with %nn codes.
There is no way to make the older browser do the right thing, or to force blatantly wrong browsers to act appropriately. However, Apache 2.0 is learning the magic to fix things up. While mod_charset_lite allows Apache to serve web pages in a character set the user can read on their machine (different from the original character set), there is a companion module in the works to understand the older browsers' URLs for what they meant, and present the client with a name they can read. This capability only exists because of Apache 2.0's filtering design, so every web service can profit from mod_charset_lite translation and input conversion.
About the Author William is a Senior Development Engineer at Covalent Technologies, Inc. working on the next generation of Web Servers, and contributed and maintains a good portion of the Win32 specific APR, a distinction he gladly shares with Bill Stoddard, Ryan Bloom and a handful of other contributors. He is an advocate of i18n code from his days coding COM technology in C++ with Microsoft's ATL, MFC and VB, when he was confronted by the BSTR. Somehow, as a native speaker of English, he has failed to grasp any other language but a small smattering of German, and a handful of expletives, nouns and numbers from a dozen languages. But he finds glyphs and typography aesthetically cool.
William A. Rowe, Jr. Covalent Technologies, Inc. Page