Chapter 7. Internationalization
If
the Web is to reach a truly worldwide audience, it needs to be able
to support the display of all the languages of the world, with all
their unique alphabets and symbols, directionality, and specialized
punctuation. This poses a big challenge to HTML constructs as we know
them. However, according to the W3C, "energetic efforts"
are being made toward this complicated goal.
The W3C's efforts for internationalization (often referred to
as
"i18n" -- an
i, then 18 letters, then an n) address two primary issues. First is
the handling of alternative character sets that take into account all
the writing systems of the world; second is how to specify languages
and their unique presentation requirements within an HTML document.
Many solutions presented by internationalization experts in a
document called RFC 2070 were incorporated into the current HTML 4.0,
XML 1.0, and CSS2 specifications.
This chapter addresses key issues for internationalization, including
character sets and new language features in HTML 4 and CSS2. Be aware
that many of these features are not yet supported by browsers, even
the most current.
7.1. Character Sets
The
first challenge in internationalization is dealing with the
staggering number of unique character shapes (called
"glyphs") that occur in the writing
systems of the world. This includes not only alphabets, but also all
ideographs (characters that indicate a whole word or concept) for
languages such as Chinese, Japanese, and Korean.
7.1.1. 8-Bit Encoded Character Sets
Character encodings (or character sets) are
organizations of characters -- units of a written language
system -- in which each character is assigned a specific number.
Each character may be associated with a number of different glyphs;
for instance, the "close quote" character may be
displayed using a " or » glyph,
depending on the language. In addition, a single glyph may correspond
to different characters, such as a comma serving as both the
punctuation symbol for a pause in a sentence as well as a decimal
indicator in some languages.
The number of characters available in a character set is limited by
the bit-depth of its encoding. For example, 8 bits are capable of
describing 256 unique characters, which is enough for most western
languages.
HTML 2.0 and 3.2 are based on the 8-bit character set for western
languages called Latin-1 (or ISO 8859-1). There are a number of other
8-bit encodings, including:
|
ISO 8859-5
|
Cyrillic
|
|
ISO 8859-6
|
Arabic
|
|
ISO 8859-7
|
Greek
|
|
ISO 8859-8
|
Hebrew
|
|
SHIFT_JIS
|
Japanese
|
|
EUC-JP
|
Japanese
|
7.1.2. 16-Bit Encoded Character Sets
Sixteen bits of information are capable
of representing 65,536 (216) different characters -- enough to
contain a large number of alphabets and ideographs. In 1991, the
Unicode Consortium created a 16-bit encoded "super"
character set called Unicode (practically identical to another
standard called ISO 10646-1) which includes nearly every character
from the world's writing systems. The combination of Unicode
and ISO 10646 is called the Universal Character Set (UCS). Each
character is assigned a unique two-octet code (2 groups of 8 bits,
making 16 bits total). The first 256 slots are given to the ISO
8859-1 character set, so it is backwards compatible.
The HTML 4.01 specification officially adopts Unicode as its document
character set. So regardless of the character encoding used when a
document was created, it is converted to the document character set
by the browser, which interprets characters with special meaning in
HTML (such as < and >)
and converts character entities (such as
© for ©). In cases where a character
entity points outside of the Latin-1 character set (e.g.,
ϖ for ), HTML 4.0 browsers use the
Unicode character set to display the correct character.
This is the first step toward making the Web truly multilingual. The
current refinements to character-set handling on the Web are
documented in a working draft, the Character Model for the
World Wide Web 1.0, published by the W3C (http://www.w3.org/TR/charmod/).
A Unicode Font
Bitstream has created a TrueType font called
"Cyberbit" that contains a large
percentage of the Unicode character set. It is available only via
licensing to developers and is unfortunately no longer offered as a
retail product. For more information about Cyberbit, contact
Bitstream's developer products department at
oemsales@bitsream.com.
|
7.1.3. Specifying Character Encoding
When a web client (a
browser) and a server make a
transaction, meta-information about the requested and returned
document is communicated in the HTTP headers for the request and
response. One of the most important bits of information specified is
the content-type, which describes the type of data
the server is sending. The charset parameter
further specifies the character set used for a text document. A
typical HTTP header looks like this:
Content-type: text/html; charset=ISO-8859-8
To deliberately set the character-encoding information in a document
header, use the <meta> tag with its
http-equiv attribute (which adds its values into
the HTTP header). The meta tag that corresponds to the above header
message looks like this:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-8">
Note that the browser must support your chosen character set in order
for the page to display properly.
Browsers that are capable of sending an
accept-charset value can specify their preferred
character encoding when requesting a document. The server can then
serve the document with the appropriate encoding, if the preferred
version is available.
The accept-charset attribute is already a part of
the HTML 4.0 specification for form elements. With the
accept-charset attribute, the document can specify
which character sets the server can receive from the user in text
input fields.
 |  |  | | 6.3. Accessibility in Tools |  | 7.2. HTML 4.01 Language Features |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|