Character sets: 7-bit, 8-bit, 16-bit and more

Justin Bur <justin@cam.org> July 1992 / revised August 1995

Computers no longer speak only English. Computer users increasingly expect interfaces and messages to be offered in their own language, and to be able to process text in more than one language at a time. Humanities and social science researchers often need to work with texts written in dead languages or obsolete writing systems. As personal computers reach parts of the world where they had never before been available, more and more languages, some with unusual writing systems, are candidates for machine encoding.

There currently exist dozens of standard character sets (and many widely used sets that are not recognized standards, such as the IBM PC code pages), each dealing with a particular set of languages or applications. Researchers in fields that require additional characters have developed private encodings. It is therefore possible to represent just about anything anyone would want to – but there are too many incompatible standards. To ensure free circulation and easy exchange of information, it is essential that an internationally agreed standard character code be developed with the power to represent all human language.

This article will briefly present the most common current standard character sets before discussing some of the issues underlying the new universal character set standard, ISO 10646.

Current situation

ASCII, ISO 646, NRCS

The American Standard Code for Information Interchange (ASCII) is the only character set that just about everyone (except IBM mainframes) agrees on, and the only character set that can be safely transmitted everywhere on the Internet – or almost. Since its 7 bits encode only 128 characters, of which 33 are reserved (mostly wasted) for device control, the coding of even small numbers of diacritically marked characters for European languages can be done only by replacing some of the less used punctuation characters of ASCII. ISO 646, the international standard for 7-bit character sets, defines an International Reference Version (IRV), which is ASCII, and twelve positions in which it is permitted to place alternate characters to create National Replacement Character Sets (NRCS). This system is inadequate and rarely used except in Scandinavia and Japan. Given a choice between having all the braces in a C program turn into accented letters, and writing French without its accents, most people have preferred the latter.

ISO 8859

Since characters are transmitted and stored on most computers in 8-bit bytes (octets), the obvious solution to the need for more codepoints is to make use of the unused 8th bit. In the absence of an appropriate standard, though, several manufacturers independently assigned meanings to the 128 new codepoints. Thus we now have to live with IBM PC code (in several versions), Apple Macintosh code, Hewlett-Packard Roman8, Adobe Standard Encoding, DEC Multinational Character Set, and others. Unix systems and Internet services were very slow to adopt 8-bit characters, preferring to strip the 8th bit and work exclusively with 7-bit ASCII.

An international standard for an 8-bit code capable of representing virtually every European Latin-script language (ISO 6937) was developed quite early, but never received much favor among system implementors. This was because diacritical marks were encoded separately from the letters they modified; thus <e with acute accent> did not have a single 8-bit character code, but rather was encoded as the sequence <combining acute accent> <e>. The costs of such a system were usually considered to outweigh its benefits.

In order to handle all the European Latin-script languages with a single 8-bit code for each diacritic/base-letter combination, as well as the other major alphabetic scripts (Greek, Cyrillic, Hebrew, and Arabic), it was necessary to produce a series of several standard character sets: ISO 8859. The standard currently has ten parts, all of which contain ASCII in their first 128 positions. 8859-1 is the well-known ISO Latin 1, for Western Europe; 8859-2 (Latin 2) is for Eastern Europe. Latin 3 (Esperanto, Maltese, and Turkish) and Latin 4 (Nordic languages) were poorly conceived, have been little used, and may someday be withdrawn from the standard. 8859-5 covers the Cyrillic script; 8859-6, Arabic; 8859-7, Greek; 8859-8, Hebrew. 8859-9 (Latin 5) was introduced to handle Turkish in a less baroque manner than Latin 3; it is identical to Latin 1 with the substitution of Turkish for Icelandic letters. 8859-10 (Latin 6) covers the Nordic languages more adequately than Latin 4. There is also a set of supplementary characters designed for use in conjunction with Latin 1 or 5 and Latin 2: by switching among these sets, it is possible to obtain the full repertoire of ISO 6937 and therefore to encode almost any Latin-script language covered by ISO 8859.

Most parts of ISO 8859 have been well accepted and widely implemented. Yet 8859 suffers from the same problem as 646, on a different scale: it represents several sets of characters within the same limited encoding space. Although it is no longer necessary to choose between C programs and French accents, it is still not possible to exchange files between Eastern and Western Europe without character set problems. Or, more to the point for most people, it is not possible to exchange files between PCs, Macintoshes, and Latin 1 machines without transcoding – and these character sets still cause problems with Internet mail and the World-Wide Web, despite measures such as MIME and HTML entities.

Han (Chinese/Japanese/Korean) characters

Languages written with non-alphabetic scripts introduce entirely new problems. The most important are Chinese, Japanese, and Korean, each of which uses a native phonetic script of 40-50 signs to supplement a vocabulary of many thousands of ideograms (Han characters). Although all three languages' ideograms are of common origin, they have developed independently and diverged. In addition, the People's Republic of China has introduced "simplified" characters that differ from those still used in Taiwan, Hong Kong, and elsewhere. National standards exist for encoding each language using 2-octet (16-bit) or 3-octet characters.

Character set switching

A standard method of announcing which character set is being used and of switching among character sets has been created (ISO 2022). It is therefore possible to mix different NRCS, parts of 8859, or any other of the many special-purpose character sets that have been registered (according to ISO 2375), without danger of misinterpretation. In practice, however, ISO 2022 is particularly confusing and awkward to implement and has had limited use, except in a few special applications (e.g., bibliographic information interchange, Japanese electronic mail; partial implementations of 2022 include DEC VT200 terminals and X Window System compound text). Moreover, it does not make any attempt to provide a unified and consistent repertoire of characters; it simply allows many varied character sets to coexist – as long as they have been registered. None of the manufacturers' private character sets are registered.

Towards a universal coded character set

The only way out of the character set morass is to define a new standard character set offering a unified and consistent repertoire capable of representing all the major languages of the world. The demand for native-language interfaces in more and more languages, and the need to exchange information on a worldwide scale, have made such a universal character set a commercial necessity. Dealing with all the world's languages and scripts at once is not an easy task, and requires addressing many issues that previous character sets have been able to avoid.

How many characters are there?

The first issue that must be resolved – and not the least controversial – is the approximate number of characters to be encoded and the number of bits required to offer this number of codepoints. Computer architecture places major constraints on this choice. We no longer have DECsystem-10s and 20s with 18- and 36-bit words; the only convenient units for most processors are 8, 16, and 32 bits. It has been estimated that 18 bits would be sufficient for just about everything anyone would ever want to encode as a character. Therefore, a choice must be made between rejecting some possible characters to make everything fit compactly in 16 bits, or wasting storage space with 32-bit characters.

The most important reduction of the total number of characters – inevitable for a universal 16-bit code – is obtained by encoding Han characters common to Chinese, Japanese, and Korean once only, despite differences in meaning or variations in form. This reduction is referred to as Han Unification.

Writing systems

The writing system of most Latin-script languages is very simple: characters are aligned horizontally, left-to-right, without overlapping or changing direction; the only non-linear elements that intervene are diacritical marks placed above some letters. Even then, the number of possible diacritic/base-letter combinations in any one language is usually small enough that it is easy (and often preferable) to use a separate code for each precomposed diacritically marked letter. In general, however, writing systems are not as simple.

Certain Latin-based writing systems are already more complex. Vietnamese often requires two diacritical marks on a single letter, one of which is a tone mark. Standard phonetic script (IPA) combines superscript and subscript diacritics as well as marks that apply to more than one letter at once. It is not even practical to enumerate all the possible diacritic/base-letter combinations in IPA, since it is designed as an open-ended system in which new combinations may be invented as required.

Arabic and Hebrew are written from right to left, but numbers and insertions in Latin script are written by changing direction within a line of text. Both scripts denote only consonants with full letters; vowels are (optionally) written as points over or under the consonants. Arabic comes from a calligraphic, rather than typographic, tradition, in which letters have initial, medial, final, and standalone forms; Hebrew also has a few of these positional variants, and even Greek retains one or two.

Other complex scripts include those derived from Devanagari (the script of ancient Sanskrit), which use an involved system of ligatures, and the Korean Hangul alphabet, with which alphabetic symbols are combined into syllabic blocks.

What is a character?

The different operations performed on text in a computer – including input, rendering (display), searching, and sorting – have different preferences for the way in which the text is encoded. Rendering would be simpler if presentation forms such as the ligatures "fi" and "fl" were encoded explicitly; but this would complicate the input process and make correct searching and sorting difficult. This is a trivial example, but issues of this nature abound, especially in more complex writing systems. The method of encoding must be a tradeoff between the requirements of different types of processing. The characters of a character set are the elements required by the chosen encoding.

In this context, the term "character" has a particular meaning which overlaps partially with conventional uses of the word. Characters are not simply abstract shapes (typographic characters), nor do they necessarily correspond exactly to the elements of any one writing system.

In Latin script, the question of what is and is not a character arises mainly with diacritical marks and ligatures. Diacritics must be encoded as independent characters for applications such as IPA and are best encoded that way for the occasional diaeresis or stress marks used in English. But it would be very inconvenient to insist on such an encoding for Turkish, in which diacritically marked letters have their own separate positions in the alphabet. Thus independent diacritics must be included in the character set, but should not be used in certain applications. Ligatures such as fi and fl probably should not be considered characters, because they are significant only for rendering and can be derived automatically. The ae ligature, on the other hand, is part of the alphabet in Norwegian and Danish. Use of the oe ligature in French is not absolutely required but is considered good typographic practice; it cannot be determined automatically and therefore must be encoded explicitly. But when sorting or searching, it should be treated as if it were the individual letters o and e.

Positional variants in scripts such as Arabic are quite clearly presentation forms that should not be encoded separately. In Hebrew and Greek, however, the handful of variant forms traditionally have separate codes. For each writing system there are situations where it cannot be clearly decided just what is or is not a character.

Existing character sets evidently have a large influence on the design of the universal character set, which must be able to represent all text that could be encoded previously. Some of the mistakes of the past must be retained for the sake of compatibility, but should be avoided in the future. For example, the inadequacies of typewriters and of ASCII have accustomed people to ignore the distinction between the hyphen, dash, and minus sign, or between opening and closing quotation marks. The hyphen-minus and neutral vertical quotation mark must continue to exist, but their use should be discouraged once the correct distinct characters exist. The universal character set can allow text to be encoded more precisely and more richly than before, and facilitate improved methods of processing text.

It is not expected that every device or piece of software that supports the universal character set should be capable of handling the requirements of all writing systems. However, it is essential that the character set itself contain all the elements required for every writing system (and only as many non-essential characters as are imposed by convenience or backward compatibility). Not all writing systems have been previously considered for processing by computer, while in other cases more than one competing encoding scheme exists. Careful and well informed choices must be made.

Character, glyph, keysym...

The distinction between the notions of "character" and "glyph" is fundamental in recent work with character sets. Informally, a character is a unit of information used to encode text, whereas a glyph is a shape (a homogeneous set of which constitutes a font) used to render text. The rendering process includes a mapping (not necessarily one-to-one) from characters to glyphs. A familiar example of such a mapping is the encoding vector in a PostScript font.

This distinction leads to two principles for the design of a character set: first, that variations in form (multiple glyphs) required for high-quality rendering of text should not be encoded with separate characters if their meaning is the same; and second, that even if two candidates for encoding are visually identical (such as capital D with stroke (Croat, Lapp) and capital eth (Icelandic)), and thus can be rendered with a single glyph, they must nevertheless be encoded separately if their meanings differ. Similarly, the script to which a character belongs is significant: Latin capital A and Greek capital alpha are distinct characters despite their shared form.

As usual, compromise, compatibility, and convenience blur the distinction and make the principles ambiguous. In the areas of mathematical symbols and diacritical marks, it is impractical to associate characters with distinct uses of each symbol, since the uses are varied and changing; instead, the shapes (glyphs) have to be used as the basis for characters. A borderline example is the case of the diaeresis or umlaut diacritical mark. Here the two meanings of the symbol are clearly defined and it can be useful for some applications to distinguish between them. In general, however, making such distinctions with diacritical marks is more trouble than it is worth.

It should also be kept in mind that the symbols used for keyboard input are not necessarily characters either. An input method is used to convert sequences of keystrokes into characters. At the simplest level, the input method consists merely of the interpretation of shift, control, and alternate keys held down in conjunction with another key. Other functions of an input method include dead-key handling, compose-character processing, and input of Han characters by typing a phonetic representation and disambiguating by choosing from a menu. The keysyms of the X Window System do not, therefore, constitute a character set and cannot in general be used directly as characters.

Defining the new standard

ISO DIS 10646

An international standards committee began work on a draft international standard (DIS 10646) for a universal coded character set several years ago. The DIS was a four-octet (32-bit) code, where each octet was limited to values which would represent printable characters in ISO 8859. This limitation would have made the code easier to transmit and process by obsolete means, but eliminated a huge number of codepoints, to the extent that no two-octet subset of the code could offer a minimal encoding of all the major languages (Base Multilingual Plane). No Han unification was considered. In an attempt to reduce the costs of storing and transmitting four octets for each character, various compaction forms were defined, of length 1, 2, or 3 octets or variable-length. The committee did not have the means to do adequate research in some areas, with the result that the draft submitted to international balloting in 1991 still had many serious problems. It was not adopted as a standard.

Unicode

Unicode is a 16-bit universal character set developed by a consortium of U.S. computer manufacturers and software houses, most of them in California. The code was compiled by a small team of engineers and linguists on the basis of existing international and national standards and corporate character sets, with much research and consultation with experts. Work was begun after DIS 10646 was started, with the conviction that the standards committee had little chance of producing a viable character set. The completed code (version 1.0) – published in two volumes (fall 1991, summer 1992) by Addison-Wesley – is technically very sound and makes few compromises to its principles, among which are Han unification and a relatively strict distinction between character and glyph.

Unicode-10646 merger and final standard

After the failure of DIS 10646, and given the stated intentions of many companies to produce products using Unicode despite its not being an officially sanctioned standard, negotiations between the ISO committee and the Unicode Consortium were begun leading to a merger of the two character sets. A new DIS 10646 was defined. It is still a 32-bit character set (though no longer subject to the restriction that each octet be printable according to ISO 8859), with a 16-bit Base Multilingual Plane (BMP) instead of compaction forms. The BMP is equivalent to 32-bit characters with the upper sixteen all zeroes; its content is basically Unicode. An annex to the standard proposes an algorithm for converting the code into a variable-length form using only octets which are printable characters, for transmission purposes.

Compromises have been made so that the new DIS is acceptable to all countries participating in the standards process, but it remains compatible with Unicode 1.0 and has been adopted by the Unicode Consortium as Unicode 1.1. The scenario of two incompatible universal character sets has thus been avoided.

The new DIS was approved by international balloting in May 1992, and its text finalized at an ISO meeting in Seoul in June. It was published in 1993. It currently defines only the BMP; future additions to the standard will fill in other parts of the codespace. In particular, the national variants of Han characters may be added, since unified Han characters are not always considered adequate. Within the BMP itself, several less commonly used scripts remain to be encoded once further research has been completed.

Acknowledgements

The bulk of the content of this article is derived from the lively and informative discussion carried out by members of the Unicode@Unicode.ORG and ISO10646@listproc.HCF.JHU.EDU mailing lists between 1990 and 1992. Thanks to Glenn Adams, Alain LaBonté, Karen Smith-Yoshimura, and Erik van der Poel for comments on and corrections to this text.