Request For Comments: DRAFT Vadim Antonov Category: Informational Pluris, Inc. 18 March 1997 Rosetta Encoding For Multi-Lingual Texts Status Of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract This document specifies the method for an efficient encoding of multi-lingual texts. The document includes comparison with other methods, detailed specification of the encoded text format and of the procedures for manipulation of such text, guidelines for development of specifications for particular languages, and sample source code. 1. Introduction: Requirements For A Practical Encoding Internet became a trans-national entity long time ago, and its users are speaking and writing in more than hundred languages. However, there is still no single standard for encoding multilingual texts facilitating transmission, storage, processing and rendering on displays or printed media. Any practical multilingual encoding must have the following features: (1) Compatibility with ASCII Practically all modern data processing systems employ ASCII as a basic character encoding and there is no indication that the situation will change any time soon. Therefore, a practical encoding must include ASCII as a subset. (2) Ability to include multiple languages in a single document. In the increasingly connected world, more and more people have to manipulate documents in different languages. One of the most frequent document manipulation techniques is composition from a number of different source texts, often written in different languages. Therefore, a practical standard must allow any combination of languages to be present in a text. Antonov [Page 1] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 (3) Partial legibility. It is unrealistic to assume that all computing systems be able to display texts in all existing languages. Therefore, it is important that any encoded document should be partially legible to any system which supports at least one of the languages used in the document. It also means that illegible parts of text should be easily recognizable as such by a simple algorithm. If such recognition is possible, a system can attempt to display or process a partially legible text using some transliteration technique, or simply skipping, the illegible parts. (4) Generic case conversion, sorting and case-insensitive matching. It is unreasonable to assume that all systems storing, indexing or otherwise processing documents will have support for all languages. Therefore it is important that a multilingual encoding should support some limited ability to perform trivial document manipulation, such as case conversion, sorting and case-insensitive pattern matching, by systems lacking facilities to support those particular languages. Although it is impossible to perform those functions ideally without knowledge of language-specific rules, some reasonable approximation can be achieved with simple operations on character codes. A text encoded with a method lacking this property cannot be effectively searched or indexed by third-party systems, such as Web indexing engines. Therefore this property is vital for Internet applications. (5) Efficiency. A practical multilingual encoding should not be significantly less efficient than existing national character sets. Any attempt to introduce an inefficient encoding is practically doomed, particularly because most non-English speaking countries have very poor and overloaded telecommunication infrastructure. (6) Context independence. It is very hard to modify existing software and network protocols to carry additional out-of-band information required to interpret documents. Therefore, decoding of documents must be possible without any such information. (7) Random accessibility. A document must be readable starting from an arbitrary point in the text with minimal amount of analysis of the text surrounding that point. It also means that concatenation of correctly encoded documents must produce a correctly encoded document. This property allows embedding multi-lingual texts into source codes of programs processed by compilers allowing extended Antonov [Page 2] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 ASCII codes in string constants. (8) Extensibility. Languages evolve, or are rediscovered (printing historical texts is a small-scale but quite real application), so a realistic encoding would allow adding support for new languages as needed. (9) Unambiguousness. An encoding method should provide a single "natural" way to encode words in every language. In other words, although letters from one language's script can often be mimicked by letters of another script, use of foreign letters inside words should not be required, and should be generally discouraged. Any encoding method not satisfying these criteria would cause the need for extensive redesign of network and database software in order to achieve true miltilinguality, and therefore cannot be recommended as an Internet standard. 2. Critique Of Older Encoding Methods Currently, most non-English texts are encoded with a plurality of language-specific code pages, some of which are straight extensions of ASCII, and some are non-ASCII character sequences encapsulated into ASCII texts by means of special control codes. The extended ASCII code pages (such as ISO 8859) are used to represent groups of languages with similar letters. Generally, the extended ASCII code pages do not allow documents including words from different groups of languages. Even worse, there is no way to determine which language group is used in a particular document without some external information. The need to "compress" several languages into a single group makes a generic processing impossible because it results in non-lexicographical ordering of letter codes for most languages within the respective group. The 8-bit code pages are very limited in size, so at least some represented alphabets are compromised and allow only the simplest mode of writing (such as used for newsprint), and therefore cannot be used for more extensive literary writing. Consequently, the ISO 8859 character sets certainly cannot be considered a practical method for multilingual encoding. Obviously, 8-bit code pages are not sufficient for non-alphabetical writing systems, such as logographic (for example, Chinese Han) or hieroglyphic writing systems, and are only marginally useful for syllabic or featural writing systems. As a result, there is a number of multi-octet encoding schemes for Asian languages. Some of them include ASCII characters (and characters of other alphabets, such as Cyrillic, Hebrew and Greek) as a subset. However, users in countries where those respective additional alphabets are dominant never use the multi-octet encodings due to their inherent inefficiency in Antonov [Page 3] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 representation of alphabetic languages. Practically all such multi- octet encodings do not allow random access, lacking local methods for determination of boundaries between multi-octet groups; the alternatives are very space-inefficient. There is a number of official standards and proprietary schemes for multilingual text representation utilizing explicit register shifts between different code pages. Such schemes provide for efficient encoding, but do not allow random access and therefore require extensive support in software. The fundamental problem with explicit register shifts is that interpretation of text potentially requires looking back for an indefinite number of octets to find the latest effective register shift code. Since most of those schemes are based on existing single-octet code pages, they also inherit all their disadvantages. Additionally, all existing register-shifting standards are not proper ASCII extensions because they employ control characters which are often used for other functions (such as control of terminals) by the existing computer systems. The most advanced of existing methods, the Unicode, is a logical extension of the multi-octet encoding approach. Unlike older multi- octet codes, Unicode employs a variable-length encoding (UTF), which is a superset of ASCII. Therefore, an 8-bit transparent computer system will have no trouble in storing or transmitting Unicode documents. An important feature of Unicode is that it supports random access. However, Unicode is efficient only for Latin-based languages; a document with text in Hebrew, Greek, or Russian (or in any other language using a non-Latin alphabet) will take nearly twice as many octets as the same text encoded with a respective ISO 8859 code page or some other single-octet code (such as one of Microsoft Code Pages, or some other national code). Unicode was supposed to be a glyph-based encoding; i.e. differently- shaped letters are represented with different codes, when similarly- shaped letters share the same code. Later this approach was modified to apply only within particular extended alphabets. Since national alphabets feature different ordering of letters, there cannot be a single coherent alphabetical order containing national alphabets. Therefore, cannot be simple generic processing algorithms for Unicode texts. Worse yet, there is a large number of identically-looking glyphs with different code pages, so an ambiguity of text representation is quite likely (for example, Russian letter YO can be represented by Cyrillic letter YO, combination of cyrillic letter YE and combining diaresis, or by combination of latin letter E and combining diaresis - all producing indistinguishable glyphs). Existence of separate code points for ligatures (such as latin ff, fl, etc) makes it rather hard to perform even trivial exact-match searches, since the same word may have several valid text encodings. Additionally, the ligatures (like latin small sharp S) cannot be Antonov [Page 4] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 case-converted. The lack of generic processing capabilities makes Unicode practically useless for such applications as databases, search engines, or text indexing because those applications routinely perform case conversion and lexicographical sorting. Such functions on Unicode texts requre external information about exact identity of text's language. Given that such external information is usually very limited (for example, a short string kept as an attribute of a file), a processing of "pure" Unicode multi-lingual text is simply impossible (there's no way to locate boundaries between segments of text in different languages). As a result, a number of authors proposed to include explicit language indication into Unicode texts, in form of explicit register shifts (also known as "rich text"). Those proposals effectively eliminate such advantages of "pure" Unicode as random access and it's being a proper ASCII superset. Unicode also does not have a partial legibility property: there is no easy way to make sure that, for example, result of case conversion of some text string will be displayable prior to the operation (an example would be an ASCII-only terminal, and text in German converted to lower case). It is interesting to note that a Unicode text featuring language indicators (in form of explicit register shifts, or derived from an external context) contains a large quantity of redundant information. Obviously, in case of non-Latin alphabetical languages, if the language of a particular text segment is known, there is no reason to use multi-octet encoding per every letter, since at most one octet is needed to encode the letter in that particular language's alphabet! In other words, in the real life Unicode does not have any advantages over register-shifting or single code-page encoding schemes, but certainly is much less efficient. 3. Rosetta Encoding The proposed universal multi-lingual encoding, Rosetta, satisfies all criteria for a practical standard encoding, as outlined above. Rosetta encoding is based on the observation that the minimal unit of a language is a word, and that all letters within a word belong to the same language. Therefore, identifying language on per-word basis eliminates most of redundancy of a multi-octet encoding. On the other hand, words are relatively short and do not cross lines so it is easy to locate a current language indicator. This means that random access to Rosetta texts is simple and efficient. (For the purpose of Rosetta encoding, hyphenation creates two separate words: at the end of a line, and at the beginning of the next line, respectively). The structure of Rosetta-encoded text is described by the following BNF: Antonov [Page 5] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 ::= * * ::= * ::= [] ( | ) ::= | | ::= The maximal length of a single word is 127 octets, including the language selector. A text containing longer words is invalid. In logograpic languages, a Rosetta word would correspond to a part or an entire sentence composed from an unbroken sequence of characters. Since practically all words in alphabetical languages are preceded with white space, an additional improvement in efficiency is achieved by providing a language selector which doubles as a white space. As a result, Rosetta encodes texts with efficiency within 8% of the efficiency of single-octet (or double-octet, for Chinese and Japanese) code pages for languages spoken by at least 80% of the world population. Unlike older code-page switching techniques, Rosetta uses switching between languages. For every language, letters are encoded in strictly alphabetical order, with simple mapping between codes for upper-case and lower-case letters (when applicable), thus facilitating generic lexicographical sorting and case-insensitive searching. Rosetta-encoded texts have partial legibility property (i.e. no mangled text will be produced if some particular language is not supported by a system), and are context-independent (so no external information is required to interpret texts). As defined below, Rosetta supports up to 61505 languages; this number can be increased in the future if necessary. In many languages, written text may contain additional hints to aid readers in understanding or pronouncing. Those hints are optional and appear mostly in children's books, works of poetry, and educational literature; hints are often omitted in newsprint. Examples of such hints are: stress accents in Slavic languages, or explicit spelling of vowels in Arabic or Hebrew. Rosetta allows separation of the core words (with simplified, newsprint spelling) and the optional hints by providing a generic format for inserting such hints into words. An additional usage for hints is to provide explicit instructions on hyphenation of a word to the rendering engine. Antonov [Page 6] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 Hints are usually ignored during machine searching, indexing and sorting. When displaying or printing texts, hints may be taken into account or ignored, depending on the capacity of the rendering engine or the intended audience. Finally, Rosetta is a proper superset of ASCII. 3.1. Code Table Rosetta letters and control codes are defined in the following BNF (see also Table 1): ::= | ::= ::= | ::= ::= | | ::= | There are two distinct text interpretation modes, the ASCII mode (all "letters" have codes in range from 100 to 176 octal), and the Rosetta mode (uppercase and lowercase letters occupy two separate regions, 100-176 octal and 300-376 octal, respectively), as shown in Table 1. +------------+--------------------+--------------------+ |Binary Code | ASCII Mode | Rosetta Mode | +------------+--------------------+--------------------+ | 00 0xx xxx | Control Characters | +------------+-----------------------------------------+ | 00 1xx xxx | Digits & Punctuation Marks | +------------+--------------------+--------------------+ | 01 xxx xxx | Upper/Lower Case | Upper-Case Letters | | | Letters & Signs | | +------------+--------------------+--------------------+ | 01 111 111 | Delete Character | +------------+--------------------+--------------------+ | 10 xxx xxx | | Language Selectors | +------------+ Reserved +--------------------+ | 11 xxx xxx | | Lower-Case Letters | +------------+--------------------+--------------------+ | 11 111 111 | Reserved | +------------+-----------------------------------------+ Table 1. Rosetta Code Table Antonov [Page 7] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 Digits (octets 60-72 octal) always represent decimal digits from 0 to 9 respectively, but can be rendered differently depending on the language selected. Octets with values in range 200-277 are exclusively used in language selectors and a hint separator. 3.2. Format of Language Selectors A language for every Rosetta word is selected with a language selector. Explicit language selectors are represented by special octet sequences. Additionally, any or octet causes implicit switching to ASCII/English besides printing corresponding ASCII character or performing corresponding ASCII control function. An explicit language selector is composed from one, two or three octets, as shown below: -- function code: | 0 = zero-width selector [10 0fx xxx] 1 = space & selector [10 1fx xxx] [11 xxx xxx] 1st octet 2nd octet 3rd octet [10 1fx xxx] [01 xxx xxx] [11 xxx xxx] <---> <-----> <-----> bits 0-3 bits 4-9 bits 10-15 Language_Number ::= (|) | | A language selector with function bit set to one causes white space to be printed prior to the corresponding word (i.e. at the left side of the word). A language selector with zero function bit does not print anything. For some language numbers, corresponding selector sequences would include octets 177 or 377. Such language numbers are therefore unusable, and must not be allocated. (Note that a valid language selector cannot include octet 300.) The spacing language selector for ASCII/English (language number 0) is redundant (the ASCII space character (40 octal) performs the implicit switching to ASCII/English); so the corresponding octet 220 is used as a hint separator. Antonov [Page 8] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 3.3. Interpretation Of Rosetta Text The interpretation of Rosetta-encoded text starts with ASCII/English language by default. When a language selector sequence is encountered, the subsequent letters are interpreted as letters of a word in the selected language. Any or octet resets the language back to ASCII/English. Besides switching language back to ASCII/English, all characters with codes in range 0-37, 177 or 377 octal are interpreted as corresponding ASCII control characters. All characters with codes in range 40-57 and 72-77 octal print corresponding ASCII characters and switch language to ASCII/English. The reason for this rule is to eliminate unnecessary explicit switching back to ASCII/English for most numbers and punctuation signs. When the current language is ASCII/English, upper-case Latin letters occupy the range 101-132 octal, and lower-case Latin letters occupy codes in range 141-172 octal. All languages with even assigned numbers have upper-case and lower- case letters with codes 100-176 octal and 300-376 octal respectively, up to 63 different letters. Languages with odd assigned numbers do not have different cases, and so octets with values in ranges 100-176 octal and 300-376 octal can be used by different letters, up to 126 different letters. If a language has more than 126 letters or symbols, a double-octet encoding is necessary: 1st octet 2nd octet [x1 xxx xxx] [x1 xxx xxx] | | | | bit 15 | bit 7 bit 0 bit 8 The octet containing most significant bits comes first, to allow generic lexicographical sorting of double-octet sequences. A special care must be taken to avoid using octets with values 177 and 377 octal. The dual-octet encoding allows to specify as much as 15876 symbols; in an unlikely case that this is not enough, a triple-octet encoding can be used. A word in a language requiring dual-octet encoding must contain even number of letters following the word's language selector (note that digits embedded into words of dual-octet encoded languages occupy only one octet). Interpretation of malformed words is not defined; it is recommended for a program encountering a malformed word to generate some kind of error message. Antonov [Page 9] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 It is important to note that interpretation of multi-octet encoding is purely language-specific. I.e. for a system which does not support such a language, the sequence of multi-octet characters is indistinguishable from a sequence of single-octet characters. In multiple-octet encodings, hints should be inserted before the first octet of corresponding characters' representations. When texts are transmitted over the Internet, they have no internal end of file indication, and text lines are terminated with Carriage Return - Line Feed sequence. There is no guarantee that ASCII control characters other than Back Space (BS, 10 octal), Horizontal Tabulation (HT, 11 octal), Line Feed (LF, 12 octal) and Carriage Return (CR, 15 octal) will be interpreted properly by a recipient system. Using octets 0 and 377 in any text is not recommended, since they will likely be misinterpreted as "end of string" or "end of file" by many application programs written in C. An internal system-specific representation of a Rosetta text may use different code sequences for end of line and end of file (for example, a Unix system uses single Line Feed to end a line, and a MS/DOS-based system uses SUB character (32 octal) as end of file). A special care must be taken to convert internal representation into network representation when transmitting such texts over the Internet. 3.4. Printing or Displaying Rosetta Texts The most complicated aspect of rendering multilingual texts for human-readable media is combining phrases or words from languages with different directions of writing (from right to left, from left to right an from top to bottom). The general direction of rendering of Rosetta texts is not specified in the texts themselves, and so should be selected elsewhere. Obviously, top-bottom and horizontal writing cannot be usefully combined. However, this is not usually a problem, since top-down writing systems usually allow left-to-right writing without causing much inconvenience to the readers. When the top-down rendering is required, the horizontally-rendered words must be rotated 90 degrees, corresponding the direction of reading, to preserve relative positions of the letters. In some cases (such as Mongolian), the horizontal rendering of text is impossible, so when words and phrases in such language must be embedded into a horizontally-rendered text, the individual words should be rendered vertically, with horizontal spacing between the words. (A similar technique can be used to ember horizontally-rendered words and numbers into vertically-rendered texts). The simplest solution for rendering texts composed from left-right and right-left writings would be to select a primary direction (which Antonov [Page 10] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 would be from left to right) and simply use reverse order of encoding for right-left words and phrases. Unfortunately this has a disadvantage of making generic lexicographical sorting impossible. An alternative approach would be to encode letters or other glyphs in a natural order for every respective language, and switch rendering directions when printing or displaying the texts. This method is rather complicated, particularly because for a line with multiple words there will be a number of alternative encoding orders, depending on when reverse-rendering regions spawn several words, or if those words are separate. To mitigate the complexity, and allow reasonable generic processing, Rosetta encoding always assumes left-to-right ordering of words (words are delimited by ASCII control characters, punctuation, and by explicit language selectors, as described above). Within every word, ordering of letters is natural. The order of words in encoded texts in right-to-left writing system is opposite to the natural. This approach allows very simple and unambiguous encoding, while still allowing for some meaningful lexicographical sorting (it can be argued that most sorting functions are performed on words, not on phrases). Within numbers (i.e. sequences of digits within words), the ordering of digits should be from the most significant to the least significant, no matter how those numbers are rendered; this rule allows unambiguous interpretation of numbers by a computer system ignorant about rules of a particular language. This means that there can be several regions of alternating rendering directions within alphanumeric words if a language assumes that numbers are written starting from the least significant digit (this is the case with Arabic). The alignment of lines (left margin, right margin, centered, etc) is not defined by the Rosetta encoding. Some languages require that in specific cases particular combinations of letters were rendered differently, forming ligatures. The ligatures are purely rendering issue, and should not be assigned separate letter codes. The same rule applies to modification of letter shapes between independent, initial, medial, and final forms (common in Arabic and other scripts); and other similar cases. In other words, an encoding of a word must depend only on its letters, not on typographical devices. If there are no font-specific rigid rules for combining letters into ligatures, hints can be used to guide selection of alternative shapes. An argument can be made that the similar rule may be applied to composition of glyphs in syllabic writing systems (such as Korean Hangul). When a simple algorithmic composition of syllable glyphs from letters is possible, the letter-based encoding may be preferable Antonov [Page 11] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 to syllable-based encoding. A rendering process can choose to ignore hints completely (for example, when chosen font or output device does not support rendering of hints; or when resolution of output device is insufficient to display embellished text), or interpret hints only partially. A rendering engine may perform text hyphenation by using language- specific algorithms. The first step is to find hyphenation points in a word; note that the word for the purposes of hyphenation is defined as a sequence of non-white space characters; so it can be composed from several Rosetta words. After hyphenation points are found according to the rules of respective languages the word is split at the most suitable point and, depending on which language is effective at that point, an appropriate hyphen symbol is added. To aid hyphenation, hints can be used to explicitly indicate hyphenation points in a word. Usually, hints with codes having bit 200 octal set should be inserted at hypenation points; any word with such hints can be hyphenated at those points only. A hypenation hint inserted before the first letter in the word will therefore inhibit hyphenation of the entire word (if no other hyphenation hints are present). 3.5. Generic Sorting and Case-Insensitive Matching A generic case-sensitive comparison of Rosetta-encoded text strings is defined as a lexicographical comparison of strings composed of pairs (language number, letter code), and produced from the Rosetta strings by processing and removal of language selector codes and hints. Letter codes are merely values of corresponding octets (i.e. they are always in ranges 60-71, 100-176 or 300-376 octal, except for ASCII characters). For the purpose of sorting or comparison, conversion of multi-octet sequences into letter codes with more significant bits is not necessary (it does not change the result of the comparison operation). Note that all source octets in ranges 0-60 and 72-77 octal, and octets 177 and 377 octal are always converted into pairs (0, octet_value). When comparison of digits is performed, the associated language codes must be ignored to ensure correct sorting of numbers rendered in different scripts. A sorting method using numeric values of digit sequences instead of lexicographical ordering may be preferable in most applications. The conversion from upper case to lower case is performed on the pairs (language number, letter code), by changing the letter code according to the following rules: (a) if language number is 0 (ASCII/English) and letter code is in range 101-132 octal, add 40 octal to letter code; Antonov [Page 12] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 (b) if language number is even (but not 0) and letter code is in range 100-176 octal, add 200 octal to letter code; (c) if language number is odd, letter code is not changed. Similarly, the conversion from lower case to upper case is performed as: (a) if language number is 0 (ASCII/English) and letter code is in range 141-172 octal, subtract 40 octal from letter code; (b) if language number is even (but not 0) and letter code is in range 300-376 octal, subtract 200 octal from letter code; (c) if language number is odd, letter code is not changed. If there is a language requiring multi-octet encoding and having even number (i.e. having upper case and lower case) the case conversion will affect more than one bit in its letter codes; so the code table must be planned accordingly. In other words, for such languages all octets in a sequence for a letter must belong to the same range of values (100-176 or 300-376 octal) accordingly to the letter's case. 4. Further Rosetta Specifications Further specification of Rosetta encoding as applied to particular languages will be presented in a series of separate documents, Language Encoding Specifications, and Assigned Language Numbers. The reason for such separation is to avoid updating existing documents. Except for the language numbers assigned in the first revision of the Assigned Language Numbers document, the numbers will be allocated in order of requests from authors of the respective Language Encoding Specification documents. Please check the latest Assigned Language Numbers document for the registry contact information. The lists of assigned numbers will be published periodically by the registry in documents superseding previous Assigned Language Numbers documents. The allocation of Rosetta language numbers is performed by the Registry, on first-come first-served basis, in accordance with the guidelines below. The contact information for the Registry is provided in the latest published revision of the Assigned Language Numbers document. 5. Language Encoding Specification Guidelines This section outlines general rules for writing language encoding specifications for particular natural languages. Antonov [Page 13] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 5.1. Language Number Allocation The criteria for deciding if some writing system should be treated as a separate entity are somewhat imprecise, so applying common sense is advised: (a) Dialects and minor variations of writing system should not be considered separate languages; instead specification should cover all known dialects. (b) It is not recommended to include obsolete orthography into specification of modern languages. If a need arises to encode historical documents, a separate historical version of the language will merit allocation of a separate language number. Such historical versions of alphabets must include all letters found in historical documents and literary works for the entire period of the existence of the language. Since the use of such historical languages is unlikely to be frequent, such languages should be allocated numbers corresponding to three- octet language selectors. (c) If a single language has several radically different writing systems, those systems may be allocated a single language number in case when there's a sufficient space in a code page to include all alphabets, and if the rendering rules for those systems are compatible. An example of a single language which needs to have two different language codes is Mongolian, with Outer Mongolian written horizontally in Cyrillic script, and Inner Mongolian written vertically in Mongolian script. (d) Existence of different typographical devices for printing or writing the same language must not be considered sufficient reason for allocating new language numbers. (e) Newer revisions of language encoding specifications must reuse language numbers allocated to older versions. An unfortunate effect of this rule is that it makes hard to introduce significant revisions because they can cause incompatibility with existing texts. Therefore, a special care must be taken to make language specifications correct from the beginning. (f) Algorithmic and other formal languages are not eligible for allocation of separate language numbers. Symbols used in such languages should be included in a generic list of symbols. Odd language numbers are allocated to the languages with no distinctive upper and lower case letter sets. Conversely, even language numbers are allocated to languages which use writing systems featuring upper case (capital) and lower case (small) letters. Antonov [Page 14] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 5.2. Alphabets Codes for letters in an alphabet must be allocated in the commonly used dictionary order, for that particular language. An authoritative dictionary can be selected as the reference document for determination of the order. Upper case (capital) letters should be assigned codes in range 100-176 octal; lower case (small) letters should be assigned codes in range 300-376. (The only exception is ASCII/English, for historical reasons). Corresponding upper and lower case letters must have codes differing by 200 octal. The alphabet should not contain various typographical forms (such as ligatures, independent, initial, medial or final letter forms, etc); i.e. every letter must be represented by a single code. Generally, alphabets should not contain any punctuation signs (unless those signs appear as a part of a word, and are not mere typographical devices). A collection of non-ASCII punctuation signs is allocated a separate language number; it has a sufficient code space available to add more punctuation signs if necessary. 5.3. Required Specification Sections A Rosetta language specification must include sufficient information for an implementor not familiar with the language to be able to implement software modules performing input, displaying, and trivial processing (such as editing, sorting or indexing) of texts in that language. Therefore, a language specification must include: (1) The standard name of the language, and widely used aliases. (2) Statistical information on language usage, and the list of countries where the language is widely used. (3) The assigned language number, and the corresponding Rosetta language selector sequence. (4) Primary direction of text rendering, and acceptable alternatives. (5) Rendering rules for numerals. (6) The complete and comprehensive list of letters, and corresponding octet values (or multi-octet sequences, if applicable), in lexicographical order. This list must contain canonical names of letters, and renditions of letter shapes (using a commonly used book or newspaper font). Such renditions must be embedded in the document as graphical images, thus allowing the document to be reproduced on any graphical output device. (An ASCII version of the document Antonov [Page 15] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 should provide a reference to a WWW site where the corresponding graphic images can be found). If letters have more than one shape, depending on their position (independent, initial, medial, final, etc), the graphic images of the alternative shapes must be provided. (7) The complete and comprehensive list of digits, if different from ASCII characters; and rules of rendering of numerals (particularly if a non-positional numerical system is used). The description must include the same information as the description of letters. (8) Detailed description of hints, including method of encoding, rules of rendering, and graphical images of the corresponding glyphs. (9) The default and alternative algorithms for conversion between upper and lower case, if applicable. (10) The default and alternative algorithms for sorting words. If a language has upper and lower case letters, two default algorithms should be specified: case sensitive (placing upper case letters before all lower case letters), and dictionary (case-insensitive). When multiple sorting methods are commonly used, the preferred method is the method employed by authoritative dictionaries. (11) Description of special rendition techniques (such as ligatures, or position or context-dependent modification of letter shapes) necessary for adequate reproduction of the text. Such description must include graphical renditions of the modified letter shapes. (12) Description of a commonly used input method, and alternatives. The description should include at least one example of a widely-used keyboard layout. (13) Description of conversion to and from commonly-used character sets. (14) References to available language resources. (15) The complete list of differences from the previous version of the document, if applicable. 5.4. Private Vendor Versions Of Alphabets Or Language Specifications Creation of such private versions is strongly discouraged. If a deficiency is found in an existing set of language specifications, the regular IETF standards route must be taken to amend the deficiency. Antonov [Page 16] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 6. Changes From The Previous Revision This is the initial version of the document. 7. Security Considerations The universal encoding is likely to improve communication between citizens of different countries, which can be viewed as a security risk by the governments. 8. Information Resources A good source of statistical information on natural languages is Entologue database, available at http://www.sil.org. A collection of Rosetta-related utilities, libraries, documents, and fonts (as well as some links to other internationalization resources) can be found at http://www.pluris.com/rosetta. 9. References [ASCII] American National Standards Institute, Coded character set: 7-bit American National Standard Code for Information Interchange, ANSI X3.4-1986. [ISO-8859] 8-bit Single-Byte Coded Graphic Character Sets, 1988. [AN] Akira Nakanishi, Writing Systems Of The World, ISBN 0-8084-1654-9, 1990. [UNICODE] The Unicode Standard Version 1.0 Volume 1, ISBN 0-201-56788-1, 1991. [ISO 2DIS 10646] Information Technology - Universal Multiple-Octet Coded Character Set (UCS), ISO/IEC JTC1/SC2/WG2 N783, 12/26/1991. [RFC 1345] K. Simonsen, Character Mnemonics & Character Sets, 06/11/1992. [RFC 1456] C. Nguyen, H. Ngo, C. Bui, T. van Nguyen, Conventions for Encoding the Vietnamese Language VISCII: VIetnamese Standard Code for Information Interchange VIQR: VIetnamese Quoted-Readable Specification, 05/08/1993. [RFC 1468] J. Murai, M. Crispin, E. van der Poel, Japanese Character Encoding for Internet Messages, 06/04/1993. [RFC 1489] A. Chernov, Registration of a Cyrillic Character Set, 07/23/1993. Antonov [Page 17] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 [RFC 1554] M. Ohta, K. Handa, ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP, 12/23/1993. [RFC 1555] H. Nussbacher, Y. Bourvine, Hebrew Character Encoding for Internet Messages, 12/23/1993. [RFC 1556] H. Nussbacher Handling of Bi-directional Texts in MIME, 12/23/1993. [RFC 1557] K. Chon, H. Je Park, U. Choi, Korean Character Encoding for Internet Messages, 12/27/1993. [RFC 1815] M. Ohta, Character Sets ISO-10646 and ISO-10646-J-1, 08/01/1995. [RFC 1922] H. Zhu, D. Hu, Z. Wang, T. Kao, W. Chang, M. Crispin, Chinese Character Encoding for Internet Messages, 03/26/1996. [RFC 1947] D. Spinellis, Greek Character Encoding for Electronic Mail Messages, 05/17/1996. [RFC 2044] F. Yergeau, UTF-8, a transformation format of Unicode and ISO 10646, 10/30/1996. 10. Author's Address Vadim Antonov Pluris, Inc. 2307 Coronet Blvd. Belmont, CA 94002 e-mail: avg@pluris.com fax: +1 (415) 654-9222 Antonov [Page 18] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 Appendix A. Sample Encoding And Decoding Routines. /* * Definitions for Rosetta encoding & decoding routines * * Note that those routines do NOT split or recombine * multi-octet characters. It is a purely input and rendering issue. */ struct _Rune { unsigned short lang; /* language number */ unsigned char hint; /* hint code (0 if no hint) */ unsigned char c; /* character code */ }; typedef struct _Rune Rune; /* EOF Rune */ #define REOF ((Rune){ ~0, ~0, ~0 }) /* NULL Rune */ #define RNULL ((Rune){ 0, 0, 0 }) /* Returns one when Rune contains end of file */ static inline int r_eof(Rune rune) { return rune.lang == (unsigned short)~0; } /* Compare two Runes (hints are ignored) */ static inline int r_cmp(Rune r1, Rune r2) { if( r1.lang == r2.lang ) return r1.c - r2.c; return r1.lang - r2.lang; } /* Check if two Runes are identical */ static inline int r_eq(Rune r1, Rune r2) { union { Rune r; long l; } a1, a2; a1.r = r1; a2.r = r2; return a1.l == a2.l; } extern Rune r_getc(Rune *state); extern void r_putc(Rune c, Rune *state); extern void r_flush(Rune *state); Antonov [Page 19] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 /* * Get the next character from Rosetta text file. * The argument is the pointer to a cell used to keep * the current state. In must be zero initially. */ Rune r_getc(Rune *state) { Rune c; int ic, cc; c.hint = 0; next: switch( (ic = GETC()) ) { case EOF: c = REOF; break; case 0220: /* hint follows */ switch( (ic = GETC()) ) { case 0100 ... 0176: case 0300 ... 0376: c.hint = ic; goto next; } /* Bad hint character, treat as ASCII code */ default: /* ASCII control and punctuation & EO state->lang = 0; c.lang = 0; c.c = ic; break; case 060 ... 071: /* digits */ case 0100 ... 0176: /* lowercase letters */ case 0300 ... 0376: /* uppercase letters */ c.lang = state->lang; c.c = ic; break; case 0200 ... 0217: /* short language selector */ state->lang = ic - 0200; goto next; case 0221 ... 0237: /* short language selector & space */ state->lang = ic - 0220; c = (Rune){ 0, 0, ' ' }; break; case 0240 ... 0277: /* long language selector */ cc = GETC(); state->lang = (ic & 017) | ((cc & 077) << 4); if( cc >= 0100 && cc < 0177 ) { Antonov [Page 20] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 cc = GETC(); state->lang |= (cc & 077) << 10; } if( cc < 0301 || cc >= 0377 ) { /* Bad language selector, treat as ASCII code state->lang = 0; c = (Rune) { 0, 0, cc }; break; } if( ic & 020 ) { c = (Rune){ 0, 0, ' ' }; break; } goto next; } return c; } /* * Put a character into Rosetta text file. * * This routine delays output of space characters to * allow optimization of the following language selectors. * Normally, this is not a problem, since no text files * end with zeroes. However, output of space character * may be forced at any time with r_flush(). * * The second argument is the pointer to a cell used to keep * the current state (including current language number, in * "lang", and pending space indicator, in "c". * The initial state must be RNULL. */ void r_putc(Rune c, Rune *state) { /* Is c an ASCII character? */ if( c.lang == 0 ) { assert(c.c < 0200); if( c.c == ' ' && c.hint == 0 ) { if( state->c ) PUTC(' '); *state = (Rune) {0, 0, 020}; return; } if( state->c ) PUTC(' '); else if( state->lang ) { switch( c.c ) { case ' ': case 060 ... 071: case 0100 ... 0176: Antonov [Page 21] RFC DRAFT Rosetta Encoding For Multi-Lingual Texts March 1997 PUTC(0200); /* Explicit switching } } *state = (Rune) {0, 0, 0}; goto outc; } /* * Did the language code change? */ if( c.lang != state->lang ) { int selc = 0200 | state->c | (c.lang & 017); state->lang = c.lang; state->c = 0; c.lang >>= 4; if( c.lang ) { /* Long selector needed */ PUTC(selc | 040); if( c.lang & 0xfc0 ) { /* x-long selector? * assert((c.lang & 077) != 077); PUTC(0100 | (c.lang & 077)); c.lang >>= 6; } assert((c.lang & 077) != 077); PUTC(0300 | (c.lang & 077)); } else PUTC(selc); } assert(0100 <= (c.c & ~0200) && (c.c & ~0200) < 0177); outc: if( c.hint ) { assert(0100 <= (c.hint & ~0200) && (c.hint & ~0200) < PUTC(0220); PUTC(c.hint); } PUTC(c.c); } /* * Force output of any pending spaces. * This routine uses the same state as r_putc(). */ void r_flush(Rune *state) { if( state->c ) { PUTC(' '); state->c = 0; } } Antonov [Page 22]