Request For Comments: DRAFT Vadim Antonov Category: Informational Pluris, Inc. 10 February 1997 Rosetta Language Specification, Russian Status Of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. 1. Language Name Modern Russian. 2. Language Usage Data Russian is a primary language in Russian Federation, and is used as an international language in former Soviet Union countries. The estimated number of native speakers is 160 million; the total number of speakers is 294 million. 3. Assigned Language Number +-------------------------+---------+ | Language Name | Russian | +-------------------------+---------+ |Assigned Language Number | 4 | +-------------------------+---------+ | Selector Sequence (oct) | 204 | +-------------------------+---------+ 4. Direction Of Writing The primary direction of writing is horizontally, from left to right. When end of line is reached, a new line is started underneath the previous line, and writing is resumed from the left side of the new line. In a multi-column text, the first column is on the left side of a page. The vertical writing from top to bottom in a single-letter column (letters are kept in upright position) is generally recognizable, although it can be used only for short sentences. An alternative Antonov [Page 1] RFC DRAFT Rosetta Language Specification, Russian February 1997 approach to vertical writing is to place horizontally written words into a column going from top to bottom. The vertical writing should not be used unless absolutely necessary because it slows reading significantly. 5. Rendering Of Numerals The numerals are usually rendered in a decimal positional system, using ASCII (language number 0) characters from DIGIT ZERO to DIGIT NINE with the most significant position on the left. A common sign for separation of integer and fractal parts is COMMA, sometimes thousands (groups of 3 digits) are separated with PERIOD. 6. Charachter Set 6.1. Letters There are two cases of letters: small (lower case) and capital (upper case). The shape of a letter in a printed text does not depend on its position within a word. +-----------+----------------------------------+ |Octal Code | Character Name | +-----------+----------------------------------+ | 100 | RUSSIAN CAPITAL LETTER A | | 101 | RUSSIAN CAPITAL LETTER BE | | 102 | RUSSIAN CAPITAL LETTER VE | | 103 | RUSSIAN CAPITAL LETTER GHE | | 104 | RUSSIAN CAPITAL LETTER DE | | 105 | RUSSIAN CAPITAL LETTER IE | | 106 | RUSSIAN CAPITAL LETTER IO | | 107 | RUSSIAN CAPITAL LETTER ZHE | | 110 | RUSSIAN CAPITAL LETTER ZE | | 111 | RUSSIAN CAPITAL LETTER I | | 112 | RUSSIAN CAPITAL LETTER SHORT I | | 113 | RUSSIAN CAPITAL LETTER KA | | 114 | RUSSIAN CAPITAL LETTER EL | | 115 | RUSSIAN CAPITAL LETTER EM | | 116 | RUSSIAN CAPITAL LETTER EN | | 117 | RUSSIAN CAPITAL LETTER O | | 120 | RUSSIAN CAPITAL LETTER PE | | 121 | RUSSIAN CAPITAL LETTER ER | | 122 | RUSSIAN CAPITAL LETTER ES | | 123 | RUSSIAN CAPITAL LETTER TE | | 124 | RUSSIAN CAPITAL LETTER U | | 125 | RUSSIAN CAPITAL LETTER EF | | 126 | RUSSIAN CAPITAL LETTER HA | +-----------+----------------------------------+ Antonov [Page 2] RFC DRAFT Rosetta Language Specification, Russian February 1997 +-----------+----------------------------------+ |Octal Code | Character Name | +-----------+----------------------------------+ | 127 | RUSSIAN CAPITAL LETTER TSE | | 130 | RUSSIAN CAPITAL LETTER CHE | | 131 | RUSSIAN CAPITAL LETTER SHA | | 132 | RUSSIAN CAPITAL LETTER SHCHA | | 133 | RUSSIAN CAPITAL LETTER HARD SIGN | | 134 | RUSSIAN CAPITAL LETTER YERU | | 135 | RUSSIAN CAPITAL LETTER SOFT SIGN | | 136 | RUSSIAN CAPITAL LETTER E | | 137 | RUSSIAN CAPITAL LETTER YU | | 140 | RUSSIAN CAPITAL LETTER YA | | 300 | RUSSIAN SMALL LETTER A | | 301 | RUSSIAN SMALL LETTER BE | | 302 | RUSSIAN SMALL LETTER VE | | 303 | RUSSIAN SMALL LETTER GHE | | 304 | RUSSIAN SMALL LETTER DE | | 305 | RUSSIAN SMALL LETTER IE | | 306 | RUSSIAN SMALL LETTER IO | | 307 | RUSSIAN SMALL LETTER ZHE | | 310 | RUSSIAN SMALL LETTER ZE | | 311 | RUSSIAN SMALL LETTER I | | 312 | RUSSIAN SMALL LETTER SHORT I | | 313 | RUSSIAN SMALL LETTER KA | | 314 | RUSSIAN SMALL LETTER EL | | 315 | RUSSIAN SMALL LETTER EM | | 316 | RUSSIAN SMALL LETTER EN | | 317 | RUSSIAN SMALL LETTER O | | 320 | RUSSIAN SMALL LETTER PE | | 321 | RUSSIAN SMALL LETTER ER | | 322 | RUSSIAN SMALL LETTER ES | | 323 | RUSSIAN SMALL LETTER TE | | 324 | RUSSIAN SMALL LETTER U | | 325 | RUSSIAN SMALL LETTER EF | | 326 | RUSSIAN SMALL LETTER HA | | 327 | RUSSIAN SMALL LETTER TSE | | 330 | RUSSIAN SMALL LETTER CHE | | 331 | RUSSIAN SMALL LETTER SHA | | 332 | RUSSIAN SMALL LETTER SHCHA | | 333 | RUSSIAN SMALL LETTER HARD SIGN | | 334 | RUSSIAN SMALL LETTER YERU | | 335 | RUSSIAN SMALL LETTER SOFT SIGN | | 336 | RUSSIAN SMALL LETTER E | | 337 | RUSSIAN SMALL LETTER YU | | 340 | RUSSIAN SMALL LETTER YA | +-----------+----------------------------------+ Antonov [Page 3] RFC DRAFT Rosetta Language Specification, Russian February 1997 6.2. Digits The ASCII (European "Arabic") digits are commonly used. The shapes of digits appearing in words with Russian language code should be identical to the shapes of corresponding ASCII/English digits. 6.3. Punctuation And Other Signs The ASCII punctuation and other signs are commonly used. There are no specifically Russian punctuation signs. Often, in book texts, LEFT GUILLEMET and RIGHT GUILLEMET (double- angle quotation marks) are used in place of QUOTATION MARK. See Rosetta Language Specification, Symbols for the description. 6.4. Charachter Shapes A table of Russian character shapes can be found at http://www.pluris.com/rosetta/russian-tab1.gif http://www.pluris.com/rosetta/russian-tab2.gif 7. Hints The only printed hint commonly used in Russian texts is the stress accent, which is represented by an acute accent mark placed above the vowel of the stressed syllable. See http://www.pluris.com/rosetta/russian-hint.gif for examples of letters with stress accents. The hint codes are bit masks, containing one or more of the following bits set: Antonov [Page 4] RFC DRAFT Rosetta Language Specification, Russian February 1997 +------------+-----------------------+ |Bit (octal) | Hint Name | +------------+-----------------------+ | 1 | RUSSIAN STRESS ACCENT | +------------+-----------------------+ | 2 | | | 4 | | | 10 | must be zero | | 20 | | | 40 | | +------------+-----------------------+ | 100 | must be one | +------------+-----------------------+ | 200 | HYPHENATION POINT | +------------+-----------------------+ When the RUSSIAN STRESS ACCENT hint is encountered, the acute accent should be shown over the following letter. The HYPNENATION POINT hint is not displayed (although it can be shown during input process or in text editors), and is used as an instruction to rendering engines to hypnehate a word at specified hyphenation points instead of automatically selected ones. When HYPHENATION POINT hint precedes the first letter in a word and no other hyphenation points are specified, the word cannot be hyphenated. 8. Word Comparison And Case Conversion The algorithms for comparison of Russian words, and conversion between upper and lower cases are the same as the Rosetta default: lexicographical comparison by octet values, and addition or subtraction of 200 octal to octet codes, as described in Rosetta Encoding For Multi-Lingual Texts. 9. Special Rendition Techniques Russian characters are always output without any modifications. Application of a stress accent does not change the shape of the stressed letter. 10. Input From Keyboard The most common layout of keyboards used for entering Russian characters is called JTSUKEN. The picture showing this layout can be found at http://www.pluris.com/rosetta/russian-kbd.gif Antonov [Page 5] RFC DRAFT Rosetta Language Specification, Russian February 1997 Function of keys is the same as that of the QWERTY keyboard used for entering ASCII characters, as described in Rosetta Language Specification, ASCII/English. Combination of Ctrl and left Alt keys (or Compose key, when available) pressed together with key marked RUSSIAN LETTER E (corresponding the place of ASCII/English keyboard's key APOSTROPHE and QUOTATION MARK) causes RUSSIAN STRESS ACCENT hint to be inserted. (Note that the hint applies to the following letter). If the hint is not followed by a letter, it should be discarded (in text editors, entering a hint may not require subsequent entering of a letter, if the current position on a screen is already occupied by a Russian letter). Combination of Ctrl and left Alt keys (or Compose key, when available) pressed together with key marked HYPHEN causes HYPHENATION POINT hint to be inserted. The recommended hint erasure sequence for use in text editors is left Alt-Ctrl-Space (or Compose-Space). When BACKSPACE or DELETE erases a letter, it should also erase the preceding hint, if there's one. 11. Conversion To/From Other Character Sets The current de-facto standard encoding for Russian texts in Internet and Unix-compatible operating systems is described in [RFC 1489] (that character set us also known as KOI8-R). Conversion from RFC 1489 into Rosetta is performed by transliteration using the following table, where all numbers are octal, A stands for ASCII/English (language number 0), R stands for Russian (language number 4): +----+---------------------------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+---------------------------------------------------------------+ | 0 | A:0 A:1 A:2 A:3 A:4 A:5 A:6 A:7 | | 10 | A:10 A:11 A:12 A:13 A:14 A:15 A:16 A:17 | | 20 | A:20 A:21 A:22 A:23 A:24 A:25 A:26 A:27 | | 30 | A:30 A:31 A:32 A:33 A:34 A:35 A:36 A:37 | | 40 | A:40 A:41 A:42 A:43 A:44 A:45 A:46 A:47 | | 50 | A:50 A:51 A:52 A:53 A:54 A:55 A:56 A:57 | | 60 | A:60 A:61 A:62 A:63 A:64 A:65 A:66 A:67 | | 70 | A:70 A:71 A:72 A:73 A:74 A:75 A:76 A:77 | |100 | A:100 A:101 A:102 A:103 A:104 A:105 A:106 A:107 | |110 | A:110 A:111 A:112 A:113 A:114 A:115 A:116 A:117 | |120 | A:120 A:121 A:122 A:123 A:124 A:125 A:126 A:127 | |130 | A:130 A:131 A:132 A:133 A:134 A:135 A:136 A:137 | |140 | A:140 A:141 A:142 A:143 A:144 A:145 A:146 A:147 | +----+---------------------------------------------------------------+ Antonov [Page 6] RFC DRAFT Rosetta Language Specification, Russian February 1997 +----+---------------------------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+---------------------------------------------------------------+ |150 | A:150 A:151 A:152 A:153 A:154 A:155 A:156 A:157 | |160 | A:160 A:161 A:162 A:163 A:164 A:165 A:166 A:167 | |170 | A:170 A:171 A:172 A:173 A:174 A:175 A:176 A:177 | |200 | - - - - - - - - | |210 | - - - - - - - - | |220 | - - - - - - - - | |230 | - - - - - - - - | |240 | - - - R:306 - - - - | |250 | - - - - - - - - | |260 | - - - R:106 - - - - | |270 | - - - - - - - - | |300 | R:337 R:300 R:301 R:327 R:304 R:305 R:325 R:303 | |310 | R:326 R:311 R:312 R:313 R:314 R:315 R:316 R:317 | |320 | R:320 R:340 R:321 R:322 R:323 R:324 R:307 R:302 | |330 | R:335 R:334 R:310 R:331 R:336 R:332 R:330 R:333 | |340 | R:137 R:100 R:101 R:127 R:104 R:105 R:125 R:103 | |350 | R:126 R:111 R:112 R:113 R:114 R:115 R:116 R:117 | |360 | R:120 R:140 R:121 R:122 R:123 R:124 R:107 R:102 | |370 | R:135 R:134 R:110 R:131 R:136 R:132 R:130 R:133 | +----+---------------------------------------------------------------+ Conversion from Rosetta to RFC 1489 is performed with the inversion of the table above, and skipping or transliterating characters from other languages which cannot be represented in RFC 1489. Another popular encoding is known as Microsoft Code Page 1251, and is used mostly in computers running MS Windows. The following table can be used for conversion from CP1251 into Rosetta (all numbers are octal, A stands for ASCII/English (language number 0), R stands for Russian (language number 4)): +----+---------------------------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+---------------------------------------------------------------+ | 0 | A:0 A:1 A:2 A:3 A:4 A:5 A:6 A:7 | | 10 | A:10 A:11 A:12 A:13 A:14 A:15 A:16 A:17 | | 20 | A:20 A:21 A:22 A:23 A:24 A:25 A:26 A:27 | | 30 | A:30 A:31 A:32 A:33 A:34 A:35 A:36 A:37 | | 40 | A:40 A:41 A:42 A:43 A:44 A:45 A:46 A:47 | | 50 | A:50 A:51 A:52 A:53 A:54 A:55 A:56 A:57 | | 60 | A:60 A:61 A:62 A:63 A:64 A:65 A:66 A:67 | | 70 | A:70 A:71 A:72 A:73 A:74 A:75 A:76 A:77 | |100 | A:100 A:101 A:102 A:103 A:104 A:105 A:106 A:107 | |110 | A:110 A:111 A:112 A:113 A:114 A:115 A:116 A:117 | |120 | A:120 A:121 A:122 A:123 A:124 A:125 A:126 A:127 | |130 | A:130 A:131 A:132 A:133 A:134 A:135 A:136 A:137 | +----+---------------------------------------------------------------+ Antonov [Page 7] RFC DRAFT Rosetta Language Specification, Russian February 1997 +----+---------------------------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+---------------------------------------------------------------+ |140 | A:140 A:141 A:142 A:143 A:144 A:145 A:146 A:147 | |150 | A:150 A:151 A:152 A:153 A:154 A:155 A:156 A:157 | |160 | A:160 A:161 A:162 A:163 A:164 A:165 A:166 A:167 | |170 | A:170 A:171 A:172 A:173 A:174 A:175 A:176 A:177 | |200 | - - - - - - - - | |210 | - - - - - - - - | |220 | - - - - - - - - | |230 | - - - - - - - - | |240 | - - - - - - - - | |250 | R:106 - - - - - - - | |260 | - - - - - - - - | |270 | R:306 - - - - - - - | |300 | R:100 R:101 R:102 R:103 R:104 R:105 R:107 R:110 | |310 | R:111 R:112 R:113 R:114 R:115 R:117 R:117 R:120 | |320 | R:121 R:122 R:123 R:124 R:125 R:127 R:127 R:130 | |330 | R:131 R:132 R:133 R:134 R:135 R:137 R:137 R:140 | |340 | R:300 R:301 R:302 R:303 R:304 R:306 R:307 R:310 | |350 | R:311 R:312 R:313 R:314 R:315 R:317 R:317 R:320 | |360 | R:321 R:322 R:323 R:324 R:325 R:327 R:327 R:330 | |370 | R:331 R:332 R:333 R:334 R:335 R:337 R:337 R:340 | +----+---------------------------------------------------------------+ Other single-octet Cyrillic character sets include ISO 8859-5 (not used in Russia), Microsoft CP866 (used sometimes in MS/DOS systems), and a number of older codes such as KOI7, DKOI, GOST which are obsolete now. The multi-octet encoding methods including Cyrillic character set are not commonly used to encode Russian texts. 12. Information Resources An excellent guide to Russian language computing can be found at http://www.siber.com/sib/russify 13. Changes From The Previous Revision This is the initial version of the document. 14. Security Considerations Not addressed in this document. 15. References [ASCII] American National Standards Institute, Coded character set: 7-bit American National Standard Code for Information Interchange, ANSI X3.4-1986. Antonov [Page 8] RFC DRAFT Rosetta Language Specification, Russian February 1997 [ISO-8859] 8-bit Single-Byte Coded Graphic Character Sets, 1988. [RFC 1489] A. Chernov, Registration of a Cyrillic Character Set, 07/23/1993. [RFC XXX] V. Antonov, Rosetta Encoding For Multi-Lingual Texts, 1/97. [RFC XXX] V. Antonov, Rosetta Language Specification, ASCII/English, 1/97. 16. Author's Address Vadim Antonov Pluris, Inc. 2307 Coronet Blvd. Belmont, CA 94002 e-mail: avg@pluris.com fax: +1 (415) 654-9222 Antonov [Page 9]