Request For Comments: DRAFT Vadim Antonov Category: Informational Pluris, Inc. 10 February 1997 Rosetta Language Specification, ASCII/English Status Of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. 1. Language Name Modern English, and US ASCII character set. 2. Language Usage Data English is used as a primary language in United Kingdom, United States, Canada, Australia, and other countries, and worldwide as an international language. The estimated number of native speakers: 450 million; the total number of speakers is 800 million. US ASCII is the dominant computer character set. Most of other character sets include US ASCII as a subset. 3. Assigned Language Number +-------------------------+---------------+ | Language Name | ASCII/English | +-------------------------+---------------+ |Assigned Language Number | 0 | +-------------------------+---------------+ | Selector Sequence (oct) | 200 | +-------------------------+---------------+ 4. Direction Of Writing The primary direction of writing is horizontally, from left to right. When end of line is reached, a new line is started underneath the previous line, and writing is resumed from the left side of the new line. In a multi-column text, the first column is on the left side of a page. Antonov [Page 1] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 The vertical writing from top to bottom in a single-letter column (letters are kept in upright position) is generally recognizable, although it can be used only for short sentences. An alternative approach to vertical writing is to place horizontally written words into a column going from top to bottom. The vertical writing should not be used unless absolutely necessary because it slows reading significantly. 5. Rendering Of Numerals The numerals are usually rendered in a decimal positional system, using characters from DIGIT ZERO to DIGIT NINE with the most significant position on the left. A common sign for separation of integer and fractal parts is PERIOD, sometimes thousands (groups of 3 digits) are separated with COMMA. 6. Charachter Set 6.1. Letters There are two cases of letters: small (lower case) and capital (upper case). The shape of a letter in a printed text does not depend on its position within a word. +-----------+--------------------------+ |Octal Code | Character Name | +-----------+--------------------------+ | 101 | ENGLISH CAPITAL LETTER A | +-----------+--------------------------+ | 102 | ENGLISH CAPITAL LETTER B | +-----------+--------------------------+ | 103 | ENGLISH CAPITAL LETTER C | +-----------+--------------------------+ | 104 | ENGLISH CAPITAL LETTER D | +-----------+--------------------------+ | 105 | ENGLISH CAPITAL LETTER E | +-----------+--------------------------+ | 106 | ENGLISH CAPITAL LETTER F | +-----------+--------------------------+ | 107 | ENGLISH CAPITAL LETTER G | +-----------+--------------------------+ | 110 | ENGLISH CAPITAL LETTER H | +-----------+--------------------------+ | 111 | ENGLISH CAPITAL LETTER I | +-----------+--------------------------+ | 112 | ENGLISH CAPITAL LETTER J | +-----------+--------------------------+ Antonov [Page 2] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+--------------------------+ |Octal Code | Character Name | +-----------+--------------------------+ | 113 | ENGLISH CAPITAL LETTER K | +-----------+--------------------------+ | 114 | ENGLISH CAPITAL LETTER L | +-----------+--------------------------+ | 115 | ENGLISH CAPITAL LETTER M | +-----------+--------------------------+ | 116 | ENGLISH CAPITAL LETTER N | +-----------+--------------------------+ | 117 | ENGLISH CAPITAL LETTER O | +-----------+--------------------------+ | 120 | ENGLISH CAPITAL LETTER P | +-----------+--------------------------+ | 121 | ENGLISH CAPITAL LETTER Q | +-----------+--------------------------+ | 122 | ENGLISH CAPITAL LETTER R | +-----------+--------------------------+ | 123 | ENGLISH CAPITAL LETTER S | +-----------+--------------------------+ | 124 | ENGLISH CAPITAL LETTER T | +-----------+--------------------------+ | 125 | ENGLISH CAPITAL LETTER U | +-----------+--------------------------+ | 126 | ENGLISH CAPITAL LETTER V | +-----------+--------------------------+ | 127 | ENGLISH CAPITAL LETTER W | +-----------+--------------------------+ | 130 | ENGLISH CAPITAL LETTER X | +-----------+--------------------------+ | 131 | ENGLISH CAPITAL LETTER Y | +-----------+--------------------------+ | 132 | ENGLISH CAPITAL LETTER Z | +-----------+--------------------------+ | 141 | ENGLISH SMALL LETTER A | +-----------+--------------------------+ | 142 | ENGLISH SMALL LETTER B | +-----------+--------------------------+ | 143 | ENGLISH SMALL LETTER C | +-----------+--------------------------+ | 144 | ENGLISH SMALL LETTER D | +-----------+--------------------------+ | 145 | ENGLISH SMALL LETTER E | +-----------+--------------------------+ | 146 | ENGLISH SMALL LETTER F | +-----------+--------------------------+ | 147 | ENGLISH SMALL LETTER G | +-----------+--------------------------+ | 150 | ENGLISH SMALL LETTER H | +-----------+--------------------------+ Antonov [Page 3] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+--------------------------+ |Octal Code | Character Name | +-----------+--------------------------+ | 151 | ENGLISH SMALL LETTER I | +-----------+--------------------------+ | 152 | ENGLISH SMALL LETTER J | +-----------+--------------------------+ | 153 | ENGLISH SMALL LETTER K | +-----------+--------------------------+ | 154 | ENGLISH SMALL LETTER L | +-----------+--------------------------+ | 155 | ENGLISH SMALL LETTER M | +-----------+--------------------------+ | 156 | ENGLISH SMALL LETTER N | +-----------+--------------------------+ | 157 | ENGLISH SMALL LETTER O | +-----------+--------------------------+ | 160 | ENGLISH SMALL LETTER P | +-----------+--------------------------+ | 161 | ENGLISH SMALL LETTER Q | +-----------+--------------------------+ | 162 | ENGLISH SMALL LETTER R | +-----------+--------------------------+ | 163 | ENGLISH SMALL LETTER S | +-----------+--------------------------+ | 164 | ENGLISH SMALL LETTER T | +-----------+--------------------------+ | 165 | ENGLISH SMALL LETTER U | +-----------+--------------------------+ | 166 | ENGLISH SMALL LETTER V | +-----------+--------------------------+ | 167 | ENGLISH SMALL LETTER W | +-----------+--------------------------+ | 170 | ENGLISH SMALL LETTER X | +-----------+--------------------------+ | 171 | ENGLISH SMALL LETTER Y | +-----------+--------------------------+ | 172 | ENGLISH SMALL LETTER Z | +-----------+--------------------------+ 6.2. Digits The ASCII (European "Arabic") digits are commonly used in texts in languages other than English. Accordingly, the digit character names do not include the language name. The shape of a digit does not depend on its position within a number. The width of all digit characters should be the same, even in a proportional font. Antonov [Page 4] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+----------------+ |Octal Code | Character Name | +-----------+----------------+ | 60 | DIGIT ZERO | +-----------+----------------+ | 61 | DIGIT ONE | +-----------+----------------+ | 62 | DIGIT TWO | +-----------+----------------+ | 63 | DIGIT THREE | +-----------+----------------+ | 64 | DIGIT FOUR | +-----------+----------------+ | 65 | DIGIT FIVE | +-----------+----------------+ | 66 | DIGIT SIX | +-----------+----------------+ | 67 | DIGIT SEVEN | +-----------+----------------+ | 70 | DIGIT EIGHT | +-----------+----------------+ | 71 | DIGIT NINE | +-----------+----------------+ 6.3. Punctuation And Other Signs The ASCII punctuation and other signs are commonly used in languages other than English. Accordingly, the sign character names do not include the language name. The shape of a punctuation of other sign does not depend on its position in the text. The width of the white space character is the same as the width of ENGLISH LETTER SMALL N. The HYPHEN sign is usually shorter than a space. All ASCII signs advance the current writing position to the right, and do not overlap previous or subsequent characters. +-----------+----------------------+ |Octal Code | Character Name | +-----------+----------------------+ | 40 | SPACE | +-----------+----------------------+ | 41 | EXCLAMATION MARK | +-----------+----------------------+ | 42 | QUOTATION MARK | +-----------+----------------------+ | 43 | NUMBER SIGN | +-----------+----------------------+ Antonov [Page 5] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+----------------------+ |Octal Code | Character Name | +-----------+----------------------+ | 44 | DOLLAR SIGN | +-----------+----------------------+ | 45 | PERCENT SIGN | +-----------+----------------------+ | 46 | AMPERSAND | +-----------+----------------------+ | 47 | APOSTROPHE | +-----------+----------------------+ | 50 | LEFT PARENTHESIS | +-----------+----------------------+ | 51 | RIGHT PARENTHESIS | +-----------+----------------------+ | 52 | ASTERISK | +-----------+----------------------+ | 53 | PLUS SIGN | +-----------+----------------------+ | 54 | COMMA | +-----------+----------------------+ | 55 | HYPHEN | +-----------+----------------------+ | 56 | PERIOD | +-----------+----------------------+ | 57 | SLASH | +-----------+----------------------+ | 72 | COLON | +-----------+----------------------+ | 73 | SEMICOLON | +-----------+----------------------+ | 74 | LESS-THAN SIGN | +-----------+----------------------+ | 75 | EQUALS SIGN | +-----------+----------------------+ | 76 | GREATER-THAN SIGN | +-----------+----------------------+ | 77 | QUESTION MARK | +-----------+----------------------+ | 100 | COMMERCIAL AT | +-----------+----------------------+ | 133 | LEFT SQUARE BRACKET | +-----------+----------------------+ | 134 | BACKSLASH | +-----------+----------------------+ | 135 | RIGHT SQUARE BRACKET | +-----------+----------------------+ | 136 | CIRCUMFLEX ACCENT | +-----------+----------------------+ | 137 | UNDERSCORE | +-----------+----------------------+ Antonov [Page 6] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+----------------------+ |Octal Code | Character Name | +-----------+----------------------+ | 140 | GRAVE ACCENT | +-----------+----------------------+ | 173 | LEFT CURLY BRACKET | +-----------+----------------------+ | 174 | VERTICAL BAR | +-----------+----------------------+ | 175 | RIGHT CURLY BRACKET | +-----------+----------------------+ | 176 | TILDE | +-----------+----------------------+ 6.4. ASCII Control Codes The ASCII control codes are provided for separating lines, and providing other instructions to the output devices. The control codes do not have any printable shapes. These codes are common for texts in all languages. +-----------+---------------------------+------------------+ |Octal Code | Control Code Name | Abbreviated Name | +-----------+---------------------------+------------------+ | 0 | NULL | NUL | +-----------+---------------------------+------------------+ | 1 | START OF HEADING | SOH | +-----------+---------------------------+------------------+ | 2 | START OF TEXT | STX | +-----------+---------------------------+------------------+ | 3 | END OF TEXT | STX | +-----------+---------------------------+------------------+ | 4 | END OF TRANSMISSION | EOT | +-----------+---------------------------+------------------+ | 5 | ENQUIRY | ENQ | +-----------+---------------------------+------------------+ | 6 | ACKNOWLEDGE | ACK | +-----------+---------------------------+------------------+ | 7 | BELL | BEL | +-----------+---------------------------+------------------+ | 10 | BACKSPACE | BS | +-----------+---------------------------+------------------+ | 11 | HORIZONTAL TABULATION | HT | +-----------+---------------------------+------------------+ | 12 | LINE FEED | LF | +-----------+---------------------------+------------------+ | 13 | VERTICAL TABULATION | VT | +-----------+---------------------------+------------------+ Antonov [Page 7] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+---------------------------+------------------+ |Octal Code | Control Code Name | Abbreviated Name | +-----------+---------------------------+------------------+ | 14 | FORM FEED | FF | +-----------+---------------------------+------------------+ | 15 | CARRIAGE RETURN | CR | +-----------+---------------------------+------------------+ | 16 | SHIFT OUT | SO | +-----------+---------------------------+------------------+ | 17 | SHIFT IN | SI | +-----------+---------------------------+------------------+ | 20 | DATA LINK ESCAPE | DLE | +-----------+---------------------------+------------------+ | 21 | DEVICE CONTROL ONE | DC1 | +-----------+---------------------------+------------------+ | 22 | DEVICE CONTROL TWO | DC2 | +-----------+---------------------------+------------------+ | 23 | DEVICE CONTROL THREE | DC3 | +-----------+---------------------------+------------------+ | 24 | DEVICE CONTROL FOUR | DC4 | +-----------+---------------------------+------------------+ | 25 | NEGATIVE ACKNOWLEDGE | NAK | +-----------+---------------------------+------------------+ | 26 | SYNCHRONOUS IDLE | SYN | +-----------+---------------------------+------------------+ | 27 | END OF TRANSMISSION BLOCK | ETB | +-----------+---------------------------+------------------+ | 30 | CANCEL | CAN | +-----------+---------------------------+------------------+ | 31 | END OF MEDIUM | EM | +-----------+---------------------------+------------------+ | 32 | SUBSTITUTE | SUB | +-----------+---------------------------+------------------+ | 33 | ESCAPE | ESC | +-----------+---------------------------+------------------+ | 34 | FILE SEPARATOR | FS | +-----------+---------------------------+------------------+ | 35 | GROUP SEPARATOR | GS | +-----------+---------------------------+------------------+ | 36 | RECORD SEPARATOR | RS | +-----------+---------------------------+------------------+ | 37 | UNIT SEPARATOR | US | +-----------+---------------------------+------------------+ | 177 | DELETE | DEL | +-----------+---------------------------+------------------+ | 377 | RESERVED DELETE | | +-----------+---------------------------+------------------+ Antonov [Page 8] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 6.5. Charachter Shapes A table of ASCII/English character shapes can be found at http://www.pluris.com/rosetta/ascii-tab.gif 7. Hints The hint codes are bit masks, containing one or more of the following bits set: +------------+-------------------------+ |Bit (octal) | Hint Name | +------------+-------------------------+ | 1 | ENGLISH LETTER MODIFIER | +------------+-------------------------+ | 2 | | | 4 | | | 10 | must be zero | | 20 | | | 40 | | +------------+-------------------------+ | 100 | must be one | +------------+-------------------------+ | 200 | HYPHENATION POINT | +------------+-------------------------+ When a hint with ENGLISH LETTER MODIFIER is inserted before some English letters or pairs of letters, an alternative letter must be substituted, per the following table: Antonov [Page 9] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +-----------+-----------+----------------------------------------+ |Octal Code | Modified | Letter Name | | | Letter(s) | | +-----------+-----------+----------------------------------------+ | 101 105 | AE | ENGLISH CAPITAL LETTER AE | | 101 145 | Ae | | +-----------+-----------+----------------------------------------+ | 111 | I | ENGLISH CAPITAL LETTER I WITH DIARESIS | +-----------+-----------+----------------------------------------+ | 117 105 | OE | ENGLISH CAPITAL LETTER OE | | 117 145 | Oe | | +-----------+-----------+----------------------------------------+ | 141 105 | aE | ENGLISH SMALL LETTER AE | | 141 145 | ae | | +-----------+-----------+----------------------------------------+ | 151 | i | ENGLISH SMALL LETTER I WITH DIARESIS | +-----------+-----------+----------------------------------------+ | 157 105 | oE | ENGLISH SMALL LETTER OE | | 157 145 | oe | | +-----------+-----------+----------------------------------------+ The table with shapes of those additional letters can be found at http://www.rosetta.com/rosetta/ascii-xtra.gif Note that if a pair of letters is replaced with a single letter because of ENGLISH LETTER MODIFIER hint, the case of resulting letter is determined by the case of the first letter in the pair. This rule allows to enter names (such as Aeacus or Oedipus) in a way allowing them to be rendered properly even if this hint is ignored. When ENGLISH LETTER MODIFIER hint is inserted before a SPACE, the HALF SPACE should be output. Similarly, a LINE FEED preceded by ENGLISH LETTER MODIFIER becomes HALF LINE FEED (i.e. advances paper by 1/2 of vertical interval). The HYPNENATION POINT hint is not displayed (although it can be shown during input process or in text editors), and is used as an instruction to rendering engines to hyphenate a word at specified hyphenation points instead of automatically selected ones. When HYPHENATION POINT hint precedes the first letter in a word and no other hyphenation points are specified, the word cannot be hyphenated. As a special case, when HYPHENATION POINT hint is inserted before a SPACE, that space character becomes non-breaking; i.e. it ceases to be a word separator for hyphenation purposes, and becomes a part of a longer word. This usage of HYPHENATION POINT hint does not create a hyphenation point; but it does inhibit automatic hyphenation of the Antonov [Page 10] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 extended word. 8. Word Comparison And Case Conversion The algorithms for comparison of ASCII/English words, and conversion between upper and lower cases are the same as the Rosetta default: lexicographical comparison by octet values, and addition or subtraction of 40 octal to octet codes, as described in Rosetta Encoding For Multi-Lingual Texts. 9. Special Rendition Techniques Most commonly, the ASCII/English characters are output without any modifications. However, if a Serif proportional font is used (such as Times Roman, etc), pairs or triplets of letters ENGLISH SMALL LETTER F, ENGLISH SMALL LETTER I, and ENGLISH SMALL LETTER L can be replaced with corresponding ligatures: +---------+---------------------------+-------------+ |Ligature | Letters | Octal Code | +---------+---------------------------+-------------+ | ff | SMALL F, SMALL F | 146 146 | +---------+---------------------------+-------------+ | ffi | SMALL F, SMALL F, SMALL I | 146 146 151 | +---------+---------------------------+-------------+ | ffl | SMALL F, SMALL F, SMALL L | 146 146 154 | +---------+---------------------------+-------------+ | fi | SMALL F, SMALL I | 146 151 | +---------+---------------------------+-------------+ | fl | SMALL F, SMALL L | 146 154 | +---------+---------------------------+-------------+ | st | SMALL S, SMALL T | 163 164 | +---------+---------------------------+-------------+ A table of ASCII/English ligature forms can be found at http://www.pluris.com/rosetta/ascii-lig.gif Usage of ligatures improves readability of texts set in Serif fonts only slightly, and so a simplified rendering process may omit generation of ligatures. 10. Input From Keyboard The most common layout of keyboards used for entering ASCII characters is called QWERTY. The picture showing this layout can be found at http://www.pluris.com/rosetta/ascii-kbd.gif Antonov [Page 11] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 Pressing a letter, sign, or digit key causes input of a single character. The functions of keys marked with words are described below: Shift Those keys switch from lower to upper register. By default, lower case letters and digits (and lower set of signs) are entered by pressing letter, sign or digit keys. When the same key is pressed together with a Shift key, the upper case letter, or the upper sign will be entered. Right and left Shift keys are equivalent. Caps Lock That toggle-action key locks and unlocks upper case for all letter keys (i.e. upper case letters will be entered even if Shift is not pressed). Some implementations invert the meaning of Shift keys for letters when Caps Lock is in effect. Ctrl When one of those key is pressed simultaneously with a letter or sign key, a character with code equivalent to the result of bitwise and of the original key's code and 37 octal is produced. This allows to enter any ASCII control character. Right and left Ctrl keys are equivalent. Tab That key always generates HORIZONTAL TABULATION. Backspace That key always generates BACKSPACE. Return (also commonly marked as Enter). That key generates the sequence of control codes corresponding to the local operating system's end-of-line sequence. Alt The left Alt key is used like Ctrl key, to generate additional application-specific control codes when pressed together with a letter, digit or sign key. Pressing left Alt together with Ctrl and some other keys is reserved for entering additional characters in some languages. It is recommended that the right Alt key should be used to toggle between two user-selectable "primary" languages. Switching to a third language, and reassignment of the primary languages is implementation- dependent. Some keyboards have additional Compose key located in proximity of left Ctrl and Alt keys. The Compose key has the same effect as Ctrl and left Alt pressed simultaneously. To enter HYPHENATION POINT hint (to be inserted before a subsequent character), a key marked HYPHEN should be pressed simultaneously with Ctrl and left Alt (see also note on usage of Compose above). Antonov [Page 12] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 To enter ENGLISH LETTER MODIFIER hint with a letter or pair of letters, left Alt and Ctrl should be held down when entering the letter or the pair of letters. Note that entering an uppercase letter may also require simultaneous holding of a Shift key. When entering pairs of letters, Alt and Ctrl (or Compose) can be released after entering the first letter so the second letter is entered normally; this allows more convenient way of entering pairs of letters having different cases. The recommended hint erasure sequence for use in text editors is left Alt-Ctrl-Space (or Compose-Space). When BACKSPACE or DELETE erases a letter, it should also erase the preceding hint, if there's one. 11. Conversion To/From Other Character Sets Rosetta is a proper superset of ASCII, so the conversion from ASCII to Rosetta is not required. A conversion from Rosetta to ASCII may require removal of non-ASCII characters, or transliteration of such words into ASCII characters. ASCII characters (i.e. those with language code 0) should be preserved as is. The second most often used English-only code is IBM's EBCDIC; the conversion is performed by transliterating octets according to the following table (all numbers are octal): +----+-----------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+-----------------------------------------------+ | 0 | 0 1 2 3 - 11 - 177 | | 10 | - - - 13 14 15 16 17 | | 20 | 20 21 22 23 - - 10 - | | 30 | 30 31 - - 34 35 36 37 | | 40 | - - - - - 12 27 33 | | 50 | - - - - - 5 6 7 | | 60 | - - 26 - - - - 4 | | 70 | - - - - 24 25 - 32 | |100 | 40 - - - - - - - | |110 | - - - 56 74 50 53 174 | |120 | 46 - - - - - - - | |130 | - - 41 44 52 51 73 176 | |140 | 55 57 - - - - - - | |150 | - - - 54 45 137 76 77 | |160 | - - - - - - - - | |170 | - 140 72 43 100 47 75 42 | |200 | - 141 142 143 144 145 146 147 | |210 | 150 151 - - - - - - | +----+-----------------------------------------------+ Antonov [Page 13] RFC DRAFT Rosetta Language Specification, ASCII/EnglishFebruary 1997 +----+-----------------------------------------------+ | | 0 1 2 3 4 5 6 7 | +----+-----------------------------------------------+ |220 | - 152 153 154 155 156 157 160 | |230 | 161 162 136 - - - - - | |240 | - - 163 164 165 166 167 170 | |250 | 171 172 - - - 133 - - | |260 | - - - - - - - - | |270 | - - - - - 135 - - | |300 | 173 101 102 103 104 105 106 107 | |310 | 110 111 - - - - - - | |320 | 175 112 113 114 115 116 117 120 | |330 | 121 122 - - - - - - | |340 | 134 - 123 124 125 126 127 130 | |350 | 131 132 - - - - - - | |360 | 60 61 62 63 64 65 66 67 | |370 | 70 71 - - - - - 377 | +----+-----------------------------------------------+ Conversion from Rosetta to EBCDIC is performed by conversion from Rosetta to ASCII and transliteration of the resulting ASCII octets by the inversion of the EBCDIC-to-Rosetta table. 12. Changes From The Previous Revision This is the initial version of the document. 13. Security Considerations Not addressed in this document. 14. References [ASCII] American National Standards Institute, Coded character set: 7-bit American National Standard Code for Information Interchange, ANSI X3.4-1986. [RFC XXX] V. Antonov, Rosetta Encoding For Multi-Lingual Texts, 1/97. 15. Author's Address Vadim Antonov Pluris, Inc. 2307 Coronet Blvd. Belmont, CA 94002 e-mail: avg@pluris.com fax: +1 (415) 654-9222 Antonov [Page 14]