diff options
Diffstat (limited to 'man/6/utf')
| -rw-r--r-- | man/6/utf | 108 |
1 files changed, 108 insertions, 0 deletions
diff --git a/man/6/utf b/man/6/utf new file mode 100644 index 0000000..12832a1 --- /dev/null +++ b/man/6/utf @@ -0,0 +1,108 @@ +.TH UTF 6 +.SH NAME +UTF, Unicode, ASCII, rune \- character set and format +.SH DESCRIPTION +The Inferno character set and representation are +based on the Unicode Standard and on the ISO multibyte +.SM UTF-8 +encoding (Universal Character +Set Transformation Format, 8 bits wide). +The Unicode Standard represents its characters in 21 +bits; +.SM UTF-8 +represents such +values in an 8-bit byte stream. +Throughout this manual, +.SM UTF-8 +is shortened to +.SM UTF. +.PP +Internally, programs store individual Unicode characters as 32-bit integers, +of which only 21 bits are currently used. +Documentation often refers to them as `runes', following Plan 9. +However, any external manifestation of textual information, +in files or at the interface between programs, uses the +machine-independent, byte-stream encoding called +.SM UTF. +.PP +.SM UTF +is designed so the 7-bit +.SM ASCII +set (values hexadecimal 00 to 7F), +appear only as themselves +in the encoding. +Characters with values above 7F appear as sequences of two or more +bytes with values only from 80 to FF. +.PP +The +.SM UTF +encoding of the Unicode Standard is backward compatible with +.SM ASCII\c +: programs presented only with +.SM ASCII +work on Inferno +even if not written to deal with +.SM UTF\c +, +as do +programs that deal with uninterpreted byte streams. +However, programs that perform semantic processing on +characters must convert from +.SM UTF +to runes +in order to work properly with non-\c +.SM ASCII +input. +Normally, all necessary conversions are done by the Limbo compiler +and execution envirnoment, when converting between +.B "array of byte" +and +.B "string" , +but sometimes more is needed, such +as when a program receives +.SM UTF +input one byte at a time; +see +.IR sys-byte2char (2) +for routines to handle such processing. +.PP +Letting numbers be binary, +a rune x is converted to a multibyte +.SM UTF +sequence +as follows: +.PP +01. x in [000000.00000000.0bbbbbbb] → 0bbbbbbb +.br +10. x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb +.br +11. x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb +.br +100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb +.br +.PP +.PP +Conversion 01 provides a one-byte sequence that spans the +.SM ASCII +character set in a compatible way. +Conversions 10, 11 and 100 represent higher-valued characters +as sequences of two, three or four bytes with the high bit set. +Inferno does not support the 5 and 6 byte sequences proposed by X-Open. +When there are multiple ways to encode a value, for example rune 0, +the shortest encoding is used. +.PP +In the inverse mapping, +any sequence except those described above +is incorrect and is converted to the rune hexadecimal FFFD. +.SH FILES +.TF "/lib/unicode " +.TP +.B /lib/unicode +table of characters and descriptions, suitable for +.IR look (1). +.SH "SEE ALSO" +.IR ascii (1), +.IR tcs (1), +.IR sys-byte2char (2), +.IR keyboard (6), +.IR "The Unicode Standard" . |
