1 files changed, 108 insertions, 0 deletions
diff --git a/man/6/utf b/man/6/utf
new file mode 100644
index 0000000..12832a1
--- /dev/null
+++ b/man/6/utf
@@ -0,0 +1,108 @@
+.TH UTF 6
+.SH NAME
+UTF, Unicode, ASCII, rune \- character set and format
+.SH DESCRIPTION
+The Inferno character set and representation are
+based on the Unicode Standard and on the ISO multibyte
+.SM UTF-8
+encoding (Universal Character
+Set Transformation Format, 8 bits wide).
+The Unicode Standard represents its characters in 21
+bits;
+.SM UTF-8
+represents such
+values in an 8-bit byte stream.
+Throughout this manual,
+.SM UTF-8
+is shortened to
+.SM UTF.
+.PP
+Internally, programs store individual Unicode characters as 32-bit integers,
+of which only 21 bits are currently used.
+Documentation often refers to them as `runes', following Plan 9.
+However, any external manifestation of textual information,
+in files or at the interface between programs, uses the
+machine-independent, byte-stream encoding called
+.SM UTF.
+.PP
+.SM UTF
+is designed so the 7-bit
+.SM ASCII
+set (values hexadecimal 00 to 7F),
+appear only as themselves
+in the encoding.
+Characters with values above 7F appear as sequences of two or more
+bytes with values only from 80 to FF.
+.PP
+The
+.SM UTF
+encoding of the Unicode Standard is backward compatible with
+.SM ASCII\c
+: programs presented only with
+.SM ASCII
+work on Inferno
+even if not written to deal with
+.SM UTF\c
+,
+as do
+programs that deal with uninterpreted byte streams.
+However, programs that perform semantic processing on
+characters must convert from
+.SM UTF
+to runes
+in order to work properly with non-\c
+.SM ASCII
+input.
+Normally, all necessary conversions are done by the Limbo compiler
+and execution envirnoment, when converting between
+.B "array of byte"
+and
+.B "string" ,
+but sometimes more is needed, such
+as when a program receives
+.SM UTF
+input one byte at a time;
+see
+.IR sys-byte2char (2)
+for routines to handle such processing.
+.PP
+Letting numbers be binary,
+a rune x is converted to a multibyte
+.SM UTF
+sequence
+as follows:
+.PP
+01.   x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
+.br
+10.   x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+.br
+11.   x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
+.br
+.PP
+.PP
+Conversion 01 provides a one-byte sequence that spans the
+.SM ASCII
+character set in a compatible way.
+Conversions 10, 11 and 100 represent higher-valued characters
+as sequences of two, three or four bytes with the high bit set.
+Inferno does not support the 5 and 6 byte sequences proposed by X-Open.
+When there are multiple ways to encode a value, for example rune 0,
+the shortest encoding is used.
+.PP
+In the inverse mapping,
+any sequence except those described above
+is incorrect and is converted to the rune hexadecimal FFFD.
+.SH FILES
+.TF "/lib/unicode "
+.TP
+.B /lib/unicode
+table of characters and descriptions, suitable for
+.IR look (1).
+.SH "SEE ALSO"
+.IR ascii (1),
+.IR tcs (1),
+.IR sys-byte2char (2),
+.IR keyboard (6), 
+.IR "The Unicode Standard" .