Specify text and numeric literals

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2020-06-25 21:39:38 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2020-06-25 21:58:32 -0400
commit: d20aae8dbb1ad2558f32fdc9fcf939658ab97e16 (patch)
tree: 77461378cf2a9b236c0467330faf0bd5ffea9657
parent: 3e565a092bba840fc556ea9b6850aed29313df56 (diff)
3 files changed, 31 insertions, 14 deletions
diff --git a/spec/README.md b/spec/README.md
index 8d000cb9..afe71cff 100644
--- a/spec/README.md
+++ b/spec/README.md
@@ -3,9 +3,9 @@
 This directory gives a (currently incomplete) specification for BQN. The specification differs from the documentation in `doc/` in that its purpose is only to describe the exact details of BQN's operation in the most quickly accessible way, rather than to explain the core ideas of BQN functionality and how it might be used. Since it is easier to specify than to document, the specification is currently more complete than the documentation; for example, it includes nearly all primitives.
 
 The following aspects define BQN and are or will be specified:
+- Types
 - [Token formation](token.md)
-- Numeric and character literals
+- [Literals](literal.md)
 - [Grammar](grammar.md)
-- Array model and notation
 - [Evaluation semantics](evaluate.md)
-- Built-in operations ([reference implementations](reference.bqn))
+- Primitives ([reference implementations](reference.bqn))
diff --git a/spec/literal.md b/spec/literal.md
new file mode 100644
index 00000000..0f22909f
--- /dev/null
+++ b/spec/literal.md
@@ -0,0 +1,17 @@
+A *literal* is a single [token](token.md) that indicates a fixed character, number, or array. While literals indicate data of a value type, [primitives](primitive.md) indicate data of a function type: function, modifier, or composition.
+
+Two types of literal deal with text. As the source code is considered to be a sequence of unicode code points ("characters"), and these code points are also used for BQN's character [data type](types.md), the representation of a text literal is very similar to its value. In a text literal, the newline character is always represented using the ASCII line feed character, code point 10. A *character literal* is enclosed with single quotes `'` and its value is identical to the single character between them. A *string literal* is enclosed in double quotes `"`, and any double quotes between them must come in pairs, as a lone double quote marks the end of the literal. The value of a string literal is a rank-1 array whose elements are the characters in between the enclosing quotes, after replacing each pair of double quotes with only one such quote.
+
+The format of a *numeric literal* is more complicated. From the [tokenization rules](token.md), a numeric literal consists of a numeric character (one of `¯∞π.0123456789`) followed by any number of numeric or alphabetic characters. Some numeric literals are *valid* and indicate a number, while others are invalid and cause an error. The grammar for valid numbers is given below in a [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) variant. Only four alphabetic characters are allowed: "i", which separates the real and imaginary components of a complex number, "e", which functions as in scientific notation, and the uppercase versions of these letters.
+
+    number    = component ( ( "i" | "I" ) component )?
+    component = mantissa ( ( "e" | "E" ) exponent )?
+    exponent  = "¯"? digit+
+    mantissa  = "¯"? ( "∞" | "π" | digit+ ( "." digit+ )? )
+    digit     = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
+
+The digits or arabic numerals correspond to the numbers from 0 to 9 in the conventional way (also, each corresponds to its code point value minus 48). A sequence of digits gives a natural number by evaluating it in base 10: the number is 0 for an empty sequence, and otherwise the last digit's numerical value plus ten times the number obtained from the remaining digits. The symbol `∞` indicates infinity and `π` indicates the ratio [pi](https://en.wikipedia.org/wiki/Pi_(mathematics)) of a circle's circumference to its diameter (or, for modern mathematicians, the smallest positive real number at which the function `{⋆0j1×𝕩}` attains a real part of 0). The [high minus](https://aplwiki.com/wiki/High_minus) symbol `¯` indicates that the number containing it is to be negated.
+
+When an exponent is provided (with `e` or `E`), the corresponding mantissa is multiplied by ten to that power, giving the value `mantissa×10⋆exponent`. If a second component is present (using `i` or `I`), that component's value is multiplied by the [imaginary unit](https://en.wikipedia.org/wiki/Imaginary_unit) *i* and added to the first component; otherwise the value is the first component's value without modification. If complex numbers are not supported, then `i` should not be allowed in numeric literals, even when followed by 0.
+
+The above specification describes exactly a complex number with extended real components. To obtain a BQN number, each component is rounded to its nearest representative by the rules of the number system used: for IEEE 754, smallest distance, with ties rounding to the option with even mantissa.
diff --git a/spec/token.md b/spec/token.md
index bc084239..3ded7b5d 100644
--- a/spec/token.md
+++ b/spec/token.md
@@ -1,23 +1,23 @@
 This page describes BQN's token formation rules (token formation is also called scanning). Most tokens in BQN are a single character long, but quoted characters and strings, identifiers, and numbers can consist of multiple characters, and comments, spaces, and tabs are discarded during token formation.
 
-BQN source code should be considered as a series of unicode code points. Here the separator between lines in a file is considered to be a single character, newline, even though some operating systems such as Windows typically represent it with a two-character CRLF sequence. Implementers should note that not all languages treat UTF code points as atomic, as exposing the UTF-8 or UTF-16 representation instead is common. For a language such as JavaScript that uses UTF-16, the double-struck characters `𝕨𝕎𝕩𝕏𝕗𝔽𝕘𝔾` are represented as two 16-bit surrogate characters, but BQN treats them as a single unit.
+BQN source code should be considered as a series of unicode code points, which we refer to as "characters". The separator between lines in a file is considered to be a single character, newline, even though some operating systems such as Windows typically represent it with a two-character CRLF sequence. Implementers should note that not all languages treat unicode code points as atomic, as exposing the UTF-8 or UTF-16 representation instead is common. For a language such as JavaScript that uses UTF-16, the double-struck characters `𝕨𝕎𝕩𝕏𝕗𝔽𝕘𝔾` are represented as two 16-bit surrogate characters, but BQN treats them as a single unit.
 
-A BQN character literal consists of a single character between single quotes, such as `'a'`, and a string literal consists of any number of characters between double quotes, such as `""` or `"abc"`. Character and string literals take precedence with comments over other tokenization rules, so that `⍝` between quotes does not start a comment and whitespace between quotes is not removed, but a quote within a comment does not start a character literal. Almost any character can be included directly in a character or string literal without escaping. The only exception is the double quote character `"`, which must be written twice to include it in a string, as otherwise it would end the string instead. Character literals require no escaping at all, as the length is fixed. In particular, literals for the double and single quote characters are written `'''` and `'"'`, while length-1 strings containing these characters are `"'"` and `""""`.
+A BQN *character literal* consists of a single character between single quotes, such as `'a'`, and a *string literal* consists of any number of characters between double quotes, such as `""` or `"abc"`. Character and string literals take precedence with comments over other tokenization rules, so that `⍝` between quotes does not start a comment and whitespace between quotes is not removed, but a quote within a comment does not start a character literal. Almost any character can be included directly in a character or string literal without escaping. The only exception is the double quote character `"`, which must be written twice to include it in a string, as otherwise it would end the string instead. Character literals require no escaping at all, as the length is fixed. In particular, literals for the double and single quote characters are written `'''` and `'"'`, while length-1 strings containing these characters are `"'"` and `""""`.
 
 A comment consists of the lamp character `⍝` and any following text until (not including) the next newline character. The initial `⍝` must not be part of a string literal started earlier. Comments are ignored entirely and do not form tokens.
 
-Identifiers and numeric literals share the same token formation rule. These tokens are formed from the *numeric characters* `¯∞π.0123456789` and *alphabetic characters* `_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`. Any sequence of adjacent numeric and alphabetic characters forms a single token, which is a numeric literal if it begins with a numeric character and an identifier if it begins with an alphabetic character. Numeric literals are also subject to [numeric literal rules](literal.md), which specify which numeric literals are valid and which numbers they represent.
+Identifiers and numeric literals share the same token formation rule. These tokens are formed from the *numeric characters* `¯∞π.0123456789` and *alphabetic characters* `_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`. Any sequence of adjacent numeric and alphabetic characters forms a single token, which is a *numeric literal* if it begins with a numeric character and an *identifier* if it begins with an alphabetic character. Numeric literals are also subject to [numeric literal rules](literal.md), which specify which numeric literals are valid and which numbers they represent.
 
 Following this step, the whitespace characters space and tab are ignored, and do not form tokens. Only these whitespace characters, and the newline character, which does form a token, are allowed.
 
 Otherwise, a single character forms a token. Only the specified set of characters can be used; others result in an error. The classes of characters are given below.
 
-| Class               | Characters
-|---------------------|------------
-| Function literal    | `+-×÷⋆√⌊⌈\|¬∧∨<>≠=≤≥≡≢⊣⊢⥊∾≍↑↓↕⌽⍉/⍋⍒⊏⊑⊐⊒∊⍷⊔`
-| Modifier literal    | `` ˜˘¨⌜⁼´` ``
-| Composition literal | `∘○⊸⟜⌾⎉⚇⍟`
-| Parameter           | `𝕨𝕩𝕗𝕘𝕎𝕏𝔽𝔾`
-| Punctuation         | `←↩→(){}⟨⟩‿⋄,` and newline
+| Class                 | Characters
+|-----------------------|------------
+| Primitive Function    | `+-×÷⋆√⌊⌈\|¬∧∨<>≠=≤≥≡≢⊣⊢⥊∾≍↑↓↕⌽⍉/⍋⍒⊏⊑⊐⊒∊⍷⊔`
+| Primitive Modifier    | `` ˜˘¨⌜⁼´` ``
+| Primitive Composition | `∘○⊸⟜⌾⎉⚇⍟`
+| Parameter             | `𝕨𝕩𝕗𝕘𝕎𝕏𝔽𝔾`
+| Punctuation           | `←↩→(){}⟨⟩‿⋄,` and newline
 
-In the BQN [grammar specification](grammar.md), the three literal classes are grouped into terminals `Fl`, `_ml`, and `_cl`, while the punctuation characters are identified separately as keywords such as "←". The parameters are handled specially. The uppercase versions `𝕎𝕏𝔽𝔾` and lowercase versions `𝕨𝕩𝕗𝕘` are two spellings of the four underlying parameters.
+In the BQN [grammar specification](grammar.md), the three primitive classes are grouped into terminals `Fl`, `_ml`, and `_cl`, while the punctuation characters are identified separately as keywords such as `"←"`. The parameters are handled specially. The uppercase versions `𝕎𝕏𝔽𝔾` and lowercase versions `𝕨𝕩𝕗𝕘` are two spellings of the four underlying parameters.
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2020-06-25 21:39:38 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2020-06-25 21:58:32 -0400
commit	d20aae8dbb1ad2558f32fdc9fcf939658ab97e16 (patch)
tree	77461378cf2a9b236c0467330faf0bd5ffea9657
parent	3e565a092bba840fc556ea9b6850aed29313df56 (diff)