Update tokenization spec: add CR, and minor fixes

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-07-07 18:24:49 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-07-07 18:24:49 -0400
commit: f14c4af888dc678eefe1de323b8fe41f7387e82b (patch)
tree: 2ebb7c175a307e49bd01656485c38a8f5e2ba54c /docs/spec
parent: 17fa3e9e0517662e7b49fcabdf68166a8914321c (diff)
2 files changed, 4 insertions, 4 deletions
diff --git a/docs/spec/grammar.html b/docs/spec/grammar.html
index 92b4dd64..8dde8b83 100644
--- a/docs/spec/grammar.html
+++ b/docs/spec/grammar.html
@@ -10,7 +10,7 @@
 <p>A program is a list of statements. Almost all statements are expressions. Namespace export statements, and valueless results stemming from <code><span class='Nothing'>·</span></code>, or <code><span class='Value'>𝕨</span></code> in a monadic block function, can be used as statements but not expressions.</p>
 <pre><span class='Function'>PROGRAM</span>  <span class='Function'>=</span> <span class='Separator'>⋄</span><span class='Head'>?</span> <span class='Paren'>(</span> <span class='Function'>STMT</span> <span class='Separator'>⋄</span> <span class='Paren'>)</span><span class='Value'>*</span> <span class='Function'>STMT</span> <span class='Separator'>⋄</span><span class='Head'>?</span>
 <span class='Function'>STMT</span>     <span class='Function'>=</span> <span class='Function'>EXPR</span> <span class='Function'>|</span> <span class='Value'>nothing</span> <span class='Function'>|</span> <span class='Function'>EXPORT</span>
-<span class='Separator'>⋄</span>        <span class='Function'>=</span> <span class='Paren'>(</span> <span class='String'>&quot;⋄&quot;</span> <span class='Function'>|</span> <span class='String'>&quot;,&quot;</span> <span class='Function'>|</span> <span class='Value'>\n</span> <span class='Paren'>)</span><span class='Function'>+</span>
+<span class='Separator'>⋄</span>        <span class='Function'>=</span> <span class='Paren'>(</span> <span class='String'>&quot;⋄&quot;</span> <span class='Function'>|</span> <span class='String'>&quot;,&quot;</span> <span class='Function'>|</span> <span class='Function'>LF</span> <span class='Function'>|</span> <span class='Function'>CR</span> <span class='Paren'>)</span><span class='Function'>+</span>
 <span class='Function'>EXPR</span>     <span class='Function'>=</span> <span class='Value'>subExpr</span> <span class='Function'>|</span> <span class='Function'>FuncExpr</span> <span class='Function'>|</span> <span class='Modifier'>_m1Expr</span> <span class='Function'>|</span> <span class='Modifier2'>_m2Expr_</span>
 <span class='Function'>EXPORT</span>   <span class='Function'>=</span> <span class='Function'>LHS_ELT</span><span class='Head'>?</span> <span class='String'>&quot;⇐&quot;</span>
 </pre>
diff --git a/docs/spec/token.html b/docs/spec/token.html
index c54fb384..55c1d8d5 100644
--- a/docs/spec/token.html
+++ b/docs/spec/token.html
@@ -6,12 +6,12 @@
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">spec</a></div>
 <h1 id="specification-bqn-token-formation"><a class="header" href="#specification-bqn-token-formation">Specification: BQN token formation</a></h1>
 <p>This page describes BQN's token formation rules (token formation is also called scanning). Most tokens in BQN are a single character long, but quoted characters and strings, identifiers, and numbers can consist of multiple characters, and comments, spaces, and tabs are discarded during token formation.</p>
-<p>BQN source code should be considered as a series of unicode code points, which we refer to as &quot;characters&quot;. The separator between lines in a file is considered to be a single character, newline, even though some operating systems such as Windows typically represent it with a two-character CRLF sequence. Implementers should note that not all languages treat unicode code points as atomic, as exposing the UTF-8 or UTF-16 representation instead is common. For a language such as JavaScript that uses UTF-16, the double-struck characters <code><span class='Value'>𝕨</span><span class='Function'>𝕎</span><span class='Value'>𝕩</span><span class='Function'>𝕏</span><span class='Value'>𝕗</span><span class='Function'>𝔽</span><span class='Value'>𝕘</span><span class='Function'>𝔾</span></code> are represented as two 16-bit surrogate characters, but BQN treats them as a single unit.</p>
+<p>BQN source code should be considered as a series of unicode code points, which we refer to as &quot;characters&quot;. Implementers should note that not all languages treat unicode code points as atomic, as exposing the UTF-8 or UTF-16 representation instead is common. For a language such as JavaScript that uses UTF-16, the double-struck characters <code><span class='Value'>𝕨𝕩𝕗𝕘𝕤</span><span class='Function'>𝕎𝕏𝔽𝔾𝕊</span><span class='Value'>𝕣</span></code> are represented as two 16-bit surrogate characters, but BQN treats them as a single unit. The line feed (LF) and carriage return (CR) characters are both considered newline characters.</p>
 <p>A BQN <em>character literal</em> consists of a single character between single quotes, such as <code><span class='String'>'a'</span></code>, and a <em>string literal</em> consists of any number of characters between double quotes, such as <code><span class='String'>&quot;&quot;</span></code> or <code><span class='String'>&quot;abc&quot;</span></code>. Character and string literals take precedence with comments over other tokenization rules, so that <code><span class='Comment'>#</span></code> between quotes does not start a comment and whitespace between quotes is not removed, but a quote within a comment does not start a character literal. Almost any character can be included directly in a character or string literal without escaping. The only exception is the double quote character <code><span class='String'>&quot;</span></code>, which must be written twice to include it in a string, as otherwise it would end the string instead. Character literals require no escaping at all, as the length is fixed. In particular, literals for the double and single quote characters are written <code><span class='String'>'''</span></code> and <code><span class='String'>'&quot;'</span></code>, while length-1 strings containing these characters are <code><span class='String'>&quot;'&quot;</span></code> and <code><span class='String'>&quot;&quot;&quot;&quot;</span></code>.</p>
 <p>A comment consists of the hash character <code><span class='Comment'>#</span></code> and any following text until (not including) the next newline character. The initial <code><span class='Comment'>#</span></code> must not be part of a string literal started earlier. Comments are ignored entirely and do not form tokens.</p>
 <p>Identifiers and numeric literals share the same token formation rule. These tokens are formed from the <em>numeric characters</em> <code><span class='Number'>¯∞π0123456789</span></code> and <em>alphabetic characters</em> <code><span class='Modifier'>_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ</span></code> and the oddball <code><span class='Value'>𝕣</span></code>. Additionally, <code><span class='Value'>.</span></code> is considered a numeric character if it is followed immediately by a digit (<code><span class='Number'>0123456789</span></code>); otherwise it forms its own token. Any sequence of these characters adjacent to each other forms a single token, which is a <em>numeric literal</em> if it begins with a numeric character and an <em>identifier</em> if it begins with an alphabetic character. If a token begins with an underscore then its first non-underscore character must be alphabetic: for example, <code><span class='Modifier'>_99</span></code> is not a valid token. Numeric literals are also subject to <a href="literal.html">numeric literal rules</a>, which specify which numeric literals are valid and which numbers they represent. If the token contains <code><span class='Value'>𝕣</span></code> it must be either <code><span class='Value'>𝕣</span></code>, <code><span class='Modifier'>_𝕣</span></code>, or <code><span class='Modifier2'>_𝕣_</span></code> and is considered a special name (see below). As the value taken by this identifier can only be a modifier, the uppercase character <code><span class='Value'>ℝ</span></code> is not allowed.</p>
 <p>The <em>system dot</em> <code><span class='Value'>•</span></code> always attaches to the token containing the next character, which must not be a whitespace character, <code><span class='Comment'>#</span></code>, or <code><span class='Value'>•</span></code>. This combined token is valid only if its name matches a defined <a href="system.html">system value</a>, ignoring underscores and letter case as with identifiers (but in the unlikely case that system values with numeric names are defined, they need not follow the numeric literal rules). Its role is the same as the role the remainder of the token would have if not preceded by <code><span class='Value'>•</span></code>, and it is considered a literal for grammar purposes.</p>
-<p>Following these steps, the whitespace characters space and tab are ignored, and do not form tokens. Only these whitespace characters, and the newline character, which does form a token, are allowed.</p>
+<p>Following these steps, the whitespace characters space and tab are ignored, and do not form tokens. These characters and the newline characters, which do form tokens, are the only whitespace characters allowed.</p>
 <p>Otherwise, a single character forms a token. Only the specified set of characters can be used; others result in an error. The classes of characters are given below.</p>
 <table>
 <thead>
@@ -43,7 +43,7 @@
 </tr>
 <tr>
 <td>Punctuation</td>
-<td><code><span class='Gets'>←⇐↩</span><span class='Paren'>()</span><span class='Brace'>{}</span><span class='Bracket'>⟨⟩[]</span><span class='Ligature'>‿</span><span class='Separator'>⋄,</span><span class='Value'>.</span></code> and newline</td>
+<td><code><span class='Gets'>←⇐↩</span><span class='Paren'>()</span><span class='Brace'>{}</span><span class='Bracket'>⟨⟩[]</span><span class='Ligature'>‿</span><span class='Nothing'>·</span><span class='Separator'>⋄,</span><span class='Value'>.</span><span class='Head'>;:?</span></code> and newlines</td>
 </tr>
 </tbody>
 </table>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-07-07 18:24:49 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-07-07 18:24:49 -0400
commit	f14c4af888dc678eefe1de323b8fe41f7387e82b (patch)
tree	2ebb7c175a307e49bd01656485c38a8f5e2ba54c /docs/spec
parent	17fa3e9e0517662e7b49fcabdf68166a8914321c (diff)