diff options
| author | Marshall Lochbaum <mwlochbaum@gmail.com> | 2020-06-26 14:30:03 -0400 |
|---|---|---|
| committer | Marshall Lochbaum <mwlochbaum@gmail.com> | 2020-06-26 14:30:03 -0400 |
| commit | bfa0be0bbc67047e2ec3f3d457bffdc372e0bcf9 (patch) | |
| tree | a9c91a3f75d58b15ed2bf28a04b131b84fe4d100 | |
| parent | f34e14170e1fb9b962d88316b7089192e07b2025 (diff) | |
Add header grammar to specification
| -rw-r--r-- | spec/grammar.md | 60 | ||||
| -rw-r--r-- | spec/token.md | 6 |
2 files changed, 43 insertions, 23 deletions
diff --git a/spec/grammar.md b/spec/grammar.md index 8ff8336a..acb6fb33 100644 --- a/spec/grammar.md +++ b/spec/grammar.md @@ -1,4 +1,4 @@ -BQN's grammar is given below. All terms except `braceVal`, `BraceFunc`, `_braceMod`, and `_braceComp_` are defined in a [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) variant; distinguishing these four terms is possible but difficult in BNF and they are explained near the end. +BQN's grammar is given below. Terms are defined in a [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) variant. However, handling special names properly is possible but difficult in BNF, so they are explained in text along with the braced block grammar. The symbols `v`, `F`, `_m`, and `_c_` are identifier tokens with value, function, modifier, and composition classes respectively. Similarly, `vl`, `Fl`, `_ml`, and `_cl_` refer to value literals (numeric and character literals, or primitives) of those classes. While names in the BNF here follow the identifier naming scheme, this is informative only: syntactic classes are no longer used after parsing and cannot be inspected in a running program. @@ -12,10 +12,10 @@ A program is a list of statements. Almost all statements are expressions. Howeve Here we define the "atomic" forms of functions and operators, which are either single tokens or enclosed in paired symbols. Stranded vectors with `‿`, which binds more tightly than any form of execution, are also included. ANY = atom | Func | _mod | _comp_ - _comp_ = _c_ | _cl_ | "(" _cmpExp_ ")" | _braceComp_ - _mod = _m | _ml | "(" _modExpr ")" | _braceMod - Func = F | Fl | "(" FuncExpr ")" | BraceFunc - atom = v | vl | "(" valExpr ")" | braceVal + _comp_ = _c_ | _cl_ | "(" _cmpExp_ ")" | _brComp_ + _mod = _m | _ml | "(" _modExpr ")" | _brMod + Func = F | Fl | "(" FuncExpr ")" | BrFunc + atom = v | vl | "(" valExpr ")" | brVal | "⟨" ⋄? ( ( EXPR ⋄ )* EXPR ⋄? )? "⟩" value = atom | ANY ( "‿" ANY )+ @@ -60,24 +60,44 @@ Value expressions are complicated by the possibility of list assignment. We also | lhs ASGN valExpr | lhs Derv "↩" valExpr ⍝ Modified assignment -In an explicit definition, the left hand side looks like application of a function, modifier, or combinator. As with assignment, it is restricted to a simple form with no extra parentheses. The full list syntax is allowed for arguments. - - DEF = VALDEF | FUNCDEF - VALDEF = valLHS "⇐" valExpr - FUNCDEF = FuncLHS "⇐" FuncExpr - FuncLHS = F _m - | F _c_ F - valLHS = lhs? ( F | FuncLHS ) lhs - -The terms `braceVal`, `BraceFunc`, `_braceMod`, and `_braceComp_` all obey the syntax for `BRACED` given below. Then the class is determined by the presence of `𝕨`, `𝕩`, `𝕗`, and `𝕘` (including alternate class spellings) at the top level, that is, not contained within further pairs of braces. If `𝕘` is present, it is a `_braceCmp_`; otherwise, if `𝕗` is present it is a `_braceMod`; otherwise is is a `BraceFunc` if `𝕨` or `𝕩` are present and a `braceVal` if no special names appear. - - BRACED = "{" ⋄? ( STMT ⋄ )* EXPR ⋄? "}" - -This rule can be expressed in BNF by making many copies of all the rules above. For each "level" (no special names; arguments; `𝕗`; `𝕘`), a new version of every rule should be made that allows that level but not higher ones, and another version should be made that requires exactly that level. The values themselves should be included in `v`, `F`, `_m`, and `_c_` for these copies. Then the "allowed" rules are made simply by replacing the terms they contain with the same "allowed" versions, and "required" rules are constructed using both "allowed" and "required" rules. For every part of a production rule, an alternative should be created that requires the relevant name in that part while allowing it in the others. For example, `( value | nothing )? Derv arg` would be transformed to +A header looks like a name for the thing being headed, or its application to inputs (possibly twice in the case of modifiers and compositions). As with assignment, it is restricted to a simple form with no extra parentheses. The full list syntax is allowed for arguments. As a special rule, a function header specifically can omit the function. + + headW = value | "𝕨" + headX = value | "𝕩" + HeadF = F | "𝕗" | "𝔽" + HeadG = G | "𝕘" | "𝔾" + ModH1 = HeadF ( _m | "_𝕣" | "_ℝ" ) + CmpH1 = HeadF ( _c_ | "_𝕣_" | "_ℝ_" ) HeadG + valHead = v + FuncHead = F | ( headW? ( F | "𝕊" ) )? headX + _modHead = _m | ModH1 | headW? ModH1 headX + _cmpHed_ = _c_ | CmpH1 | headW? CmpH1 headX + +A braced block contains bodies, which are lists of statements, separated by semicolons and possibly preceded by headers, which are separated from the body with a colon. Multiple bodies allow different handling for various cases, which are pattern-matched by headers. For a value block there are no inputs, so there can only be one possible case and one body. Functions and operators allow any number of bodies with headers followed by at most two bodies without headers. These bodies refer to the general cases: ambivalent if there is only one and split into monadic and dyadic if there are two. + + BODY = ⋄? ( STMT ⋄ )* EXPR ⋄? + FuncCase = ⋄? FuncHead ":" BODY + _modCase = ⋄? _modHead ":" BODY + _cmpCas_ = ⋄? _cmpHed_ ":" BODY + brVal = "{" ( ⋄? valHead ":" )? BODY "}" + BrFunc = "{" ( FuncCase ";" )* ( FuncCase | BODY | BODY ";" BODY ) "}" + _brMod = "{" ( _modCase ";" )* ( _modCase | BODY | BODY ";" BODY ) "}" + _brComp_ = "{" ( _cmpCas_ ";" )* ( _cmpCas_ | BODY | BODY ";" BODY ) "}" + +Two additional rules apply to blocks, based on the special name associations in the table below. First, each block allows the special names in its column to be used as the given token types within `BODY` terms (not headers). Except for the spaces labelled "None", each column is cumulative and a given entry also includes all the entries above it. Second, for `BrFunc`, `_brMod`, and `_brComp_` terms, if no header is given, then at least one `BODY` term in it *must* contain one of the names on, and not above, the corresponding row. Otherwise the syntax would be ambiguous, since for example a simple `"{" BODY "}"` sequence could have any type. + +| Term | `v` | `F` | `_m` | `_c_` | other +|--------------------|--------|--------|---------|----------|------- +| `brVal`, `PROGRAM` | None | None | None | None | +| `BrFunc` | `𝕨𝕩𝕤` | `𝕎𝕏𝕊` | | | `";"` +| `_brMod` | `𝕗𝕣` | `𝔽` | `_𝕣` | | +| `_brComp_` | `𝕘` | `𝔾` | None | `_𝕣_` | + +The rules for special name can be expressed in BNF by making many copies of all expression rules above. For each "level", or row in the table, a new version of every rule should be made that allows that level but not higher ones, and another version should be made that requires exactly that level. The values themselves should be included in `v`, `F`, `_m`, and `_c_` for these copies. Then the "allowed" rules are made simply by replacing the terms they contain (excluding `brVal` and so on) with the same "allowed" versions, and "required" rules are constructed using both "allowed" and "required" rules. For every part of a production rule, an alternative should be created that requires the relevant name in that part while allowing it in the others. For example, `( value | nothing )? Derv arg` would be transformed to arg_req1 = valExpr_req1 | ( value_req1 | nothing_req1 ) Derv_allow1 arg_allow1 | ( value_allow1 | nothing_allow1 )? Derv_req1 arg_allow1 | ( value_allow1 | nothing_allow1 )? Derv_allow1 arg_req1 -Quite tedious. The explosion of rules is partly due to the fact that the brace-typing rule falls into a weaker class of grammars than the other rules. The rest of BQN is [deterministic context-free](https://en.wikipedia.org/wiki/Deterministic_context-free_grammar) while brace-typing is not, only context-free. Fortunately brace typing does not introduce the parsing difficulties that can be present in a general context-free grammar, and it can easily be performed in linear time: after [scanning](token.md) but before parsing, move through the source code maintaining a stack of the current top-level set of braces. Whenever a special name is encountered, annotate that set of braces to indicate that the name is present. When a closing brace is encountered and the top brace is popped off the stack, the type can be found based on which names were present. One way to present this information to the parser is to replace the brace tokens with new tokens that indicate the type. +Quite tedious. The explosion of rules is partly due to the fact that the brace-typing rule falls into a weaker class of grammars than the other rules. Most of BQN is [deterministic context-free](https://en.wikipedia.org/wiki/Deterministic_context-free_grammar) but brace-typing is not, only context-free. Fortunately brace typing does not introduce the parsing difficulties that can be present in a general context-free grammar, and it can easily be performed in linear time: after [scanning](token.md) but before parsing, move through the source code maintaining a stack of the current top-level set of braces. Whenever a colon or special name is encountered, annotate that set of braces to indicate that it is present. When a closing brace is encountered and the top brace is popped off the stack, the type is needed if there was no colon, and can be found based on which names were present. One way to present this information to the parser is to replace the brace tokens with new tokens that indicate the type. diff --git a/spec/token.md b/spec/token.md index 8e695139..430869ac 100644 --- a/spec/token.md +++ b/spec/token.md @@ -6,7 +6,7 @@ A BQN *character literal* consists of a single character between single quotes, A comment consists of the lamp character `⍝` and any following text until (not including) the next newline character. The initial `⍝` must not be part of a string literal started earlier. Comments are ignored entirely and do not form tokens. -Identifiers and numeric literals share the same token formation rule. These tokens are formed from the *numeric characters* `¯∞π.0123456789` and *alphabetic characters* `_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`. Any sequence of adjacent numeric and alphabetic characters forms a single token, which is a *numeric literal* if it begins with a numeric character and an *identifier* if it begins with an alphabetic character. Numeric literals are also subject to [numeric literal rules](literal.md), which specify which numeric literals are valid and which numbers they represent. +Identifiers and numeric literals share the same token formation rule. These tokens are formed from the *numeric characters* `¯∞π.0123456789` and *alphabetic characters* `_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ` and the oddball `𝕣`. Any sequence of these characters adjacent to each other forms a single token, which is a *numeric literal* if it begins with a numeric character and an *identifier* if it begins with an alphabetic character. Numeric literals are also subject to [numeric literal rules](literal.md), which specify which numeric literals are valid and which numbers they represent. If the token contains `𝕣` it must be either `𝕣`, `_𝕣`, or `_𝕣_` and is considered a special name (see below). As the value taken by this identifier can only be a modifier or composition, the uppercase character `ℝ` is not allowed. Following this step, the whitespace characters space and tab are ignored, and do not form tokens. Only these whitespace characters, and the newline character, which does form a token, are allowed. @@ -17,7 +17,7 @@ Otherwise, a single character forms a token. Only the specified set of character | Primitive Function | `+-×÷⋆√⌊⌈\|¬∧∨<>≠=≤≥≡≢⊣⊢⥊∾≍↑↓↕⌽⍉/⍋⍒⊏⊑⊐⊒∊⍷⊔` | Primitive Modifier | `` ˜˘¨⌜⁼´` `` | Primitive Composition | `∘○⊸⟜⌾⊘◶⎉⚇⍟` -| Parameter | `𝕨𝕩𝕗𝕘𝕎𝕏𝔽𝔾` +| Special name | `𝕨𝕩𝕗𝕘𝕤𝕎𝕏𝔽𝔾𝕊` | Punctuation | `←↩→(){}⟨⟩‿⋄,` and newline -In the BQN [grammar specification](grammar.md), the three primitive classes are grouped into terminals `Fl`, `_ml`, and `_cl`, while the punctuation characters are identified separately as keywords such as `"←"`. The parameters are handled specially. The uppercase versions `𝕎𝕏𝔽𝔾` and lowercase versions `𝕨𝕩𝕗𝕘` are two spellings of the four underlying parameters. +In the BQN [grammar specification](grammar.md), the three primitive classes are grouped into terminals `Fl`, `_ml`, and `_cl`, while the punctuation characters are identified separately as keywords such as `"←"`. The special names are handled specially. The uppercase versions `𝕎𝕏𝔽𝔾𝕊` and lowercase versions `𝕨𝕩𝕗𝕘𝕤` are two spellings of the five underlying inputs and function. |
