Allow clicking on header to get fragment link

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-08-11 17:21:31 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-08-11 17:25:04 -0400
commit: 2afb23928e1984d475cc460e1672e8f6fa0e4dbe (patch)
tree: ebd2cc514294d30b6fa4b36c2ee638326f06ef72 /docs/implementation
parent: eac61ca02074c218667754d5f4ef562e780bae85 (diff)
10 files changed, 62 insertions, 62 deletions
diff --git a/docs/implementation/codfns.html b/docs/implementation/codfns.html
index 57cfa3d5..dcbaae30 100644
--- a/docs/implementation/codfns.html
+++ b/docs/implementation/codfns.html
@@ -4,16 +4,16 @@
   <title>Co-dfns versus BQN's implementation</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">implementation</a></div>
-<h1 id="co-dfns-versus-bqns-implementation">Co-dfns versus BQN's implementation</h1>
+<h1 id="co-dfns-versus-bqns-implementation"><a class="header" href="#co-dfns-versus-bqns-implementation">Co-dfns versus BQN's implementation</a></h1>
 <p>The BQN self-hosted compiler is directly inspired by the <a href="https://github.com/Co-dfns/Co-dfns">Co-dfns</a> project, a compiler for a subset of <a href="../doc/fromDyalog.html">Dyalog APL</a>. I'm very grateful to Aaron for showing that array-oriented compilation is even possible! In addition to the obvious difference of target language, BQN differs from Co-dfns both in goals and methods.</p>
 <p>The shared goals of BQN and Co-dfns are to implement a compiler for an array language with whole-array operations. This provides the theoretical benefit of a short <em>critical path</em>, which in practice means that both compilers can make good use of a GPU or a CPU's vector instructions simply by providing an appropriate runtime (however, only Co-dfns has such a runtime—an ArrayFire program on the GPU and Dyalog APL on the CPU). The two implementations also share a preference for working &quot;close to the metal&quot; by passing around arrays of numbers rather than creating abstract types to work with data. Objects are right out. These choices lead to a compact source code implementation, and may have some benefits in terms of how easy it is to write and understand the compiler.</p>
-<h2 id="compilation-strategy">Compilation strategy</h2>
+<h2 id="compilation-strategy"><a class="header" href="#compilation-strategy">Compilation strategy</a></h2>
 <p>Co-dfns development has primarily been focused on the core compiler, and not parsing, code generation, or the runtime. The associated Ph.D. thesis and famous 17 lines figure refer to this section, which transforms the abstract syntax tree (AST) of a program to a lower-level form, and resolves lexical scoping by linking variables to their definitions. While all of Co-dfns is written in APL, other sections aren't necessarily designed to be data-parallel and don't have the same performance guarantees. For example, the parser uses a parsing expression grammar (PEG), a sequential algorithm. In contrast, BQN is entirely written in a data-parallel style. It does not maintain the same clean separation between compiler sections: <a href="../spec/token.html">token formation</a> and literal evaluation is separated into its own function, but parsing, AST manipulation, and code generation overlap.</p>
 <p>The core Co-dfns compiler is based on manipulating the syntax tree, which is mostly stored as parent and sibling vectors—that is, lists of indices of other nodes in the tree. BQN is less methodical, but generally treats the source tokens as a simple list. This list is reordered in various ways, allowing operations that behave like tree traversals to be performed with scans under the right ordering. This strategy allows BQN to be much stricter in the kinds of operations it uses. Co-dfns regularly uses <code><span class='Value'>⍣</span><span class='Function'>≡</span></code> (repeat until convergence) for iteration and creates nested arrays with <code><span class='Value'>⌸</span></code> (Key, like <a href="../doc/group.html">Group</a>), but BQN has only a single instance of iteration to resolve quotes and comments, plus one complex but parallelizable scan for numeric literal processing, and only uses Group to extract identifiers and strings. This means that most primitives, if we count simple reductions and scans as single primitives, are executed a fixed number of times during execution, making complexity analysis even easier than in Co-dfns.</p>
-<h2 id="backends-and-optimization">Backends and optimization</h2>
+<h2 id="backends-and-optimization"><a class="header" href="#backends-and-optimization">Backends and optimization</a></h2>
 <p>Co-dfns was designed from the beginning to build GPU programs, and outputs code in ArrayFire (a C++ framework), which is then compiled. GPU programming is quite limiting, and as a result Co-dfns has strict limitations in functionality that are slowly being removed. It now has partial support for nested arrays and array ranks higher than 4. BQN is designed with performance in mind, but implementation effort focused on functionality first, so that arbitrary array structures as well as trains and lexical closures have been supported from the beginning. Rather than target a specific language, it outputs object code to be interpreted by a <a href="vm.html">virtual machine</a>. Another goal for BQN was to not only write the compiler in BQN but to use BQN for the runtime as much as possible. The BQN-based runtime uses a small number of basic array operations provided by the VM. The extra abstraction causes this runtime to be very slow, but this can be fixed by overwriting functions from the runtime with natively-implemented ones.</p>
 <p>Neither BQN nor Co-dfns significantly optimize their output at the time of writing (it could be said that Co-dfns relies on the ArrayFire backend to optimize). BQN does have one optimization, which is to compute variable lifetimes in functions so that the last access to a variable can clear it. Further optimizations often require finding properties such as reachability in a graph of expressions that probably can't be done efficiently in a strict array style. For this and other reasons it would probably be best to structure compiler optimization as a set of additional modules that can be provided during a given compilation.</p>
-<h2 id="error-reporting">Error reporting</h2>
+<h2 id="error-reporting"><a class="header" href="#error-reporting">Error reporting</a></h2>
 <p>Co-dfns doesn't check for compilation errors, while BQN has complete error checking and good error messages, and includes source positions in compiler errors as well as in the compiled code for use in runtime errors. Position tracking and error checking add up to a little more than 20% overhead for the compiler, both in runtime and lines of code. And improving the way errors are reported once found has no cost for working programs, because reporting code only needs to be run if there's a compiler error. This leaves room for potentially very sophisticated error analysis to attempt to track down the root cause of a compilation error, but I haven't yet done any work along these lines.</p>
-<h2 id="comments">Comments</h2>
+<h2 id="comments"><a class="header" href="#comments">Comments</a></h2>
 <p>Aaron advocates the almost complete separation of code from comments (thesis) in addition to his very terse style as a general programming methodology. I find that this practice makes it hard to connect the documentation to the code, and is very slow in providing a summary or reminder of functionality that a comment might. One comment on each line makes a better balance of compactness and faster accessibility in my opinion. However, I do plan to write long-form material providing the necessary context and explanations required to understand the compiler.</p>
diff --git a/docs/implementation/compile/dynamic.html b/docs/implementation/compile/dynamic.html
index 1edb73ef..0ffb6c46 100644
--- a/docs/implementation/compile/dynamic.html
+++ b/docs/implementation/compile/dynamic.html
@@ -4,12 +4,12 @@
   <title>BQN: Dynamic compilation</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">compile</a></div>
-<h1 id="dynamic-compilation">Dynamic compilation</h1>
+<h1 id="dynamic-compilation"><a class="header" href="#dynamic-compilation">Dynamic compilation</a></h1>
 <p>This page discusses at a high level (without covering any specific source transformations) how compilation and interpretation might be structured in order to execute BQN faster. Effective strategies will compile parts of the program with more specialization or higher optimization as it runs and more information about it becomes known.</p>
 <p>To avoid confusion, an <strong>interpreter</strong> evaluates code to perform actions or produce a result; a CPU is a machine code interpreter. A <strong>compiler</strong> transforms code in one language or format to another. To run a program that does anything, it must eventually be interpreted; before this the program or parts of it might be compiled any number of times.</p>
 <p>The starting point for algorithms described here is bytecode for the BQN <a href="../vm.html">virtual machine</a>. Producing this bytecode mostly amounts to parsing the program, which any implementation would have to do anyway, and it's fast enough that compiling an entire file before starting isn't a big concern. The bytecode can be interpreted directly, but compiler passes can reduce the overhead for each instruction or implement groups of instructions more effectively. Sophisticated compiler passes cost time, so the improvement must be balanced against time spent even though it might not be known in advance.</p>
-<h2 id="considerations">Considerations</h2>
-<h3 id="what-are-we-optimizing">What are we optimizing?</h3>
+<h2 id="considerations"><a class="header" href="#considerations">Considerations</a></h2>
+<h3 id="what-are-we-optimizing"><a class="header" href="#what-are-we-optimizing">What are we optimizing?</a></h3>
 <p>How a program can be optimized depends on why it's taking time to run. Here we'll assume it's the core BQN syntax and primitives that are responsible, as things like system interaction are out of scope, and that the source file is not unreasonably large. The total time taken is the sum of the time taken by each action performed, so there are two main possibilities:</p>
 <ul>
 <li>Primitives take a long time, because of large arrays.</li>
@@ -18,7 +18,7 @@
 <p>If many bytecode instructions are evaluated, it must be that blocks are repeated, because each instruction can only be run once by its containing block. A derived function might be run many times by a modifier, which typically involves a large array, but could also be because of Repeat (<code><span class='Modifier2'>⍟</span></code>).</p>
 <p>This is an array-based viewpoint, because in a low-level language the large array case would just be considered one of several kinds of repetition. Traditionally APL focused exclusively on speeding up the large array case; BQN's compilation makes better block performance a reachable goal.</p>
 <p>The two conditions are routinely mixed in various ways: a program might split its time between manipulating small and large arrays, or it might work with large arrays but sometimes apply a block function to each element or small groups of elements. Array size or number of iterations could even differ between program runs. An evaluation strategy needs to adapt to these changes.</p>
-<h3 id="hardware">Hardware</h3>
+<h3 id="hardware"><a class="header" href="#hardware">Hardware</a></h3>
 <p>It's easiest to get BQN running on a single-threaded scalar CPU, but all of the following improvements are commonly available, and can greatly increase speed for some programs:</p>
 <ul>
 <li>SIMD instructions</li>
@@ -27,30 +27,30 @@
 </ul>
 <p>Each of the three is a good fit for array programming: any size array with SIMD and large arrays for multi-threading and GPU use. Multi-threading can also be useful for blocks that don't have side effects.</p>
 <p>SIMD instructions can be used at the primitive level (<a href="../primitive/index.html">primitive implementation notes</a>), with some improvement by fusing small operations together. Multi-threading is also possible at the primitive level, but no coordination between primitives will lead to cache coherency issues, and wasted time as some cores wait for others. It's probably better to analyze groups of primitives to allocate data and work. GPU execution is the fastest method for suitable programs, but can only handle certain kinds of code well and must use compiled kernels, which need to do a significant amount of work to justify the overhead.</p>
-<h3 id="cached-optimization">Cached optimization</h3>
+<h3 id="cached-optimization"><a class="header" href="#cached-optimization">Cached optimization</a></h3>
 <p>Expensive ahead-of-time optimizations can be saved somewhere in the filesystem (presumably the XDG cache directory for Linux). This of course introduces the problem of cache invalidation: cache files need to be versioned so an old cache isn't used by a new BQN, and they need to be thrown out when the file is changed and avoid saving information about user input or external files that could change.</p>
 <p>I think the data about a BQN source file that can be saved is generally pretty cheap to compute, and it's also important to make sure completely new code runs quickly. I'm hoping we don't have a reason to resort to caching.</p>
-<h2 id="strategies">Strategies</h2>
-<h3 id="hot-paths">Hot paths</h3>
+<h2 id="strategies"><a class="header" href="#strategies">Strategies</a></h2>
+<h3 id="hot-paths"><a class="header" href="#hot-paths">Hot paths</a></h3>
 <p>The basic strategy for JIT compilers like Java and Javascript (hey, they do actually have something in common!) is to track the number of times a block is called and perform more optimization as this number increases. It's also possible to record information about the inputs to do more specialized compilation, usually with a test in the compiled code to abort when the specialization is invalid.</p>
 <p>CBQN already includes a count to control native compilation; because this compilation is fast the best value is very low, between 1 and 5. More powerful optimization would generally use higher counts.</p>
 <p>Iteration modifiers <code><span class='Modifier'>˘</span><span class='Modifier2'>⎉</span><span class='Modifier'>¨⌜´˝`</span><span class='Modifier2'>⍟</span></code> can compute how many times they will call the operand quickly: it's usually based on the result size or argument length. The slowest case is probably monadic Scan <code><span class='Modifier'>`</span></code>, where it's the size of <code><span class='Value'>𝕩</span></code> minus the cell size, both values that already need to be computed. This number can be added to the existing count to find what level of optimization would be used, or even compared without adding, if false negatives are acceptable. This reduces the number of times the program runs blocks at the wrong optimization level, but slightly increases the overhead of mapping modifiers even on calls where the block to be run is already optimized at a high level. It can be restricted to only modifiers with a block operand because the modifier needs to inspect its operand anyway—the cost of running <code><span class='Function'>=</span><span class='Modifier'>⌜</span></code> or <code><span class='Function'>+</span><span class='Modifier'>´</span></code> without optimization is too high to skip this.</p>
 <p>Iteration modifiers on typed arrays often allow the argument types to be known with certainty, so that the operand can be specialized on that type with no testing.</p>
-<h3 id="metadata">Metadata</h3>
+<h3 id="metadata"><a class="header" href="#metadata">Metadata</a></h3>
 <p>There's a lot of information about values that can be used to optimize, either in the general case if it's known for sure, or a specialization if it's suspected, or known under specific conditions. Type is the most important, then depth, rank, shape and element metadata for arrays and range or even value for numbers and characters. Other bulk properties like sortedness (to enable faster searches) or sum (for example, to find the shape of <code><span class='Function'>/</span><span class='Value'>𝕩</span></code>) can be useful for arrays.</p>
 <p>Proofs that constrain the metadata in all cases are the most valuable, since the properties don't have to be tested or recomputed if they're already known. One way to get this information is to do an initial run of a block that only propagates known information about metadata. For example, if <code><span class='Brace'>{</span><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>↕</span><span class='Value'>𝕩</span><span class='Brace'>}</span></code> is run on an integer array, <code><span class='Function'>↕</span></code> can return a list of natural numbers with the same type as <code><span class='Value'>𝕩</span></code> or cause an error, and <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, given a list of natural numbers, returns a natural number. This approach can only help if the block will be run multiple times: clearly for a single run computing metadata along with data is the fastest. It runs at interpreted rather than native speed (because it's only run once) and could in fact take many calls to break even. For large array code, where the interpretive overhead is irrelevant, it could also be a step before a compilation pass that fuses and rearranges primitives.</p>
 <p>The same procedure can be run on local rather than global constraints, which might produce more specialized code at the cost of running through the block once per specialization.</p>
 <p>Saving metadata from the first run is another possibility, with very low overhead. This most naturally provides a guess as to what the metadata usually is, but it may also be possible to keep track of when metadata is &quot;known&quot; with a flag system.</p>
 <p>The desire to do metadata computations, or pure data ones once metadata is known suggests a system with a &quot;wrapper&quot; that computes type, shape, and so on, then selects and calls an &quot;kernel&quot; function for the computation. Specialized code could use a particular kernel, or a different wrapper that selects from a subset of the kernels.</p>
-<h3 id="derived-functions">Derived functions</h3>
+<h3 id="derived-functions"><a class="header" href="#derived-functions">Derived functions</a></h3>
 <p>Like blocks, it can be valuable to optimize derived functions if they are run many times. Derived functions are often known at the program start by constant folding, but might also be constructed dynamically, particularly to bind an argument to a function.</p>
 <p>Compound arithmetic functions like <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code>, or <code><span class='Function'>=</span><span class='Modifier'>⌜</span></code> are essential to array programming, and have fast SIMD implementations, so they need to be recognized wherever they are found.</p>
 <p>In addition to these, there are patterns like <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> that can be implemented faster than their components, and bindings like <code><span class='Value'>l</span><span class='Modifier2'>⊸</span><span class='Function'>⊐</span></code> where a computation (here, a hash table) on the left argument can be saved. These can be handled by inspecting the function. However, it's more robust to convert it to a canonical form, so this possibility should also be considered.</p>
 <p>Tacit code can be converted to <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">SSA</a> form very easily. To translate it into stack-based bytecode it would need a way to reuse values from the stack in multiple places; instructions to duplicate or extract a value from higher in the stack are an obvious candidate. Either of these forms is a natural step on the way to native compilation, and a bytecode representation would make it easier to optimize mixed tacit and explicit code—but it's easier to do the optimizations on SSA-form rather than stack-based code, so perhaps the right path is to convert both bytecode and derived functions to SSA.</p>
-<h3 id="compile-in-another-thread">Compile in another thread</h3>
+<h3 id="compile-in-another-thread"><a class="header" href="#compile-in-another-thread">Compile in another thread</a></h3>
 <p>A simple and widely-used strategy to reduce slowdown due to dynamic compilation is to compile blocks in a separate thread from the one that runs them. The new code needs to be added in a thread-safe manner, which is not hard as the set of optimized implementations is a small lookup table of some sort with only one writer.</p>
 <p>If the implementation is able to make use of all available threads (possible when working with large arrays), then it's still important to minimize compilation time as that thread could be put to better use. If there are idle threads then the only costs of compilation overhead are minor: the optimized code can't be put to use as quickly, and there is more power draw and possible downclocking.</p>
-<h3 id="anticipation">Anticipation</h3>
+<h3 id="anticipation"><a class="header" href="#anticipation">Anticipation</a></h3>
 <p>The <a href="#hot-paths">hot path</a> strategy depends on targetting code for optimization based on history. Anticipation would identify in advance what code will take longer to run, and allocate a fraction of the time taken for optimizing that code. This is most useful for code that runs a small number of times on large arrays. An example where anticipation would be very important is for a programmer trying experimental one-off queries on a large dataset.</p>
 <p>The end result seems similar to that obtained by thunks as discussed at Dyalog '18 (<a href="https://dyalog.tv/Dyalog18/?v=-6no6N3i9Tg">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>). A thunk runs as part of a primitive, detecting that computing the result will be slow and outputting a deferred computation instead. Anticipation is more powerful because it can scan ahead in the bytecode instead of deciding as primitives are called whether or not to expand the thunk.</p>
 <p>Anticipation attempts to improve program speed while bounding the added overhead. For example, it might be constrained to add no more than 5% to the time to first program output, relative to base-level interpretation. The idea is to exit normal interpretation as soon as a large enough lower bound is established on this time, for example if an operation would create a large array. At this point it begins analysis, which will involve at least some shape propagation and probably increase the lower bound and optimization budget.</p>
diff --git a/docs/implementation/compile/index.html b/docs/implementation/compile/index.html
index 4ec585f0..44424f90 100644
--- a/docs/implementation/compile/index.html
+++ b/docs/implementation/compile/index.html
@@ -4,7 +4,7 @@
   <title>BQN: Optimizing compilation notes</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a></div>
-<h1 id="optimizing-compilation-notes">Optimizing compilation notes</h1>
+<h1 id="optimizing-compilation-notes"><a class="header" href="#optimizing-compilation-notes">Optimizing compilation notes</a></h1>
 <p>Pages here discuss advanced compilation strategies for BQN, that is, steps that might take take place after compiling to bytecode or a similar intermediate representation.</p>
 <p>Most content here is currently speculative: C, Java, and Javascript backends are capable of compiling to native (x86, JVM, or Javascript) code in order to lower evaluation overhead but don't perform much if any analysis to improve this code. CBQN is likely to start making such optimizations in the future.</p>
 <ul>
diff --git a/docs/implementation/index.html b/docs/implementation/index.html
index d0b73c59..8e46bc35 100644
--- a/docs/implementation/index.html
+++ b/docs/implementation/index.html
@@ -4,7 +4,7 @@
   <title>BQN implementation notes</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a></div>
-<h1 id="bqn-implementation-notes">BQN implementation notes</h1>
+<h1 id="bqn-implementation-notes"><a class="header" href="#bqn-implementation-notes">BQN implementation notes</a></h1>
 <p>Notes about how BQN is or could be implemented.</p>
 <p>This repository's BQN implementation is written mainly in BQN: the bytecode <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/c.bqn">compiler</a> is completely self-hosted, and the majority of the runtime (<a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/r0.bqn">r0</a>, <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/r1.bqn">r1</a>) is written in BQN except that it is allowed to define primitives; some preprocessing (<a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/pr.bqn">pr</a>) turns the primitives into identifiers.</p>
 <p>The remaining part, a Virtual Machine (VM), can be implemented in any language to obtain a version of BQN running in that language. The VM used for the online REPL is the <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../docs/bqn.js">Javascript implementation</a>, while <a href="https://github.com/dzaima/CBQN">CBQN</a> is a more advanced VM in C. There are platform-specific and generic tests in the <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../test/">test</a> directory.</p>
diff --git a/docs/implementation/kclaims.html b/docs/implementation/kclaims.html
index 4d028e6c..fcc0ddac 100644
--- a/docs/implementation/kclaims.html
+++ b/docs/implementation/kclaims.html
@@ -4,30 +4,30 @@
   <title>BQN: Wild claims about K performance</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">implementation</a></div>
-<h1 id="wild-claims-about-k-performance">Wild claims about K performance</h1>
+<h1 id="wild-claims-about-k-performance"><a class="header" href="#wild-claims-about-k-performance">Wild claims about K performance</a></h1>
 <p>Sometimes I see unsourced, unclear, vaguely mystical claims about K being the fastest array language. It happens often enough that I'd like to write a long-form rebuttal to these, and a demand that the people who make these do more to justify them.</p>
 <p>This isn't meant to put down the K language! K is in fact the only APL-family language other than BQN that I would recommend without reservations. And there's nothing wrong with the K community as a whole. Go to <a href="https://chat.stackexchange.com/rooms/90748/the-k-tree">the k tree</a> and meet them! What I want to fight is the <em>myth</em> of K, which is carried around as much by those who used K once upon a time, and no longer have any connection to it, as by active users.</p>
 <p>The points I argue here are narrow. To some extent I'm picking out the craziest things said about K to argue against. Please don't assume whoever you're talking to thinks these crazy things about K just because I wrote them here. Or, if they are wrong about these topics, that they're wrong about everything. Performance is a complicated and often counter-intuitive field and it's easy to be mislead.</p>
 <p>On that note, it's possible <em>I've</em> made mistakes, such as incorrectly designing or interpreting benchmarks. If you present me with concrete evidence against something I wrote below, I promise I'll revise this page to include it, even if I just have to quote verbatim because I don't understand a word of it.</p>
-<h2 id="the-fastest-array-language">The fastest array language</h2>
+<h2 id="the-fastest-array-language"><a class="header" href="#the-fastest-array-language">The fastest array language</a></h2>
 <p>When you ask what the fastest array language is, chances are someone is there to answer one of k, kdb, or q. I can't offer benchmarks that contradict this, but I will argue that there's little reason to take these people at their word.</p>
 <p>The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti's <a href="https://shakti.com/download/license">license</a> says users cannot &quot;distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report&quot;. As I would be unable to share the results, I have not taken benchmarks of any commercial K. Or downloaded one for that matter. Shakti could publish benchmarks; they choose to publish a handful of comparisons with database software and none with array languages or frameworks. I do run tests with <a href="https://codeberg.org/ngn/k">ngn/k</a>, which is developed with goals similar to Whitney's K; the author says it's slower than Shakti but not by too much.</p>
 <p>The primary reason I don't give any credence to claims that K is the best is that they are always devoid of specifics. Most importantly, the same assertion is made across decades even though performance in J, Dyalog, and NumPy has improved by leaps and bounds in the meantime—I participated in advances of <a href="https://www.dyalog.com/dyalog/dyalog-versions/170/performance.htm">26%</a> and <a href="https://www.dyalog.com/dyalog-versions/180/performance.htm">10%</a> in overall Dyalog benchmarks in the last two major versions. Has K4 (the engine behind kdb and Q) kept pace? Maybe it's fallen behind since Arthur left but Shakti K is better? Which other array languages has the poster used? Doesn't matter—<em>they</em> are all the same but <em>K</em> is better.</p>
 <p>A related theme I find is equivocating between different kinds of performance. I suspect that for interpreting scalar code K is faster than APL and J but slower than Javascript, and certainly any compiled language. For operations on arrays, maybe it beats Javascript and Java but loses to current Dyalog and tensor frameworks. Simple database queries, Shakti says it's faster than Spark and Postgres but is silent about newer in-memory databases. The most extreme K advocates sweep away all this complexity by comparing K to weaker contenders in each category. Just about any language can be &quot;the best&quot; with this approach.</p>
 <p>Before getting into array-based versus scalar code, here's a simpler case. It's well known that K works on one list at a time, that is, if you have a matrix—in K, a list of lists—then applying an operation (say sum) to each row works on each one independently. If the rows are short then there's function overhead for each one. In APL, J, and BQN, the matrix is stored as one unit with a stride. The sum can use one metadata computation for all rows, and there's usually special code for many row-wise functions. I measured that Dyalog is 30 times faster than ngn/k to sum rows of a ten-million by three float (double) matrix, for one fairly representative example. It's fine to say—as many K-ers do—that these cases don't matter or can be avoided in practice; it's dishonest (or ignorant) to claim they don't exist.</p>
-<h2 id="scalar-versus-array-code">Scalar versus array code</h2>
+<h2 id="scalar-versus-array-code"><a class="header" href="#scalar-versus-array-code">Scalar versus array code</a></h2>
 <p>I have a suspicion that users sometimes think K is faster than APL because they try out a Fibonacci function or other one-number-at-a-time code. Erm, your boat turns faster than a battleship, congratulations? <em>Python</em> beats these languages at interpreted performance. By like a factor of five. The only reason for anyone to think this is relevant is if they have a one-dimensional model where J is &quot;better&quot; than Python, so K is &quot;better&quot; than both.</p>
 <p>Popular APL and J implementations interpret source code directly, without even building an AST. This is very slow, and Dyalog has several other pathologies that get in the way as well. Like storing the execution stack in the workspace to prevent stack overflows, and the requirement that a user can save a workspace with paused code and resume it <em>in a later version</em>. But the overhead is per token executed, and a programmer can avoid the cost by working on large arrays where one token does a whole lot of work. If you want to show a language is faster than APL generally, this is the kind of code to look at.</p>
 <p>K is designed to be as fast as possible when interpreting scalar code, for example using a <a href="https://k.miraheze.org/wiki/Grammar">grammar</a> that's much simpler than <a href="../spec/grammar.html">BQN's</a> (speed isn't the only benefit of being simpler of course, but it's clearly a consideration). It succeeds at this, and K interpreters are very fast, even without bytecode compilation in advance.</p>
 <p>But K still isn't good at scalar code! It's an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java. A compiler generates code to do what you want, while an interpreter is code that reads data (the program) to do what you want. Once the code is compiled, the interpreter has an extra step and <em>has</em> to be slower. This is why BQN uses compiler-based strategies to speed up execution, first compiling to bytecode to make syntax overhead irrelevant and then usually post-processing that bytecode. Compilation is fast enough that it's perfectly fine to compile code every time it's run.</p>
-<h2 id="instruction-cache">Instruction cache</h2>
+<h2 id="instruction-cache"><a class="header" href="#instruction-cache">Instruction cache</a></h2>
 <p>A more specific claim about K is that the key to its speed is that the interpreter, or some part of it, fits in L1 cache. I know Arthur Whitney himself has said this; I can't find that now but <a href="https://kx.com/blog/what-makes-time-series-database-kdb-so-fast/">here</a>'s some material from KX about the &quot;L1/2 cache&quot;. Maybe this was a relevant factor in the early days of K around 2000—I'm doubtful. In the 2020s it's ridiculous to say that instruction caching matters.</p>
 <p>Let's clarify terms first. The CPU cache is a set of storage areas that are smaller and faster than RAM; memory is copied there when it's used so it will be faster to access it again later. L1 is the smallest and fastest level. On a typical CPU these days it might consist of 64KB of <em>data</em> cache for memory to be read and written, and 64KB of <em>instruction</em> cache for memory to be executed by the CPU. When I've seen it the L1 cache claim is specifically about the K interpreter (and not the data it works with) fitting in the cache, so it clearly refers to the instruction cache.</p>
 <p>(Unlike the instruction cache, the data cache is a major factor that makes array languages faster. It's what terms like &quot;cache-friendly&quot; typically refer to. I think the reason KX prefers to talk about the instruction cache is that it allows them to link this well-known consideration to the size of the kdb binary, which is easily measured and clearly different from other products. Anyone can claim to use cache-friendly algorithms.)</p>
 <p>A K interpreter will definitely benefit from the instruction cache. Unfortunately, that's where the truth of this claim runs out. Any other interpreter you use will get just about the same benefit, because the most used code will fit in the cache with plenty of room to spare. And the best case you get from a fast core interpreter loop is fast handling of scalar code—exactly the case that array languages typically ignore.</p>
 <p>So, 64KB of instruction cache. That would be small even for a K interpreter. Why is it enough? I claim specifically that while running a program might cause a cache miss once in a while, the total cost of these will only ever be a small fraction of execution time. This is because an interpreter is made of loops: a core loop to run the program as a whole and usually smaller loops for some specific instructions. These loops are small, with the core loop being on the larger side. In fact it can be pretty huge if the interpreter has a lot of exotic instructions, but memory is brought to the cache in lines of around 64 bytes, so that unused regions can be ignored. The active portions might take up a kilobyte or two. Furthermore, you've got the L2 and L3 caches as backup, which are many times larger than L1 and not much slower.</p>
 <p>So a single loop doesn't overflow the cache. And the meaning of a loop is that it's loaded once but run multiple times—for array operations, it could be a huge number. The body of an interpreter loop isn't likely to be fast either, typically performing some memory accesses or branches or both. An L1 instruction cache miss costs tens of cycles if it's caught by another cache layer and hundreds if it goes to memory. Twenty cycles would be astonishingly fast for a go around the core interpreter loop, and array operation loops are usually five cycles or more, plus a few tens in setup. It doesn't take many loops to overcome a cache miss, and interpreting any program that doesn't finish instantly will take millions of iterations or more, spread across various loops.</p>
-<h3 id="measuring-l1-with-perf">Measuring L1 with perf</h3>
+<h3 id="measuring-l1-with-perf"><a class="header" href="#measuring-l1-with-perf">Measuring L1 with perf</a></h3>
 <p>Look, you can measure this stuff. Linux has a nice tool called <a href="https://en.wikipedia.org/wiki/Perf_(Linux)">perf</a> that can track all sorts of hardware events related to your program, including cache misses. You can pass in a list of events with <code><span class='Function'>-</span><span class='Value'>e</span></code> followed by the program to be run. It can even distinguish instruction from data cache misses! I'll be showing the following events:</p>
 <pre><span class='Value'>perf</span> <span class='Value'>stat</span> <span class='Function'>-</span><span class='Value'>e</span> <span class='Value'>cycles</span><span class='Separator'>,</span><span class='Value'>icache_16b.ifdata_stall</span><span class='Separator'>,</span><span class='Value'>cache</span><span class='Function'>-</span><span class='Value'>misses</span><span class='Separator'>,</span><span class='Function'>L1-</span><span class='Value'>dcache</span><span class='Function'>-</span><span class='Value'>load</span><span class='Function'>-</span><span class='Value'>misses</span><span class='Separator'>,</span><span class='Function'>L1-</span><span class='Value'>icache</span><span class='Function'>-</span><span class='Value'>load</span><span class='Function'>-</span><span class='Value'>misses</span>
 </pre>
diff --git a/docs/implementation/primitive/index.html b/docs/implementation/primitive/index.html
index b5256302..f139e5f4 100644
--- a/docs/implementation/primitive/index.html
+++ b/docs/implementation/primitive/index.html
@@ -4,7 +4,7 @@
   <title>BQN: Primitive implementation notes</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a></div>
-<h1 id="primitive-implementation-notes">Primitive implementation notes</h1>
+<h1 id="primitive-implementation-notes"><a class="header" href="#primitive-implementation-notes">Primitive implementation notes</a></h1>
 <p>Commentary on the best methods I know for implementing various primitives. Often there are many algorithms that are viable in different situations, and in these cases I try to discuss the tradeoffs.</p>
 <ul>
 <li><a href="replicate.html">Replicate</a></li>
diff --git a/docs/implementation/primitive/random.html b/docs/implementation/primitive/random.html
index 7f660792..c907ee1c 100644
--- a/docs/implementation/primitive/random.html
+++ b/docs/implementation/primitive/random.html
@@ -4,12 +4,12 @@
   <title>BQN: Implementation of random stuff</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-random-stuff">Implementation of random stuff</h1>
+<h1 id="implementation-of-random-stuff"><a class="header" href="#implementation-of-random-stuff">Implementation of random stuff</a></h1>
 <p>Not a primitive, but CBQN's <code><span class='Function'>•MakeRand</span></code> initializes a random number generator that has some built-in utilities. For clarity we'll call a result of this initialization <code><span class='Value'>rand</span></code> in the text below.</p>
-<h2 id="random-number-generation">Random number generation</h2>
+<h2 id="random-number-generation"><a class="header" href="#random-number-generation">Random number generation</a></h2>
 <p>CBQN is currently using wyrand, part of the <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> library. It's extremely fast, passes the expected test suites, and no one's raised any concerns about it yet (but it's very new). It uses only 64 bits of state and doesn't have extra features like jump ahead.</p>
 <p>Other choices are <a href="https://prng.di.unimi.it/">xoshiro++</a> and <a href="https://www.pcg-random.org/">PCG</a>. The authors of these algorithms (co-author for xoshiro) hate each other very much and have spent quite some time slinging mud at each other. As far as I can tell they both have the normal small bias in favor of their own algorithms but are wildly unfair towards the other side, choosing misleading examples and inflating minor issues. I think both generators are good but find the case for xoshiro a little more convincing, and I think it's done better in third-party benchmarks.</p>
-<h2 id="simple-random-sample">Simple random sample</h2>
+<h2 id="simple-random-sample"><a class="header" href="#simple-random-sample">Simple random sample</a></h2>
 <p>A <a href="https://en.wikipedia.org/wiki/Simple_random_sample">simple random sample</a> from a set is a subset with a specified size, chosen so that each subset of that size has equal probability. <code><span class='Value'>rand.</span><span class='Function'>Deal</span></code> gets a sample of size <code><span class='Value'>𝕨</span></code> from the set <code><span class='Function'>↕</span><span class='Value'>𝕩</span></code> with elements in a uniformly random order, and <code><span class='Value'>rand.</span><span class='Function'>Subset</span></code> does the same but sorts the elements.</p>
 <p><code><span class='Function'>Deal</span></code> uses a <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Knuth shuffle</a>, stopping after the first <code><span class='Value'>𝕨</span></code> elements have been shuffled, as the algorithm won't touch them again. Usually it creates <code><span class='Function'>↕</span><span class='Value'>𝕩</span></code> explicitly for this purpose, but if <code><span class='Value'>𝕨</span></code> is very small then initializing it is too slow. In this case we initialize <code><span class='Function'>↕</span><span class='Value'>𝕨</span></code>, but use a &quot;hash&quot; table with an identity hash—the numbers are already random—for <code><span class='Value'>𝕨</span><span class='Function'>↓↕</span><span class='Value'>𝕩</span></code>. The default is that every value in the table is equal to its key, so that only entries where a swap has happened need to be stored. The hash table is the same design as for self-search functions, with open addressing and linear probing.</p>
 <p><code><span class='Function'>Subset</span></code> uses <a href="https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin">Floyd's method</a>, which is sort of a modification of shuffling where only the selected elements need to be stored, not what they were swapped with. This requires a lookup structure that can be updated efficiently and output all elements in sorted order. The choices are a bitset for large <code><span class='Value'>𝕨</span></code> and another not-really-hash table for small <code><span class='Value'>𝕨</span></code>. The table uses a right shift—that is, division by a power of two—as a hash so that hashing preserves the ordering, and inserts like an insertion sort: any larger entries are pushed forward. Really this is an online sorting algorithm, that works because we know the input distribution is well-behaved (it degrades to quadratic performance only in very unlikely cases). When <code><span class='Value'>𝕨</span><span class='Function'>&gt;</span><span class='Value'>𝕩</span><span class='Function'>÷</span><span class='Number'>2</span></code>, we always use a bitset, but select <code><span class='Value'>𝕩</span><span class='Function'>-</span><span class='Value'>𝕨</span></code> elements and invert the selection.</p>
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html
index 9d539d40..9e07a754 100644
--- a/docs/implementation/primitive/replicate.html
+++ b/docs/implementation/primitive/replicate.html
@@ -4,23 +4,23 @@
   <title>BQN: Implementation of Indices and Replicate</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-indices-and-replicate">Implementation of Indices and Replicate</h1>
+<h1 id="implementation-of-indices-and-replicate"><a class="header" href="#implementation-of-indices-and-replicate">Implementation of Indices and Replicate</a></h1>
 <p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. The most important is converting <a href="#booleans-to-indices">bits to indices</a>: AVX-512 extensions implement this natively for various index sizes, and even with no SIMD support at all there are surprisingly fast table-based algorithms for it.</p>
 <p><a href="#replicate">General replication</a> is more complex. Branching will slow many useful cases down considerably when using the obvious solution. However, branch-free techniques introduce overhead for larger replication amounts. Hybridizing these seems to be the only way, but it's finicky.</p>
 <p>Replicate by a <a href="#constant-replicate">constant amount</a> (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and scalar dyadics with prefix agreement.</p>
-<h2 id="indices">Indices</h2>
+<h2 id="indices"><a class="header" href="#indices">Indices</a></h2>
 <p>Branchless algorithms are fastest, but with unbounded values in <code><span class='Value'>𝕨</span></code> a fully branchless algorithm is impossible because you can't write an arbitrary amount of memory without branching. So the best algorithms depend on bounding <code><span class='Value'>𝕨</span></code>. Fortunately the most useful case is that <code><span class='Value'>𝕨</span></code> is boolean.</p>
-<h3 id="booleans-to-indices">Booleans to indices</h3>
+<h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3>
 <p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
 <p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then incease the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
 <p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
-<h3 id="non-booleans-to-indices">Non-booleans to indices</h3>
+<h3 id="non-booleans-to-indices"><a class="header" href="#non-booleans-to-indices">Non-booleans to indices</a></h3>
 <p>If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. This is much like the lookup table algorithm above, minus the lookup table. If the indices need to be larger than one byte, it's fine to expand them, and possibly add an offset, after generation (probably in chunks).</p>
 <p>There are two ways I know to fill in the gaps that this method would leave with elements that are too large. First is to stop after such an element and fill remaining space branchfully (maybe with <code><span class='Value'>memset</span></code>). This is maximally efficient if <code><span class='Value'>𝕩</span></code> is dominated by large elements—particularly for 2-byte indices when it skips index expansion—but not good if there are a lot of elements near the threshold. Second, initialize the buffer with 0 and perform <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code> afterwards, or other variations. This eliminates all but a fixed amount of branching, but it's a lot of overhead and I think unless a more sophisticated strategy arises it's best to stick with the first method.</p>
 <p>Indices is half of a counting sort: for sparse values, it's the slower half. Making it fast makes counting sort viable for much larger range-to-length ratios.</p>
-<h2 id="replicate">Replicate</h2>
+<h2 id="replicate"><a class="header" href="#replicate">Replicate</a></h2>
 <p>For the most part, understanding Indices is the best way to implement Replicate quickly. But this is not the case if <code><span class='Value'>𝕩</span></code> is boolean because then its elements are smaller than any useful index, and faster methods are available.</p>
-<h3 id="compress">Compress</h3>
+<h3 id="compress"><a class="header" href="#compress">Compress</a></h3>
 <p>Most of the methods listed below can be performed in place.</p>
 <p>For booleans, use BMI2's PEXT (parallel bits extract) instruction, or an emulation of it. The result can be built recursively alongside the also-required popcount using masked shifts.</p>
 <p>The generally best method for small elements seems to be to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
@@ -28,11 +28,11 @@
 <p>Odd-sized cells could be handled with an index buffer like small elements, using oversized writes and either overallocating or handling the last element specially.</p>
 <p>For medium-sized cells copying involves partial writes and so is somewhat inefficient. It's better to split <code><span class='Value'>𝕨</span></code> into groups of 1s in order to copy larger chunks from <code><span class='Value'>𝕩</span></code> at once. So the algorithm repeatedly searches <code><span class='Value'>𝕨</span></code> for the next 1, then the next 0, then copies the corresponding value from <code><span class='Value'>𝕩</span></code> to the result. This might be better for small odd-sized cells as well; I haven't implemented the algorithm with oversized writes to compare.</p>
 <p>The grouped algorithm, as well as a simpler sparse algorithm that just finds each 1 in <code><span class='Value'>𝕨</span></code>, can also better for small elements. Whether to use these depends on the value of <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>𝕨</span></code> (sparse) or <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> (clumped). The checking is fast and these cases are common, but the general case is also fast enough that this is not a particularly high priority.</p>
-<h3 id="replicate">Replicate</h3>
+<h3 id="replicate"><a class="header" href="#replicate">Replicate</a></h3>
 <p>Like Compress I think the best algorithm is often to generate small indices in a buffer and then select. But this is inefficient when <code><span class='Value'>𝕨</span></code> contains large values, so those need to be detected and handled. Very tricky.</p>
-<h4 id="constant-replicate">Constant replicate</h4>
+<h4 id="constant-replicate"><a class="header" href="#constant-replicate">Constant replicate</a></h4>
 <p>Useful for outer products and leading-axis extension. See <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a> for the boolean case. C compilers will generate decent code for constant small numbers and variable large ones, but I think specialized code with shuffle would be better for small numbers.</p>
-<h3 id="higher-ranks">Higher ranks</h3>
+<h3 id="higher-ranks"><a class="header" href="#higher-ranks">Higher ranks</a></h3>
 <p>When replicating along the first axis only, additional axes only change the element size (these are the main reason why a large element method is given). Replicating along a later axis offers a few opportunities for improvement relative to replicating each cell individually.</p>
 <p>Particularly for boolean <code><span class='Value'>𝕨</span></code>, Select is usually faster than Replicate (a major exception is for a boolean <code><span class='Value'>𝕩</span></code>). Simply replacing <code><span class='Function'>/</span></code> with <code><span class='Function'>/</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code> (after checking conformability) could be an improvement. It's probably best to compute the result shape first to avoid doing any work if it's empty. Similarly, if early result axes are small then the overhead of separating out Indices might make it worse than just doing the small number of Replicates.</p>
 <p>A technique when <code><span class='Value'>𝕨</span></code> processed with one or more bytes at a time, and applies to many rows, is to repeat it up to an even number of bytes and combine rows of <code><span class='Value'>𝕩</span></code> into longer virtual rows (the last one can be short). I think this only ends up being useful when <code><span class='Value'>𝕩</span></code> is boolean.</p>
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index 2a417770..5c6c3d70 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -4,35 +4,35 @@
   <title>BQN: Implementation of ordering functions</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-ordering-functions">Implementation of ordering functions</h1>
+<h1 id="implementation-of-ordering-functions"><a class="header" href="#implementation-of-ordering-functions">Implementation of ordering functions</a></h1>
 <p>The <a href="../../doc/order.html">ordering functions</a> are Sort (<code><span class='Function'>∧∨</span></code>), Grade (<code><span class='Function'>⍋⍒</span></code>), and Bins (<code><span class='Function'>⍋⍒</span></code>). Although these are well-studied—particularly sorting, and then binary search or &quot;predecessor search&quot;—there are many recent developments, as well as techniques that I have not found in the literature. The three functions are closely related but have important differences in what algorithms are viable. Sorting is a remarkably deep problem with different algorithms able to do a wide range of amazing things, and sophisticated ways to combine those. It is by no means solved. In comparison, Bins is pretty tame.</p>
 <p>There's a large divide between ordering compound data and simple data. For compound data comparisons are expensive, and the best algorithm will generally be the one that uses the fewest comparisons. For simple data they fall somewhere between cheap and extremely cheap, and fancy branchless and vectorized algorithms are the best.</p>
-<h2 id="on-quicksort-versus-merge-sort">On quicksort versus merge sort</h2>
+<h2 id="on-quicksort-versus-merge-sort"><a class="header" href="#on-quicksort-versus-merge-sort">On quicksort versus merge sort</a></h2>
 <p>Merge sort is better. It is deterministic, stable, and has optimal worst-case performance. Its pattern handling is better: while merge sort handles &quot;horizontal&quot; patterns and quicksort does &quot;vertical&quot; ones, merge sort gets useful work out of <em>any</em> sequence of runs but in-place quicksort will quickly mangle its analogue until it may as well be random.</p>
 <p>But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.</p>
 <p>The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmark means radix sorting looks better than it is.</p>
-<h2 id="on-binary-search">On binary search</h2>
+<h2 id="on-binary-search"><a class="header" href="#on-binary-search">On binary search</a></h2>
 <p>Binary searches are very easy to get wrong. Do not write <code><span class='Paren'>(</span><span class='Value'>hi</span><span class='Function'>+</span><span class='Value'>lo</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Number'>2</span></code>: it's not safe from overflows. I always follow the pattern given in the first code block <a href="https://pvk.ca/Blog/2015/11/29/retrospective-on-binary-search-and-on-compression-slash-compilation/">here</a>. This code will never access the value <code><span class='Value'>*base</span></code>, so it should be considered a search on the <code><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>1</span></code> values beginning at <code><span class='Value'>base</span><span class='Function'>+</span><span class='Number'>1</span></code> (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract <code><span class='Value'>n%</span><span class='Number'>2</span></code> from <code><span class='Value'>n</span></code> in the case that <code><span class='Value'>*mid</span> <span class='Function'>&lt;</span> <span class='Value'>x</span></code>.</p>
-<h2 id="compound-data">Compound data</h2>
+<h2 id="compound-data"><a class="header" href="#compound-data">Compound data</a></h2>
 <p>Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.</p>
 <p>For <strong>Sort</strong> and <strong>Grade</strong>, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.</p>
 <p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
-<h2 id="simple-data">Simple data</h2>
+<h2 id="simple-data"><a class="header" href="#simple-data">Simple data</a></h2>
 <p>The name of the game here is &quot;branchless&quot;.</p>
 <p>For sorting, the fastest algorithms for generic data and generic hardware are branchless <a href="#quicksort">quicksorts</a>. Fluxsort is new and very exciting because it's a <em>stable</em> algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.</p>
 <p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
-<h3 id="counting-and-bucket-sort">Counting and bucket sort</h3>
+<h3 id="counting-and-bucket-sort"><a class="header" href="#counting-and-bucket-sort">Counting and bucket sort</a></h3>
 <p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
 <p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
-<h3 id="quicksort">Quicksort</h3>
+<h3 id="quicksort"><a class="header" href="#quicksort">Quicksort</a></h3>
 <p><a href="https://github.com/scandum/fluxsort">Fluxsort</a> attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (<a href="https://github.com/scandum/quadsort">Quadsort</a>) as the base case. I haven't fully evaluated these.</p>
 <p><a href="https://arxiv.org/abs/2106.05123">This paper</a> gives a good description of <a href="https://github.com/orlp/pdqsort">pdqsort</a>. I'd start with the <a href="https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs">Rust version</a>, which has some advantages but can still be improved further. The subsections below describe improved <a href="#partitioning">partitioning</a> and an <a href="#initial-pass">initial pass</a> with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.</p>
 <p>While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.</p>
-<h4 id="partitioning">Partitioning</h4>
+<h4 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h4>
 <p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
 <p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
 <p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
-<h4 id="initial-pass">Initial pass</h4>
+<h4 id="initial-pass"><a class="header" href="#initial-pass">Initial pass</a></h4>
 <p>An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:</p>
 <ul>
 <li>Recognize sorted and reverse-sorted arrays as fast as possible</li>
@@ -44,15 +44,15 @@
 <p>Finding an initial run is fast as well. Compare the first two elements to determine direction, then search for the first pair that have the opposite direction (this can be vectorized because overreading is fine). This run can be used as the first range block, because the maximum and minimum are the two elements at the ends of the run.</p>
 <p>At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.</p>
 <p>This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them. There may also be sampling-based tests to find when merge sort is better, even if the runs aren't perfect (but is this actually common?).</p>
-<h3 id="other-sorting-algorithms">Other sorting algorithms</h3>
+<h3 id="other-sorting-algorithms"><a class="header" href="#other-sorting-algorithms">Other sorting algorithms</a></h3>
 <p><a href="https://github.com/ips4o/ips4o">IPS⁴o</a> is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.</p>
 <p><a href="https://github.com/Morwenn/vergesort">Vergesort</a> has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.</p>
 <p>Sorting networks compare and swap elements in a fixed pattern, and so can be implemented with branchless or even vectorized code. They're great for sorting many small arrays of the same size, but the limit before insertion sort beats it will be pretty small without hardware specialization.</p>
-<h4 id="simd-sorting">SIMD sorting</h4>
+<h4 id="simd-sorting"><a class="header" href="#simd-sorting">SIMD sorting</a></h4>
 <p>A few people have done some work on merge sorting with AVX2 or AVX-512: <a href="https://github.com/sid1607/avx2-merge-sort">two</a> <a href="https://github.com/PatwinchIR/ultra-sort">examples</a>. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?</p>
 <p><a href="https://github.com/nlw0/ChipSort.jl">ChipSort</a> seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.</p>
 <p>Or AVX can <a href="https://github.com/WojciechMula/simd-sort">speed up</a> quicksort. I suspect this is more of a marginal improvement (over BlockQuicksort/pdqsort discussed below) relative to merge sort. If partitioning is fast enough it might make stable quicksort viable.</p>
-<h3 id="binary-search">Binary search</h3>
+<h3 id="binary-search"><a class="header" href="#binary-search">Binary search</a></h3>
 <p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
 <p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
 <p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>𝕨</span></code> down to the range of <code><span class='Value'>𝕩</span></code>. After finding the range of <code><span class='Value'>𝕩</span></code>, binary cut <code><span class='Value'>𝕨</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>𝕨</span></code> for the appropriate endpoint of the range.</p>
diff --git a/docs/implementation/vm.html b/docs/implementation/vm.html
index 1ecac46c..b7dd287f 100644
--- a/docs/implementation/vm.html
+++ b/docs/implementation/vm.html
@@ -4,14 +4,14 @@
   <title>The BQN virtual machine and runtime</title>
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">implementation</a></div>
-<h1 id="the-bqn-virtual-machine-and-runtime">The BQN virtual machine and runtime</h1>
+<h1 id="the-bqn-virtual-machine-and-runtime"><a class="header" href="#the-bqn-virtual-machine-and-runtime">The BQN virtual machine and runtime</a></h1>
 <p>BQN's self-hosted compiler and runtime mean that only a small amount of native code is needed to run BQN on any given platform. The <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../docs/bqn.js">Javascript environment</a> requires about 500 lines of Javascript code including system functions and performance improvements; probably around 250 would be required just to run the core language. This makes it fairly easy to port BQN to new platforms, allowing BQN to be <a href="../doc/embed.html">embedded</a> within other programming languages and interact with arrays or functions in those languages.</p>
 <p>The way data is represented is part of the VM implementation: it can use native arrays or a custom data structure, depending on what the language supports. An initial implementation will be very slow, but can be improved by replacing functions from the BQN-based runtime with native code. As the VM system can be hard to work with if you're not familiar with it, I advise you to contact me to discuss implementing a VM if you are interested.</p>
-<h2 id="bytecode">Bytecode</h2>
+<h2 id="bytecode"><a class="header" href="#bytecode">Bytecode</a></h2>
 <p>The BQN implementation here and dzaima/BQN share a stack-based object code format used to represent compiled code. This format is a list of numbers of unspecified precision (small precision will limit the length of list literals and number of locals per block, blocks, and constants). Previously it was encoded as bytes with the <a href="https://en.wikipedia.org/wiki/LEB128">LEB128</a> format; while it no longer has anything to do with bytes it's called a &quot;bytecode&quot; because this is shorter than &quot;object code&quot;.</p>
 <p>The self-hosted compiler uses a simpler, and less capable, format for block and variable data than dzaima/BQN. Only this format is described here; <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../dc.bqn">dc.bqn</a> adapts it to be compatible with dzaima/BQN.</p>
 <p>dzaima/BQN can interpret bytecode or convert it to <a href="https://en.wikipedia.org/wiki/Java_virtual_machine">JVM</a> bytecode, while the Javascript VM previously interpreted bytecode but now always compiles it. Since interpretation is a simpler strategy, it may be helpful to use the <a href="https://github.com/mlochbaum/BQN/blob/f74d9223ef880f2914030c2375f680dcc7e8c92b/bqn.js#L36">old Javascript bytecode interpreter</a> as a reference (for bytecode execution only) when implementing a BQN virtual machine.</p>
-<h3 id="components">Components</h3>
+<h3 id="components"><a class="header" href="#components">Components</a></h3>
 <p>The complete bytecode for a program consists of the following:</p>
 <ul>
 <li>A bytecode sequence <code><span class='Value'>code</span></code></li>
@@ -21,7 +21,7 @@
 <li>Optionally, source locations for each instruction</li>
 <li>Optionally, tokenization information</li>
 </ul>
-<h4 id="blocks">Blocks</h4>
+<h4 id="blocks"><a class="header" href="#blocks">Blocks</a></h4>
 <p>Each entry in <code><span class='Value'>blocks</span></code> is a list of the following properties:</p>
 <ul>
 <li>Block type: (0) function/immediate, (1) 1-modifier, (2) 2-modifier</li>
@@ -31,7 +31,7 @@
 <p>Compilation separates blocks so that they are not nested in bytecode. A block consists of bodies, so that all compiled code is contained in some body of a block. The self-hosted compiler compiles the entire program into an immediate block, and the program is run by evaluating this block. Bodies are terminated with a RETN or RETD instruction.</p>
 <p>When the block is evaluated depends on its type and immediateness. An immediate block (0,1) is evaluated as soon as it is pushed; a function (0,0) is evaluated when called on arguments, an immediate modifier (1 or 2, 1) is evaluated when called on operands, and a deferred modifier (1 or 2, 0) creates a derived function when called on operands and is evaluated when this derived function is called on arguments.</p>
 <p>The last property can be a single number, or, if it's a deferred block, might be a pair of lists. For a single number the block is always evaluated by evaluating the body with the given index. For a pair, the first element gives the monadic case and the second the dyadic one. A given valence should begin at the first body in the appropriate list, moving to the next one if a header test (SETH instruction) fails.</p>
-<h4 id="bodies">Bodies</h4>
+<h4 id="bodies"><a class="header" href="#bodies">Bodies</a></h4>
 <p>Bodies in a block are separated by <code><span class='Value'>;</span></code>. Each entry in <code><span class='Value'>bodies</span></code> is a list containing:</p>
 <ul>
 <li>Starting index in <code><span class='Value'>code</span></code></li>
@@ -41,7 +41,7 @@
 </ul>
 <p>The starting index refers to the position in bytecode where execution starts in order to evaluate the block. Different bodies will always have the same set of special names, but the variables they define are unrelated, so of course they can have different counts. The given number of variables includes special names, but list of names and export mask don't.</p>
 <p>The program's symbol list is included in the tokenization information <code><span class='Value'>t</span></code>: it is <code><span class='Number'>0</span><span class='Function'>⊑</span><span class='Number'>2</span><span class='Function'>⊑</span><span class='Value'>t</span></code>. Since the entire program (the source code passed in one compiler call) uses this list, namespace field accesses can be performed with indices alone within a program. The symbol list is needed for cross-program access, for example if <code><span class='Function'>•BQN</span></code> returns a namespace.</p>
-<h3 id="instructions">Instructions</h3>
+<h3 id="instructions"><a class="header" href="#instructions">Instructions</a></h3>
 <p>The following instructions are defined by dzaima/BQN. The ones emitted by the self-hosted BQN compiler are marked in the &quot;used&quot; column. Instructions marked <code><span class='Function'>NS</span></code> are used only in programs with namespaces, and so are not needed to support the compiler or self-hosted runtime.</p>
 <table>
 <thead>
@@ -447,23 +447,23 @@
 </tbody>
 </table>
 <p>Many instructions just call functions or modifiers or otherwise have fairly obvious implementations. Instructions to handle variables and blocks are more complicated (although very typical of bytecode representations for lexically-scoped languages) and are described in more detail below.</p>
-<h3 id="local-variables-dfnd-loco-locu-locm-retn">Local variables: DFND LOCO LOCU LOCM RETN</h3>
+<h3 id="local-variables-dfnd-loco-locu-locm-retn"><a class="header" href="#local-variables-dfnd-loco-locu-locm-retn">Local variables: DFND LOCO LOCU LOCM RETN</a></h3>
 <p>The bytecode representation is designed with the assumption that variables will be stored in frames, one for each time an instance of a block is run. dzaima/BQN has facilities to give frame slots names, in order to support dynamic execution, but self-hosted BQN doesn't. A new frame is created when the block is evaluated (see <a href="#blocks">#blocks</a>) and in general has to be cleaned up by garbage collection, because a lexical closure might need to refer to the frame even after the corresponding block finishes. Lexical closures can form loops, so simple reference counting can leak memory, but it could be used in addition to less frequent tracing garbage collection or another strategy.</p>
 <p>A frame is a mutable list of <em>slots</em> for variable values. It has slots for any special names that are available during the blocks execution followed by the local variables it defines. Special names use the ordering <code><span class='Value'>𝕤𝕩𝕨𝕣𝕗𝕘</span></code>; the first three of these are available in non-immediate blocks while <code><span class='Value'>𝕣</span></code> and <code><span class='Value'>𝕗</span></code> are available in modifiers and <code><span class='Value'>𝕘</span></code> in 2-modifiers specifically.</p>
 <p>When a block is pushed with <strong>DFND</strong>, an instance of the block is created, with its <em>parent frame</em> set to be the frame of the currently-executing block. Setting the parent frame when the block is first seen, instead of when it's evaluated, is what distinguishes lexical from dynamic scoping. If it's an immediate block, it's evaluated immediately, and otherwise it's pushed onto the stack. When the block is evaluated, its frame is initialized using any arguments passed to it, the next instruction's index is pushed onto the return stack, and execution moves to the first instruction in the block. When the RETN instruction is encountered, an index is popped from the return stack and execution returns to this location. As an alternative to maintaining an explicit return stack, a block can be implemented as a native function that creates a new execution stack and returns the value in it when the <strong>RETN</strong> instruction is reached. This approach uses the implementation language's call stack for the return stack.</p>
 <p>Local variables are manipulated with the <strong>LOCO</strong> (or <strong>LOCU</strong>) and <strong>LOCM</strong> instructions, which load the value of a variable and a reference to it (see the next section) respectively. These instructions reference variables by <em>frame depth</em> and <em>slot index</em>. The frame depth indicates in which frame the variable is found: the current frame has depth 0, its block's parent frame has depth 1, and so on. The slot index is an index within that frame.</p>
 <p>Slots should be initialized with some indication they are not yet defined. The variable can be defined with SETN only if it hasn't been defined yet, and can be accessed with LOCO or LOCU or modified with SETU or SETM only if it <em>has</em> been defined.</p>
-<h3 id="variable-references-arrm-locm-setn-setu-setm">Variable references: ARRM LOCM SETN SETU SETM</h3>
+<h3 id="variable-references-arrm-locm-setn-setu-setm"><a class="header" href="#variable-references-arrm-locm-setn-setu-setm">Variable references: ARRM LOCM SETN SETU SETM</a></h3>
 <p>A <em>variable reference</em> indicates a particular frame slot in a way that's independent of the execution context. For example, it could be a pointer to the slot, or a reference to the frame along with the index of the slot. <strong>LOCM</strong> pushes a variable reference to the stack.</p>
 <p>A <em>reference list</em> is a list of variable references or reference lists. It's created with the <strong>ARRM</strong> instruction. In the Javascript VM there's no difference between a reference list and an ordinary BQN list other than the contents.</p>
 <p>The <strong>SETN</strong>, <strong>SETU</strong>, and <strong>SETM</strong> instructions set a value for a reference. If the reference is to a variable, they simply set its value. For a reference list, the value needs to be destructured. It must be a list of the same length, and each reference in the reference list is set to the corresponding element of the value list.</p>
 <p><strong>SETM</strong> additionally needs to get the current value of a reference. For a variable reference this is its current value (with an error if it's not defined yet); for a reference list it's a list of the values of each reference in the list.</p>
-<h3 id="namespaces-fldo-fldm-nspm-retd">Namespaces: FLDO FLDM NSPM RETD</h3>
+<h3 id="namespaces-fldo-fldm-nspm-retd"><a class="header" href="#namespaces-fldo-fldm-nspm-retd">Namespaces: FLDO FLDM NSPM RETD</a></h3>
 <p>A <em>namespace</em> is the collection of variables in a particular scope, along with an association mapping some exported <em>symbols</em> (case- and underscore-normalized strings) to a subset of these. It can be represented using a frame for the variables, plus the variable name list and mask of exported variables from that block's properties in the bytecode. <strong>RETD</strong> finishes executing the block and returns the namespace for that execution.</p>
 <p>The variable name list is split into a program-level list of names and a list of indices into these names: within-program field accesses can be done with the indices only while cross-program ones must use names. One way to check whether an access is cross-program is to compare the accessor's program-level name list with the one for the accessed namespace as references or pointers. Then a lookup should be performed as appropriate. A persistent lookup table is needed to make this efficient.</p>
 <p><strong>FLDO</strong> reads a field from a namespace. The parameter <code><span class='Function'>I</span></code> is a program-level identifier index for this field. The VM must ensure that the field is exported, possibly by leaving unexported identifiers out of the namespace's lookup table. <strong>FLDM</strong> does the same but pushes a reference to the field, to be modified by assignment.</p>
 <p><strong>NSPM</strong> is used for aliased assignment such as <code><span class='Bracket'>⟨</span><span class='Value'>a</span><span class='Ligature'>‿</span><span class='Value'>b</span><span class='Gets'>⇐</span><span class='Value'>c</span><span class='Bracket'>⟩</span><span class='Gets'>←</span><span class='Value'>ns</span></code>. It tags a reference with a namespace field, identified with a program-level index. A value assigned to the tagged reference must be a namespace. The relevant field is extracted, and then stored in the original reference.</p>
-<h2 id="runtime">Runtime</h2>
+<h2 id="runtime"><a class="header" href="#runtime">Runtime</a></h2>
 <p>Primitive functions and modifiers used in a program are stored in its <code><span class='Value'>consts</span></code> array. The compiler needs to be passed a <em>runtime</em> with the value of every primitive so that these functions and modifiers are available.</p>
 <p>While it's perfectly possible to implement the runtime from scratch, the pseudo-BQN files <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/r0.bqn">r0.bqn</a> and <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/r1.bqn">r1.bqn</a> implement the full runtime in terms of a <em>core runtime</em> consisting of a smaller number of much simpler functions. <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/pr.bqn">pr.bqn</a> converts this file so that it can be compiled. It changes values in the core runtime to primitives and primitives to generated identifiers, so that the first 22 values in the output's <code><span class='Value'>consts</span></code> array are exactly the core runtime, and no other primitives are required. The result is a list of two elements: first the list of all primitive values, and then a function that can be called to pass in two additional core functions used for inferred properties.</p>
 <p>The contents of a core runtime are given below. The names given are those used in r1.bqn; the environment only provides a list of values and therefore doesn't need to use the same names. For named functions a description is given. For primitives, the implementation should match the BQN specification for that primitive on the specified domain, or the entire domain if left empty. They won't be called outside that domain except if there are bugs in the BQN sources.</p>
@@ -600,7 +600,7 @@
 </table>
 <p>To define the final two functions, call the second returned element as a function, with argument <code><span class='Function'>Decompose</span><span class='Ligature'>‿</span><span class='Function'>PrimInd</span></code>. The function <code><span class='Function'>PrimInd</span></code> gives the index of <code><span class='Value'>𝕩</span></code> in the list of all primitives (defined as <code><span class='Value'>glyphs</span></code> in pr.bqn), or the length of the runtime if <code><span class='Value'>𝕩</span></code> is not a primitive. The two functions are only needed for computing inferred properties, and are defined by default so that every value is assumed to be a primitive, and <code><span class='Function'>PrimInd</span></code> performs a linear search over the returned runtime. If the runtime is used directly, then this means that without setting <code><span class='Function'>Decompose</span><span class='Ligature'>‿</span><span class='Function'>PrimInd</span></code>, function inferred properties will work slowly and for primitives only; if values from the runtime are wrapped then function inferred properties will not work at all.</p>
 <p>Remember that <code><span class='Function'>+</span></code> and <code><span class='Function'>-</span></code> can also work on characters in some circumstances, under the rules of affine characters. Other arithmetic functions should only accept numbers. <code><span class='Function'>=</span></code> must work on any atoms including functions and modifiers. <code><span class='Function'>≤</span></code> must work on numbers and characters.</p>
-<h3 id="grouplen-and-groupord">GroupLen and GroupOrd</h3>
+<h3 id="grouplen-and-groupord"><a class="header" href="#grouplen-and-groupord">GroupLen and GroupOrd</a></h3>
 <p>GroupLen and GroupOrd, short for Group length and Group order, are used to implement <a href="../doc/group.html">Group</a> (<code><span class='Function'>⊔</span></code>) and also to grade small-range lists and invert permutations (the inverse of a permutation <code><span class='Value'>p</span></code> is <code><span class='Number'>1</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>GroupOrd</span> <span class='Value'>p</span></code>). Each of these only needs to work on lists of integers. Shown below are efficient implementations using BQN extended with the notation <code><span class='Paren'>(</span><span class='Value'>i</span><span class='Function'>⊑</span><span class='Value'>l</span><span class='Paren'>)</span> <span class='Function'>Fn</span><span class='Gets'>↩</span> <span class='Value'>x</span></code> meaning <code><span class='Value'>l</span> <span class='Gets'>↩</span> <span class='Function'>Fn</span><span class='Modifier2'>⟜</span><span class='Value'>x</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Value'>i</span><span class='Modifier2'>⊸</span><span class='Function'>⊑</span><span class='Paren'>)</span> <span class='Value'>l</span></code>, where <code><span class='Function'>Fn</span></code> is <code><span class='Function'>⊢</span></code> if omitted. Since these special assignments only change one element of <code><span class='Value'>l</span></code>, each can be a fast constant-time operation.</p>
 <pre><span class='Function'>GroupLen</span> <span class='Gets'>←</span> <span class='Brace'>{</span>
   <span class='Value'>l</span> <span class='Gets'>←</span> <span class='Number'>¯1</span> <span class='Function'>⌈</span><span class='Modifier'>´</span> <span class='Value'>𝕩</span>
@@ -617,10 +617,10 @@
   <span class='Value'>r</span>
 <span class='Brace'>}</span>
 </pre>
-<h2 id="assembly">Assembly</h2>
+<h2 id="assembly"><a class="header" href="#assembly">Assembly</a></h2>
 <p>The full BQN implementation is made up of the two components above—virtual machine and core runtime—and the compiled runtime, compiler, and formatter. Since the compiler unlikely to work right away, I suggest initially testing the virtual machine on smaller pieces of code compiled by an existing, working, BQN implementation.</p>
 <p>BQN sources are compiled with <a href="https://github.com/mlochbaum/BQN/blob/master/implementation/../src/cjs.bqn">cjs.bqn</a>, which runs under <a href="https://github.com/dzaima/BQN/">dzaima/BQN</a> as a Unix-style script. It has two modes. If given a command-line argument <code><span class='Value'>r</span></code>, <code><span class='Value'>c</span></code>, or <code><span class='Value'>fmt</span></code>, it compiles one of the source files. With any other command-line arguments, it will compile each one, and format it as a single line of output. The output is in a format designed for Javascript, but it can be adjusted to work in other languages either by text replacement on the output or changes to the formatting functions in cjs.bqn.</p>
-<h3 id="structure">Structure</h3>
+<h3 id="structure"><a class="header" href="#structure">Structure</a></h3>
 <p>The following steps give a working BQN system, assuming a working VM and core runtime:</p>
 <ul>
 <li>Evaluate the bytecode <code><span class='Value'>$</span> <span class='Value'>src</span><span class='Function'>/</span><span class='Value'>cjs.bqn</span> <span class='Value'>r</span></code>, passing the core runtime <code><span class='Value'>provide</span></code> in the constants array. The result is a BQN list of a full runtime, and a function <code><span class='Function'>SetPrims</span></code>.</li>
@@ -630,7 +630,7 @@
 </ul>
 <p>The compiler takes the runtime as <code><span class='Value'>𝕨</span></code> and source code as <code><span class='Value'>𝕩</span></code>. To evaluate BQN source code, convert it into a BQN string (rank-1 array of characters), pass this string and runtime to the compiler, and evaluate the result as bytecode. Results can be formatted with the formatter for use in a REPL, or used from the implementation language.</p>
 <p>Two formatter arguments <code><span class='Function'>Glyph</span></code> and <code><span class='Function'>FmtNum</span></code> are not part of the runtime. <code><span class='Function'>Glyph</span></code> assumes <code><span class='Value'>𝕩</span></code> is a primitive and returns the character (not string) that represents it, and <code><span class='Function'>FmtNum</span></code> assumes <code><span class='Value'>𝕩</span></code> is a number and returns a string representing it.</p>
-<h3 id="testing">Testing</h3>
+<h3 id="testing"><a class="header" href="#testing">Testing</a></h3>
 <p>I recommend roughly the following sequence of tests to get everything working smoothly. It can be very difficult to figure out where in a VM things went wrong, so it's important to work methodically and make sure each component is all right before moving to the next.</p>
 <p>Because the compiler works almost entirely with lists of numbers, a correct fill implementation is not needed to run the compiler. Instead, you can define <code><span class='Function'>Fill</span></code> as <code><span class='Number'>0</span><span class='Modifier2'>⊘</span><span class='Function'>⊢</span></code> and <code><span class='Modifier2'>_fillBy_</span></code> as <code><span class='Brace'>{</span><span class='Function'>𝔽</span><span class='Brace'>}</span></code> to always use a fill element of 0.</p>
 <ul>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-08-11 17:21:31 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-08-11 17:25:04 -0400
commit	2afb23928e1984d475cc460e1672e8f6fa0e4dbe (patch)
tree	ebd2cc514294d30b6fa4b36c2ee638326f06ef72 /docs/implementation
parent	eac61ca02074c218667754d5f4ef562e780bae85 (diff)