Updates to Co-dfns page

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-01-25 17:57:37 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-01-25 17:58:23 -0500
commit: c8441998a911e5613b594e730a76b7b84e23f48e (patch)
tree: 8bf284a0861940c6e2b65893cdc5df121b5864f7 /docs
parent: bc0245f0be38e547ad76bcd719c6dedbb161f0f4 (diff)
1 files changed, 11 insertions, 11 deletions
diff --git a/docs/implementation/codfns.html b/docs/implementation/codfns.html
index e9646f53..65ac1391 100644
--- a/docs/implementation/codfns.html
+++ b/docs/implementation/codfns.html
@@ -6,15 +6,15 @@
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">implementation</a></div>
 <h1 id="co-dfns-versus-bqns-implementation"><a class="header" href="#co-dfns-versus-bqns-implementation">Co-dfns versus BQN's implementation</a></h1>
 <p>The BQN self-hosted compiler is directly inspired by the <a href="https://github.com/Co-dfns/Co-dfns">Co-dfns</a> project, a compiler for a subset of <a href="../doc/fromDyalog.html">Dyalog APL</a>. I'm very grateful to Aaron for showing that array-oriented compilation is even possible! In addition to the obvious difference of target language, BQN differs from Co-dfns both in goals and methods.</p>
-<p>The shared goals of BQN and Co-dfns are to implement a compiler for an array language with whole-array operations. This provides the theoretical benefit of a short <em>critical path</em>, which in practice means that both compilers can make good use of a GPU or a CPU's vector instructions simply by providing an appropriate runtime (Co-dfns has good runtimes—an ArrayFire program on the GPU and Dyalog APL on the CPU—while CBQN isn't at this level yet). The two implementations also share a preference for working &quot;close to the metal&quot; by passing around arrays of numbers rather than creating abstract types to work with data. Objects are right out. These choices lead to a compact source code implementation, and may have some benefits in terms of how easy it is to write and understand the compiler.</p>
+<p>The shared goals of BQN and Co-dfns are to implement a compiler for an array language with whole-array operations. This provides the theoretical benefit of a short <em>critical path</em>, which in practice means that both compilers can make good use of a GPU or a CPU's vector instructions simply by providing an appropriate runtime (Co-dfns has good runtimes—an ArrayFire program on the GPU and Dyalog APL on the CPU—while CBQN isn't at the same level yet). The two implementations also share a preference for working &quot;close to the metal&quot; by passing around arrays of numbers rather than creating abstract types to work with data. Objects are right out. These choices lead to a compact source code implementation, and may have some benefits in terms of how easy it is to write and understand the compiler.</p>
 <h2 id="compilation-strategy"><a class="header" href="#compilation-strategy">Compilation strategy</a></h2>
 <p>Co-dfns development has primarily been focused on the core compiler, and not parsing, code generation, or the runtime. The associated Ph.D. thesis and famous 17 lines figure refer to this section, which transforms the abstract syntax tree (AST) of a program to a lower-level form, and resolves lexical scoping by linking variables to their definitions. While all of Co-dfns is written in APL, other sections aren't necessarily designed to be data-parallel and don't have the same performance guarantees. For example, the parser uses a parsing expression grammar (PEG), a sequential algorithm. In contrast, BQN is entirely written in a data-parallel style. It does not maintain the same clean separation between compiler sections: <a href="../spec/token.html">token formation</a> and literal evaluation is separated into its own function, but parsing, AST manipulation, and code generation overlap.</p>
 <p>The core Co-dfns compiler is based on manipulating the syntax tree, which is mostly stored as parent and sibling vectors—that is, lists of indices of other nodes in the tree. BQN is less methodical, but generally treats the source tokens as a simple list. This list is reordered in various ways, allowing operations that behave like tree traversals to be performed with scans under the right ordering. This strategy allows BQN to be much stricter in the kinds of operations it uses. Co-dfns regularly uses <code><span class='Value'>⍣</span><span class='Function'>≡</span></code> (repeat until convergence) for iteration and creates nested arrays with <code><span class='Value'>⌸</span></code> (Key, like <a href="../doc/group.html">Group</a>), but BQN has only a single instance of iteration to resolve quotes and comments, plus one complex but parallelizable scan for numeric literal processing, and only uses Group to extract identifiers and strings. This means that most primitives, if we count simple reductions and scans as single primitives, are executed a fixed number of times during execution, making complexity analysis even easier than in Co-dfns.</p>
 <h2 id="backends-and-optimization"><a class="header" href="#backends-and-optimization">Backends and optimization</a></h2>
 <p>Co-dfns was designed from the beginning to build GPU programs, and outputs code in ArrayFire (a C++ framework), which is then compiled. GPU programming is quite limiting, and as a result Co-dfns has strict limitations in functionality that are slowly being removed. It now has partial support for nested arrays and array ranks higher than 4. BQN is designed with performance in mind, but implementation effort focused on functionality first, so that arbitrary array structures as well as trains and lexical closures have been supported from the beginning. Rather than target a specific language, it outputs object code to be interpreted by a <a href="vm.html">virtual machine</a>. Another goal for BQN was to not only write the compiler in BQN but to use BQN for the runtime as much as possible. The BQN-based runtime uses a small number of basic array operations provided by the VM. The extra abstraction causes this runtime to be very slow, but this can be fixed by overwriting functions from the runtime with natively-implemented ones.</p>
-<p>Neither BQN nor Co-dfns significantly optimize their output at the time of writing (it could be said that Co-dfns relies on the ArrayFire backend to optimize). BQN does have one optimization, which is to compute variable lifetimes in functions so that the last access to a variable can clear it. Further optimizations often require finding properties such as reachability in a graph of expressions that probably can't be done efficiently in a strict array style. For this and other reasons it would probably be best to structure compiler optimization as a set of additional modules that can be provided during a given compilation.</p>
+<p>Neither BQN nor Co-dfns significantly optimize their output at the time of writing (it could be said that Co-dfns relies on the ArrayFire backend to optimize). BQN does have one optimization, which is to compute variable lifetimes in functions so that the last access to a variable can clear it. Further optimizations often require finding properties such as reachability in a graph of expressions that likely can't be done efficiently in a strict array style. For this and other reasons it would probably be best to structure compiler optimization as a set of additional modules that can be provided during a given compilation.</p>
 <h2 id="error-reporting"><a class="header" href="#error-reporting">Error reporting</a></h2>
-<p>Co-dfns doesn't check for compilation errors, while BQN has complete error checking and good error messages, and includes source positions in compiler errors as well as in the compiled code for use in runtime errors. Position tracking and error checking add up to a little more than 20% overhead for the compiler, both in runtime and lines of code. And improving the way errors are reported once found has no cost for working programs, because reporting code only needs to be run if there's a compiler error. This leaves room for potentially very sophisticated error analysis to attempt to track down the root cause of a compilation error, but I haven't yet done any work along these lines.</p>
+<p>Co-dfns doesn't check for compilation errors, while BQN has complete error checking and good error messages, and includes source positions in compiler errors as well as in the compiled code for use in runtime errors. Position tracking and error checking add up to a little more than 20% overhead for the compiler, both in runtime and lines of code. And improving the way errors are reported once found has no cost for working programs, because reporting code only needs to be run if there's a compiler error. The only thing that really takes advantage of this now is the reporting for bracket matching, which goes over all brackets with a stack-based (not array-oriented or parallel) method.</p>
 <h2 id="comments"><a class="header" href="#comments">Comments</a></h2>
 <p>Aaron advocates the almost complete separation of code from comments (thesis) in addition to his very terse style as a general programming methodology. I find that this practice makes it hard to connect the documentation to the code, and is very slow in providing a summary or reminder of functionality that a comment might. One comment on each line makes a better balance of compactness and faster accessibility in my opinion. However, I do plan to write long-form material providing the necessary context and explanations required to understand the compiler.</p>
 <h2 id="is-it-a-good-idea"><a class="header" href="#is-it-a-good-idea">Is it a good idea?</a></h2>
@@ -28,26 +28,26 @@
 <p>The sort of static guarantee I want is not really a type system but an <em>axis</em> system. That is, if I take <code><span class='Value'>a</span><span class='Function'>∧</span><span class='Value'>b</span></code> I want to know that the arithmetic mapping makes sense because the two variables use the same axis. And I want to know that if <code><span class='Value'>a</span></code> and <code><span class='Value'>b</span></code> are compatible, then so are <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but not <code><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>. I could use a form of <a href="https://en.wikipedia.org/wiki/Hungarian_notation">Hungarian notation</a> for this, and write <code><span class='Value'>ia</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>ib</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but it's inconvenient to rewrite the axis every time the variable appears, and I'd much prefer a computer checking agreement rather than my own fallible self.</p>
 <h3 id="performance"><a class="header" href="#performance">Performance</a></h3>
 <p>In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to <strong>10 times faster</strong> for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program. I think comparing the self-hosted BQN compiler to the one in dzaima/BQN shows that this large improvement is caused as much by nanopass being slow as Co-dfns being fast.</p>
-<p>The self-hosted compiler running in CBQN achieves full performance at about 1KB of dense source code. On large files it achieves speeds around 2MB/s, slightly less than <strong>half as fast</strong> as dzaima/BQN's compiler. This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
-<p>A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 20, so I think that nanopass is simply not as fast as more typical imperative compiler methods.</p>
+<p>The self-hosted compiler running in CBQN reachej full performance at about 1KB of dense source code. On large files it achieves speeds around 3MB/s, about <strong>two-thirds as fast</strong> as dzaima/BQN's compiler. This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
+<p>A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 15, so I think that nanopass is simply not as fast as more typical imperative compiler methods.</p>
 <ul>
-<li>The CBQN runtime is still suboptimal, with no integer types smaller than 4 bytes (in particular no bit booleans). Small types would improve things but this is limited as the bottleneck shifts over to operations like selection that don't scale well to small types. My estimate is a factor of 3 improvement from improving speed to match Dyalog, and I think more than a factor of 5 is unlikely.</li>
+<li>The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. My estimate is a little less than a factor of 2 improvement remaining from improving speed to match Dyalog, and I think more than a factor of 4 is unlikely.</li>
 <li>On the other hand Java isn't the fastest language for a compiler and a C-based compiler would likely be faster. I don't have an estimate for the size of the difference.</li>
 <li>Co-dfns and BQN use different compilation strategies. I think that my methods are at least as fast, and scale better to a full compiler.</li>
 <li>The portion of the compiler implemented by Co-dfns could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming. I think the whole-compiler comparison is more informative, although if this effect is very strong (I don't think it is), then hybrid array-scalar compilers could make sense.</li>
-<li>The Co-dfns and nanopass implementations are pass-for-pass equivalent, while BQN and Java are only comparable as a whole. As the passes were chosen to be ideal for Co-dfns, I think this could be slowing down nanopass in a way that doesn't translate to real-world performance.</li>
+<li>The Co-dfns and nanopass implementations are pass-for-pass equivalent, while BQN and Java are only comparable as a whole. As the passes were chosen to be ideal for Co-dfns, there's some chance this could be slowing down nanopass in a way that doesn't translate to real-world performance.</li>
 </ul>
 <p>Overall, it seems safe to say that an ideal array-based compiler would be competitive with a scalar one on the CPU, not vastly better as Aaron's benchmarks suggest. This result is still remarkable! APL and BQN are high-level dynamically-typed languages, and wouldn't be expected to go near the performance of a compiled language like Java. However, it makes it much harder to recommend array-based compilation than the numbers from the Co-dfns paper would.</p>
 <p>I stress here that I don't think there's anything wrong about the way Aaron has conducted or presented his research. The considerations described above are speculative even in (partial) hindsight. I think Aaron chose nanopass because of his familiarity with the functional programming compiler literature and because its multi-pass system is more similar to Co-dfns. And I know that he actively sought out other programmers willing to implement the compiler in other ways including imperative methods; apparently these efforts didn't pan out. Aaron even mentioned to me that his outstanding results were something of a problem for him, because reviewers found them unbelievable!</p>
 <h4 id="what-about-the-gpu"><a class="header" href="#what-about-the-gpu">What about the GPU?</a></h4>
 <p>BQN's compiler could certainly be made to run on a GPU, and it's fascinating that this is possible merely because I stuck to an array-based style. In Co-dfns, Aaron found a maximum factor of 6 improvement by running on the GPU, and this time it's the GPU runtime that we should expect to be slower than Dyalog. So we could expect an array-based compiler to run faster on large source files in this case. The problem is this: who could benefit from this speed?</p>
-<p>Probably not BQN. Recall that the BQN compiler runs at 2MB/s. This is fast enough that it almost certainly takes much longer to compile the program than to run it. The exception would be when a lot of code is loaded but not used, which can be solved by splitting the code into smaller files and only using those which are needed.</p>
-<p>The programmers who complain about compile times are those who use low-level languages like C++, Rust, and Julia. The things that these compilers spend time on aren't the things that the BQN compiler does! BQN has a rigid syntax, no metaprogramming, and compiles to bytecode. The slow compilers are churning to perform tasks like:</p>
+<p>Probably not BQN. Recall that the BQN compiler runs at 3MB/s. This is fast enough that it almost certainly takes much longer to compile the program than to run it. The exception would be when a lot of code is loaded but not used, which can be solved by splitting the code into smaller files and only using those which are needed.</p>
+<p>The programmers who complain about compile times are using languages like C++, Rust, and Julia that compile to native code, often through LLVM. The things that these compilers spend time on aren't the things that the BQN compiler does! BQN has a rigid syntax, no metaprogramming, and compiles to bytecode. The slow compilers are churning to perform tasks like:</p>
 <ul>
 <li>Type checking</li>
 <li>Templates, polymorphism, and other code generation</li>
 <li>Reachability computations, for optimization</li>
 <li>Automatic vectorization</li>
 </ul>
-<p>I don't know how to implement these in an array style. I suspect many are simply not possible. They tend to involve relationships between distant parts of the program, and are approached with graph-based methods for which efficient parallel implementations aren't known.</p>
-<p>It might be possible to translate a compiler with less optimization such as Go to an array style, partially or completely. But Go compiles very fast, so speeding it up isn't as desirable. In slower languages attacking problems like the above strikes me as many levels beyond the work on Co-dfns or BQN.</p>
+<p>I don't know how to implement these in an array style. I suspect many are simply not possible. They tend to involve relationships between distant parts of the program, and are approached with graph-based methods for which efficient array-style implementations aren't known.</p>
+<p>It might be possible to translate a compiler with less optimization such as Go to an array style, partially or completely. But Go compiles very fast, so speeding it up isn't as desirable. In slower compilers, attacking problems like the above strikes me as many levels beyond the work on Co-dfns or BQN.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-01-25 17:57:37 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-01-25 17:58:23 -0500
commit	c8441998a911e5613b594e730a76b7b84e23f48e (patch)
tree	8bf284a0861940c6e2b65893cdc5df121b5864f7 /docs
parent	bc0245f0be38e547ad76bcd719c6dedbb161f0f4 (diff)