From 147299b6861e59e06d028b9eca4ded3ac63be120 Mon Sep 17 00:00:00 2001
From: Marshall Lochbaum <mwlochbaum@gmail.com>
Date: Wed, 7 Sep 2022 21:14:04 -0400
Subject: CBQN reaches Dyalog-level performance in running the compiler,
 probably

---
 docs/implementation/codfns.html | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

(limited to 'docs')
diff --git a/docs/implementation/codfns.html b/docs/implementation/codfns.html
index 4e9647fe..066b99ce 100644
--- a/docs/implementation/codfns.html
+++ b/docs/implementation/codfns.html
@@ -10,7 +10,7 @@
 <p>The shared goals of BQN and Co-dfns are to implement a compiler for an array language with whole-array operations. This provides the theoretical benefit of a short <em>critical path</em>, which in practice means that both compilers can make good use of a GPU or a CPU's vector instructions simply by providing an appropriate runtime (Co-dfns has good runtimes—an ArrayFire program on the GPU and Dyalog APL on the CPU—while CBQN isn't at the same level yet). The two implementations also share a preference for working &quot;close to the metal&quot; by passing around arrays of numbers rather than creating abstract types to work with data. Objects are right out. These choices lead to a compact source code implementation, and may have some benefits in terms of how easy it is to write and understand the compiler.</p>
 <h2 id="compilation-strategy"><a class="header" href="#compilation-strategy">Compilation strategy</a></h2>
 <p>Co-dfns development historically focused on the core compiler, and not parsing, code generation, or the runtime. The associated Ph.D. thesis and famous 17 lines figure refer to this section, which transforms the abstract syntax tree (AST) of a program to a lower-level form, and resolves lexical scoping by linking variables to their definitions. While all of Co-dfns is written in APL, other sections aren't necessarily data-parallel, and could perform differently. As of late 2021, tokenization and most of parsing also follow the style used by the core compiler, but looser about nesting in my estimation. This brings Co-dfns closer to BQN, which is all in a data-parallel style.</p>
-<p>The core Co-dfns compiler is based on manipulating the syntax tree, which is mostly stored as parent and sibling vectors—that is, lists of indices of other nodes in the tree. BQN is less methodical, but generally treats the source tokens as a simple list. This list is reordered in various ways, allowing operations that behave like tree traversals to be performed with scans under the right ordering. This strategy allows BQN to be much stricter in the kinds of operations it uses. Co-dfns regularly uses <code><span class='Value'>⍣</span><span class='Function'>≡</span></code> (repeat until convergence) for iteration and creates nested arrays with <code><span class='Value'>⌸</span></code> (Key, like <a href="../doc/group.html">Group</a>), but BQN has only a single instance of iteration to resolve quotes and comments, plus one complex but parallelizable scan for numeric literal processing, and only uses Group to extract identifiers and strings. This means that most primitives, if we count simple reductions and scans as single primitives, are executed a fixed number of times during execution, making complexity analysis even easier than in Co-dfns.</p>
+<p>The core Co-dfns compiler is based on manipulating the syntax tree, which is mostly stored as parent and sibling vectors—that is, lists of indices of other nodes in the tree. BQN is less methodical, but generally treats the source tokens as a simple list. This list is reordered in various ways, allowing operations that behave like tree traversals to be performed with scans under the right ordering. This strategy allows BQN to be much stricter in the kinds of operations it uses. Co-dfns regularly uses <code><span class='Value'>⍣</span><span class='Function'>≡</span></code> (repeat until convergence) for iteration and creates nested arrays with <code><span class='Value'>⌸</span></code> (Key, like <a href="../doc/group.html">Group</a>). BQN has only a single instance of iteration to resolve quotes and comments, plus one complex but parallelizable scan for numeric literal processing. And it only uses Group to extract identifiers and strings, and collect metadata for final output. This means that most primitives, if we count simple reductions and scans as single primitives, are executed a fixed number of times during execution, making complexity analysis even easier than in Co-dfns.</p>
 <h2 id="backends-and-optimization"><a class="header" href="#backends-and-optimization">Backends and optimization</a></h2>
 <p>Co-dfns was designed from the beginning to build GPU programs, and outputs code in ArrayFire (a C++ framework), which is then compiled. GPU programming is quite limiting, and as a result Co-dfns has strict limitations in functionality that are slowly being removed. It now has partial support for nested arrays and array ranks higher than 4. BQN is designed with performance in mind, but implementation effort focused on functionality first, so that arbitrary array structures as well as trains and lexical closures have been supported from the beginning. Rather than target a specific language, it outputs object code to be interpreted by a <a href="vm.html">virtual machine</a>. Another goal for BQN was to not only write the compiler in BQN but to use BQN for the runtime as much as possible. The BQN-based runtime uses a small number of basic array operations provided by the VM. The extra abstraction causes this runtime to be very slow, but this can be fixed by overwriting functions from the runtime with natively-implemented ones.</p>
 <p>Neither BQN nor Co-dfns significantly optimize their output at the time of writing (it could be said that Co-dfns relies on the ArrayFire backend to optimize). BQN does have one optimization, which is to compute variable lifetimes in functions so that the last access to a variable can clear it. Further optimizations often require finding properties such as reachability in a graph of expressions that likely can't be done efficiently in a strict array style. For this and other reasons it would probably be best to structure compiler optimization as a set of additional modules that can be provided during a given compilation.</p>
@@ -29,21 +29,20 @@
 <p><a href="https://www.youtube.com/watch?v=z8MVKianh54">Here's Aaron's take</a>. Honestly I can't really go along with this: I think he ignores a lot of real distinctions between array and typed functional programming because it's convenient for the point he wants to make. On the other hand, it's abundantly clear that C-style types would be useless for an array compiler, because nearly every variable is a list of integers.</p>
 <p>The sort of static guarantee I want is not really a type system but an <em>axis</em> system. That is, if I take <code><span class='Value'>a</span><span class='Function'>∧</span><span class='Value'>b</span></code> I want to know that the arithmetic mapping makes sense because the two variables use the same axis. And I want to know that if <code><span class='Value'>a</span></code> and <code><span class='Value'>b</span></code> are compatible, then so are <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but not <code><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>. I could use a form of <a href="https://en.wikipedia.org/wiki/Hungarian_notation">Hungarian notation</a> for this, and write <code><span class='Value'>ia</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>ib</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but it's inconvenient to rewrite the axis every time the variable appears, and I'd much prefer a computer checking agreement rather than my own fallible self.</p>
 <h3 id="performance"><a class="header" href="#performance">Performance</a></h3>
-<p>In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to <strong>10 times faster</strong> for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program. I think comparing the self-hosted BQN compiler to the one in dzaima/BQN shows that this large improvement is caused as much by nanopass being slow as Co-dfns being fast.</p>
-<p>The self-hosted compiler running in CBQN reaches full performance at about 1KB of dense source code. Handling over 3MB/s, it's around <strong>half as fast</strong> as dzaima/BQN's compiler (but it's complicated—dbqn is usually slower on the first run but gets up to 3 times faster with some files, after hundreds of runs and with 3GB of memory use). This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
-<p>A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 20, so I think that nanopass is simply not as fast as more typical imperative compiler methods.</p>
+<p>In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to <strong>10 times faster</strong> for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program.</p>
+<p>The self-hosted compiler running in CBQN reaches full performance at 3KB or so of dense source code, and slows down slightly at larger sizes as the type for indices gets larger. Handling at least 5MB/s, it's between the <strong>same speed</strong> and <strong>half as fast</strong> as dzaima/BQN's compiler when that's fully warmed up (which takes hundreds of runs and uses 3GB of RAM). That compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is very similar for benchmarking purposes—it's missing some functionality but this shouldn't change timings more than 10% at a very conservative estimate. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
+<p>There are a few factors to consider regarding the difference in results:</p>
 <ul>
-<li>The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. I estimate less than a factor of 2 improvement remains from improving speed to match Dyalog, and would find a factor of 3 unlikely.</li>
-<li>On the other hand Java isn't the fastest language for a compiler and a C-based compiler would likely be faster. I don't have an estimate for the size of the difference.</li>
+<li>CBQN is a different runtime from Dyalog. We've profiled the compiler and implemented lots of things in SIMD, and I'm confident CBQN runs the compiler at least as fast as Dyalog 17.0, the latest version for Aaron's thesis, would have.</li>
 <li>Co-dfns and BQN use different compilation strategies. I think that my methods are at least as fast, and scale better to a full compiler.</li>
-<li>The portion of the compiler implemented by Co-dfns could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming. I think the whole-compiler comparison is more informative, although if this effect is very strong (I don't think it is), then hybrid array-scalar compilers could make sense.</li>
+<li>The portion of the compiler implemented by Co-dfns could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming.</li>
 <li>The Co-dfns and nanopass implementations are pass-for-pass equivalent, while BQN and Java are only comparable as a whole. As the passes were chosen to be ideal for Co-dfns, there's some chance this could be slowing down nanopass in a way that doesn't translate to real-world performance.</li>
 </ul>
-<p>Overall, it seems safe to say that an ideal array-based compiler would be competitive with a scalar one on the CPU, not vastly better as Aaron's benchmarks suggest. This result is still remarkable! APL and BQN are high-level dynamically-typed languages, and wouldn't be expected to go near the performance of a compiled language like Java. However, it makes it much harder to recommend array-based compilation than the numbers from the Co-dfns paper would.</p>
-<p>I stress here that I don't think there's anything wrong about the way Aaron has conducted or presented his research. The considerations described above are speculative even in (partial) hindsight. I think Aaron chose nanopass because of his familiarity with the functional programming compiler literature and because its multi-pass system is more similar to Co-dfns. And I know that he actively sought out other programmers willing to implement the compiler in other ways including imperative methods; apparently these efforts didn't pan out. Aaron even mentioned to me that his outstanding results were something of a problem for him, because reviewers found them unbelievable!</p>
+<p>I reckon that most of the effect is that the nanopass compiler just isn't very fast. I see our whole-compiler results as superseding Aaron's comparison, and it's worth noting as well that C could probably beat Java for a BQN compiler. So I conclude that, with current technology, array-based compilers are competitive with scalar compilers on the CPU and not vastly better. This result is still remarkable! APL and BQN are high-level dynamically-typed languages, and wouldn't be expected to go near the performance of a compiled language like Java. However, it makes it much harder to recommend array-based compilation than the numbers from the Co-dfns paper would.</p>
+<p>I stress here that I don't think there's anything wrong about the way Aaron has conducted or presented his research, even if it now appears that nanopass was a poor baseline. I think he chose nanopass because of his familiarity with the functional programming compiler literature and because its multi-pass system is more similar to Co-dfns. And I know that he actively sought out other programmers willing to implement the compiler in other ways including imperative methods; apparently these efforts didn't pan out. Aaron even mentioned to me that his outstanding results were something of a problem for him, because reviewers found them unbelievable!</p>
 <h4 id="what-about-the-gpu"><a class="header" href="#what-about-the-gpu">What about the GPU?</a></h4>
 <p>BQN's compiler could certainly be made to run on a GPU, and it's fascinating that this is possible merely because I stuck to an array-based style. In Co-dfns, Aaron found a maximum factor of 6 improvement by running on the GPU, and this time it's the GPU runtime that we should expect to be slower than Dyalog. So we could expect an array-based compiler to run faster on large source files in this case. The problem is this: who could benefit from this speed?</p>
-<p>Probably not BQN. Recall that the BQN compiler runs at 3MB/s. This is fast enough that it almost certainly takes much longer to run the program than to compile it. The exception would be when a lot of code is loaded but not used, which can be solved by splitting the code into smaller files and only using those which are needed.</p>
+<p>Probably not BQN. Recall that the BQN compiler runs at 5MB/s. This is fast enough that it almost certainly takes much longer to run the program than to compile it. The exception would be when a lot of code is loaded but not used, which can be solved by splitting the code into smaller files and only using those which are needed.</p>
 <p>The programmers who complain about compile times are using languages like C++, Rust, and Julia that compile to native code, often through LLVM. The things that these compilers spend time on aren't the things that the BQN compiler does! BQN has a rigid syntax, no metaprogramming, and compiles to bytecode. The slow compilers are churning to perform tasks like:</p>
 <ul>
 <li>Type checking</li>
-- 
cgit v1.2.3