The part about array compilers

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-04-10 18:31:57 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-04-10 18:31:57 -0400
commit: ec3ca08adb269a91362672668fe1f9891a9cc386 (patch)
tree: 5abffa4cee5512bb9edb7222432e1fcc0916d418 /docs/implementation
parent: 8f43952c048f04f72e2e39f1611944ad8a274657 (diff)
1 files changed, 32 insertions, 1 deletions
diff --git a/docs/implementation/compile/intro.html b/docs/implementation/compile/intro.html
index ea4eaed5..0431c827 100644
--- a/docs/implementation/compile/intro.html
+++ b/docs/implementation/compile/intro.html
@@ -6,6 +6,7 @@
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">compile</a></div>
 <h1 id="array-language-compilation-in-context"><a class="header" href="#array-language-compilation-in-context">Array language compilation in context</a></h1>
 <p>Chances are that if you're reading this, you know the difference between an interpreter and a compiler, and what separates C from interpreted Python. Good! You have a lot more to learn to understand what compilation means for the array language paradigm. The most important is that, for an array language, compiling isn't automatically better in the way that Python compilers tend to run a lot faster than interpreters. The two strategies have different advantages, and a hybrid approach could be much better than either alone. Here we'll discuss what makes BQN fast, what keeps it from always being fast, and how compiling might help.</p>
+<p>The most important piece of context is that you don't actually care about performance. On array programming forums there are many questions about how to speed up code or what the fastest approach would be. Nine times out of ten, or more, such a post is about a program that takes a fraction of a second—sometimes, under a millisecond, sometimes, under a microsecond. The program was already fast enough! In other cases the answer is almost never to switch languages, but to tweak the code slightly or use a different algorithm. The use cases that call for a general-purpose language and demand full hardware efficiency are rare, and it's much more important to make a language that is easy to write code in than one that runs it quickly.</p>
 <h2 id="array-interpreters"><a class="header" href="#array-interpreters">Array interpreters</a></h2>
 <p>Learn to run the code before you can tree-walk the source.</p>
 <h3 id="fast-and-slow-evaluation"><a class="header" href="#fast-and-slow-evaluation">Fast and slow evaluation</a></h3>
@@ -23,4 +24,34 @@
 <p>Because static typing lets the programmer say exactly what type the data should have, it ought to be faster, right? Well that's a bold hypothesis, C Programming Language, given that you spent the entire 90s working to convince programmers that it's better to let the compiler choose which assembly instructions are run and in what order.</p>
 <p>Array type detection can be both fast and accurate. First off, scanning a BQN array to get the appropriate type is very quick to do with vector instructions, particularly if it's an integer array. But also, this isn't usually necessary, because most primitives have an obvious result type. Structural manipulation leaves the type unchanged, and arithmetic can be done with built-in overflow checking, adding about 10% overhead to the operation in a typical case.</p>
 <p>And on the other side, it's hard to pick a static type that always works. Most times your forum software is run, each user will have under <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>15</span></code> posts. But not always, and using a short integer for the post count wouldn't be safe. With dynamically typed arrays, your program has the performance of a fast type when possible but scales to a larger one when necessary. So it can be both faster and more reliable.</p>
-<p>However, with current array implementation technology, these advantages only apply to array-style code, when you're calling primitives that each do a lot of work. In order to get many small primitive calls working quickly, you need a compiler.</p>
+<p>However, with current array implementation technology, these advantages only apply to array-style code, when you're calling primitives that each do a lot of work. In order to get many small primitive calls working quickly, you need a compiler. Compilers have another significant benefit in the form of <a href="https://en.wikipedia.org/wiki/Loop_fusion"><strong>loop fusion</strong></a>, allowing multiple primitives to be executed in one round without writing intermediate results. This is most important for arithmetic on arrays, where the cost of doing the actual operation is much less than the cost of reading and writing the results if evaluated using SIMD instructions. I think the benefits of fusion are often overstated because it's rare for such simple operations to make up a large portion of program runtime, but there's no doubt that it can provide some benefit for most array-oriented programs and a large benefit for some of them.</p>
+<h2 id="optimizing-array-primitives"><a class="header" href="#optimizing-array-primitives">Optimizing array primitives</a></h2>
+<p>There is a bit more to say while we're still in interpreter-land (the title says &quot;compilation in context&quot;, but I'm sorry to inform you that this page is heavy on &quot;context&quot;, but not so hot on &quot;compilation&quot;, and frankly lukewarm on &quot;in&quot;). The function <code><span class='Function'>+</span><span class='Modifier'>´</span></code> isn't a primitive, it's two!</p>
+<p>The way that <code><span class='Function'>+</span><span class='Modifier'>´</span></code> is evaluated using specialized code is that <code><span class='Modifier'>´</span></code>, when invoked, checks whether its operand is one of a few known cases (at the time of writing, <code><span class='Function'>+×⌊⌈∧∨</span></code>). If so, it checks the type and applies special code accordingly. Arguably, this is a very rudimentary form of just-in-time compilation for the function <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, as it takes a program that would apply <code><span class='Function'>+</span></code> many times and transforms it to special pre-compiled code that doesn't call the <code><span class='Function'>+</span></code> function. However, it's pretty different from what a compiled language would do in that this function is never associated with the object representing <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, so that <code><span class='Function'>+</span><span class='Modifier'>´</span></code> is re-compiled each time it's run.</p>
+<p>Array languages often optimize primitives or arithmetic combinations (folds, scans, table, rank) better than current C compilers handle the code a user would write for them. Partly this is because with C being so low-level it's hard to extract the meaning from this code (pointer aliasing can also get in the way). An array language implementer has an easier job since <code><span class='Function'>-</span><span class='Modifier'>⌜</span></code> or <code><span class='Function'>∨</span><span class='Modifier'>`</span></code> is easy to detect. I also think C compilers just don't pay as much attention to array operations as they should. I can't imagine any reason why the C code for an integer prefix sum <code><span class='Function'>+</span><span class='Modifier'>`</span></code> shouldn't generate SIMD code, but I've tried and failed to get gcc or clang to do it.</p>
+<p>Forms like <code><span class='Function'>+</span><span class='Modifier'>´</span></code> are very simple examples of <a href="../../doc/tacit.html">tacit</a> code. It's possible to compile more complex tacit forms as well. <a href="https://github.com/mlochbaum/ILanguage">I the language</a> compiles some tacit programs directly to x86 machine code. For example, if a fold (maybe the equivalent of <code><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>+×</span><span class='Paren'>)</span><span class='Modifier'>´</span></code>) is called on a list of at least 8 elements, then it attempts to determine the result type and compile code for the entire fold. It doesn't save this machine code, so in one way it's an extension to the treatment of <code><span class='Function'>+</span><span class='Modifier'>´</span></code> that re-analyzes the function every time. However, generating code on demand makes it a true JIT compiler.</p>
+<h2 id="compiling-array-languages"><a class="header" href="#compiling-array-languages">Compiling array languages</a></h2>
+<p>With this context, what would it mean to compile an array language? I'll try to survey existing and possible approaches here. Note that I don't have first-hand experience with typed array languages as a user or an implementer, so I can't give a very detailed account of what they do and there's a higher-than-usual chance of errors when I talk about them.</p>
+<p>I'm now discussing what most people mean when they say &quot;compiled APL&quot;, which means running code using machine code that's purpose-built for that particular operation. To make this code, the compiler must know the types of the values it's dealing with, so the two strategies discussed above come into play: have the user specify types, or derive them as the code is running. Various optimizations are possible without knowing types (for example BQN has some rudimentary lifetime analysis, which frees some variables when they're no longer used), but as they're not specific to array languages I've decided they're out of scope for this page.</p>
+<h3 id="typed-array-languages"><a class="header" href="#typed-array-languages">Typed array languages</a></h3>
+<p>Most array language compilers, and certainly the most successful ones, are based on static typing. Typed array languages have become a major subject of research in the last ten years or so, and the following two groups have arisen:</p>
+<ul>
+<li>ML-family (like Haskell): <a href="https://github.com/diku-dk/futhark">Futhark</a> has size-dependent types and <a href="https://github.com/google-research/dex-lang">Dex</a> has fully dependent types</li>
+<li>MLIR and XLA, low-level drivers of deep learning frameworks like TensorFlow</li>
+</ul>
+<p>One of the reasons to cover static types first is that all of these languages are promising compilation targets for BQN code when types have been determined. They're mostly designed for large arrays, with a significant focus on GPUs. This would make them a sort of a specialized tool in an array language, where most programs use many small and medium arrays even if the goal is to manipulate large ones.</p>
+<p>Three major efforts to apply ahead-of-time compilation to APL are <a href="https://www.snakeisland.com/apexup.htm">APEX</a>, <a href="https://github.com/melsman/apltail">apltail</a>, and <a href="https://github.com/Co-dfns/Co-dfns">Co-dfns</a>. It's worth noting that none has had any apparent uptake in the APL world. I think this is because they have to leave out significant functionality to compile ahead of time, and because performance doesn't actually matter to APL programmers, as mentioned in the introduction. From my reading it appears that APEX is statically typed, since each value in the source code can have only one type, and declarations (written as APL comments) are required to disambiguate if the are multiple possibilities. apltail and Co-dfns are dynamically typed, but apltail compiles to Typed Array Intermediate Language (TAIL), which, as the name insists, is statically typed. As far as I know TAIL is used only as a target for apltail. Some effort has been expended on the <a href="https://github.com/henrikurms/tail2futhark">tail2futhark</a> project, but it's no longer maintained.</p>
+<h3 id="compiling-with-dynamic-types"><a class="header" href="#compiling-with-dynamic-types">Compiling with dynamic types</a></h3>
+<p>Moving to dynamically-typed languages, the actual compilation isn't going to change that much. What we are interested in is types. When and how are they determined for values that haven't been created yet?</p>
+<p>First I think it's worth discussing <a href="https://julialang.org/">Julia</a>, which I would describe as the most successful compiled dynamically-typed array language. Each array has a particular type, and it sticks with it: for example if you multiple two <code><span class='Function'>Int8</span></code> arrays then the results will wrap around rather than increasing the type. But functions can accept many different argument types. Julia does this by compiling a function again whenever it's called on types it hasn't seen before. The resulting function is fast, but the time spent compiling causes significant delays. The model of arrays with a fixed type chosen from many options is the same as NumPy, which follows a traditional interpreted model. But it's different from APL and BQN, which have only one number type and optimize using subsets. J and K sit somewhere in between, with a small number of logical types (such as separate integers and floats) and some possibility for optimization.</p>
+<p>The ahead-of-time compilers apltail and Co-dfns mentioned in the previous section take different approaches. apltail uses a powerful (but not dependent) type system with type inference to detect which types the program uses. Co-dfns compiles to ArrayFire code that is still somewhat dynamic, with switches on rank or types. It's possible the ArrayFire compiler can optimize some of them out. I think that while these impressive projects are definitely doing something worthwhile, ahead-of-time compilation on its own is ultimately not a good basis for an array language implementation (but it's just my opinion, and I may well be wrong! Don't let me stop you!). There's too much to gain by having access to the actual data at compilation time, and being able to fit it into a smaller type.</p>
+<p>However, I would be very interested in compiling BQN's stack-based IR using these ahead-of-time methods. The ArrayFire code from Co-dfns would be easiest to generate and can probably be adapted to BQN primitives (Dyalog has indicated in a non-committal way that they're interested in integrating Co-dfns into Dyalog APL as well). Other backends could provide better performance at the cost of type analysis, which brings us to the question of how to approach types in running dynamic code.</p>
+<p>A very early example of JIT compilation, <a href="https://aplwiki.com/wiki/APL%5C3000">APL\3000</a>, begins by compiling each function for the exact types it's called with on the first call and recompiles for more general types when its assumptions are broken. This keeps the program from spending all its time compiling, but also can't optimize a function well over multiple types; given how cheap memory is now I think it's better to compile multiple versions more readily.</p>
+<p>I wrote in more detail about various strategies in the page on <a href="https://mlochbaum.github.io/BQN/implementation/compile/dynamic.html">dynamic compilation</a>. For one example of an advanced approach, my guess as to the best framework—if implementation costs are not taken into account—to deal with large arrays on a typical computer is to layer many strategies together:</p>
+<ul>
+<li>At the largest scale, execution is dynamic</li>
+<li>When possible, a few upcoming primitives are statically analyzed together and probably fused</li>
+<li>Execution is split into large chunks which can be split across processors</li>
+<li>A single chunk is evaluated using specialized (but possibly JIT-compiled) SIMD or GPU code.</li>
+</ul>
+<p>So, both in the time domain and the space/index domain there's an outer dynamic layer and an inner fixed one. Fusion is important here but the lowest level of optimization should still probably understand array primitives rather than breaking them into scalar loops. The scalar methods don't have deep enough understanding to generate the fancy primitive algorithms that humans can.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-04-10 18:31:57 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-04-10 18:31:57 -0400
commit	ec3ca08adb269a91362672668fe1f9891a9cc386 (patch)
tree	5abffa4cee5512bb9edb7222432e1fcc0916d418 /docs/implementation
parent	8f43952c048f04f72e2e39f1611944ad8a274657 (diff)