Revisit CBQN versus dzaima/BQN performance comparison

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-07-01 09:56:13 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-07-01 09:56:13 -0400
commit: 7cdda7ebdf565ee88276592d30691f14306f9bd1 (patch)
tree: b04015587312cdeada6a3c2e3d2c4d4bae328f90
parent: 683d2d1005e0e2f38880494621da800f60fcda8d (diff)
2 files changed, 6 insertions, 6 deletions
diff --git a/docs/implementation/codfns.html b/docs/implementation/codfns.html
index 45dbfc19..b9b830dd 100644
--- a/docs/implementation/codfns.html
+++ b/docs/implementation/codfns.html
@@ -29,10 +29,10 @@
 <p>The sort of static guarantee I want is not really a type system but an <em>axis</em> system. That is, if I take <code><span class='Value'>a</span><span class='Function'>∧</span><span class='Value'>b</span></code> I want to know that the arithmetic mapping makes sense because the two variables use the same axis. And I want to know that if <code><span class='Value'>a</span></code> and <code><span class='Value'>b</span></code> are compatible, then so are <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but not <code><span class='Value'>a</span></code> and <code><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>. I could use a form of <a href="https://en.wikipedia.org/wiki/Hungarian_notation">Hungarian notation</a> for this, and write <code><span class='Value'>ia</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>a</span></code> and <code><span class='Value'>ib</span><span class='Gets'>←</span><span class='Value'>i</span><span class='Function'>⊏</span><span class='Value'>b</span></code>, but it's inconvenient to rewrite the axis every time the variable appears, and I'd much prefer a computer checking agreement rather than my own fallible self.</p>
 <h3 id="performance"><a class="header" href="#performance">Performance</a></h3>
 <p>In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to <strong>10 times faster</strong> for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program. I think comparing the self-hosted BQN compiler to the one in dzaima/BQN shows that this large improvement is caused as much by nanopass being slow as Co-dfns being fast.</p>
-<p>The self-hosted compiler running in CBQN reaches full performance at about 1KB of dense source code. On large files it achieves speeds around 3MB/s, about <strong>two-thirds as fast</strong> as dzaima/BQN's compiler. This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
-<p>A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 15, so I think that nanopass is simply not as fast as more typical imperative compiler methods.</p>
+<p>The self-hosted compiler running in CBQN reaches full performance at about 1KB of dense source code. Handling over 3MB/s, it's around <strong>half as fast</strong> as dzaima/BQN's compiler (but it's complicated—dbqn is usually slower on the first run but gets up to 3 times faster with some files, after hundreds of runs and with 3GB of memory use). This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.</p>
+<p>A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 20, so I think that nanopass is simply not as fast as more typical imperative compiler methods.</p>
 <ul>
-<li>The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. My estimate is a little less than a factor of 2 improvement remaining from improving speed to match Dyalog, and I think more than a factor of 4 is unlikely.</li>
+<li>The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. I estimate less than a factor of 2 improvement remains from improving speed to match Dyalog, and would find a factor of 3 unlikely.</li>
 <li>On the other hand Java isn't the fastest language for a compiler and a C-based compiler would likely be faster. I don't have an estimate for the size of the difference.</li>
 <li>Co-dfns and BQN use different compilation strategies. I think that my methods are at least as fast, and scale better to a full compiler.</li>
 <li>The portion of the compiler implemented by Co-dfns could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming. I think the whole-compiler comparison is more informative, although if this effect is very strong (I don't think it is), then hybrid array-scalar compilers could make sense.</li>
diff --git a/implementation/codfns.md b/implementation/codfns.md
index 5a0ae10e..6b0d71fe 100644
--- a/implementation/codfns.md
+++ b/implementation/codfns.md
@@ -50,11 +50,11 @@ The sort of static guarantee I want is not really a type system but an *axis* sy
 
 In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to **10 times faster** for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program. I think comparing the self-hosted BQN compiler to the one in dzaima/BQN shows that this large improvement is caused as much by nanopass being slow as Co-dfns being fast.
 
-The self-hosted compiler running in CBQN reaches full performance at about 1KB of dense source code. On large files it achieves speeds around 3MB/s, about **two-thirds as fast** as dzaima/BQN's compiler. This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.
+The self-hosted compiler running in CBQN reaches full performance at about 1KB of dense source code. Handling over 3MB/s, it's around **half as fast** as dzaima/BQN's compiler (but it's complicated—dbqn is usually slower on the first run but gets up to 3 times faster with some files, after hundreds of runs and with 3GB of memory use). This compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is equivalent for benchmarking purposes. While there are minor differences in syntax accepted and the exact bytecode output, I'm sure that either compiler could be modified to match the other with negligible changes in compilation time. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.
 
-A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 15, so I think that nanopass is simply not as fast as more typical imperative compiler methods.
+A few factors other than the speed of the nanopass compiler might partly cause the discrepancy, or otherwise be worth taking into account. I doubt that these can add up to a factor of 20, so I think that nanopass is simply not as fast as more typical imperative compiler methods.
 
-- The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. My estimate is a little less than a factor of 2 improvement remaining from improving speed to match Dyalog, and I think more than a factor of 4 is unlikely.
+- The CBQN runtime is still suboptimal, missing SIMD implementations for some primitives used in the compiler. But improvements will be limited for operations like selection that don't vectorize as well. I estimate less than a factor of 2 improvement remains from improving speed to match Dyalog, and would find a factor of 3 unlikely.
 - On the other hand Java isn't the fastest language for a compiler and a C-based compiler would likely be faster. I don't have an estimate for the size of the difference.
 - Co-dfns and BQN use different compilation strategies. I think that my methods are at least as fast, and scale better to a full compiler.
 - The portion of the compiler implemented by Co-dfns could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming. I think the whole-compiler comparison is more informative, although if this effect is very strong (I don't think it is), then hybrid array-scalar compilers could make sense.
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-07-01 09:56:13 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-07-01 09:56:13 -0400
commit	7cdda7ebdf565ee88276592d30691f14306f9bd1 (patch)
tree	b04015587312cdeada6a3c2e3d2c4d4bae328f90
parent	683d2d1005e0e2f38880494621da800f60fcda8d (diff)