Adjustments to K performance information

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-08-31 12:04:36 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-08-31 12:04:36 -0400
commit: 500615cf09aaa009b7f4eac34a3e53982ad7c792 (patch)
tree: 4ad69760daefae3fad330ab1410de9256f95eb3d
parent: 505715a37a0f845029758bc16955149c800850c9 (diff)
2 files changed, 20 insertions, 11 deletions
diff --git a/docs/implementation/kclaims.html b/docs/implementation/kclaims.html
index fcc0ddac..e580f6d2 100644
--- a/docs/implementation/kclaims.html
+++ b/docs/implementation/kclaims.html
@@ -9,17 +9,20 @@
 <p>This isn't meant to put down the K language! K is in fact the only APL-family language other than BQN that I would recommend without reservations. And there's nothing wrong with the K community as a whole. Go to <a href="https://chat.stackexchange.com/rooms/90748/the-k-tree">the k tree</a> and meet them! What I want to fight is the <em>myth</em> of K, which is carried around as much by those who used K once upon a time, and no longer have any connection to it, as by active users.</p>
 <p>The points I argue here are narrow. To some extent I'm picking out the craziest things said about K to argue against. Please don't assume whoever you're talking to thinks these crazy things about K just because I wrote them here. Or, if they are wrong about these topics, that they're wrong about everything. Performance is a complicated and often counter-intuitive field and it's easy to be mislead.</p>
 <p>On that note, it's possible <em>I've</em> made mistakes, such as incorrectly designing or interpreting benchmarks. If you present me with concrete evidence against something I wrote below, I promise I'll revise this page to include it, even if I just have to quote verbatim because I don't understand a word of it.</p>
+<p>This page has now been <a href="https://news.ycombinator.com/item?id=28365645">discussed</a> on Hacker News, and I have amended and added information based on comments there and elsewhere.</p>
 <h2 id="the-fastest-array-language"><a class="header" href="#the-fastest-array-language">The fastest array language</a></h2>
 <p>When you ask what the fastest array language is, chances are someone is there to answer one of k, kdb, or q. I can't offer benchmarks that contradict this, but I will argue that there's little reason to take these people at their word.</p>
-<p>The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti's <a href="https://shakti.com/download/license">license</a> says users cannot &quot;distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report&quot;. As I would be unable to share the results, I have not taken benchmarks of any commercial K. Or downloaded one for that matter. Shakti could publish benchmarks; they choose to publish a handful of comparisons with database software and none with array languages or frameworks. I do run tests with <a href="https://codeberg.org/ngn/k">ngn/k</a>, which is developed with goals similar to Whitney's K; the author says it's slower than Shakti but not by too much.</p>
+<p>The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti's <a href="https://shakti.com/download/license">license</a> says users cannot &quot;distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report&quot;. This <a href="https://en.wikipedia.org/wiki/DeWitt_Clause">&quot;DeWitt Clause&quot;</a> is common in many databases. It takes much more work to refute bad benchmarks than to produce them so perhaps in the high-stakes DBMS world it's not as crazy as it seems—even though it's a terrible outcome. As I would be unable to share the results, I have not taken benchmarks of any commercial K. Or downloaded one for that matter. So I'm limited to a small number of approved benchmarks that have been published, which focus on performance as a database and not a programming language. I also run tests with <a href="https://codeberg.org/ngn/k">ngn/k</a>, which is developed with goals similar to Whitney's K; the author says it's slower than Shakti but not by too much.</p>
 <p>The primary reason I don't give any credence to claims that K is the best is that they are always devoid of specifics. Most importantly, the same assertion is made across decades even though performance in J, Dyalog, and NumPy has improved by leaps and bounds in the meantime—I participated in advances of <a href="https://www.dyalog.com/dyalog/dyalog-versions/170/performance.htm">26%</a> and <a href="https://www.dyalog.com/dyalog-versions/180/performance.htm">10%</a> in overall Dyalog benchmarks in the last two major versions. Has K4 (the engine behind kdb and Q) kept pace? Maybe it's fallen behind since Arthur left but Shakti K is better? Which other array languages has the poster used? Doesn't matter—<em>they</em> are all the same but <em>K</em> is better.</p>
 <p>A related theme I find is equivocating between different kinds of performance. I suspect that for interpreting scalar code K is faster than APL and J but slower than Javascript, and certainly any compiled language. For operations on arrays, maybe it beats Javascript and Java but loses to current Dyalog and tensor frameworks. Simple database queries, Shakti says it's faster than Spark and Postgres but is silent about newer in-memory databases. The most extreme K advocates sweep away all this complexity by comparing K to weaker contenders in each category. Just about any language can be &quot;the best&quot; with this approach.</p>
 <p>Before getting into array-based versus scalar code, here's a simpler case. It's well known that K works on one list at a time, that is, if you have a matrix—in K, a list of lists—then applying an operation (say sum) to each row works on each one independently. If the rows are short then there's function overhead for each one. In APL, J, and BQN, the matrix is stored as one unit with a stride. The sum can use one metadata computation for all rows, and there's usually special code for many row-wise functions. I measured that Dyalog is 30 times faster than ngn/k to sum rows of a ten-million by three float (double) matrix, for one fairly representative example. It's fine to say—as many K-ers do—that these cases don't matter or can be avoided in practice; it's dishonest (or ignorant) to claim they don't exist.</p>
 <h2 id="scalar-versus-array-code"><a class="header" href="#scalar-versus-array-code">Scalar versus array code</a></h2>
 <p>I have a suspicion that users sometimes think K is faster than APL because they try out a Fibonacci function or other one-number-at-a-time code. Erm, your boat turns faster than a battleship, congratulations? <em>Python</em> beats these languages at interpreted performance. By like a factor of five. The only reason for anyone to think this is relevant is if they have a one-dimensional model where J is &quot;better&quot; than Python, so K is &quot;better&quot; than both.</p>
 <p>Popular APL and J implementations interpret source code directly, without even building an AST. This is very slow, and Dyalog has several other pathologies that get in the way as well. Like storing the execution stack in the workspace to prevent stack overflows, and the requirement that a user can save a workspace with paused code and resume it <em>in a later version</em>. But the overhead is per token executed, and a programmer can avoid the cost by working on large arrays where one token does a whole lot of work. If you want to show a language is faster than APL generally, this is the kind of code to look at.</p>
-<p>K is designed to be as fast as possible when interpreting scalar code, for example using a <a href="https://k.miraheze.org/wiki/Grammar">grammar</a> that's much simpler than <a href="../spec/grammar.html">BQN's</a> (speed isn't the only benefit of being simpler of course, but it's clearly a consideration). It succeeds at this, and K interpreters are very fast, even without bytecode compilation in advance.</p>
-<p>But K still isn't good at scalar code! It's an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java. A compiler generates code to do what you want, while an interpreter is code that reads data (the program) to do what you want. Once the code is compiled, the interpreter has an extra step and <em>has</em> to be slower. This is why BQN uses compiler-based strategies to speed up execution, first compiling to bytecode to make syntax overhead irrelevant and then usually post-processing that bytecode. Compilation is fast enough that it's perfectly fine to compile code every time it's run.</p>
+<p>K's design is well-suited to interpreting scalar code because of its simplicity: for example, it has only one kind of user-defined function and doesn't allow lexical closures. Its <a href="https://k.miraheze.org/wiki/Grammar">grammar</a> is much simpler than <a href="../spec/grammar.html">BQN's</a>, although <a href="https://news.ycombinator.com/item?id=6118565">it seems</a> that K uses bytecode compilation (ngn/k certainly does), making this irrelevant after one quick compilation pass. As a result, K interpreters can be very fast.</p>
+<p>But K still isn't good at scalar code! It's an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java. A compiler generates code to do what you want, while an interpreter (including a bytecode VM) is code that reads data (the program) to do what you want. Once the code is compiled, the interpreter has an extra step and <em>has</em> to be slower. This is why BQN uses compiler-based strategies to speed up execution, first compiling to bytecode and then usually post-processing that bytecode. Compilation is fast enough that it's perfectly fine to compile code every time it's run.</p>
+<h2 id="parallel-execution"><a class="header" href="#parallel-execution">Parallel execution</a></h2>
+<p>As of 2020, Q supports <a href="https://code.kx.com/q/kb/mt-primitives/">multithreaded primitives</a> that can run on multiple CPU cores. I think Shakti supports multi-threading as well. Oddly enough, J user Monument AI has also been working on their own parallel <a href="https://www.monument.ai/m/parallel">J engine</a>. So array languages are finally moving to multiple cores (the reason this hasn't happened sooner is probably that array language users often have workloads where they can run one instance on each core, which is easier and tends to be faster than splitting one run across multiple cores). It's interesting, and a potential reason to use K or Q, although it's too recent to be part of the &quot;K is fastest&quot; mythos. Not every K claim is a wild one!</p>
 <h2 id="instruction-cache"><a class="header" href="#instruction-cache">Instruction cache</a></h2>
 <p>A more specific claim about K is that the key to its speed is that the interpreter, or some part of it, fits in L1 cache. I know Arthur Whitney himself has said this; I can't find that now but <a href="https://kx.com/blog/what-makes-time-series-database-kdb-so-fast/">here</a>'s some material from KX about the &quot;L1/2 cache&quot;. Maybe this was a relevant factor in the early days of K around 2000—I'm doubtful. In the 2020s it's ridiculous to say that instruction caching matters.</p>
 <p>Let's clarify terms first. The CPU cache is a set of storage areas that are smaller and faster than RAM; memory is copied there when it's used so it will be faster to access it again later. L1 is the smallest and fastest level. On a typical CPU these days it might consist of 64KB of <em>data</em> cache for memory to be read and written, and 64KB of <em>instruction</em> cache for memory to be executed by the CPU. When I've seen it the L1 cache claim is specifically about the K interpreter (and not the data it works with) fitting in the cache, so it clearly refers to the instruction cache.</p>
@@ -70,11 +73,11 @@
 </pre>
 <p>Dividing the stall number by total cycles gives us percentage of program time that can be attributed to L1 instruction misses.</p>
 <a class="replLink" title="Open in the REPL" target="_blank" href="https://mlochbaum.github.io/BQN/try.html#code=Ikoi4oC/IkJRTiLigL8iUHl0aG9uIiDiiY3LmCAxMDAgw5cgNTbigL81LjTigL8yNSDDtyAxXzQ1N+KAvzI0MeKAvzQ5OQ==">↗️</a><pre>    <span class='String'>&quot;J&quot;</span><span class='Ligature'>‿</span><span class='String'>&quot;BQN&quot;</span><span class='Ligature'>‿</span><span class='String'>&quot;Python&quot;</span> <span class='Function'>≍</span><span class='Modifier'>˘</span> <span class='Number'>100</span> <span class='Function'>×</span> <span class='Number'>56</span><span class='Ligature'>‿</span><span class='Number'>5.4</span><span class='Ligature'>‿</span><span class='Number'>25</span> <span class='Function'>÷</span> <span class='Number'>1_457</span><span class='Ligature'>‿</span><span class='Number'>241</span><span class='Ligature'>‿</span><span class='Number'>499</span>
-┌─                             
-╵ "J"      3.8435140700068633  
-  "BQN"    2.240663900414938   
-  "Python" 5.01002004008016    
-                              ┘
+┌─                            
+╵ "J"      3.843514070006863  
+  "BQN"    2.240663900414938  
+  "Python" 5.01002004008016   
+                             ┘
 </pre>
 <p>So, roughly 4%, 2%, and 5%. The cache miss counts are also broadly in line with these numbers. Note that full cache misses are pretty rare, so that most misses just hit L2 or L3 and don't suffer a large penalty. Also note that instruction cache misses are mostly lower than data misses, as expected.</p>
 <p>Don't get me wrong, I'd love to improve performance even by 2%. But it's not exactly world domination, is it? And it doesn't matter how cache-friendly K is, that's the absolute limit.</p>
diff --git a/implementation/kclaims.md b/implementation/kclaims.md
index 20d40646..1eee6966 100644
--- a/implementation/kclaims.md
+++ b/implementation/kclaims.md
@@ -10,11 +10,13 @@ The points I argue here are narrow. To some extent I'm picking out the craziest
 
 On that note, it's possible *I've* made mistakes, such as incorrectly designing or interpreting benchmarks. If you present me with concrete evidence against something I wrote below, I promise I'll revise this page to include it, even if I just have to quote verbatim because I don't understand a word of it.
 
+This page has now been [discussed](https://news.ycombinator.com/item?id=28365645) on Hacker News, and I have amended and added information based on comments there and elsewhere.
+
 ## The fastest array language
 
 When you ask what the fastest array language is, chances are someone is there to answer one of k, kdb, or q. I can't offer benchmarks that contradict this, but I will argue that there's little reason to take these people at their word.
 
-The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti's [license](https://shakti.com/download/license) says users cannot "distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report". As I would be unable to share the results, I have not taken benchmarks of any commercial K. Or downloaded one for that matter. Shakti could publish benchmarks; they choose to publish a handful of comparisons with database software and none with array languages or frameworks. I do run tests with [ngn/k](https://codeberg.org/ngn/k), which is developed with goals similar to Whitney's K; the author says it's slower than Shakti but not by too much.
+The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti's [license](https://shakti.com/download/license) says users cannot "distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report". This ["DeWitt Clause"](https://en.wikipedia.org/wiki/DeWitt_Clause) is common in many databases. It takes much more work to refute bad benchmarks than to produce them so perhaps in the high-stakes DBMS world it's not as crazy as it seems—even though it's a terrible outcome. As I would be unable to share the results, I have not taken benchmarks of any commercial K. Or downloaded one for that matter. So I'm limited to a small number of approved benchmarks that have been published, which focus on performance as a database and not a programming language. I also run tests with [ngn/k](https://codeberg.org/ngn/k), which is developed with goals similar to Whitney's K; the author says it's slower than Shakti but not by too much.
 
 The primary reason I don't give any credence to claims that K is the best is that they are always devoid of specifics. Most importantly, the same assertion is made across decades even though performance in J, Dyalog, and NumPy has improved by leaps and bounds in the meantime—I participated in advances of [26%](https://www.dyalog.com/dyalog/dyalog-versions/170/performance.htm) and [10%](https://www.dyalog.com/dyalog-versions/180/performance.htm) in overall Dyalog benchmarks in the last two major versions. Has K4 (the engine behind kdb and Q) kept pace? Maybe it's fallen behind since Arthur left but Shakti K is better? Which other array languages has the poster used? Doesn't matter—*they* are all the same but *K* is better.
 
@@ -28,9 +30,13 @@ I have a suspicion that users sometimes think K is faster than APL because they
 
 Popular APL and J implementations interpret source code directly, without even building an AST. This is very slow, and Dyalog has several other pathologies that get in the way as well. Like storing the execution stack in the workspace to prevent stack overflows, and the requirement that a user can save a workspace with paused code and resume it *in a later version*. But the overhead is per token executed, and a programmer can avoid the cost by working on large arrays where one token does a whole lot of work. If you want to show a language is faster than APL generally, this is the kind of code to look at.
 
-K is designed to be as fast as possible when interpreting scalar code, for example using a [grammar](https://k.miraheze.org/wiki/Grammar) that's much simpler than [BQN's](../spec/grammar.md) (speed isn't the only benefit of being simpler of course, but it's clearly a consideration). It succeeds at this, and K interpreters are very fast, even without bytecode compilation in advance.
+K's design is well-suited to interpreting scalar code because of its simplicity: for example, it has only one kind of user-defined function and doesn't allow lexical closures. Its [grammar](https://k.miraheze.org/wiki/Grammar) is much simpler than [BQN's](../spec/grammar.md), although [it seems](https://news.ycombinator.com/item?id=6118565) that K uses bytecode compilation (ngn/k certainly does), making this irrelevant after one quick compilation pass. As a result, K interpreters can be very fast.
+
+But K still isn't good at scalar code! It's an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java. A compiler generates code to do what you want, while an interpreter (including a bytecode VM) is code that reads data (the program) to do what you want. Once the code is compiled, the interpreter has an extra step and *has* to be slower. This is why BQN uses compiler-based strategies to speed up execution, first compiling to bytecode and then usually post-processing that bytecode. Compilation is fast enough that it's perfectly fine to compile code every time it's run.
+
+## Parallel execution
 
-But K still isn't good at scalar code! It's an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java. A compiler generates code to do what you want, while an interpreter is code that reads data (the program) to do what you want. Once the code is compiled, the interpreter has an extra step and *has* to be slower. This is why BQN uses compiler-based strategies to speed up execution, first compiling to bytecode to make syntax overhead irrelevant and then usually post-processing that bytecode. Compilation is fast enough that it's perfectly fine to compile code every time it's run.
+As of 2020, Q supports [multithreaded primitives](https://code.kx.com/q/kb/mt-primitives/) that can run on multiple CPU cores. I think Shakti supports multi-threading as well. Oddly enough, J user Monument AI has also been working on their own parallel [J engine](https://www.monument.ai/m/parallel). So array languages are finally moving to multiple cores (the reason this hasn't happened sooner is probably that array language users often have workloads where they can run one instance on each core, which is easier and tends to be faster than splitting one run across multiple cores). It's interesting, and a potential reason to use K or Q, although it's too recent to be part of the "K is fastest" mythos. Not every K claim is a wild one!
 
 ## Instruction cache
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-08-31 12:04:36 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-08-31 12:04:36 -0400
commit	500615cf09aaa009b7f4eac34a3e53982ad7c792 (patch)
tree	4ad69760daefae3fad330ab1410de9256f95eb3d
parent	505715a37a0f845029758bc16955149c800850c9 (diff)