From ccaa196d6cfa19b9e7a49b1572f0608bb085c1e8 Mon Sep 17 00:00:00 2001
From: Marshall Lochbaum <mwlochbaum@gmail.com>
Date: Thu, 19 Jan 2023 20:38:06 -0500
Subject: Performance updates

---
 docs/implementation/perf.html | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

(limited to 'docs/implementation')
diff --git a/docs/implementation/perf.html b/docs/implementation/perf.html
index a949429f..76241e34 100644
--- a/docs/implementation/perf.html
+++ b/docs/implementation/perf.html
@@ -5,14 +5,14 @@
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../index.html">BQN</a> / <a href="index.html">implementation</a></div>
 <h1 id="how-does-bqn-perform"><a class="header" href="#how-does-bqn-perform">How does BQN perform?</a></h1>
-<p>How fast is the performance-oriented BQN implementation, <a href="https://github.com/dzaima/CBQN">CBQN</a>? I must ask, why do you care? People are out there looking for the fastest array language before they've tried any one to see if it works for them. Which is kind of strange: most programs have a point where they are just fast enough, and CPUs have gotten pretty good at reaching that point. When that's not true, there's often a concentrated slow part that's easily handed off to a specialized tool like LAPACK. Regardless, a laser focus on performance from the beginning will cause you to miss the fast solutions you'd find by deeply understanding the problem. Start with clean code in the most expressive language to work out strategy, and begin thinking about tactics once you know when and how the performance falls short. Without this understanding, benchmarks are just a dick measuring contest. And it's not even your own dick. It's public, you're just using it.</p>
-<p>Anyway, BQN's dick is pretty fast. Compiles its own compiler in 3ms. Builds this whole site—a megabyte or so of markdown—in a second and a half. Lists the primes under a hundred million in a second. That sort of thing. For CBQN right now, performance splits into three major cases:</p>
+<p>How fast is the performance-oriented BQN implementation, <a href="https://github.com/dzaima/CBQN">CBQN</a>? I must ask, why do you care? People are out there looking for the fastest array language before they've tried any one to see if it works for them. Fact is, most programs have a point where they are just fast enough, and CPUs have gotten pretty good at reaching that point. Or maybe there's a concentrated slow part that's easily handed off to a specialized tool like LAPACK. No matter what, a laser focus on performance from the beginning will cause you to miss the fast solutions you'd find if you really understood the problem. So, start with clean code in the most expressive language to work out strategy, and move to tactics once you know when and how the performance falls short. Without this understanding, benchmarks are just a dick measuring contest. It's not even your own dick. It's public, you're just using it.</p>
+<p>Anyway, BQN's dick is pretty fast. Compiles its own compiler in 3ms. Builds this whole site—a megabyte or so of markdown—in a second and a half. Lists the primes under a billion in two seconds. That sort of thing. For CBQN right now, performance splits into three major cases:</p>
 <ul>
 <li>Scalar code, mostly using atoms. CBQN is faster than other array languages and on par with lightweight interpreters (not JIT compilers).</li>
-<li>Flat lists, particularly integers and characters. CBQN is rarely too slow for these and often beats other array languages, as well as idiomatic C.</li>
-<li>Multidimensional arrays. These are slow, but not pathologically so. CBQN has few optimizations for them, and often falls back to the runtime which has implementations using a lot of scalar code.</li>
+<li>Flat lists, particularly integers and characters. CBQN rarely loses to other array languages, and can beat idiomatic C.</li>
+<li>Multidimensional arrays. CBQN has less optimization, and sometimes falls back to the self-hosted runtime which has implementations using a lot of scalar code. These can be slow, but not pathologically so.</li>
 </ul>
-<p>Currently we aim for high performance on a single CPU core, and are focusing on 64-bit x86. CBQN won't use additional cores or a GPU for acceleration. It does make substantial use of x86 vector instructions up to AVX2 (2013) in the Singeli build, and will have more slow cases if built without Singeli. Comparisons are the hardest hit, as they rarely take too long with Singeli but can become a bottleneck without it.</p>
+<p>Currently we aim for high performance on a single CPU core, and are focusing on 64-bit x86 and ARM. CBQN doesn't use additional cores or a GPU for acceleration. The Singeli build does use x86 vector instructions up to AVX2 (2013) if present, and has preliminary support for ARM NEON. Singeli is assumed for the discussion here, and without it there are a few more slow cases, particularly comparisons.</p>
 <h2 id="performance-resources"><a class="header" href="#performance-resources">Performance resources</a></h2>
 <p>The spotty optimization coverage means that it's more accurate to say CBQN can be fast, not that it will be fast. Have to learn how to use it. Definitely ask on the forum if you're having performance troubles so you can find some tricks to use or request improvements.</p>
 <p>There are two measurement tools in the <a href="../spec/system.html#time">time</a> system values. <code><span class='Function'>•MonoTime</span></code> is a high-precision timer for performance measurements; you can take a time before and after some operation or section of a program and subtract them to get a time in seconds (a profiling tool to do this automatically would be nice, but we don't have one). More convenient for small snippets, <code><span class='Modifier'>•_timed</span></code> returns the time to evaluate <code><span class='Function'>𝔽</span><span class='Value'>𝕩</span></code>, averaging over <code><span class='Value'>𝕨</span></code> runs if given. For two-argument functions you can write <code><span class='Value'>w</span><span class='Modifier2'>⊸</span><span class='Function'>F</span><span class='Modifier'>•_timed</span> <span class='Value'>x</span></code> or <code><span class='Function'>F</span><span class='Modifier'>´•_timed</span> <span class='Value'>w</span><span class='Ligature'>‿</span><span class='Value'>x</span></code>. CBQN also has a <code><span class='Paren'>)</span><span class='Value'>time</span></code> command that prints the time taken by an entire expression, not counting compilation time.</p>
@@ -21,11 +21,11 @@
 </span></pre>
 <p>The <a href="https://mlochbaum.github.io/bencharray/pages/summary.html">bencharray</a> tool has a page showing primitive benchmarks with some explanations.</p>
 <h2 id="versus-other-array-languages"><a class="header" href="#versus-other-array-languages">Versus other array languages</a></h2>
-<p>Things get hard when you try to put array languages up next to each other. You can get completely different results depending on what sort of problems you want to solve and how you write code, and all those different results are valid. Because people ask for it, I'll try to give some description for the implementations I'm familiar with. I'm of course biased towards the languages I've worked on, Dyalog and BQN; if nothing else, these tend to prioritize just the features I find important! Note also that the situation can change over time; these comments are from 2022.</p>
+<p>Things get hard when you try to put array languages up next to each other. You can get completely different results depending on what sort of problems you want to solve and how you write code, and all those different results are valid. Because people ask for it, I'll try to give some description for the implementations I'm familiar with. I'm of course biased towards the languages I've worked on, Dyalog and BQN; if nothing else, these tend to prioritize just the features I find important! Note also that the situation can change over time; these comments are from 2023.</p>
 <p>The implementations I use for comparison are Dyalog APL, ngn/k, and J. I don't benchmark against proprietary K implementations because the anti-benchmarking clauses in their licenses would prevent me from sharing the results (discussed <a href="kclaims.html">here</a>).</p>
 <p>Array operations are the way to get the most value out of an array language (<a href="https://aplwiki.com/wiki/Performance">background reading</a>), so these languages tend to focus on them. But BQN tries to be usable in less array-oriented situations as well, and is faster for scalar code in the simple cases I've measured—things like naive Fibonacci or folding with a function that does some arithmetic. Dyalog is uniformly slow on such things, 5–10x worse than BQN. J is a bit better with tacit code and worse with explicit, 3–15x worse than BQN. And I measure ngn/k around 2x worse than BQN. For context, BQN is just slower than LuaJIT with the JIT off (which is still a fast interpreter), and I usually expect it to be about 10x slower than C in cases where C operations are compiling to single instructions (e.g. excluding auto-vectorization).</p>
-<p>I publish BQN benchmarks of array operations in <a href="https://mlochbaum.github.io/bencharray/pages/summary.html">bencharray</a>, which also allows me to compare against J and Dyalog to some extent. I find that in all cases, if BQN is better it's because of fundamental superiority, and if it's worse it's just a case that we're meaning to improve but haven't gotten to yet. Both happen a fair amount. In the best cases BQN can be faster by 2x or more, but these benchmarks have an extreme bias because I tend to benchmark things that dzaima or I are actively working on speeding up. We do sometimes compare translated code in the forum. Dyalog has generally been faster than CBQN when larger array operations are involved, but BQN is also quickly getting new special code, so things may be turning around!</p>
-<p>We've been working on list operations instead of getting into multi-dimensional stuff. Dyalog and J are definitely better at operations that make significant use of higher-rank arrays. BQN can also have some slow cases with booleans or floats.</p>
+<p>I publish BQN benchmarks of array operations in <a href="https://mlochbaum.github.io/bencharray/pages/summary.html">bencharray</a>, and also use it to compare against J and Dyalog. I find that in all cases, if BQN is better it's because of fundamental superiority, and if it's worse it's just a case that we're meaning to improve but haven't gotten to yet. Mostly BQN is ahead, even by 2x or more in many cases. Now, I do tend to benchmark things that dzaima or I are actively working on speeding up, but at this point I've gotten to all the list operations that are important for performance. The slow cases remaining are almost all searching and sorting on larger types, 4-byte integers and floats.</p>
+<p>We've been mainly working on list operations instead of getting into multi-dimensional stuff. Dyalog is definitely better when making significant use of higher-rank arrays. Not sure about J.</p>
 <h2 id="faster-than-c"><a class="header" href="#faster-than-c">Faster than C?</a></h2>
 <p>It's inappropriate to say a language is faster than C. Public indecency kind of stuff. On the other hand, suppose a programmer who can handle both C and BQN solves the same problem in each, and runs the C program with clang or gcc and the BQN one with CBQN. BQN might just finish first.</p>
 <p>I don't mean that it's common! Just, it's not that weird, and could happen to anyone.</p>
-- 
cgit v1.2.3