Reorganize Replicate notes; better treatment of scan-based methods

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-09-21 21:54:23 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-09-21 21:54:23 -0400
commit: 571f307a396ae52f23e996e89db3e36d1f939cea (patch)
tree: 5c1d7cd9b49b9519d6e25f2e3b5c1e2f54198cc4 /docs/implementation
parent: 555adb1ae538013bb220df56451e4255716b9d18 (diff)
1 files changed, 39 insertions, 14 deletions
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html
index f0c27952..2eba512e 100644
--- a/docs/implementation/primitive/replicate.html
+++ b/docs/implementation/primitive/replicate.html
@@ -6,33 +6,58 @@
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
 <h1 id="implementation-of-indices-and-replicate"><a class="header" href="#implementation-of-indices-and-replicate">Implementation of Indices and Replicate</a></h1>
 <p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. The most important is converting <a href="#booleans-to-indices">bits to indices</a>: AVX-512 extensions implement this natively for various index sizes, and even with no SIMD support at all there are surprisingly fast table-based algorithms for it.</p>
-<p><a href="#replicate">General replication</a> is more complex. Branching will slow many useful cases down considerably when using the obvious solution. However, branch-free techniques introduce overhead for larger replication amounts. Hybridizing these seems to be the only way, but it's finicky.</p>
+<p><a href="#replicate">General replication</a> is more complex. The main enemy is branching but there are reasonable approaches.</p>
 <p>Replicate by a <a href="#constant-replicate">constant amount</a> (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and arithmetic with prefix agreement.</p>
+<table>
+<thead>
+<tr>
+<th>Normal</th>
+<th>Boolean</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><a href="#indices">Indices</a></td>
+<td><a href="#where">Where</a></td>
+</tr>
+<tr>
+<td><a href="#replicate">Replicate</a></td>
+<td><a href="#compress">Compress</a></td>
+</tr>
+<tr>
+<td>(<a href="#constant-replicate">by constant</a>)</td>
+<td></td>
+</tr>
+</tbody>
+</table>
 <h2 id="indices"><a class="header" href="#indices">Indices</a></h2>
-<p>Branchless algorithms are fastest, but with unbounded values in <code><span class='Value'>𝕨</span></code> a fully branchless algorithm is impossible because you can't write an arbitrary amount of memory without branching. So the best algorithms depend on bounding <code><span class='Value'>𝕨</span></code>. Fortunately the most useful case is that <code><span class='Value'>𝕨</span></code> is boolean.</p>
+<p>Because it's somewhat simpler to discuss, we'll begin with the case <code><span class='Function'>/</span><span class='Value'>𝕩</span></code> where <code><span class='Value'>𝕩</span></code> has an integer type (the boolean case is discussed <a href="#compress">below</a>). The obvious C loop works fine when the average of <code><span class='Value'>𝕩</span></code> is large enough, because it auto-vectorizes to write many values at a time. When the average is smaller, this vectorization becomes less effective, but the main problem is branching, which takes many cycles for each element in <code><span class='Value'>𝕩</span></code> if the values aren't predictable.</p>
+<p>Indices is half of a <a href="sort.html#distribution-sorts">counting sort</a>: for sparse values, it's the slower half. Making it fast makes counting sort viable for much larger range-to-length ratios.</p>
+<p>I know two main ways to tackle the branching problem. The elegant way is a three-pass method computing <code><span class='Function'>+</span><span class='Modifier'>`</span><span class='Function'>/</span><span class='Modifier'>⁼</span><span class='Function'>+</span><span class='Modifier'>`</span><span class='Value'>𝕩</span></code>. First, zero out the result array. Then traverse <code><span class='Value'>𝕩</span></code> with a running sum index and increment the result value at that index at each step. Then sum the result. Somehow C compilers still don't know how to vectorize a prefix sum so you'll need to do it manually for best performance. Three passes is bad for caching so this method needs to be done in blocks to work well for large arrays. A slightly faster variation is that instead of incrementing you can write indices and take a max-scan <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code> at the end.</p>
+<p>The other way is to try to make the lengths less variable by rounding up. Later writes will overwrite earlier ones anyway. This gets messy. If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. But if it's not bounded (and why would it be?) you'll end up with gaps. You could just accept some branching and write 8 more indices. You could also use a sparse <em>where</em> algorithm to get the indices of large elements in <code><span class='Value'>𝕩</span></code>, and do the long writes for those either before or after the short ones. Overall I'm kind of skeptical of these approaches here. However, they are definitely a valid approach to constant Replicate, where <code><span class='Value'>𝕨</span></code> is inherently bounded.</p>
+<h2 id="replicate"><a class="header" href="#replicate">Replicate</a></h2>
+<p>Most techniques for Indices can be adapted to Replicate, and the same considerations about branching apply.</p>
+<p>An additional approach that becomes available is essentially <code><span class='Function'>/</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>: apply Indices to portions of <code><span class='Value'>𝕨</span></code> with the result in a temporary buffer, and select to produce the result. With small enough sections you can use 8-bit indices which can save time. As far as I can tell this method isn't an improvement for Replicate but is for the boolean case, Compress.</p>
+<p>The running sum method needs to be modified slightly: instead of incrementing result values by one always, add the difference between the current value in <code><span class='Value'>𝕩</span></code> and the previous one. It's possible to use xor instead of addition and subtraction but it shouldn't ever make much of a difference to performance. In the boolean case xor-ing trailing bits instead of single bits allows part of an xor-scan to be skipped; see <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a>.</p>
+<h3 id="constant-replicate"><a class="header" href="#constant-replicate">Constant replicate</a></h3>
+<p>The case where <code><span class='Value'>𝕨</span></code> is constant is useful for outer products and leading-axis extension, where elements of one argument need to be repeated a few times. This connection is also discussed in <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits</a>.</p>
+<p>The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers as well, but I think they're still not as good as specialized code with shuffle, and can sometimes be beaten by scan-based methods.</p>
+<h2 id="booleans"><a class="header" href="#booleans">Booleans</a></h2>
+<p>The case where the replication amount is boolean is called Where or Compress based on older APL names for these functions before Replicate was extended to natural numbers.</p>
+<p>When the amounts to replicate are natural numbers you're pretty much stuck going one at a time. With booleans there are huge advantages to doing bytes or larger units at once. This tends to lead to an outer replicate-like pattern where the relevant amount is the <em>sum</em> of a group of booleans, as well as an inner pattern based on the individual 0s and 1s.</p>
 <h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3>
 <p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
 <p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
 <p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
-<h3 id="non-booleans-to-indices"><a class="header" href="#non-booleans-to-indices">Non-booleans to indices</a></h3>
-<p>If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. This is much like the lookup table algorithm above, minus the lookup table. If the indices need to be larger than one byte, it's fine to expand them, and possibly add an offset, after generation (probably in chunks).</p>
-<p>There are two ways I know to fill in the gaps that this method would leave with elements that are too large. First is to stop after such an element and fill remaining space branchfully (maybe with <code><span class='Value'>memset</span></code>). This is maximally efficient if <code><span class='Value'>𝕩</span></code> is dominated by large elements—particularly for 2-byte indices when it skips index expansion—but not good if there are a lot of elements near the threshold. Second, initialize the buffer with 0 and perform <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code> afterwards, or other variations. This eliminates all but a fixed amount of branching, but it's a lot of overhead and I think unless a more sophisticated strategy arises it's best to stick with the first method.</p>
-<p>Indices is half of a counting sort: for sparse values, it's the slower half. Making it fast makes counting sort viable for much larger range-to-length ratios.</p>
-<h2 id="replicate"><a class="header" href="#replicate">Replicate</a></h2>
-<p>For the most part, understanding Indices is the best way to implement Replicate quickly. But this is not the case if <code><span class='Value'>𝕩</span></code> is boolean because then its elements are smaller than any useful index, and faster methods are available.</p>
 <h3 id="compress"><a class="header" href="#compress">Compress</a></h3>
 <p>Most of the methods listed below can be performed in place.</p>
 <p>For booleans, use BMI2's PEXT (parallel bits extract) instruction, or an emulation of it. The result can be built recursively alongside the also-required popcount using masked shifts.</p>
-<p>The generally best method for small elements seems to be to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
+<p>A good general method is to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
 <p>For 1- and 2-byte elements, a shuffle-based solution is a substantial improvement, if a vector shuffle is available. AVX-512 has compresses on several sizes built-in.</p>
 <p>Odd-sized cells could be handled with an index buffer like small elements, using oversized writes and either overallocating or handling the last element specially.</p>
 <p>For medium-sized cells copying involves partial writes and so is somewhat inefficient. It's better to split <code><span class='Value'>𝕨</span></code> into groups of 1s in order to copy larger chunks from <code><span class='Value'>𝕩</span></code> at once. So the algorithm repeatedly searches <code><span class='Value'>𝕨</span></code> for the next 1, then the next 0, then copies the corresponding value from <code><span class='Value'>𝕩</span></code> to the result. This might be better for small odd-sized cells as well; I haven't implemented the algorithm with oversized writes to compare.</p>
 <p>The grouped algorithm, as well as a simpler sparse algorithm that just finds each 1 in <code><span class='Value'>𝕨</span></code>, can also better for small elements. Whether to use these depends on the value of <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>𝕨</span></code> (sparse) or <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> (clumped). The checking is fast and these cases are common, but the general case is also fast enough that this is not a particularly high priority.</p>
-<h3 id="replicate"><a class="header" href="#replicate">Replicate</a></h3>
-<p>Like Compress I think the best algorithm is often to generate small indices in a buffer and then select. But this is inefficient when <code><span class='Value'>𝕨</span></code> contains large values, so those need to be detected and handled. Very tricky.</p>
-<h4 id="constant-replicate"><a class="header" href="#constant-replicate">Constant replicate</a></h4>
-<p>Useful for outer products and leading-axis extension. See <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a> for the boolean case. C compilers will generate decent code for constant small numbers and variable large ones, but I think specialized code with shuffle would be better for small numbers.</p>
-<h3 id="higher-ranks"><a class="header" href="#higher-ranks">Higher ranks</a></h3>
+<h2 id="higher-ranks"><a class="header" href="#higher-ranks">Higher ranks</a></h2>
 <p>When replicating along the first axis only, additional axes only change the element size (these are the main reason why a large element method is given). Replicating along a later axis offers a few opportunities for improvement relative to replicating each cell individually.</p>
 <p>Particularly for boolean <code><span class='Value'>𝕨</span></code>, Select is usually faster than Replicate (a major exception is for a boolean <code><span class='Value'>𝕩</span></code>). Simply replacing <code><span class='Function'>/</span></code> with <code><span class='Function'>/</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code> (after checking conformability) could be an improvement. It's probably best to compute the result shape first to avoid doing any work if it's empty. Similarly, if early result axes are small then the overhead of separating out Indices might make it worse than just doing the small number of Replicates.</p>
 <p>A technique when <code><span class='Value'>𝕨</span></code> processed with one or more bytes at a time, and applies to many rows, is to repeat it up to an even number of bytes and combine rows of <code><span class='Value'>𝕩</span></code> into longer virtual rows (the last one can be short). I think this only ends up being useful when <code><span class='Value'>𝕩</span></code> is boolean.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-09-21 21:54:23 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-09-21 21:54:23 -0400
commit	571f307a396ae52f23e996e89db3e36d1f939cea (patch)
tree	5c1d7cd9b49b9519d6e25f2e3b5c1e2f54198cc4 /docs/implementation
parent	555adb1ae538013bb220df56451e4255716b9d18 (diff)