More Where/Compress implementation notes

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-09-24 22:13:05 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-09-24 22:13:05 -0400
commit: 23889db9006eae3033c4590e8c328aa739322ad1 (patch)
tree: 8aaf03c59c6abb759cd49bc24492e2aaf27c0695 /docs/implementation/primitive
parent: 571f307a396ae52f23e996e89db3e36d1f939cea (diff)
1 files changed, 17 insertions, 13 deletions
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html
index 2eba512e..e8b051ab 100644
--- a/docs/implementation/primitive/replicate.html
+++ b/docs/implementation/primitive/replicate.html
@@ -5,9 +5,7 @@
 </head>
 <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
 <h1 id="implementation-of-indices-and-replicate"><a class="header" href="#implementation-of-indices-and-replicate">Implementation of Indices and Replicate</a></h1>
-<p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. The most important is converting <a href="#booleans-to-indices">bits to indices</a>: AVX-512 extensions implement this natively for various index sizes, and even with no SIMD support at all there are surprisingly fast table-based algorithms for it.</p>
-<p><a href="#replicate">General replication</a> is more complex. The main enemy is branching but there are reasonable approaches.</p>
-<p>Replicate by a <a href="#constant-replicate">constant amount</a> (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and arithmetic with prefix agreement.</p>
+<p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. Most important is the boolean case, which can be used to ignore unwanted values without branching. Replicate by a constant amount (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and arithmetic with prefix agreement. Fast implementations can be much better than the obvious C code, particularly for the boolean case.</p>
 <table>
 <thead>
 <tr>
@@ -18,7 +16,7 @@
 <tbody>
 <tr>
 <td><a href="#indices">Indices</a></td>
-<td><a href="#where">Where</a></td>
+<td><a href="#booleans-to-indices">Where</a></td>
 </tr>
 <tr>
 <td><a href="#replicate">Replicate</a></td>
@@ -43,20 +41,26 @@
 <p>The case where <code><span class='Value'>𝕨</span></code> is constant is useful for outer products and leading-axis extension, where elements of one argument need to be repeated a few times. This connection is also discussed in <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits</a>.</p>
 <p>The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers as well, but I think they're still not as good as specialized code with shuffle, and can sometimes be beaten by scan-based methods.</p>
 <h2 id="booleans"><a class="header" href="#booleans">Booleans</a></h2>
-<p>The case where the replication amount is boolean is called Where or Compress based on older APL names for these functions before Replicate was extended to natural numbers.</p>
-<p>When the amounts to replicate are natural numbers you're pretty much stuck going one at a time. With booleans there are huge advantages to doing bytes or larger units at once. This tends to lead to an outer replicate-like pattern where the relevant amount is the <em>sum</em> of a group of booleans, as well as an inner pattern based on the individual 0s and 1s.</p>
+<p>The case with a boolean replication amount is called Where or Compress, based on APL names for these functions from before Replicate was extended to natural numbers.</p>
+<p>When the amounts to replicate are natural numbers you're pretty much stuck going one at a time. With booleans there are huge advantages to doing bytes or larger units at once. This tends to lead to a two-level model: an outer replicate-like pattern where the relevant amount is the <em>sum</em> of a group of booleans, as well as an inner pattern based on the individual 0s and 1s.</p>
+<p>The standard branchless strategy is to write each result value regardless of whether it should actually be included, then increment the result pointer only if it is. This works best if done in an unrolled loop handling 8 bits at a time. But it's still not competitive with other methods using hardware extensions or lookup tables, so these should be used when there's more implementation time available.</p>
 <h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3>
-<p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
+<p>Generally, Compress implementations can be adapted to implement Where, and this tends to be the best way to do it. In the other direction, a Where implementation on 1 or 2 bytes can be used to implement Where or Compress on larger types. To do this, traverse the boolean argument in blocks; for each, get the indices and either add the current offset to them (Where) or use them to select from the argument (Compress). This is close to but not necessarily the fastest method for 4-byte and 8-byte outputs. It's a bargain given the low amount of effort required though.</p>
+<p>Indices from 256 or fewer bits is fastest with hardware extensions as discussed in the next section. But it can be done pretty fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
 <p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
 <p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
 <h3 id="compress"><a class="header" href="#compress">Compress</a></h3>
-<p>Most of the methods listed below can be performed in place.</p>
+<p>Branchless writes and blocked indices are possible methods as discussed above. The ones below use extended instruction sets. Note that older AMD processors do BMI2 operations very slowly. Compress can often be performed in-place.</p>
 <p>For booleans, use BMI2's PEXT (parallel bits extract) instruction, or an emulation of it. The result can be built recursively alongside the also-required popcount using masked shifts.</p>
-<p>A good general method is to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
-<p>For 1- and 2-byte elements, a shuffle-based solution is a substantial improvement, if a vector shuffle is available. AVX-512 has compresses on several sizes built-in.</p>
-<p>Odd-sized cells could be handled with an index buffer like small elements, using oversized writes and either overallocating or handling the last element specially.</p>
-<p>For medium-sized cells copying involves partial writes and so is somewhat inefficient. It's better to split <code><span class='Value'>𝕨</span></code> into groups of 1s in order to copy larger chunks from <code><span class='Value'>𝕩</span></code> at once. So the algorithm repeatedly searches <code><span class='Value'>𝕨</span></code> for the next 1, then the next 0, then copies the corresponding value from <code><span class='Value'>𝕩</span></code> to the result. This might be better for small odd-sized cells as well; I haven't implemented the algorithm with oversized writes to compare.</p>
-<p>The grouped algorithm, as well as a simpler sparse algorithm that just finds each 1 in <code><span class='Value'>𝕨</span></code>, can also better for small elements. Whether to use these depends on the value of <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>𝕨</span></code> (sparse) or <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> (clumped). The checking is fast and these cases are common, but the general case is also fast enough that this is not a particularly high priority.</p>
+<p>AVX-512 has compresses on all sizes up to 8 bytes, split across several sub-extensions.</p>
+<p>For 1- and 2-byte elements and no AVX-512 support, there are two methods with similar performance on Intel hardware. Both work with one byte at a time from the boolean argument. One uses BMI2, first using PDEP (1-byte case) or a lookup table (2-byte case) to expand each bit to the result element size, then applies PEXT to that and a word from the compressed argument. The other is to use a 256-entry lookup table where each entry is 8 indices for a vector shuffle instruction, a total of 2KB of data. Shuffling may also be competitive for 4-byte elements but it's no longer automatically the best.</p>
+<h3 id="sparse-compress"><a class="header" href="#sparse-compress">Sparse compress</a></h3>
+<p>When <code><span class='Value'>𝕨</span></code> is sparse (or <code><span class='Value'>𝕩</span></code> for Indices), that is, has a small sum relative to its length, there are methods that lower the per-input cost by doing more work for each output.</p>
+<p>The best known sparse method is to work on a full word of bits. At each step, find the first index with count-trailing-zeros, and then remove that bit with a bitwise and, <code><span class='Value'>w</span> <span class='Value'>&amp;</span><span class='Function'>=</span> <span class='Value'>w</span><span class='Function'>-</span><span class='Number'>1</span></code> in C. However, this method has a loop whose length is the number of 1s in the word, a variable. CPUs are very good at predicting this length in benchmarks, but in practice it's likely to be less predictable! In CBQN it's only used for densities below 1/128, one bit every two words.</p>
+<p>For marginal cases, I found a branchless algorithm that can work on blocks of up to <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>11</span></code> elements. The idea is to split each word into a few segments, and write the bits and relative offset for each segment to the appropriate position in the result of a zeroed buffer. Then traverse the buffer, maintaining bits and a cumulative offset. At each step, the index is obtained from those bits with count-trailing-zeros just as in the branching algorithm. The bits will all be removed exactly when the next segment is reached, so new values from the buffer can be incorporated just by adding them.</p>
+<h3 id="grouped-compress"><a class="header" href="#grouped-compress">Grouped compress</a></h3>
+<p>The sparse method can also be adapted to find groups of 1s instead of individual 1s, by searching for the first 1 and then the first 0 after that. This is useful if <code><span class='Value'>𝕨</span></code> changes value rarely, that is, if <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> is small. Computing this value can be expensive so it's best to compute the threshold first, then update it in blocks and stop if it exceeds the threshold.</p>
+<p>For copying medium-sized cells with memcpy, all the branching here is pretty cheap relative to the actual operation, and it may as well be used all the time. This may not be true for smaller cells copied with overwriting, but I haven't implemented overwriting so I'm not sure.</p>
 <h2 id="higher-ranks"><a class="header" href="#higher-ranks">Higher ranks</a></h2>
 <p>When replicating along the first axis only, additional axes only change the element size (these are the main reason why a large element method is given). Replicating along a later axis offers a few opportunities for improvement relative to replicating each cell individually.</p>
 <p>Particularly for boolean <code><span class='Value'>𝕨</span></code>, Select is usually faster than Replicate (a major exception is for a boolean <code><span class='Value'>𝕩</span></code>). Simply replacing <code><span class='Function'>/</span></code> with <code><span class='Function'>/</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code> (after checking conformability) could be an improvement. It's probably best to compute the result shape first to avoid doing any work if it's empty. Similarly, if early result axes are small then the overhead of separating out Indices might make it worse than just doing the small number of Replicates.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-09-24 22:13:05 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-09-24 22:13:05 -0400
commit	23889db9006eae3033c4590e8c328aa739322ad1 (patch)
tree	8aaf03c59c6abb759cd49bc24492e2aaf27c0695 /docs/implementation/primitive
parent	571f307a396ae52f23e996e89db3e36d1f939cea (diff)