From 7fa6f9077057152e2e6c654df003cb6e0aa97d22 Mon Sep 17 00:00:00 2001
From: Marshall Lochbaum <mwlochbaum@gmail.com>
Date: Sun, 25 Sep 2022 21:55:17 -0400
Subject: Finish discussion of constant Replicate

---
 docs/implementation/primitive/replicate.html | 5 +++--
 implementation/primitive/replicate.md        | 6 ++++--
 2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html
index e8b051ab..4342345d 100644
--- a/docs/implementation/primitive/replicate.html
+++ b/docs/implementation/primitive/replicate.html
@@ -39,7 +39,8 @@
 <p>The running sum method needs to be modified slightly: instead of incrementing result values by one always, add the difference between the current value in <code><span class='Value'>𝕩</span></code> and the previous one. It's possible to use xor instead of addition and subtraction but it shouldn't ever make much of a difference to performance. In the boolean case xor-ing trailing bits instead of single bits allows part of an xor-scan to be skipped; see <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a>.</p>
 <h3 id="constant-replicate"><a class="header" href="#constant-replicate">Constant replicate</a></h3>
 <p>The case where <code><span class='Value'>𝕨</span></code> is constant is useful for outer products and leading-axis extension, where elements of one argument need to be repeated a few times. This connection is also discussed in <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits</a>.</p>
-<p>The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers as well, but I think they're still not as good as specialized code with shuffle, and can sometimes be beaten by scan-based methods.</p>
+<p>The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers—better for powers of two, but still not optimal it seems?</p>
+<p>For top performance, the result should be constructed from one shuffle per output, and some haggling with lanes for odd values in AVX. This means the number of shuffle constants is the value of <code><span class='Value'>𝕨</span></code>, so as the numbers get larger there's a question of how much space in the binary you're willing to devote to it. Although it's not really so bad because the overhead for non-specialized code gets lower as <code><span class='Value'>𝕨</span></code> increases. But JIT compiling would be very useful here, of course.</p>
 <h2 id="booleans"><a class="header" href="#booleans">Booleans</a></h2>
 <p>The case with a boolean replication amount is called Where or Compress, based on APL names for these functions from before Replicate was extended to natural numbers.</p>
 <p>When the amounts to replicate are natural numbers you're pretty much stuck going one at a time. With booleans there are huge advantages to doing bytes or larger units at once. This tends to lead to a two-level model: an outer replicate-like pattern where the relevant amount is the <em>sum</em> of a group of booleans, as well as an inner pattern based on the individual 0s and 1s.</p>
@@ -47,7 +48,7 @@
 <h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3>
 <p>Generally, Compress implementations can be adapted to implement Where, and this tends to be the best way to do it. In the other direction, a Where implementation on 1 or 2 bytes can be used to implement Where or Compress on larger types. To do this, traverse the boolean argument in blocks; for each, get the indices and either add the current offset to them (Where) or use them to select from the argument (Compress). This is close to but not necessarily the fastest method for 4-byte and 8-byte outputs. It's a bargain given the low amount of effort required though.</p>
 <p>Indices from 256 or fewer bits is fastest with hardware extensions as discussed in the next section. But it can be done pretty fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
-<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
+<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
 <p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
 <h3 id="compress"><a class="header" href="#compress">Compress</a></h3>
 <p>Branchless writes and blocked indices are possible methods as discussed above. The ones below use extended instruction sets. Note that older AMD processors do BMI2 operations very slowly. Compress can often be performed in-place.</p>
diff --git a/implementation/primitive/replicate.md b/implementation/primitive/replicate.md
index ec4d505b..18f6a332 100644
--- a/implementation/primitive/replicate.md
+++ b/implementation/primitive/replicate.md
@@ -32,7 +32,9 @@ The running sum method needs to be modified slightly: instead of incrementing re
 
 The case where `𝕨` is constant is useful for outer products and leading-axis extension, where elements of one argument need to be repeated a few times. This connection is also discussed in [Expanding Bits](https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/).
 
-The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers as well, but I think they're still not as good as specialized code with shuffle, and can sometimes be beaten by scan-based methods.
+The same approaches apply, but the branches in the branchless ones become a lot more predictable. So the obvious loops are now okay instead of bad even for small values. C compilers will generate decent code for constant small numbers—better for powers of two, but still not optimal it seems?
+
+For top performance, the result should be constructed from one shuffle per output, and some haggling with lanes for odd values in AVX. This means the number of shuffle constants is the value of `𝕨`, so as the numbers get larger there's a question of how much space in the binary you're willing to devote to it. Although it's not really so bad because the overhead for non-specialized code gets lower as `𝕨` increases. But JIT compiling would be very useful here, of course.
 
 ## Booleans
 
@@ -48,7 +50,7 @@ Generally, Compress implementations can be adapted to implement Where, and this
 
 Indices from 256 or fewer bits is fastest with hardware extensions as discussed in the next section. But it can be done pretty fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.
 
-To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in `𝕨`, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.
+To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.
 
 Some other methods discussed by [Langdale](https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/) and [Lemire](https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/). I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.
 
-- 
cgit v1.2.3