diff options
| author | Marshall Lochbaum <mwlochbaum@gmail.com> | 2022-09-01 20:20:21 -0400 |
|---|---|---|
| committer | Marshall Lochbaum <mwlochbaum@gmail.com> | 2022-09-01 20:20:21 -0400 |
| commit | dc2c7cd33d2adb701945b050a8dff1ff47003eec (patch) | |
| tree | 9953e9349b413dc6d4ee79cb7edfc736e4d28767 /docs/implementation/primitive | |
| parent | 99f3205c897c82b378b9b022ac295085005cef5e (diff) | |
Typo
Diffstat (limited to 'docs/implementation/primitive')
| -rw-r--r-- | docs/implementation/primitive/replicate.html | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html index 64cf73a7..f0c27952 100644 --- a/docs/implementation/primitive/replicate.html +++ b/docs/implementation/primitive/replicate.html @@ -12,7 +12,7 @@ <p>Branchless algorithms are fastest, but with unbounded values in <code><span class='Value'>𝕨</span></code> a fully branchless algorithm is impossible because you can't write an arbitrary amount of memory without branching. So the best algorithms depend on bounding <code><span class='Value'>𝕨</span></code>. Fortunately the most useful case is that <code><span class='Value'>𝕨</span></code> is boolean.</p> <h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3> <p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p> -<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then incease the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p> +<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p> <p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p> <h3 id="non-booleans-to-indices"><a class="header" href="#non-booleans-to-indices">Non-booleans to indices</a></h3> <p>If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. This is much like the lookup table algorithm above, minus the lookup table. If the indices need to be larger than one byte, it's fine to expand them, and possibly add an offset, after generation (probably in chunks).</p> |
