aboutsummaryrefslogtreecommitdiff
path: root/implementation
diff options
context:
space:
mode:
authorMarshall Lochbaum <mwlochbaum@gmail.com>2022-09-01 20:20:21 -0400
committerMarshall Lochbaum <mwlochbaum@gmail.com>2022-09-01 20:20:21 -0400
commitdc2c7cd33d2adb701945b050a8dff1ff47003eec (patch)
tree9953e9349b413dc6d4ee79cb7edfc736e4d28767 /implementation
parent99f3205c897c82b378b9b022ac295085005cef5e (diff)
Typo
Diffstat (limited to 'implementation')
-rw-r--r--implementation/primitive/replicate.md2
1 files changed, 1 insertions, 1 deletions
diff --git a/implementation/primitive/replicate.md b/implementation/primitive/replicate.md
index d6141096..202dd334 100644
--- a/implementation/primitive/replicate.md
+++ b/implementation/primitive/replicate.md
@@ -16,7 +16,7 @@ Branchless algorithms are fastest, but with unbounded values in `𝕨` a fully b
Indices (`/`) on a boolean `𝕩` of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.
-To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in `𝕨`, get the corresponding indices, add an increment, and write them to the current index in the output. Then incease the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.
+To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in `𝕨`, get the corresponding indices, add an increment, and write them to the current index in the output. Then increase the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.
Some other methods discussed by [Langdale](https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/) and [Lemire](https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/). I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.