docs/implementation/primitive/replicate.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

<head>
  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="../../style.css" rel="stylesheet"/>
  <title>BQN: Implementation of Indices and Replicate</title>
</head>
<div class="nav"><a href="https://github.com/mlochbaum/BQN">BQN</a> / <a href="../../index.html">main</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
<h1 id="implementation-of-indices-and-replicate">Implementation of Indices and Replicate</h1>
<p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. The most important is converting <a href="#booleans-to-indices">bits to indices</a>: AVX-512 extensions implement this natively for various index sizes, and even with no SIMD support at all there are surprisingly fast table-based algorithms for it.</p>
<p><a href="#replicate">General replication</a> is more complex. Branching will slow many useful cases down considerably when using the obvious solution. However, branch-free techniques introduce overhead for larger replication amounts. Hybridizing these seems to be the only way, but it's finicky.</p>
<p>Replicate by a <a href="#constant-replicate">constant amount</a> (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and scalar dyadics with prefix agreement.</p>
<h2 id="indices">Indices</h2>
<p>Branchless algorithms are fastest, but with unbounded values in <code><span class='Value'>𝕨</span></code> a fully branchless algorithm is impossible because you can't write an arbitrary amount of memory without branching. So the best algorithms depend on bounding <code><span class='Value'>𝕨</span></code>. Fortunately the most useful case is that <code><span class='Value'>𝕨</span></code> is boolean.</p>
<h3 id="booleans-to-indices">Booleans to indices</h3>
<p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then incease the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
<p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
<h3 id="non-booleans-to-indices">Non-booleans to indices</h3>
<p>If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. This is much like the lookup table algorithm above, minus the lookup table. If the indices need to be larger than one byte, it's fine to expand them, and possibly add an offset, after generation (probably in chunks).</p>
<p>There are two ways I know to fill in the gaps that this method would leave with elements that are too large. First is to stop after such an element and fill remaining space branchfully (maybe with <code><span class='Value'>memset</span></code>). This is maximally efficient if <code><span class='Value'>𝕩</span></code> is dominated by large elements—particularly for 2-byte indices when it skips index expansion—but not good if there are a lot of elements near the threshold. Second, initialize the buffer with 0 and perform <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code> afterwards, or other variations. This eliminates all but a fixed amount of branching, but it's a lot of overhead and I think unless a more sophisticated strategy arises it's best to stick with the first method.</p>
<p>Indices is half of a counting sort: for sparse values, it's the slower half. Making it fast makes counting sort viable for much larger range-to-length ratios.</p>
<h2 id="replicate">Replicate</h2>
<p>For the most part, understanding Indices is the best way to implement Replicate quickly. But this is not the case if <code><span class='Value'>𝕩</span></code> is boolean because then its elements are smaller than any useful index, and faster methods are available.</p>
<h3 id="compress">Compress</h3>
<p>Most of the methods listed below can be performed in place.</p>
<p>For booleans, use BMI2's PEXT (parallel bits extract) instruction, or an emulation of it. The result can be built recursively alongside the also-required popcount using masked shifts.</p>
<p>The generally best method for small elements seems to be to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
<p>For 1- and 2-byte elements, a shuffle-based solution is a substantial improvement, if a vector shuffle is available. AVX-512 has compresses on several sizes built-in.</p>
<p>Odd-sized cells could be handled with an index buffer like small elements, using oversized writes and either overallocating or handling the last element specially.</p>
<p>For medium-sized cells copying involves partial writes and so is somewhat inefficient. It's better to split <code><span class='Value'>𝕨</span></code> into groups of 1s in order to copy larger chunks from <code><span class='Value'>𝕩</span></code> at once. So the algorithm repeatedly searches <code><span class='Value'>𝕨</span></code> for the next 1, then the next 0, then copies the corresponding value from <code><span class='Value'>𝕩</span></code> to the result. This might be better for small odd-sized cells as well; I haven't implemented the algorithm with oversized writes to compare.</p>
<p>The grouped algorithm, as well as a simpler sparse algorithm that just finds each 1 in <code><span class='Value'>𝕨</span></code>, can also better for small elements. Whether to use these depends on the value of <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>𝕨</span></code> (sparse) or <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> (clumped). The checking is fast and these cases are common, but the general case is also fast enough that this is not a particularly high priority.</p>
<h3 id="replicate">Replicate</h3>
<p>Like Compress I think the best algorithm is often to generate small indices in a buffer and then select. But this is inefficient when <code><span class='Value'>𝕨</span></code> contains large values, so those need to be detected and handled. Very tricky.</p>
<h4 id="constant-replicate">Constant replicate</h4>
<p>Useful for outer products and leading-axis extension. See <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a> for the boolean case. C compilers will generate decent code for constant small numbers and variable large ones, but I think specialized code with shuffle would be better for small numbers.</p>
<h3 id="higher-ranks">Higher ranks</h3>
<p>When replicating along the first axis only, additional axes only change the element size (these are the main reason why a large element method is given). Replicating along a later axis offers a few opportunities for improvement relative to replicating each cell individually.</p>
<p>Particularly for boolean <code><span class='Value'>𝕨</span></code>, Select is usually faster than Replicate (a major exception is for a boolean <code><span class='Value'>𝕩</span></code>). Simply replacing <code><span class='Function'>/</span></code> with <code><span class='Function'>/</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code> (after checking conformability) could be an improvement. It's probably best to compute the result shape first to avoid doing any work if it's empty. Similarly, if early result axes are small then the overhead of separating out Indices might make it worse than just doing the small number of Replicates.</p>
<p>A technique when <code><span class='Value'>𝕨</span></code> processed with one or more bytes at a time, and applies to many rows, is to repeat it up to an even number of bytes and combine rows of <code><span class='Value'>𝕩</span></code> into longer virtual rows (the last one can be short). I think this only ends up being useful when <code><span class='Value'>𝕩</span></code> is boolean.</p>