1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
|
<head>
<link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="../../style.css" rel="stylesheet"/>
<title>BQN: Implementation of array transpose</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
<h1 id="implementation-of-array-transpose"><a class="header" href="#implementation-of-array-transpose">Implementation of array transpose</a></h1>
<p>The <a href="../../doc/transpose.html">Transpose</a> primitive <code><span class='Function'>⍉</span></code> reorders axes of an array. It's a fairly deep primitive: even swapping two axes has some complexity, while rearranging many axes presents lots of different scenarios, making it hard to figure out which strategy will be best.</p>
<h2 id="exchanging-axes"><a class="header" href="#exchanging-axes">Exchanging axes</a></h2>
<p>Monadic transpose in BQN always exchanges one axis with the rest. As mentioned <a href="#axis-simplification">below</a>, that's equivalent to swapping exactly two axes. This case is easier to implement, but ideas that are useful for this simple transpose always apply to more complex ones as well.</p>
<p>The scalar approach is simply to set <code><span class='Value'>res</span><span class='Bracket'>[</span><span class='Value'>i</span><span class='Bracket'>][</span><span class='Value'>j</span><span class='Bracket'>]</span> <span class='Function'>=</span> <span class='Value'>arg</span><span class='Bracket'>[</span><span class='Value'>j</span><span class='Bracket'>][</span><span class='Value'>i</span><span class='Bracket'>]</span></code> in a loop. Doing the writes in order and the reads out of order (so, <code><span class='Value'>i</span></code> is the outer loop here) seems to be slightly faster. This is fine most of the time, but suffers slightly when the inner loop is very short, and can be improved with SIMD.</p>
<p>My talk "Moving Bits Faster in Dyalog 16.0" (<a href="https://dyalog.tv/Dyalog17/?v=2KnrDmZov4U">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>) discusses optimizations for the boolean case, where a scalar loop is particularly bad because reading or writing a single bit is so slow. Booleans can be a little easier to work with than larger units, although there are also some unique issues to be addressed because they can only be accessed in aligned groups, while SIMD vectors don't have to be aligned.</p>
<h3 id="kernels"><a class="header" href="#kernels">Kernels</a></h3>
<p>A transpose kernel is a unit that's transposed with a single loopless section of code. Typically kernels are square, with a power-of-two length on the sides. For example, the boolean talk shows subarrays of 8×8 bits being transposed, since each fits in one 64-bit register. But now I think that's a pretty small kernel and 4 or more would be better, even sticking to general-purpose registers. Vector instructions both increase the size that can be held and speed up the transposing itself, so SIMD kernels are a big step up.</p>
<p>The major benefit of transposing with kernels is to combine multiple reads that would be scattered in ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets expensive, so at some point the CPU load will be big enough that increasing the kernel size is no longer worth it.</p>
<p>A matrix in SIMD registers can be transposed with unpack instructions in a <a href="https://en.wikipedia.org/wiki/Butterfly_diagram">butterfly</a> pattern. This takes <code><span class='Value'>k</span></code> layers for a <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>k</span></code> by <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>k</span></code> matrix, where each layer pairs up all registers. So the asymptotic cost for <code><span class='Value'>n</span></code> elements is linear times an extra factor of <code><span class='Number'>4</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Value'>n</span></code>.</p>
<h3 id="interleaving"><a class="header" href="#interleaving">Interleaving</a></h3>
<p>A single fixed-size kernel breaks down when one axis is smaller than the kernel length. Handling these cases well calls for a different strategy that I call interleaving or uninterleaving, depending on orientation. Interleaving can transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and a general kernel are that writes (or reads for uninterleaving) are entirely contiguous, and that the two subarray dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.</p>
<p>The boolean case can use methods similar to those described <a href="take.html#bit-interleaving-and-uninterleaving">for Take and Drop</a>. For example, to transpose <code><span class='Value'>k</span></code> long rows, interleave each row to place <code><span class='Value'>k</span><span class='Function'>-</span><span class='Number'>1</span></code> zeros between each bit, then or results from the different rows together at the appropriate offsets.</p>
<p>The case with non-booleans and SIMD should be simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.</p>
<h3 id="cache-efficient-orderings"><a class="header" href="#cache-efficient-orderings">Cache-efficient orderings</a></h3>
<p>Particularly with large elements and a good SIMD kernel, cache behavior begins to be a problem: by writing in index order, for instance, writes are perfectly cached but reads are very poorly cached.</p>
<p>A simple solution to this problem is blocking: for example, split the array into blocks of a few kilobytes, transposing each before moving to the next. If one of these small blocks stays in the L1 cache, then L1 is used efficiently over the course of a block. But moving between blocks is inefficient again: we'd want it to use L2 when L1 is overloaded, but writes won't do that.</p>
<p>Adding more layers of blocking one at a time might help, but a cache-oblivious layout—named for its ability to perform well regardless of what layers of cache exist—skips over this mess and adds layers fractally at all scales. Here's <a href="https://en.algorithmica.org/hpc/external-memory/oblivious/#matrix-transposition">one presentation</a> of this idea. A simpler recursive implementation is to just halve the longer side of the matrix at each step. At the other complexity extreme, Hilbert curves offer better locality and can be traced without recursion. A recent <a href="https://dl.acm.org/doi/10.1145/3555353">paper</a> with <a href="https://github.com/JoaoAlves95/HPC-Cache-Oblivious-Transposition">source code</a> offers a SIMD scheme to generate the curve that even the authors say is pointless for transpose because the overhead was low enough already. But it also explains the non-SIMD aspects well, if you have the time.</p>
<h4 id="cache-associativity"><a class="header" href="#cache-associativity">Cache associativity</a></h4>
<p>A more severe problem than just limited cache capacity is that when an array's stride is a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a <a href="https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache">set-associative</a> cache. This can slow down something like a naive 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. Fortunately it's also addressed by a cache-oblivious ordering, but if you don't have such an ordering you should try to choose block sizes so that side lengths aren't near powers of two.</p>
<p>A set-associative cache can only store a fixed number of lines that have the same trailing bits in their address. So an 8-way associative cache might have space for 8 lines with address ending in 0x85f0. In the 512×512 case with 4-byte elements, every read in a column has the same last 9+2 bits as the previous one, and <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>9</span></code> elements are read before getting to the next column. The cache would need space for <code><span class='Paren'>(</span><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>9</span><span class='Function'>+</span><span class='Number'>2</span><span class='Function'>+</span><span class='Number'>9</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Number'>8</span></code>, or <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>17</span></code>, lines to store those elements, which it doesn't have, so cache lines from the beginning of the column start getting kicked out before reaching the next column, and those are exactly the ones you needed to keep around.</p>
<h2 id="arbitrary-reordering"><a class="header" href="#arbitrary-reordering">Arbitrary reordering</a></h2>
<p>Moving lots of axes around quickly gets hard to think about. Here's what I've found so far. Most of these methods work in some way for repeated axis indices (that is, taking diagonals) too.</p>
<h3 id="axis-simplification"><a class="header" href="#axis-simplification">Axis simplification</a></h3>
<p>Sometimes the transpose isn't as complicated as it appears. The following are useful simplifications to apply:</p>
<ul>
<li>Any length-1 axis can be ignored; it doesn't affect ordering.</li>
<li>Two axes that begin and end next to each other can be treated as one larger axis.</li>
<li>Axes at the end that don't change can be absorbed into a total "element size".</li>
</ul>
<p>Particularly after dropping length-1 axes, some transposes might get down to 1 or 0 axes, which means they're no-ops.</p>
<h3 id="the-bottom-line"><a class="header" href="#the-bottom-line">The bottom line</a></h3>
<p>When there's a large enough fixed axis at the bottom, most of the work of transpose is just moving large chunks of data. If these moves are done with memcpy or similar, it doesn't matter what the structure around them is, as the limiting factor is memory bandwidth.</p>
<p>A similar principle applies if, for example, the last <em>two</em> axes switch places, but stay at the end, provided their product is large enough. Then the transpose is a bunch of 2D transposes. If each of these is done efficiently, then the outer structure can be anything.</p>
<h3 id="decomposition"><a class="header" href="#decomposition">Decomposition</a></h3>
<p>Before spending a tremendous amount of effort on optimizing strange transpose arrangements it's probably worth looking into transposing multiple times instead. I don't know whether this can beat a single-pass method in any case, but it's certainly simpler a lot of the time. But finding an algorithm which will make good decompositions can be pretty hard still.</p>
<p>Dyalog splits a general boolean transposes into an axis rotation, which swaps two groups of axes and thus works like a matrix transpose, and a second transpose that fixes the last axis. Since length-1 axes have been eliminated, that last transpose works with at least 2 bits at a time, so the worst case is <em>slightly</em> better than moving single bits around but still not great.</p>
<h3 id="recursion"><a class="header" href="#recursion">Recursion</a></h3>
<p>Unlike a 2D transpose, the basic implementation of an arbitrary reordering is not so obvious. It "should" be a loop with a variable amount of nesting. Of course, you could get rid of this nesting by working with index vectors, but that's very slow. The way Dyalog handles it is with a recursive function that does a layer of looping. Inside the loop it either calls itself or a base case, which is passed by a function pointer for flexibility. There are actually two loops, one that does one axis and one that does two at once. The single-axis version is only called at the top level in case there's an axis left over, so most of the work is done in this double-loop case.</p>
<p>Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like kernels.</p>
<h3 id="indices"><a class="header" href="#indices">Indices</a></h3>
<p>One thing I noticed when working with BQN is that CBQN's dyadic <code><span class='Function'>⍉</span></code> sometimes beat Dyalog's, despite using the BQN runtime instead of a native implementation. The runtime just constructs all the (ravel) indices for the output with a big addition table <code><span class='Function'>+</span><span class='Modifier'>⌜´</span></code>, then selects. I think this is actually a good strategy for handling trailing axes if there are a lot of small ones. Construct the index table for the bottom few axes of the result—which don't have to correspond to the bottom axes of the argument, even though for cache reasons it'd be nice if they did. Then the base case for tranpose is simply selecting using these axes.</p>
|