Transpose implementation: notes on SIMD kernels and cache-friendly orderingHEAD master

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-03-12 22:30:35 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-03-12 22:30:35 -0400
commit: 1e332be5f70c657d362711673ac5250b19054001 (patch)
tree: e663fd03c69bed3b8be84c3b823e1bfab2c35e47
parent: 3577976fda09e50a2efb1ea40afe9ea3a8c08f82 (diff)
2 files changed, 36 insertions, 18 deletions
diff --git a/docs/implementation/primitive/transpose.html b/docs/implementation/primitive/transpose.html
index a1968e8f..7eb68356 100644
--- a/docs/implementation/primitive/transpose.html
+++ b/docs/implementation/primitive/transpose.html
@@ -10,15 +10,21 @@
 <p>Monadic transpose in BQN always exchanges one axis with the rest. As mentioned <a href="#axis-simplification">below</a>, that's equivalent to swapping exactly two axes. This case is easier to implement, but ideas that are useful for this simple transpose always apply to more complex ones as well.</p>
 <p>The scalar approach is simply to set <code><span class='Value'>res</span><span class='Bracket'>[</span><span class='Value'>i</span><span class='Bracket'>][</span><span class='Value'>j</span><span class='Bracket'>]</span> <span class='Function'>=</span> <span class='Value'>arg</span><span class='Bracket'>[</span><span class='Value'>j</span><span class='Bracket'>][</span><span class='Value'>i</span><span class='Bracket'>]</span></code> in a loop. Doing the writes in order and the reads out of order (so, <code><span class='Value'>i</span></code> is the outer loop here) seems to be slightly faster. This is fine most of the time, but suffers slightly when the inner loop is very short, and can be improved with SIMD.</p>
 <p>My talk &quot;Moving Bits Faster in Dyalog 16.0&quot; (<a href="https://dyalog.tv/Dyalog17/?v=2KnrDmZov4U">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>) discusses optimizations for the boolean case, where a scalar loop is particularly bad because reading or writing a single bit is so slow. Booleans can be a little easier to work with than larger units, although there are also some unique issues to be addressed because they can only be accessed in aligned groups, while SIMD vectors don't have to be aligned.</p>
-<h3 id="blocking"><a class="header" href="#blocking">Blocking</a></h3>
-<p>The boolean talk shows blocks (that is, subarrays) of 8×8 bits being transposed. SIMD can handle similar-sized or smaller blocks, depending on element size. This requires using multiple registers; I think the best approach is to use quite a few—that is, sticking to one register for booleans was probably not optimal.</p>
-<p>Blocking is useful because it combines multiple reads that would be scattered in the ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets complicated, so at some point the CPU load will be big enough that increasing the block size is no longer worth it.</p>
+<h3 id="kernels"><a class="header" href="#kernels">Kernels</a></h3>
+<p>A transpose kernel is a unit that's transposed with a single loopless section of code. Typically kernels are square, with a power-of-two length on the sides. For example, the boolean talk shows subarrays of 8×8 bits being transposed, since each fits in one 64-bit register. But now I think that's a pretty small kernel and 4 or more would be better, even sticking to general-purpose registers. Vector instructions both increase the size that can be held and speed up the transposing itself, so SIMD kernels are a big step up.</p>
+<p>The major benefit of transposing with kernels is to combine multiple reads that would be scattered in ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets expensive, so at some point the CPU load will be big enough that increasing the kernel size is no longer worth it.</p>
+<p>A matrix in SIMD registers can be transposed with unpack instructions in a <a href="https://en.wikipedia.org/wiki/Butterfly_diagram">butterfly</a> pattern. This takes <code><span class='Value'>k</span></code> layers for a <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>k</span></code> by <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>k</span></code> matrix, where each layer pairs up all registers. So the asymptotic cost for <code><span class='Value'>n</span></code> elements is linear times an extra factor of <code><span class='Number'>4</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Value'>n</span></code>.</p>
 <h3 id="interleaving"><a class="header" href="#interleaving">Interleaving</a></h3>
-<p>When one axis is smaller than the wanted block length, the ordinary blocking method clearly isn't going to work so well. Handling these cases well calls for a different strategy that I've called interleaving/uninterleaving in the past. Interleaving is used to transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and the general block strategy are that writes (or reads for uninterleaving) are entirely contiguous, and that the two block dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.</p>
-<p>I never explained how to implement the boolean case in a Dyalog publication, but it's mentioned in the boolean talk, and brought up in <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a> (section &quot;Small expansion factors&quot;). The two methods used in Dyalog are pdep/pext instructions with a precomputed mask if available (this is straightforward), and shifting power-of-two-sized sections if not (harder, similar to various things shown in Hacker's Delight). It's possible that a SIMD implementation of the shifting logic could beat pdep/pext, but I haven't attempted this.</p>
-<p>The case with non-booleans and SIMD should be a lot simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.</p>
-<h3 id="cache-associativity"><a class="header" href="#cache-associativity">Cache associativity</a></h3>
-<p>A problem to be aware of is that when an array is accessed with a stride that's a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a <a href="https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache">set-associative</a> cache. This can slow down something like a 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. But I don't have a good general model for when this requires action or what exactly to do about it. I'm not aware of any array language that tries to mitigate it.</p>
+<p>A single fixed-size kernel breaks down when one axis is smaller than the kernel length. Handling these cases well calls for a different strategy that I call interleaving or uninterleaving, depending on orientation. Interleaving can transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and a general kernel are that writes (or reads for uninterleaving) are entirely contiguous, and that the two subarray dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.</p>
+<p>The boolean case can use methods similar to those described <a href="take.html#bit-interleaving-and-uninterleaving">for Take and Drop</a>. For example, to transpose <code><span class='Value'>k</span></code> long rows, interleave each row to place <code><span class='Value'>k</span><span class='Function'>-</span><span class='Number'>1</span></code> zeros between each bit, then or results from the different rows together at the appropriate offsets.</p>
+<p>The case with non-booleans and SIMD should be simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.</p>
+<h3 id="cache-efficient-orderings"><a class="header" href="#cache-efficient-orderings">Cache-efficient orderings</a></h3>
+<p>Particularly with large elements and a good SIMD kernel, cache behavior begins to be a problem: by writing in index order, for instance, writes are perfectly cached but reads are very poorly cached.</p>
+<p>A simple solution to this problem is blocking: for example, split the array into blocks of a few kilobytes, transposing each before moving to the next. If one of these small blocks stays in the L1 cache, then L1 is used efficiently over the course of a block. But moving between blocks is inefficient again: we'd want it to use L2 when L1 is overloaded, but writes won't do that.</p>
+<p>Adding more layers of blocking one at a time might help, but a cache-oblivious layout—named for its ability to perform well regardless of what layers of cache exist—skips over this mess and adds layers fractally at all scales. Here's <a href="https://en.algorithmica.org/hpc/external-memory/oblivious/#matrix-transposition">one presentation</a> of this idea. A simpler recursive implementation is to just halve the longer side of the matrix at each step. At the other complexity extreme, Hilbert curves offer better locality and can be traced without recursion. A recent <a href="https://dl.acm.org/doi/10.1145/3555353">paper</a> with <a href="https://github.com/JoaoAlves95/HPC-Cache-Oblivious-Transposition">source code</a> offers a SIMD scheme to generate the curve that even the authors say is pointless for transpose because the overhead was low enough already. But it also explains the non-SIMD aspects well, if you have the time.</p>
+<h4 id="cache-associativity"><a class="header" href="#cache-associativity">Cache associativity</a></h4>
+<p>A more severe problem than just limited cache capacity is that when an array's stride is a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a <a href="https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache">set-associative</a> cache. This can slow down something like a naive 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. Fortunately it's also addressed by a cache-oblivious ordering, but if you don't have such an ordering you should try to choose block sizes so that side lengths aren't near powers of two.</p>
+<p>A set-associative cache can only store a fixed number of lines that have the same trailing bits in their address. So an 8-way associative cache might have space for 8 lines with address ending in 0x85f0. In the 512×512 case with 4-byte elements, every read in a column has the same last 9+2 bits as the previous one, and <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>9</span></code> elements are read before getting to the next column. The cache would need space for <code><span class='Paren'>(</span><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>9</span><span class='Function'>+</span><span class='Number'>2</span><span class='Function'>+</span><span class='Number'>9</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Number'>8</span></code>, or <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>17</span></code>, lines to store those elements, which it doesn't have, so cache lines from the beginning of the column start getting kicked out before reaching the next column, and those are exactly the ones you needed to keep around.</p>
 <h2 id="arbitrary-reordering"><a class="header" href="#arbitrary-reordering">Arbitrary reordering</a></h2>
 <p>Moving lots of axes around quickly gets hard to think about. Here's what I've found so far. Most of these methods work in some way for repeated axis indices (that is, taking diagonals) too.</p>
 <h3 id="axis-simplification"><a class="header" href="#axis-simplification">Axis simplification</a></h3>
@@ -37,6 +43,6 @@
 <p>Dyalog splits a general boolean transposes into an axis rotation, which swaps two groups of axes and thus works like a matrix transpose, and a second transpose that fixes the last axis. Since length-1 axes have been eliminated, that last transpose works with at least 2 bits at a time, so the worst case is <em>slightly</em> better than moving single bits around but still not great.</p>
 <h3 id="recursion"><a class="header" href="#recursion">Recursion</a></h3>
 <p>Unlike a 2D transpose, the basic implementation of an arbitrary reordering is not so obvious. It &quot;should&quot; be a loop with a variable amount of nesting. Of course, you could get rid of this nesting by working with index vectors, but that's very slow. The way Dyalog handles it is with a recursive function that does a layer of looping. Inside the loop it either calls itself or a base case, which is passed by a function pointer for flexibility. There are actually two loops, one that does one axis and one that does two at once. The single-axis version is only called at the top level in case there's an axis left over, so most of the work is done in this double-loop case.</p>
-<p>Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like blocking.</p>
+<p>Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like kernels.</p>
 <h3 id="indices"><a class="header" href="#indices">Indices</a></h3>
 <p>One thing I noticed when working with BQN is that CBQN's dyadic <code><span class='Function'>⍉</span></code> sometimes beat Dyalog's, despite using the BQN runtime instead of a native implementation. The runtime just constructs all the (ravel) indices for the output with a big addition table <code><span class='Function'>+</span><span class='Modifier'>⌜´</span></code>, then selects. I think this is actually a good strategy for handling trailing axes if there are a lot of small ones. Construct the index table for the bottom few axes of the result—which don't have to correspond to the bottom axes of the argument, even though for cache reasons it'd be nice if they did. Then the base case for tranpose is simply selecting using these axes.</p>
diff --git a/implementation/primitive/transpose.md b/implementation/primitive/transpose.md
index 6e23ab73..2dedb3be 100644
--- a/implementation/primitive/transpose.md
+++ b/implementation/primitive/transpose.md
@@ -12,23 +12,35 @@ The scalar approach is simply to set `res[i][j] = arg[j][i]` in a loop. Doing th
 
 My talk "Moving Bits Faster in Dyalog 16.0" ([video](https://dyalog.tv/Dyalog17/?v=2KnrDmZov4U), [slides](https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip)) discusses optimizations for the boolean case, where a scalar loop is particularly bad because reading or writing a single bit is so slow. Booleans can be a little easier to work with than larger units, although there are also some unique issues to be addressed because they can only be accessed in aligned groups, while SIMD vectors don't have to be aligned.
 
-### Blocking
+### Kernels
 
-The boolean talk shows blocks (that is, subarrays) of 8×8 bits being transposed. SIMD can handle similar-sized or smaller blocks, depending on element size. This requires using multiple registers; I think the best approach is to use quite a few—that is, sticking to one register for booleans was probably not optimal.
+A transpose kernel is a unit that's transposed with a single loopless section of code. Typically kernels are square, with a power-of-two length on the sides. For example, the boolean talk shows subarrays of 8×8 bits being transposed, since each fits in one 64-bit register. But now I think that's a pretty small kernel and 4 or more would be better, even sticking to general-purpose registers. Vector instructions both increase the size that can be held and speed up the transposing itself, so SIMD kernels are a big step up.
 
-Blocking is useful because it combines multiple reads that would be scattered in the ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets complicated, so at some point the CPU load will be big enough that increasing the block size is no longer worth it.
+The major benefit of transposing with kernels is to combine multiple reads that would be scattered in ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets expensive, so at some point the CPU load will be big enough that increasing the kernel size is no longer worth it.
+
+A matrix in SIMD registers can be transposed with unpack instructions in a [butterfly](https://en.wikipedia.org/wiki/Butterfly_diagram) pattern. This takes `k` layers for a `2⋆k` by `2⋆k` matrix, where each layer pairs up all registers. So the asymptotic cost for `n` elements is linear times an extra factor of `4⋆⁼n`.
 
 ### Interleaving
 
-When one axis is smaller than the wanted block length, the ordinary blocking method clearly isn't going to work so well. Handling these cases well calls for a different strategy that I've called interleaving/uninterleaving in the past. Interleaving is used to transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and the general block strategy are that writes (or reads for uninterleaving) are entirely contiguous, and that the two block dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.
+A single fixed-size kernel breaks down when one axis is smaller than the kernel length. Handling these cases well calls for a different strategy that I call interleaving or uninterleaving, depending on orientation. Interleaving can transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and a general kernel are that writes (or reads for uninterleaving) are entirely contiguous, and that the two subarray dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.
+
+The boolean case can use methods similar to those described [for Take and Drop](take.md#bit-interleaving-and-uninterleaving). For example, to transpose `k` long rows, interleave each row to place `k-1` zeros between each bit, then or results from the different rows together at the appropriate offsets.
+
+The case with non-booleans and SIMD should be simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.
+
+### Cache-efficient orderings
+
+Particularly with large elements and a good SIMD kernel, cache behavior begins to be a problem: by writing in index order, for instance, writes are perfectly cached but reads are very poorly cached.
+
+A simple solution to this problem is blocking: for example, split the array into blocks of a few kilobytes, transposing each before moving to the next. If one of these small blocks stays in the L1 cache, then L1 is used efficiently over the course of a block. But moving between blocks is inefficient again: we'd want it to use L2 when L1 is overloaded, but writes won't do that.
 
-I never explained how to implement the boolean case in a Dyalog publication, but it's mentioned in the boolean talk, and brought up in [Expanding Bits in Shrinking Time](https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/) (section "Small expansion factors"). The two methods used in Dyalog are pdep/pext instructions with a precomputed mask if available (this is straightforward), and shifting power-of-two-sized sections if not (harder, similar to various things shown in Hacker's Delight). It's possible that a SIMD implementation of the shifting logic could beat pdep/pext, but I haven't attempted this.
+Adding more layers of blocking one at a time might help, but a cache-oblivious layout—named for its ability to perform well regardless of what layers of cache exist—skips over this mess and adds layers fractally at all scales. Here's [one presentation](https://en.algorithmica.org/hpc/external-memory/oblivious/#matrix-transposition) of this idea. A simpler recursive implementation is to just halve the longer side of the matrix at each step. At the other complexity extreme, Hilbert curves offer better locality and can be traced without recursion. A recent [paper](https://dl.acm.org/doi/10.1145/3555353) with [source code](https://github.com/JoaoAlves95/HPC-Cache-Oblivious-Transposition) offers a SIMD scheme to generate the curve that even the authors say is pointless for transpose because the overhead was low enough already. But it also explains the non-SIMD aspects well, if you have the time.
 
-The case with non-booleans and SIMD should be a lot simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.
+#### Cache associativity
 
-### Cache associativity
+A more severe problem than just limited cache capacity is that when an array's stride is a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. This can slow down something like a naive 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. Fortunately it's also addressed by a cache-oblivious ordering, but if you don't have such an ordering you should try to choose block sizes so that side lengths aren't near powers of two.
 
-A problem to be aware of is that when an array is accessed with a stride that's a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. This can slow down something like a 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. But I don't have a good general model for when this requires action or what exactly to do about it. I'm not aware of any array language that tries to mitigate it.
+A set-associative cache can only store a fixed number of lines that have the same trailing bits in their address. So an 8-way associative cache might have space for 8 lines with address ending in 0x85f0. In the 512×512 case with 4-byte elements, every read in a column has the same last 9+2 bits as the previous one, and `2⋆9` elements are read before getting to the next column. The cache would need space for `(2⋆9+2+9)÷8`, or `2⋆17`, lines to store those elements, which it doesn't have, so cache lines from the beginning of the column start getting kicked out before reaching the next column, and those are exactly the ones you needed to keep around.
 
 ## Arbitrary reordering
 
@@ -60,7 +72,7 @@ Dyalog splits a general boolean transposes into an axis rotation, which swaps tw
 
 Unlike a 2D transpose, the basic implementation of an arbitrary reordering is not so obvious. It "should" be a loop with a variable amount of nesting. Of course, you could get rid of this nesting by working with index vectors, but that's very slow. The way Dyalog handles it is with a recursive function that does a layer of looping. Inside the loop it either calls itself or a base case, which is passed by a function pointer for flexibility. There are actually two loops, one that does one axis and one that does two at once. The single-axis version is only called at the top level in case there's an axis left over, so most of the work is done in this double-loop case.
 
-Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like blocking.
+Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like kernels.
 
 ### Indices
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-03-12 22:30:35 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-03-12 22:30:35 -0400
commit	1e332be5f70c657d362711673ac5250b19054001 (patch)
tree	e663fd03c69bed3b8be84c3b823e1bfab2c35e47
parent	3577976fda09e50a2efb1ea40afe9ea3a8c08f82 (diff)