1 files changed, 21 insertions, 9 deletions
diff --git a/implementation/primitive/transpose.md b/implementation/primitive/transpose.md
index 6e23ab73..2dedb3be 100644
--- a/implementation/primitive/transpose.md
+++ b/implementation/primitive/transpose.md
@@ -12,23 +12,35 @@ The scalar approach is simply to set `res[i][j] = arg[j][i]` in a loop. Doing th
 
 My talk "Moving Bits Faster in Dyalog 16.0" ([video](https://dyalog.tv/Dyalog17/?v=2KnrDmZov4U), [slides](https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip)) discusses optimizations for the boolean case, where a scalar loop is particularly bad because reading or writing a single bit is so slow. Booleans can be a little easier to work with than larger units, although there are also some unique issues to be addressed because they can only be accessed in aligned groups, while SIMD vectors don't have to be aligned.
 
-### Blocking
+### Kernels
 
-The boolean talk shows blocks (that is, subarrays) of 8×8 bits being transposed. SIMD can handle similar-sized or smaller blocks, depending on element size. This requires using multiple registers; I think the best approach is to use quite a few—that is, sticking to one register for booleans was probably not optimal.
+A transpose kernel is a unit that's transposed with a single loopless section of code. Typically kernels are square, with a power-of-two length on the sides. For example, the boolean talk shows subarrays of 8×8 bits being transposed, since each fits in one 64-bit register. But now I think that's a pretty small kernel and 4 or more would be better, even sticking to general-purpose registers. Vector instructions both increase the size that can be held and speed up the transposing itself, so SIMD kernels are a big step up.
 
-Blocking is useful because it combines multiple reads that would be scattered in the ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets complicated, so at some point the CPU load will be big enough that increasing the block size is no longer worth it.
+The major benefit of transposing with kernels is to combine multiple reads that would be scattered in ordinary scalar iteration (if writes are in-order). It costs about the same to read one element as one 32-byte register, so the larger read clearly gets a lot more done with the same memory load. But transposing huge numbers of elements in registers gets expensive, so at some point the CPU load will be big enough that increasing the kernel size is no longer worth it.
+
+A matrix in SIMD registers can be transposed with unpack instructions in a [butterfly](https://en.wikipedia.org/wiki/Butterfly_diagram) pattern. This takes `k` layers for a `2⋆k` by `2⋆k` matrix, where each layer pairs up all registers. So the asymptotic cost for `n` elements is linear times an extra factor of `4⋆⁼n`.
 
 ### Interleaving
 
-When one axis is smaller than the wanted block length, the ordinary blocking method clearly isn't going to work so well. Handling these cases well calls for a different strategy that I've called interleaving/uninterleaving in the past. Interleaving is used to transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and the general block strategy are that writes (or reads for uninterleaving) are entirely contiguous, and that the two block dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.
+A single fixed-size kernel breaks down when one axis is smaller than the kernel length. Handling these cases well calls for a different strategy that I call interleaving or uninterleaving, depending on orientation. Interleaving can transpose a few long rows into a long matrix with short rows, while uninterleaving goes the other way. Generally, an approach that works for one can be run backwards to do the other. The differences between interleaving and a general kernel are that writes (or reads for uninterleaving) are entirely contiguous, and that the two subarray dimensions aren't the same. One dimension is forced by the matrix shape, and the other should be chosen as large as possible while fitting in a register. Or multiple registers, if you're brave enough. Alignment is a significant annoyance here, unless the smaller axis length is a power of two.
+
+The boolean case can use methods similar to those described [for Take and Drop](take.md#bit-interleaving-and-uninterleaving). For example, to transpose `k` long rows, interleave each row to place `k-1` zeros between each bit, then or results from the different rows together at the appropriate offsets.
+
+The case with non-booleans and SIMD should be simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.
+
+### Cache-efficient orderings
+
+Particularly with large elements and a good SIMD kernel, cache behavior begins to be a problem: by writing in index order, for instance, writes are perfectly cached but reads are very poorly cached.
+
+A simple solution to this problem is blocking: for example, split the array into blocks of a few kilobytes, transposing each before moving to the next. If one of these small blocks stays in the L1 cache, then L1 is used efficiently over the course of a block. But moving between blocks is inefficient again: we'd want it to use L2 when L1 is overloaded, but writes won't do that.
 
-I never explained how to implement the boolean case in a Dyalog publication, but it's mentioned in the boolean talk, and brought up in [Expanding Bits in Shrinking Time](https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/) (section "Small expansion factors"). The two methods used in Dyalog are pdep/pext instructions with a precomputed mask if available (this is straightforward), and shifting power-of-two-sized sections if not (harder, similar to various things shown in Hacker's Delight). It's possible that a SIMD implementation of the shifting logic could beat pdep/pext, but I haven't attempted this.
+Adding more layers of blocking one at a time might help, but a cache-oblivious layout—named for its ability to perform well regardless of what layers of cache exist—skips over this mess and adds layers fractally at all scales. Here's [one presentation](https://en.algorithmica.org/hpc/external-memory/oblivious/#matrix-transposition) of this idea. A simpler recursive implementation is to just halve the longer side of the matrix at each step. At the other complexity extreme, Hilbert curves offer better locality and can be traced without recursion. A recent [paper](https://dl.acm.org/doi/10.1145/3555353) with [source code](https://github.com/JoaoAlves95/HPC-Cache-Oblivious-Transposition) offers a SIMD scheme to generate the curve that even the authors say is pointless for transpose because the overhead was low enough already. But it also explains the non-SIMD aspects well, if you have the time.
 
-The case with non-booleans and SIMD should be a lot simpler, as it can be done with byte shuffles. The complication is reading from or writing to the side with long rows. I think the right approach is to read or write most of a register at once, and split it up as part of the shuffling. It's probably enough to have one precomputed index vector and add an offset to it as necessary.
+#### Cache associativity
 
-### Cache associativity
+A more severe problem than just limited cache capacity is that when an array's stride is a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. This can slow down something like a naive 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. Fortunately it's also addressed by a cache-oblivious ordering, but if you don't have such an ordering you should try to choose block sizes so that side lengths aren't near powers of two.
 
-A problem to be aware of is that when an array is accessed with a stride that's a power of 2 or very close to it, different accesses compete for the same small set of cache lines in a [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. This can slow down something like a 512×512 transpose to the point that it's faster to pad the matrix, transpose, and remove the padding. But I don't have a good general model for when this requires action or what exactly to do about it. I'm not aware of any array language that tries to mitigate it.
+A set-associative cache can only store a fixed number of lines that have the same trailing bits in their address. So an 8-way associative cache might have space for 8 lines with address ending in 0x85f0. In the 512×512 case with 4-byte elements, every read in a column has the same last 9+2 bits as the previous one, and `2⋆9` elements are read before getting to the next column. The cache would need space for `(2⋆9+2+9)÷8`, or `2⋆17`, lines to store those elements, which it doesn't have, so cache lines from the beginning of the column start getting kicked out before reaching the next column, and those are exactly the ones you needed to keep around.
 
 ## Arbitrary reordering
 
@@ -60,7 +72,7 @@ Dyalog splits a general boolean transposes into an axis rotation, which swaps tw
 
 Unlike a 2D transpose, the basic implementation of an arbitrary reordering is not so obvious. It "should" be a loop with a variable amount of nesting. Of course, you could get rid of this nesting by working with index vectors, but that's very slow. The way Dyalog handles it is with a recursive function that does a layer of looping. Inside the loop it either calls itself or a base case, which is passed by a function pointer for flexibility. There are actually two loops, one that does one axis and one that does two at once. The single-axis version is only called at the top level in case there's an axis left over, so most of the work is done in this double-loop case.
 
-Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like blocking.
+Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like kernels.
 
 ### Indices