1 files changed, 19 insertions, 5 deletions
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index 66426a40..713b4266 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -12,7 +12,7 @@ Merge sort is better. It is deterministic, stable, and has optimal worst-case pe
 
 But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.
 
-The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmark means radix sorting looks better than it is.
+The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sort is a weird one: very fast on lots of inputs, but not adaptive at all, and in fact actively un-adaptive on common patterns from cache associativity. Very hard to know when it's a good choice.
 
 ## On binary search
 
@@ -30,19 +30,33 @@ For **Bins**, use a branching binary search: see [On binary search](#on-binary-s
 
 The name of the game here is "branchless".
 
-For sorting, the fastest algorithms for generic data and generic hardware are branchless [quicksorts](#quicksort). Fluxsort is new and very exciting because it's a *stable* algorithm that's substantially faster than runner-up pdqsort on random arrays. However, pdqsort still has some advantages, and there are unstable techniques that could be used to improve Fluxsort when stability doesn't matter.
+For **Sort**:
+- Insertion sort is best for small data (*maybe* [quadsort](https://github.com/scandum/quadsort), but I worry about the cache footprint if you just sort a small array somewhere in a long loop).
+- Then [counting sort](#distribution-sorts) for 1-byte types (and obviously for 1-bit regardless of length).
+- Branchless [quicksorts](#quicksort) are the solid choice for larger types, particularly since they can track ranges and call counting and other distribution sorts when appropriate.
+- But for 2- and 4-byte data, [radix sort](#radix-sort) can be a lot faster? For 2-byte sort, I think it's a better bridge than fluxsort between insertion and counting sort (but scan for sortedness first); for 4-byte, hard to say.
 
-A branchless binary search is adequate for Bins but in many cases—very small or large `𝕨`, and small range—there are better methods.
+**Grade** is basically the same (now that fluxsort gives us a good stable quicksort), except you can't use radix sort and have to replace counting sort with the slower bucket sort.
 
-### Counting and bucket sort
+A branchless binary search is adequate for **Bins** but in many cases—very small or large `𝕨`, and small range—there are better methods.
+
+### Distribution sorts
 
 Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written `(//⁼)⌾(-⟜min)` in BQN, relying on the extension of `/⁼` to unsorted arguments.
 
 Bucket sort can be used for Grade or sort-by (`⍋⊸⊏`), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with [fast Indices](replicate.md#non-booleans-to-indices), counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.
 
+I developed [Robin Hood Sort](https://github.com/mlochbaum/rhsort) as an algorithm with similar properties to bucket sort that relies on uniformly-distributed data rather than a small range. It uses a buffer a few times larger than the input array, and inserts values in a manner similar to a hash table with linear probing, shifting large clumps out if they appear—they're merge-sorted back in at the end. Like counting sort, the substantial memory use can be cut down by partitioning. And it should be cheap to detect probable uniformity during median selection, making this a good fit for quicksorts. Emphasis on probable: it's still very important that RHsort has decent worst-case performance.
+
+#### Radix sort
+
+LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (definitely not stable). It's tricky to implement right though. The `sort_inline` functions from `ska_sort_copy` [here](https://github.com/skarupke/ska_sort) are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a few hundred.
+
+LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers `↕n`. Lowest byte is fine: the lengths are around `n÷256`. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the instability and lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.
+
 ### Quicksort
 
-[Fluxsort](https://github.com/scandum/fluxsort) attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort ([Quadsort](https://github.com/scandum/quadsort)) as the base case. I haven't looked into Quadsort, but did discuss other features with the author in [this issue](https://github.com/scandum/fluxsort/issues/1). Pivot selection is an important one—it seems pdqsort uses far fewer pivots than it should.
+[Fluxsort](https://github.com/scandum/fluxsort) attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort ([Quadsort](https://github.com/scandum/quadsort)) as the base case. I haven't looked into Quadsort, but did discuss other features with the author in [this issue](https://github.com/scandum/fluxsort/issues/1). Pivot selection is an important one—it seems pdqsort uses far fewer pivots than it should. Picking out a larger sample of pivots also opens up the opportunity of performing statistics on them, or checking for a run while the cache line's hot.
 
 [This paper](https://arxiv.org/abs/2106.05123) gives a good description of [pdqsort](https://github.com/orlp/pdqsort). I'd start with the [Rust version](https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs), which has some advantages but can still be improved further. The subsections below describe improved [partitioning](#partitioning) and an [initial pass](#initial-pass) with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.