Radix sort is stable, oops

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-02-27 17:25:57 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-02-27 17:25:57 -0500
commit: 08a21b1512fd1e392695673059904eca8c6d099f (patch)
tree: 8fba06b1e57fd9c58dc2a69125d3b7668cac9cf2 /implementation/primitive
parent: 33a5fad736daee03ca45281a438fc270279c056e (diff)
1 files changed, 3 insertions, 3 deletions
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index 713b4266..f7273ffc 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -36,7 +36,7 @@ For **Sort**:
 - Branchless [quicksorts](#quicksort) are the solid choice for larger types, particularly since they can track ranges and call counting and other distribution sorts when appropriate.
 - But for 2- and 4-byte data, [radix sort](#radix-sort) can be a lot faster? For 2-byte sort, I think it's a better bridge than fluxsort between insertion and counting sort (but scan for sortedness first); for 4-byte, hard to say.
 
-**Grade** is basically the same (now that fluxsort gives us a good stable quicksort), except you can't use radix sort and have to replace counting sort with the slower bucket sort.
+**Grade** is basically the same (now that fluxsort gives us a good stable quicksort), excepth moves get more expensive relative to comparisons. Counting sort needs to be switch to the much slower bucket sort.
 
 A branchless binary search is adequate for **Bins** but in many cases—very small or large `𝕨`, and small range—there are better methods.
 
@@ -50,9 +50,9 @@ I developed [Robin Hood Sort](https://github.com/mlochbaum/rhsort) as an algorit
 
 #### Radix sort
 
-LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (definitely not stable). It's tricky to implement right though. The `sort_inline` functions from `ska_sort_copy` [here](https://github.com/skarupke/ska_sort) are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a few hundred.
+LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (but stably!). It's tricky to implement right though. The `sort_inline` functions from `ska_sort_copy` [here](https://github.com/skarupke/ska_sort) are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a few hundred.
 
-LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers `↕n`. Lowest byte is fine: the lengths are around `n÷256`. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the instability and lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.
+LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers `↕n`. Lowest byte is fine: the lengths are around `n÷256`. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.
 
 ### Quicksort
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-02-27 17:25:57 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-02-27 17:25:57 -0500
commit	08a21b1512fd1e392695673059904eca8c6d099f (patch)
tree	8fba06b1e57fd9c58dc2a69125d3b7668cac9cf2 /implementation/primitive
parent	33a5fad736daee03ca45281a438fc270279c056e (diff)