Candidate selection and sampling-based heuristics

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-03-09 15:23:23 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-03-09 15:40:04 -0500
commit: 0539dbf1c8ed11e32f2a111c5d6da928c0b61f9f (patch)
tree: 3a489d0e13893e05bda892b004f67ebf010d41dd /implementation/primitive/sort.md
parent: 43462da01c44e0f5ab1a292c8ac0237480e287db (diff)
1 files changed, 22 insertions, 6 deletions
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index f7273ffc..27f98571 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -36,7 +36,7 @@ For **Sort**:
 - Branchless [quicksorts](#quicksort) are the solid choice for larger types, particularly since they can track ranges and call counting and other distribution sorts when appropriate.
 - But for 2- and 4-byte data, [radix sort](#radix-sort) can be a lot faster? For 2-byte sort, I think it's a better bridge than fluxsort between insertion and counting sort (but scan for sortedness first); for 4-byte, hard to say.
 
-**Grade** is basically the same (now that fluxsort gives us a good stable quicksort), excepth moves get more expensive relative to comparisons. Counting sort needs to be switch to the much slower bucket sort.
+**Grade** is basically the same (now that fluxsort gives us a good stable quicksort), except moves get more expensive relative to comparisons. Counting sort needs to be switch to the much slower bucket sort.
 
 A branchless binary search is adequate for **Bins** but in many cases—very small or large `𝕨`, and small range—there are better methods.
 
@@ -46,21 +46,23 @@ Both counting and bucket sort are small-range algorithms that begin by counting
 
 Bucket sort can be used for Grade or sort-by (`⍋⊸⊏`), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with [fast Indices](replicate.md#non-booleans-to-indices), counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.
 
-I developed [Robin Hood Sort](https://github.com/mlochbaum/rhsort) as an algorithm with similar properties to bucket sort that relies on uniformly-distributed data rather than a small range. It uses a buffer a few times larger than the input array, and inserts values in a manner similar to a hash table with linear probing, shifting large clumps out if they appear—they're merge-sorted back in at the end. Like counting sort, the substantial memory use can be cut down by partitioning. And it should be cheap to detect probable uniformity during median selection, making this a good fit for quicksorts. Emphasis on probable: it's still very important that RHsort has decent worst-case performance.
+I developed [Robin Hood Sort](https://github.com/mlochbaum/rhsort) as an algorithm with similar properties to bucket sort that relies on uniformly-distributed data rather than a small range. It uses a buffer a few times larger than the input array, and inserts values in a manner similar to a hash table with linear probing, shifting large clumps out if they appear—they're merge-sorted back in at the end. Like counting sort, the substantial memory use can be cut down by partitioning. And a random selection of `√n` samples is enough to make a good decision about whether to use it (see [candidate selection](#candidate-selection)), which is a good fit for quicksorts. Of course this can only be probabilistic, so it's still important that RHsort has decent worst-case performance. When quadsort is used for merging, the worst case appears to be about half as fast as fluxsort, very solid.
 
 #### Radix sort
 
-LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (but stably!). It's tricky to implement right though. The `sort_inline` functions from `ska_sort_copy` [here](https://github.com/skarupke/ska_sort) are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a few hundred.
+LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (but stably!). It's tricky to implement right though. The `sort_inline` functions from `ska_sort_copy` [here](https://github.com/skarupke/ska_sort) are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a hundred or so (to get down to this limit, use 1-byte counts and sum with SIMD or at least SWAR).
 
 LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers `↕n`. Lowest byte is fine: the lengths are around `n÷256`. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any [set-associative](https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache) cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.
 
 ### Quicksort
 
-[Fluxsort](https://github.com/scandum/fluxsort) attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort ([Quadsort](https://github.com/scandum/quadsort)) as the base case. I haven't looked into Quadsort, but did discuss other features with the author in [this issue](https://github.com/scandum/fluxsort/issues/1). Pivot selection is an important one—it seems pdqsort uses far fewer pivots than it should. Picking out a larger sample of pivots also opens up the opportunity of performing statistics on them, or checking for a run while the cache line's hot.
+[Fluxsort](https://github.com/scandum/fluxsort) attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort ([Quadsort](https://github.com/scandum/quadsort)) as the base case.
+
+[Crumsort](https://github.com/scandum/crumsort) is an in-place adaptation of fluxsort, which uses a (constant) small amount of external memory to take an approach to partitioning that's slightly more ordered than Hoare partitioning. It's more complicated than fluxsort but performs similarly at small sizes and faster at large ones.
 
 [This paper](https://arxiv.org/abs/2106.05123) gives a good description of [pdqsort](https://github.com/orlp/pdqsort). I'd start with the [Rust version](https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs), which has some advantages but can still be improved further. The subsections below describe improved [partitioning](#partitioning) and an [initial pass](#initial-pass) with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.
 
-While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.
+Most likely fluxsort/crumsort partitioning make pdqsort's partitioning obselete for sorting. Because it's easily invertible if comparison results are saved (it's a self-inverse) it's still useful for the partitioning approach to binary search mentioned later.
 
 #### Partitioning
 
@@ -70,6 +72,10 @@ It's simplest to define an operation `P` that partitions a list `𝕩` according
 
 For binary search `𝕨⍋𝕩`, partitioning allows one pivot element `t` from `𝕨` to be compared to all of `𝕩` at once, instead of the normal strategy of working with one element from `𝕩` at a time. `𝕩` is partitioned according to `t≤𝕩`, then result values are found by searching the first half of `𝕨` for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, `P` is a self inverse, identical to `P⁼`. So the last step is simple, provided the partitioning information `t≤𝕩` is saved.
 
+### Scans and heuristics
+
+Because quicksort does its work before recursing, it's well suited to statistical techniques that allow the algorithm to be changed.
+
 #### Initial pass
 
 An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:
@@ -85,7 +91,17 @@ Finding an initial run is fast as well. Compare the first two elements to determ
 
 At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.
 
-This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them. There may also be sampling-based tests to find when merge sort is better, even if the runs aren't perfect (but is this actually common?).
+This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them.
+
+#### Candidate selection
+
+Traditionally, performance-oriented quicksort algorithms have chosen pivots using a small number of candidates, such as a median of 3 or pseudomedian of 9. Scandum's work with fluxsort showed that this is leaving performance on the table, because it results in less even partitions. My own [analysis](https://github.com/scandum/fluxsort/issues/1#issuecomment-890540150) found an optimal pivot number approximately equal to `√n÷1+4⋆⁼n`, which also seems to do well empirically. Both costs and benefits here are proportional to `√n` before an O(n) partition, so there's significant room for variation in this number. The cost of increasing the number of pivots further can also be decreased by maintaining candidates as the array is partitioned.
+
+Candidates should be chosen in a pseudo-random or at least not significantly patterned distribution to avoid getting a useless biased median. Randomness is also important for heuristics that select between different algorithms. These will be faster for sorting some patterns or distributions and slower for others. In order to avoid slow cases, the requirement should be that any particular slow case has a low probability of passing the heuristic. This approach doesn't require a guess at what properties the input is likely to have, while an attempt to minimize expected time or some other average metric would.
+
+Robin Hood sorting has a sharp threshold at about `√n` where it's feasible to distinguish good and bad cases, that is, reject arrays with many collisions and accept uniformly random ones. This requires the candidate sampling method to be able to pick candidates close to each other, or it'll fail to notice things like an input that consists of many short blocks of similar values.
+
+Inspecting the pivot candidates before or while sorting could also be used to choose a merge-based recursive step over a partition-based one. That is, rather than partitioning then recursing on each half, the algorithm would recurse on each half and then merge them together. If the halves don't overlap by much, the merge can skip a lot of elements with some binary searches and potentially be much than partitioning, which at a minimum needs to compare every element to the pivot. This allows the overall algorithm to exploit large-scale structure for merging even if there's a lot of noise at the small scale that would make merge sorting a bad choice overall. Unlike partitioning, it doesn't reduce the range of the halves.
 
 ### Other sorting algorithms
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-03-09 15:23:23 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-03-09 15:40:04 -0500
commit	0539dbf1c8ed11e32f2a111c5d6da928c0b61f9f (patch)
tree	3a489d0e13893e05bda892b004f67ebf010d41dd /implementation/primitive/sort.md
parent	43462da01c44e0f5ab1a292c8ac0237480e287db (diff)