docs/implementation/primitive/sort.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

<head>
  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="../../style.css" rel="stylesheet"/>
  <title>BQN: Implementation of ordering functions</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
<h1 id="implementation-of-ordering-functions"><a class="header" href="#implementation-of-ordering-functions">Implementation of ordering functions</a></h1>
<p>The <a href="../../doc/order.html">ordering functions</a> are Sort (<code><span class='Function'>∧∨</span></code>), Grade (<code><span class='Function'>⍋⍒</span></code>), and Bins (<code><span class='Function'>⍋⍒</span></code>). Although these are well-studied—particularly sorting, and then binary search or &quot;predecessor search&quot;—there are many recent developments, as well as techniques that I have not found in the literature. The three functions are closely related but have important differences in what algorithms are viable. Sorting is a remarkably deep problem with different algorithms able to do a wide range of amazing things, and sophisticated ways to combine those. It is by no means solved. In comparison, Bins is pretty tame.</p>
<p>There's a large divide between ordering compound data and simple data. For compound data comparisons are expensive, and the best algorithm will generally be the one that uses the fewest comparisons. For simple data they fall somewhere between cheap and extremely cheap, and fancy branchless and vectorized algorithms are the best.</p>
<h2 id="on-quicksort-versus-merge-sort"><a class="header" href="#on-quicksort-versus-merge-sort">On quicksort versus merge sort</a></h2>
<p>Merge sort is better. It is deterministic, stable, and has optimal worst-case performance. Its pattern handling is better: while merge sort handles &quot;horizontal&quot; patterns and quicksort does &quot;vertical&quot; ones, merge sort gets useful work out of <em>any</em> sequence of runs but in-place quicksort will quickly mangle its analogue until it may as well be random.</p>
<p>But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.</p>
<p>The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sort is a weird one: very fast on lots of inputs, but not adaptive at all, and in fact actively un-adaptive on common patterns from cache associativity. Very hard to know when it's a good choice.</p>
<h2 id="on-binary-search"><a class="header" href="#on-binary-search">On binary search</a></h2>
<p>Binary searches are very easy to get wrong. Do not write <code><span class='Paren'>(</span><span class='Value'>hi</span><span class='Function'>+</span><span class='Value'>lo</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Number'>2</span></code>: it's not safe from overflows. I always follow the pattern given in the first code block <a href="https://pvk.ca/Blog/2015/11/29/retrospective-on-binary-search-and-on-compression-slash-compilation/">here</a>. This code will never access the value <code><span class='Value'>*base</span></code>, so it should be considered a search on the <code><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>1</span></code> values beginning at <code><span class='Value'>base</span><span class='Function'>+</span><span class='Number'>1</span></code> (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract <code><span class='Value'>n%</span><span class='Number'>2</span></code> from <code><span class='Value'>n</span></code> in the case that <code><span class='Value'>*mid</span> <span class='Function'>&lt;</span> <span class='Value'>x</span></code>.</p>
<h2 id="compound-data"><a class="header" href="#compound-data">Compound data</a></h2>
<p>Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.</p>
<p>For <strong>Sort</strong> and <strong>Grade</strong>, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.</p>
<p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
<h2 id="simple-data"><a class="header" href="#simple-data">Simple data</a></h2>
<p>The name of the game here is &quot;branchless&quot;.</p>
<p>For <strong>Sort</strong>:</p>
<ul>
<li>Insertion sort is best for small data (<em>maybe</em> <a href="https://github.com/scandum/quadsort">quadsort</a>, but I worry about the cache footprint if you just sort a small array somewhere in a long loop).</li>
<li>Then <a href="#distribution-sorts">counting sort</a> for 1-byte types (and obviously for 1-bit regardless of length).</li>
<li>Branchless <a href="#quicksort">quicksorts</a> are the solid choice for larger types, particularly since they can track ranges and call counting and other distribution sorts when appropriate.</li>
<li>But for 2- and 4-byte data, <a href="#radix-sort">radix sort</a> can be a lot faster? For 2-byte sort, I think it's a better bridge than fluxsort between insertion and counting sort (but scan for sortedness first); for 4-byte, hard to say.</li>
</ul>
<p><strong>Grade</strong> is basically the same (now that fluxsort gives us a good stable quicksort), except you can't use radix sort and have to replace counting sort with the slower bucket sort.</p>
<p>A branchless binary search is adequate for <strong>Bins</strong> but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
<h3 id="distribution-sorts"><a class="header" href="#distribution-sorts">Distribution sorts</a></h3>
<p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>//</span><span class='Modifier'>⁼</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, relying on the extension of <code><span class='Function'>/</span><span class='Modifier'>⁼</span></code> to unsorted arguments.</p>
<p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
<p>I developed <a href="https://github.com/mlochbaum/rhsort">Robin Hood Sort</a> as an algorithm with similar properties to bucket sort that relies on uniformly-distributed data rather than a small range. It uses a buffer a few times larger than the input array, and inserts values in a manner similar to a hash table with linear probing, shifting large clumps out if they appear—they're merge-sorted back in at the end. Like counting sort, the substantial memory use can be cut down by partitioning. And it should be cheap to detect probable uniformity during median selection, making this a good fit for quicksorts. Emphasis on probable: it's still very important that RHsort has decent worst-case performance.</p>
<h4 id="radix-sort"><a class="header" href="#radix-sort">Radix sort</a></h4>
<p>LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (definitely not stable). It's tricky to implement right though. The <code><span class='Value'>sort_inline</span></code> functions from <code><span class='Value'>ska_sort_copy</span></code> <a href="https://github.com/skarupke/ska_sort">here</a> are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a few hundred.</p>
<p>LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers <code><span class='Function'>↕</span><span class='Value'>n</span></code>. Lowest byte is fine: the lengths are around <code><span class='Value'>n</span><span class='Function'>÷</span><span class='Number'>256</span></code>. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any <a href="https://en.wikipedia.org/wiki/Cache_associativity#Set-associative_cache">set-associative</a> cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the instability and lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.</p>
<h3 id="quicksort"><a class="header" href="#quicksort">Quicksort</a></h3>
<p><a href="https://github.com/scandum/fluxsort">Fluxsort</a> attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (<a href="https://github.com/scandum/quadsort">Quadsort</a>) as the base case. I haven't looked into Quadsort, but did discuss other features with the author in <a href="https://github.com/scandum/fluxsort/issues/1">this issue</a>. Pivot selection is an important one—it seems pdqsort uses far fewer pivots than it should. Picking out a larger sample of pivots also opens up the opportunity of performing statistics on them, or checking for a run while the cache line's hot.</p>
<p><a href="https://arxiv.org/abs/2106.05123">This paper</a> gives a good description of <a href="https://github.com/orlp/pdqsort">pdqsort</a>. I'd start with the <a href="https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs">Rust version</a>, which has some advantages but can still be improved further. The subsections below describe improved <a href="#partitioning">partitioning</a> and an <a href="#initial-pass">initial pass</a> with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.</p>
<p>While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.</p>
<h4 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h4>
<p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
<p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
<p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
<h4 id="initial-pass"><a class="header" href="#initial-pass">Initial pass</a></h4>
<p>An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:</p>
<ul>
<li>Recognize sorted and reverse-sorted arrays as fast as possible</li>
<li>Always use unguarded insertion sort</li>
<li>Find and maintain range information to switch to counting sort</li>
</ul>
<p>The main purpose of the pass is to find the range of the array. For an insignificant additional cost, the smallest and largest elements can be swapped to the edges of the array and sorted arrays can be detected.</p>
<p>To find the smallest and largest elements, compute the range in blocks, on the order of 1KB each. Record the maximum and minimum, as well as the index of the block that contained these. After each block, update the range as well as the indices. Then, after traversing all blocks, search the appropriate blocks for these values to find exact indices. It may also be possible to skip the search and jump straight to counting sort.</p>
<p>Finding an initial run is fast as well. Compare the first two elements to determine direction, then search for the first pair that have the opposite direction (this can be vectorized because overreading is fine). This run can be used as the first range block, because the maximum and minimum are the two elements at the ends of the run.</p>
<p>At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.</p>
<p>This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them. There may also be sampling-based tests to find when merge sort is better, even if the runs aren't perfect (but is this actually common?).</p>
<h3 id="other-sorting-algorithms"><a class="header" href="#other-sorting-algorithms">Other sorting algorithms</a></h3>
<p><a href="https://github.com/ips4o/ips4o">IPS⁴o</a> is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.</p>
<p><a href="https://github.com/Morwenn/vergesort">Vergesort</a> has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.</p>
<p>Sorting networks compare and swap elements in a fixed pattern, and so can be implemented with branchless or even vectorized code. They're great for sorting many small arrays of the same size, but the limit before insertion sort beats it will be pretty small without hardware specialization.</p>
<h4 id="simd-sorting"><a class="header" href="#simd-sorting">SIMD sorting</a></h4>
<p>A few people have done some work on merge sorting with AVX2 or AVX-512: <a href="https://github.com/sid1607/avx2-merge-sort">two</a> <a href="https://github.com/PatwinchIR/ultra-sort">examples</a>. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?</p>
<p><a href="https://github.com/nlw0/ChipSort.jl">ChipSort</a> seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.</p>
<p>Or AVX can <a href="https://github.com/WojciechMula/simd-sort">speed up</a> quicksort. I suspect this is more of a marginal improvement (over branchless quicksorts) relative to merge sort.</p>
<h3 id="binary-search"><a class="header" href="#binary-search">Binary search</a></h3>
<p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
<p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
<p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>𝕨</span></code> down to the range of <code><span class='Value'>𝕩</span></code>. After finding the range of <code><span class='Value'>𝕩</span></code>, binary cut <code><span class='Value'>𝕨</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>𝕨</span></code> for the appropriate endpoint of the range.</p>
<p>If <code><span class='Value'>𝕩</span></code> is small-range, then a lookup table method is possible. Check the length of <code><span class='Value'>𝕨</span></code> because if it's too large then this method is slower—binary search doesn't have to hit every element! The approach is simply to create a table of the number of elements in <code><span class='Value'>𝕨</span></code> with each value, then take a prefix sum. In BQN, <code><span class='Value'>𝕩</span><span class='Function'>⊏+</span><span class='Modifier'>`</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>+⌈</span><span class='Modifier'>´</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>↑/</span><span class='Modifier'>⁼</span><span class='Value'>𝕨</span></code>, assuming a minimum of 0.</p>
<p><a href="#partitioning">Partitioning</a> allows one pivot <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared with all of <code><span class='Value'>𝕩</span></code> at once. Although the comparison <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search <em>when</em> <code><span class='Value'>𝕨</span></code> <em>fits in L1 cache</em>. For larger <code><span class='Value'>𝕨</span></code> (and randomly positioned <code><span class='Value'>𝕩</span></code>) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when <code><span class='Value'>𝕨</span></code> is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of <code><span class='Value'>𝕨</span></code> as in a normal binary search. However, there is a potential issue with memory. If <code><span class='Value'>𝕩</span></code> is unbalanced with respect to <code><span class='Value'>𝕨</span></code>, then the larger part can be nearly the whole length of <code><span class='Value'>𝕩</span></code> (if it's all of <code><span class='Value'>𝕩</span></code> partitioning isn't actually needed and it doesn't need to be saved). This can require close to <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Function'>≠</span><span class='Value'>𝕨</span></code> saved partitions of length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>, while the expected use would be a total length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>.</p>
<h4 id="interpolation-search"><a class="header" href="#interpolation-search">Interpolation search?</a></h4>
<p>Binary search is the optimal approach for truly unknown data. However, if the searched array has an approximately uniform distribution, giving an approximately linear relationship between index and value, <a href="https://en.wikipedia.org/wiki/Interpolation_search">interpolation search</a> uses asymptotically fewer comparisons. Because of the high overhead of index computation (a division!!), it could only ever be useful for large <code><span class='Value'>𝕨</span></code>, and <code><span class='Value'>𝕩</span></code> small enough that partitioning isn't viable.</p>
<p>Interpolation is generally held to be counterproductive because of how badly it fails for non-uniform data. But the continuous-domain equivalent, bracketed root-finding, is better studied, with hybrid approaches developed to fix this. The recently published <a href="https://en.wikipedia.org/wiki/ITP_Method">ITP method</a> (2020) gets the advantages of interpolation with the same maximum number of comparisons as binary search, plus a configurable constant (even if set to 0, it can take advantage of the gap between the length and the next power of 2, but 1 is a better setting). And once the search range is small enough that memory access stops being very expensive, it should switch to binary search and avoid the division.</p>