Notes on partitioning and simple-data binary search

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-06-17 22:08:13 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-06-17 22:08:13 -0400
commit: 189ddf99b155f117d73e55a1230794d0331ec0fb (patch)
tree: 39b8a83f579531afc69fe77b326eb71f00cd976d /docs/implementation
parent: 0dd15451548abe63d70b061309b261b51df36313 (diff)
1 files changed, 13 insertions, 3 deletions
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index 3842ecb7..848b6edb 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -10,19 +10,29 @@
 <h2 id="on-quicksort-versus-merge-sort">On quicksort versus merge sort</h2>
 <p>Merge sort is better. It is deterministic, stable, and has optimal worst-case performance. Its pattern handling is better: while merge sort handles &quot;horizontal&quot; patterns and quicksort does &quot;vertical&quot; ones, merge sort gets useful work out of <em>any</em> sequence of runs but in-place quicksort will quickly mangle its analogue until it may as well be random.</p>
 <p>But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.</p>
-<p>The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmarks means radix sorting looks better than it is.</p>
+<p>The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmark means radix sorting looks better than it is.</p>
 <h2 id="on-binary-search">On binary search</h2>
 <p>Binary searches are very easy to get wrong. Do not write <code><span class='Paren'>(</span><span class='Value'>hi</span><span class='Function'>+</span><span class='Value'>lo</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Number'>2</span></code>: it's not safe from overflows. I always follow the pattern given in the first code block <a href="https://pvk.ca/Blog/2015/11/29/retrospective-on-binary-search-and-on-compression-slash-compilation/">here</a>. This code will never access the value <code><span class='Value'>*base</span></code>, so it should be considered a search on the <code><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>1</span></code> values beginning at <code><span class='Value'>base</span><span class='Function'>+</span><span class='Number'>1</span></code> (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract <code><span class='Value'>n%</span><span class='Number'>2</span></code> from <code><span class='Value'>n</span></code> in the case that <code><span class='Value'>*mid</span> <span class='Function'>&lt;</span> <span class='Value'>x</span></code>.</p>
 <h2 id="compound-data">Compound data</h2>
 <p>Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.</p>
-<p>For <strong>Sort</strong> and <strong>Grade</strong>, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.</p>
+<p>For <strong>Sort</strong> and <strong>Grade</strong>, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.</p>
 <p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
 <h2 id="simple-data">Simple data</h2>
 <p>The name of the game here is &quot;branchless&quot;.</p>
 <p>Sorting algorithms of interest are counting sort and <a href="https://github.com/orlp/pdqsort">pdqsort</a> (some improvements of my own to be described here later). However, these are both unusable for Grade.</p>
 <p>For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. <a href="https://github.com/scandum/wolfsort">Wolfsort</a> is a hybrid radix/merge sort that might be better.</p>
 <p><a href="https://github.com/ips4o/ips4o">IPS⁴o</a> is a horrifyingly complicated samplesort thing. Stable, I think. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.</p>
-<p>Vector binary search described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>).</p>
+<p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
 <h3 id="counting-and-bucket-sort">Counting and bucket sort</h3>
 <p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
 <p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
+<h3 id="partitioning">Partitioning</h3>
+<p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
+<p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
+<p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
+<h3 id="binary-search">Binary search</h3>
+<p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
+<p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
+<p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>𝕨</span></code> down to the range of <code><span class='Value'>𝕩</span></code>. After finding the range of <code><span class='Value'>𝕩</span></code>, binary cut <code><span class='Value'>𝕨</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>𝕨</span></code> for the appropriate endpoint of the range.</p>
+<p>If <code><span class='Value'>𝕩</span></code> is small-range, then a lookup table method is possible. Check the length of <code><span class='Value'>𝕨</span></code> because if it's too large then this method is slower! The approach is simply to create a table of the number of elements in <code><span class='Value'>𝕨</span></code> with each value, then take a prefix sum. In BQN, <code><span class='Value'>𝕩</span><span class='Function'>⊏+</span><span class='Modifier'>`</span><span class='Function'>∾≠</span><span class='Modifier'>¨</span><span class='Function'>⊔</span><span class='Value'>𝕨</span></code>, assuming a minimum of 0.</p>
+<p><a href="#partitioning">Partitioning</a> allows one pivot <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared with all of <code><span class='Value'>𝕩</span></code> at once. Although the comparison <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search <em>when</em> <code><span class='Value'>𝕨</span></code> <em>fits in L1 cache</em>. For larger <code><span class='Value'>𝕨</span></code> (and randomly positioned <code><span class='Value'>𝕩</span></code>) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when <code><span class='Value'>𝕨</span></code> is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of <code><span class='Value'>𝕨</span></code> as in a normal binary search. However, there is a potential issue with memory. If <code><span class='Value'>𝕩</span></code> is unbalanced with respect to <code><span class='Value'>𝕨</span></code>, then the larger part can be nearly the whole length of <code><span class='Value'>𝕩</span></code> (if it's all of <code><span class='Value'>𝕩</span></code> partitioning isn't actually needed and it doesn't need to be saved). This can require close to <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Function'>≠</span><span class='Value'>𝕨</span></code> saved partitions of length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>, while the expected use would be a total length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-06-17 22:08:13 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-06-17 22:08:13 -0400
commit	189ddf99b155f117d73e55a1230794d0331ec0fb (patch)
tree	39b8a83f579531afc69fe77b326eb71f00cd976d /docs/implementation
parent	0dd15451548abe63d70b061309b261b51df36313 (diff)