1 files changed, 20 insertions, 6 deletions
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index f726461c..def9c975 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -19,12 +19,30 @@
 <p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
 <h2 id="simple-data">Simple data</h2>
 <p>The name of the game here is &quot;branchless&quot;.</p>
-<p>Sorting algorithms of interest are counting sort and <a href="https://github.com/orlp/pdqsort">pdqsort</a> (some improvements of my own to be described here later). However, these are both unusable for Grade.</p>
-<p>For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. <a href="https://github.com/scandum/wolfsort">Wolfsort</a> is a hybrid radix/merge sort that might be better.</p>
+<p>For sorting the best general algorithm I know of is a hybrid of pdqsort with counting sort, described below. I've found some improvements to pdqsort, but they all maintain its basic strategy and pattern-defeating nature, so I think the name still fits.</p>
+<p>Both pdqsort and counting sort are not stable, so they can't be used for Grade. For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. <a href="https://github.com/scandum/wolfsort">Wolfsort</a> is a hybrid radix/merge sort that might be better.</p>
 <p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
 <h3 id="counting-and-bucket-sort">Counting and bucket sort</h3>
 <p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
 <p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
+<h3 id="quicksort">Quicksort</h3>
+<p><a href="https://arxiv.org/abs/2106.05123">This paper</a> gives a good description of <a href="https://github.com/orlp/pdqsort">pdqsort</a>. I'd start with the <a href="https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs">Rust version</a>, which has some advantages but can still be improved further. The subsections below describe improved <a href="#partitioning">partitioning</a> and an <a href="#initial-pass">initial pass</a> with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.</p>
+<h4 id="partitioning">Partitioning</h4>
+<p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
+<p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
+<p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
+<h4 id="initial-pass">Initial pass</h4>
+<p>An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:</p>
+<ul>
+<li>Recognize sorted and reverse-sorted arrays as fast as possible</li>
+<li>Always use unguarded insertion sort</li>
+<li>Find and maintain range information to switch to counting sort</li>
+</ul>
+<p>The main purpose of the pass is to find the range of the array. For an insignificant additional cost, the smallest and largest elements can be swapped to the edges of the array and sorted arrays can be detected.</p>
+<p>To find the smallest and largest elements, compute the range in blocks, on the order of 1KB each. Record the maximum and minimum, as well as the index of the block that contained these. After each block, update the range as well as the indices. Then, after traversing all blocks, search the appropriate blocks for these values to find exact indices. It may also be possible to skip the search and jump straight to counting sort.</p>
+<p>Finding an initial run is fast as well. Compare the first two elements to determine direction, then search for the first pair that have the opposite direction (this can be vectorized because overreading is fine). This run can be used as the first range block, because the maximum and minimum are the two elements at the ends of the run.</p>
+<p>At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.</p>
+<p>This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them. There may also be sampling-based tests to find when merge sort is better, even if the runs aren't perfect (but is this actually common?).</p>
 <h3 id="other-sorting-algorithms">Other sorting algorithms</h3>
 <p><a href="https://github.com/ips4o/ips4o">IPS⁴o</a> is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.</p>
 <p><a href="https://github.com/Morwenn/vergesort">Vergesort</a> has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.</p>
@@ -33,10 +51,6 @@
 <p>A few people have done some work on merge sorting with AVX2 or AVX-512: <a href="https://github.com/sid1607/avx2-merge-sort">two</a> <a href="https://github.com/PatwinchIR/ultra-sort">examples</a>. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?</p>
 <p><a href="https://github.com/nlw0/ChipSort.jl">ChipSort</a> seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.</p>
 <p>Or AVX can <a href="https://github.com/WojciechMula/simd-sort">speed up</a> quicksort. I suspect this is more of a marginal improvement (over BlockQuicksort/pdqsort discussed below) relative to merge sort. If partitioning is fast enough it might make stable quicksort viable.</p>
-<h3 id="partitioning">Partitioning</h3>
-<p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
-<p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
-<p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
 <h3 id="binary-search">Binary search</h3>
 <p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
 <p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>