Start updating sorting implementation notes for Fluxsort

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-07-29 21:59:25 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-07-29 21:59:25 -0400
commit: c2afaad9f6951c2f635f2ac63842f190dd799573 (patch)
tree: abebe841b652ec910ae987d5c4ca13d35cab275d
parent: 585acb05a97146389dfa1d8f5b2f87fcfa501214 (diff)
2 files changed, 8 insertions, 5 deletions
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index def9c975..86c36738 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -19,14 +19,15 @@
 <p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
 <h2 id="simple-data">Simple data</h2>
 <p>The name of the game here is &quot;branchless&quot;.</p>
-<p>For sorting the best general algorithm I know of is a hybrid of pdqsort with counting sort, described below. I've found some improvements to pdqsort, but they all maintain its basic strategy and pattern-defeating nature, so I think the name still fits.</p>
-<p>Both pdqsort and counting sort are not stable, so they can't be used for Grade. For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. <a href="https://github.com/scandum/wolfsort">Wolfsort</a> is a hybrid radix/merge sort that might be better.</p>
+<p>For sorting, the fastest algorithms for generic data and generic hardware are branchless <a href="#quicksort">quicksorts</a>. Fluxsort is new and very exciting because it's a <em>stable</em> algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.</p>
 <p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
 <h3 id="counting-and-bucket-sort">Counting and bucket sort</h3>
 <p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
 <p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
 <h3 id="quicksort">Quicksort</h3>
+<p><a href="https://github.com/scandum/fluxsort">Fluxsort</a> attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (<a href="https://github.com/scandum/quadsort">Quadsort</a>) as the base case. I haven't fully evaluated these.</p>
 <p><a href="https://arxiv.org/abs/2106.05123">This paper</a> gives a good description of <a href="https://github.com/orlp/pdqsort">pdqsort</a>. I'd start with the <a href="https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs">Rust version</a>, which has some advantages but can still be improved further. The subsections below describe improved <a href="#partitioning">partitioning</a> and an <a href="#initial-pass">initial pass</a> with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.</p>
+<p>While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.</p>
 <h4 id="partitioning">Partitioning</h4>
 <p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
 <p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index 2a1c0c27..0a4c5211 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -30,9 +30,7 @@ For **Bins**, use a branching binary search: see [On binary search](#on-binary-s
 
 The name of the game here is "branchless".
 
-For sorting the best general algorithm I know of is a hybrid of pdqsort with counting sort, described below. I've found some improvements to pdqsort, but they all maintain its basic strategy and pattern-defeating nature, so I think the name still fits.
-
-Both pdqsort and counting sort are not stable, so they can't be used for Grade. For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. [Wolfsort](https://github.com/scandum/wolfsort) is a hybrid radix/merge sort that might be better.
+For sorting, the fastest algorithms for generic data and generic hardware are branchless [quicksorts](#quicksort). Fluxsort is new and very exciting because it's a *stable* algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.
 
 A branchless binary search is adequate for Bins but in many cases—very small or large `𝕨`, and small range—there are better methods.
 
@@ -44,8 +42,12 @@ Bucket sort can be used for Grade or sort-by (`⍋⊸⊏`), but counting sort on
 
 ### Quicksort
 
+[Fluxsort](https://github.com/scandum/fluxsort) attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort ([Quadsort](https://github.com/scandum/quadsort)) as the base case. I haven't fully evaluated these.
+
 [This paper](https://arxiv.org/abs/2106.05123) gives a good description of [pdqsort](https://github.com/orlp/pdqsort). I'd start with the [Rust version](https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs), which has some advantages but can still be improved further. The subsections below describe improved [partitioning](#partitioning) and an [initial pass](#initial-pass) with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.
 
+While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.
+
 #### Partitioning
 
 In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The [Hoare partition scheme](https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme) does this, and [BlockQuicksort](https://github.com/weissan/BlockQuicksort) showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the [bit booleans to indices](replicate.md#booleans-to-indices) method is faster and fits well with vectorized comparisons.
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-07-29 21:59:25 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-07-29 21:59:25 -0400
commit	c2afaad9f6951c2f635f2ac63842f190dd799573 (patch)
tree	abebe841b652ec910ae987d5c4ca13d35cab275d
parent	585acb05a97146389dfa1d8f5b2f87fcfa501214 (diff)