From 2afb23928e1984d475cc460e1672e8f6fa0e4dbe Mon Sep 17 00:00:00 2001 From: Marshall Lochbaum Date: Wed, 11 Aug 2021 17:21:31 -0400 Subject: Allow clicking on header to get fragment link --- docs/implementation/primitive/sort.html | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'docs/implementation/primitive/sort.html') diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html index 2a417770..5c6c3d70 100644 --- a/docs/implementation/primitive/sort.html +++ b/docs/implementation/primitive/sort.html @@ -4,35 +4,35 @@ BQN: Implementation of ordering functions -

Implementation of ordering functions

+

Implementation of ordering functions

The ordering functions are Sort (∧∨), Grade (⍋⍒), and Bins (⍋⍒). Although these are well-studied—particularly sorting, and then binary search or "predecessor search"—there are many recent developments, as well as techniques that I have not found in the literature. The three functions are closely related but have important differences in what algorithms are viable. Sorting is a remarkably deep problem with different algorithms able to do a wide range of amazing things, and sophisticated ways to combine those. It is by no means solved. In comparison, Bins is pretty tame.

There's a large divide between ordering compound data and simple data. For compound data comparisons are expensive, and the best algorithm will generally be the one that uses the fewest comparisons. For simple data they fall somewhere between cheap and extremely cheap, and fancy branchless and vectorized algorithms are the best.

-

On quicksort versus merge sort

+

On quicksort versus merge sort

Merge sort is better. It is deterministic, stable, and has optimal worst-case performance. Its pattern handling is better: while merge sort handles "horizontal" patterns and quicksort does "vertical" ones, merge sort gets useful work out of any sequence of runs but in-place quicksort will quickly mangle its analogue until it may as well be random.

But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.

The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmark means radix sorting looks better than it is.

- +

Binary searches are very easy to get wrong. Do not write (hi+lo)/2: it's not safe from overflows. I always follow the pattern given in the first code block here. This code will never access the value *base, so it should be considered a search on the n-1 values beginning at base+1 (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract n%2 from n in the case that *mid < x.

-

Compound data

+

Compound data

Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.

For Sort and Grade, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.

For Bins, use a branching binary search: see On binary search above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.

-

Simple data

+

Simple data

The name of the game here is "branchless".

For sorting, the fastest algorithms for generic data and generic hardware are branchless quicksorts. Fluxsort is new and very exciting because it's a stable algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.

A branchless binary search is adequate for Bins but in many cases—very small or large 𝕨, and small range—there are better methods.

-

Counting and bucket sort

+

Counting and bucket sort

Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written (/≠¨)(-min) in BQN, with ¨ as a single efficient operation.

Bucket sort can be used for Grade or sort-by (), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with fast Indices, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.

-

Quicksort

+

Quicksort

Fluxsort attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (Quadsort) as the base case. I haven't fully evaluated these.

This paper gives a good description of pdqsort. I'd start with the Rust version, which has some advantages but can still be improved further. The subsections below describe improved partitioning and an initial pass with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.

While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.

-

Partitioning

+

Partitioning

In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The Hoare partition scheme does this, and BlockQuicksort showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the bit booleans to indices method is faster and fits well with vectorized comparisons.

It's simplest to define an operation P that partitions a list 𝕩 according to a boolean list 𝕨. Partitioning permutes 𝕩 so that all elements corresponding to 0 in 𝕨 come before those corresponding to 1. The quicksort partition step, with pivot t, is (t𝕩)P𝕩, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of 𝕩, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step P is also good for Bins, where the boolean 𝕨 will have to be saved.

For binary search 𝕨𝕩, partitioning allows one pivot element t from 𝕨 to be compared to all of 𝕩 at once, instead of the normal strategy of working with one element from 𝕩 at a time. 𝕩 is partitioned according to t𝕩, then result values are found by searching the first half of 𝕨 for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, P is a self inverse, identical to P. So the last step is simple, provided the partitioning information t𝕩 is saved.

-

Initial pass

+

Initial pass

An initial pass for pdqsort (or another in-place quicksort) provides a few advantages: