From 9d6009137b08c2faa2006fe2d98ec856c99236ec Mon Sep 17 00:00:00 2001 From: Marshall Lochbaum Date: Sat, 10 Jul 2021 22:13:03 -0400 Subject: More literature review --- docs/implementation/primitive/sort.html | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'docs/implementation/primitive') diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html index 42783a08..f726461c 100644 --- a/docs/implementation/primitive/sort.html +++ b/docs/implementation/primitive/sort.html @@ -21,11 +21,18 @@

The name of the game here is "branchless".

Sorting algorithms of interest are counting sort and pdqsort (some improvements of my own to be described here later). However, these are both unusable for Grade.

For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. Wolfsort is a hybrid radix/merge sort that might be better.

-

IPS⁴o is a horrifyingly complicated samplesort thing. Stable, I think. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.

A branchless binary search is adequate for Bins but in many cases—very small or large 𝕨, and small range—there are better methods.

Counting and bucket sort

Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written (/≠¨)(-min) in BQN, with ¨ as a single efficient operation.

Bucket sort can be used for Grade or sort-by (), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with fast Indices, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.

+

Other sorting algorithms

+

IPS⁴o is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.

+

Vergesort has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.

+

Sorting networks compare and swap elements in a fixed pattern, and so can be implemented with branchless or even vectorized code. They're great for sorting many small arrays of the same size, but the limit before insertion sort beats it will be pretty small without hardware specialization.

+

SIMD sorting

+

A few people have done some work on merge sorting with AVX2 or AVX-512: two examples. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?

+

ChipSort seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.

+

Or AVX can speed up quicksort. I suspect this is more of a marginal improvement (over BlockQuicksort/pdqsort discussed below) relative to merge sort. If partitioning is fast enough it might make stable quicksort viable.

Partitioning

In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The Hoare partition scheme does this, and BlockQuicksort showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the bit booleans to indices method is faster and fits well with vectorized comparisons.

It's simplest to define an operation P that partitions a list 𝕩 according to a boolean list 𝕨. Partitioning permutes 𝕩 so that all elements corresponding to 0 in 𝕨 come before those corresponding to 1. The quicksort partition step, with pivot t, is (t𝕩)P𝕩, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of 𝕩, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step P is also good for Bins, where the boolean 𝕨 will have to be saved.

-- cgit v1.2.3