More literature review

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-07-10 22:13:03 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2021-07-10 22:13:03 -0400
commit: 9d6009137b08c2faa2006fe2d98ec856c99236ec (patch)
tree: dd88f7a6c1bdcf0b93c9f73476c5b68cc50256d3 /implementation
parent: 10ec9ac75cd7fd73428a913b54ff0206c54773a9 (diff)
1 files changed, 16 insertions, 2 deletions
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index cd6f470e..8a5c99e7 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -34,8 +34,6 @@ Sorting algorithms of interest are counting sort and [pdqsort](https://github.co
 
 For small-range Grade, counting sort must be replaced with bucket sort, at a significant performance cost. I don't have any method I'm happy with for other data. Stabilizing pdqsort by sorting the indices at the end is possible but slow. [Wolfsort](https://github.com/scandum/wolfsort) is a hybrid radix/merge sort that might be better.
 
-[IPS⁴o](https://github.com/ips4o/ips4o) is a horrifyingly complicated samplesort thing. Stable, I think. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.
-
 A branchless binary search is adequate for Bins but in many cases—very small or large `𝕨`, and small range—there are better methods.
 
 ### Counting and bucket sort
@@ -44,6 +42,22 @@ Both counting and bucket sort are small-range algorithms that begin by counting
 
 Bucket sort can be used for Grade or sort-by (`⍋⊸⊏`), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with [fast Indices](replicate.md#non-booleans-to-indices), Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.
 
+### Other sorting algorithms
+
+[IPS⁴o](https://github.com/ips4o/ips4o) is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.
+
+[Vergesort](https://github.com/Morwenn/vergesort) has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.
+
+Sorting networks compare and swap elements in a fixed pattern, and so can be implemented with branchless or even vectorized code. They're great for sorting many small arrays of the same size, but the limit before insertion sort beats it will be pretty small without hardware specialization.
+
+#### SIMD sorting
+
+A few people have done some work on merge sorting with AVX2 or AVX-512: [two](https://github.com/sid1607/avx2-merge-sort) [examples](https://github.com/PatwinchIR/ultra-sort). Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?
+
+[ChipSort](https://github.com/nlw0/ChipSort.jl) seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.
+
+Or AVX can [speed up](https://github.com/WojciechMula/simd-sort) quicksort. I suspect this is more of a marginal improvement (over BlockQuicksort/pdqsort discussed below) relative to merge sort. If partitioning is fast enough it might make stable quicksort viable.
+
 ### Partitioning
 
 In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The [Hoare partition scheme](https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme) does this, and [BlockQuicksort](https://github.com/weissan/BlockQuicksort) showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the [bit booleans to indices](replicate.md#booleans-to-indices) method is faster and fits well with vectorized comparisons.
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-07-10 22:13:03 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2021-07-10 22:13:03 -0400
commit	9d6009137b08c2faa2006fe2d98ec856c99236ec (patch)
tree	dd88f7a6c1bdcf0b93c9f73476c5b68cc50256d3 /implementation
parent	10ec9ac75cd7fd73428a913b54ff0206c54773a9 (diff)