aboutsummaryrefslogtreecommitdiff
path: root/docs/implementation/primitive
diff options
context:
space:
mode:
authorMarshall Lochbaum <mwlochbaum@gmail.com>2021-08-11 17:21:31 -0400
committerMarshall Lochbaum <mwlochbaum@gmail.com>2021-08-11 17:25:04 -0400
commit2afb23928e1984d475cc460e1672e8f6fa0e4dbe (patch)
treeebd2cc514294d30b6fa4b36c2ee638326f06ef72 /docs/implementation/primitive
parenteac61ca02074c218667754d5f4ef562e780bae85 (diff)
Allow clicking on header to get fragment link
Diffstat (limited to 'docs/implementation/primitive')
-rw-r--r--docs/implementation/primitive/index.html2
-rw-r--r--docs/implementation/primitive/random.html6
-rw-r--r--docs/implementation/primitive/replicate.html18
-rw-r--r--docs/implementation/primitive/sort.html24
4 files changed, 25 insertions, 25 deletions
diff --git a/docs/implementation/primitive/index.html b/docs/implementation/primitive/index.html
index b5256302..f139e5f4 100644
--- a/docs/implementation/primitive/index.html
+++ b/docs/implementation/primitive/index.html
@@ -4,7 +4,7 @@
<title>BQN: Primitive implementation notes</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a></div>
-<h1 id="primitive-implementation-notes">Primitive implementation notes</h1>
+<h1 id="primitive-implementation-notes"><a class="header" href="#primitive-implementation-notes">Primitive implementation notes</a></h1>
<p>Commentary on the best methods I know for implementing various primitives. Often there are many algorithms that are viable in different situations, and in these cases I try to discuss the tradeoffs.</p>
<ul>
<li><a href="replicate.html">Replicate</a></li>
diff --git a/docs/implementation/primitive/random.html b/docs/implementation/primitive/random.html
index 7f660792..c907ee1c 100644
--- a/docs/implementation/primitive/random.html
+++ b/docs/implementation/primitive/random.html
@@ -4,12 +4,12 @@
<title>BQN: Implementation of random stuff</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-random-stuff">Implementation of random stuff</h1>
+<h1 id="implementation-of-random-stuff"><a class="header" href="#implementation-of-random-stuff">Implementation of random stuff</a></h1>
<p>Not a primitive, but CBQN's <code><span class='Function'>•MakeRand</span></code> initializes a random number generator that has some built-in utilities. For clarity we'll call a result of this initialization <code><span class='Value'>rand</span></code> in the text below.</p>
-<h2 id="random-number-generation">Random number generation</h2>
+<h2 id="random-number-generation"><a class="header" href="#random-number-generation">Random number generation</a></h2>
<p>CBQN is currently using wyrand, part of the <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> library. It's extremely fast, passes the expected test suites, and no one's raised any concerns about it yet (but it's very new). It uses only 64 bits of state and doesn't have extra features like jump ahead.</p>
<p>Other choices are <a href="https://prng.di.unimi.it/">xoshiro++</a> and <a href="https://www.pcg-random.org/">PCG</a>. The authors of these algorithms (co-author for xoshiro) hate each other very much and have spent quite some time slinging mud at each other. As far as I can tell they both have the normal small bias in favor of their own algorithms but are wildly unfair towards the other side, choosing misleading examples and inflating minor issues. I think both generators are good but find the case for xoshiro a little more convincing, and I think it's done better in third-party benchmarks.</p>
-<h2 id="simple-random-sample">Simple random sample</h2>
+<h2 id="simple-random-sample"><a class="header" href="#simple-random-sample">Simple random sample</a></h2>
<p>A <a href="https://en.wikipedia.org/wiki/Simple_random_sample">simple random sample</a> from a set is a subset with a specified size, chosen so that each subset of that size has equal probability. <code><span class='Value'>rand.</span><span class='Function'>Deal</span></code> gets a sample of size <code><span class='Value'>𝕨</span></code> from the set <code><span class='Function'>↕</span><span class='Value'>𝕩</span></code> with elements in a uniformly random order, and <code><span class='Value'>rand.</span><span class='Function'>Subset</span></code> does the same but sorts the elements.</p>
<p><code><span class='Function'>Deal</span></code> uses a <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Knuth shuffle</a>, stopping after the first <code><span class='Value'>𝕨</span></code> elements have been shuffled, as the algorithm won't touch them again. Usually it creates <code><span class='Function'>↕</span><span class='Value'>𝕩</span></code> explicitly for this purpose, but if <code><span class='Value'>𝕨</span></code> is very small then initializing it is too slow. In this case we initialize <code><span class='Function'>↕</span><span class='Value'>𝕨</span></code>, but use a &quot;hash&quot; table with an identity hash—the numbers are already random—for <code><span class='Value'>𝕨</span><span class='Function'>↓↕</span><span class='Value'>𝕩</span></code>. The default is that every value in the table is equal to its key, so that only entries where a swap has happened need to be stored. The hash table is the same design as for self-search functions, with open addressing and linear probing.</p>
<p><code><span class='Function'>Subset</span></code> uses <a href="https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin">Floyd's method</a>, which is sort of a modification of shuffling where only the selected elements need to be stored, not what they were swapped with. This requires a lookup structure that can be updated efficiently and output all elements in sorted order. The choices are a bitset for large <code><span class='Value'>𝕨</span></code> and another not-really-hash table for small <code><span class='Value'>𝕨</span></code>. The table uses a right shift—that is, division by a power of two—as a hash so that hashing preserves the ordering, and inserts like an insertion sort: any larger entries are pushed forward. Really this is an online sorting algorithm, that works because we know the input distribution is well-behaved (it degrades to quadratic performance only in very unlikely cases). When <code><span class='Value'>𝕨</span><span class='Function'>&gt;</span><span class='Value'>𝕩</span><span class='Function'>÷</span><span class='Number'>2</span></code>, we always use a bitset, but select <code><span class='Value'>𝕩</span><span class='Function'>-</span><span class='Value'>𝕨</span></code> elements and invert the selection.</p>
diff --git a/docs/implementation/primitive/replicate.html b/docs/implementation/primitive/replicate.html
index 9d539d40..9e07a754 100644
--- a/docs/implementation/primitive/replicate.html
+++ b/docs/implementation/primitive/replicate.html
@@ -4,23 +4,23 @@
<title>BQN: Implementation of Indices and Replicate</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-indices-and-replicate">Implementation of Indices and Replicate</h1>
+<h1 id="implementation-of-indices-and-replicate"><a class="header" href="#implementation-of-indices-and-replicate">Implementation of Indices and Replicate</a></h1>
<p>The replicate family of functions contains not just primitives but powerful tools for implementing other functionality. The most important is converting <a href="#booleans-to-indices">bits to indices</a>: AVX-512 extensions implement this natively for various index sizes, and even with no SIMD support at all there are surprisingly fast table-based algorithms for it.</p>
<p><a href="#replicate">General replication</a> is more complex. Branching will slow many useful cases down considerably when using the obvious solution. However, branch-free techniques introduce overhead for larger replication amounts. Hybridizing these seems to be the only way, but it's finicky.</p>
<p>Replicate by a <a href="#constant-replicate">constant amount</a> (so <code><span class='Value'>𝕨</span></code> is a single number) is not too common in itself, but it's notable because it can be the fastest way to implement outer products and scalar dyadics with prefix agreement.</p>
-<h2 id="indices">Indices</h2>
+<h2 id="indices"><a class="header" href="#indices">Indices</a></h2>
<p>Branchless algorithms are fastest, but with unbounded values in <code><span class='Value'>𝕨</span></code> a fully branchless algorithm is impossible because you can't write an arbitrary amount of memory without branching. So the best algorithms depend on bounding <code><span class='Value'>𝕨</span></code>. Fortunately the most useful case is that <code><span class='Value'>𝕨</span></code> is boolean.</p>
-<h3 id="booleans-to-indices">Booleans to indices</h3>
+<h3 id="booleans-to-indices"><a class="header" href="#booleans-to-indices">Booleans to indices</a></h3>
<p>Indices (<code><span class='Function'>/</span></code>) on a boolean <code><span class='Value'>𝕩</span></code> of 256 or fewer bits can be made very fast on generic 64-bit hardware using a lookup table on 8 bits at a time. This algorithm can write past the end by up to 8 bytes (7 if trailing 0s are excluded), but never writes more than 256 bytes total. This means it's suitable for writing to an overallocated result array or a 256-byte buffer.</p>
<p>To generate indices, use a 256×8-byte lookup table that goes from bytes to 8-byte index lists, and either a popcount instruction or another lookup table to get the sum of each byte. For each byte in <code><span class='Value'>𝕨</span></code>, get the corresponding indices, add an increment, and write them to the current index in the output. Then incease the output index by the byte's sum. The next indices will overlap the 8 bytes written, with the actual indices kept and junk values at the end overwritten. The increment added is an 8-byte value where each byte contains the current input index (always a multiple of 8); it can be added or bitwise or-ed with the lookup value.</p>
<p>Some other methods discussed by <a href="https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/">Langdale</a> and <a href="https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/">Lemire</a>. I think very large lookup tables are not good for an interpreter because they cause too much cache pressure if used occasionally on smaller arrays. This rules out many of these strategies.</p>
-<h3 id="non-booleans-to-indices">Non-booleans to indices</h3>
+<h3 id="non-booleans-to-indices"><a class="header" href="#non-booleans-to-indices">Non-booleans to indices</a></h3>
<p>If the maximum value in <code><span class='Value'>𝕩</span></code> is, say, 8, then generating indices is fairly fast: for each element, write 8 indices and then move the output pointer forward by that much. This is much like the lookup table algorithm above, minus the lookup table. If the indices need to be larger than one byte, it's fine to expand them, and possibly add an offset, after generation (probably in chunks).</p>
<p>There are two ways I know to fill in the gaps that this method would leave with elements that are too large. First is to stop after such an element and fill remaining space branchfully (maybe with <code><span class='Value'>memset</span></code>). This is maximally efficient if <code><span class='Value'>𝕩</span></code> is dominated by large elements—particularly for 2-byte indices when it skips index expansion—but not good if there are a lot of elements near the threshold. Second, initialize the buffer with 0 and perform <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code> afterwards, or other variations. This eliminates all but a fixed amount of branching, but it's a lot of overhead and I think unless a more sophisticated strategy arises it's best to stick with the first method.</p>
<p>Indices is half of a counting sort: for sparse values, it's the slower half. Making it fast makes counting sort viable for much larger range-to-length ratios.</p>
-<h2 id="replicate">Replicate</h2>
+<h2 id="replicate"><a class="header" href="#replicate">Replicate</a></h2>
<p>For the most part, understanding Indices is the best way to implement Replicate quickly. But this is not the case if <code><span class='Value'>𝕩</span></code> is boolean because then its elements are smaller than any useful index, and faster methods are available.</p>
-<h3 id="compress">Compress</h3>
+<h3 id="compress"><a class="header" href="#compress">Compress</a></h3>
<p>Most of the methods listed below can be performed in place.</p>
<p>For booleans, use BMI2's PEXT (parallel bits extract) instruction, or an emulation of it. The result can be built recursively alongside the also-required popcount using masked shifts.</p>
<p>The generally best method for small elements seems to be to generate 1-byte indices into a buffer 256 at a time and select with those. There's a branchless method on one bit at a time which is occasionally better, but I don't think the improvement is enough to justify using it.</p>
@@ -28,11 +28,11 @@
<p>Odd-sized cells could be handled with an index buffer like small elements, using oversized writes and either overallocating or handling the last element specially.</p>
<p>For medium-sized cells copying involves partial writes and so is somewhat inefficient. It's better to split <code><span class='Value'>𝕨</span></code> into groups of 1s in order to copy larger chunks from <code><span class='Value'>𝕩</span></code> at once. So the algorithm repeatedly searches <code><span class='Value'>𝕨</span></code> for the next 1, then the next 0, then copies the corresponding value from <code><span class='Value'>𝕩</span></code> to the result. This might be better for small odd-sized cells as well; I haven't implemented the algorithm with oversized writes to compare.</p>
<p>The grouped algorithm, as well as a simpler sparse algorithm that just finds each 1 in <code><span class='Value'>𝕨</span></code>, can also better for small elements. Whether to use these depends on the value of <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>𝕨</span></code> (sparse) or <code><span class='Function'>+</span><span class='Modifier'>´</span><span class='Function'>»</span><span class='Modifier2'>⊸</span><span class='Function'>&lt;</span><span class='Value'>𝕨</span></code> (clumped). The checking is fast and these cases are common, but the general case is also fast enough that this is not a particularly high priority.</p>
-<h3 id="replicate">Replicate</h3>
+<h3 id="replicate"><a class="header" href="#replicate">Replicate</a></h3>
<p>Like Compress I think the best algorithm is often to generate small indices in a buffer and then select. But this is inefficient when <code><span class='Value'>𝕨</span></code> contains large values, so those need to be detected and handled. Very tricky.</p>
-<h4 id="constant-replicate">Constant replicate</h4>
+<h4 id="constant-replicate"><a class="header" href="#constant-replicate">Constant replicate</a></h4>
<p>Useful for outer products and leading-axis extension. See <a href="https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/">Expanding Bits in Shrinking Time</a> for the boolean case. C compilers will generate decent code for constant small numbers and variable large ones, but I think specialized code with shuffle would be better for small numbers.</p>
-<h3 id="higher-ranks">Higher ranks</h3>
+<h3 id="higher-ranks"><a class="header" href="#higher-ranks">Higher ranks</a></h3>
<p>When replicating along the first axis only, additional axes only change the element size (these are the main reason why a large element method is given). Replicating along a later axis offers a few opportunities for improvement relative to replicating each cell individually.</p>
<p>Particularly for boolean <code><span class='Value'>𝕨</span></code>, Select is usually faster than Replicate (a major exception is for a boolean <code><span class='Value'>𝕩</span></code>). Simply replacing <code><span class='Function'>/</span></code> with <code><span class='Function'>/</span><span class='Modifier'>¨</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code> (after checking conformability) could be an improvement. It's probably best to compute the result shape first to avoid doing any work if it's empty. Similarly, if early result axes are small then the overhead of separating out Indices might make it worse than just doing the small number of Replicates.</p>
<p>A technique when <code><span class='Value'>𝕨</span></code> processed with one or more bytes at a time, and applies to many rows, is to repeat it up to an even number of bytes and combine rows of <code><span class='Value'>𝕩</span></code> into longer virtual rows (the last one can be short). I think this only ends up being useful when <code><span class='Value'>𝕩</span></code> is boolean.</p>
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index 2a417770..5c6c3d70 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -4,35 +4,35 @@
<title>BQN: Implementation of ordering functions</title>
</head>
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
-<h1 id="implementation-of-ordering-functions">Implementation of ordering functions</h1>
+<h1 id="implementation-of-ordering-functions"><a class="header" href="#implementation-of-ordering-functions">Implementation of ordering functions</a></h1>
<p>The <a href="../../doc/order.html">ordering functions</a> are Sort (<code><span class='Function'>∧∨</span></code>), Grade (<code><span class='Function'>⍋⍒</span></code>), and Bins (<code><span class='Function'>⍋⍒</span></code>). Although these are well-studied—particularly sorting, and then binary search or &quot;predecessor search&quot;—there are many recent developments, as well as techniques that I have not found in the literature. The three functions are closely related but have important differences in what algorithms are viable. Sorting is a remarkably deep problem with different algorithms able to do a wide range of amazing things, and sophisticated ways to combine those. It is by no means solved. In comparison, Bins is pretty tame.</p>
<p>There's a large divide between ordering compound data and simple data. For compound data comparisons are expensive, and the best algorithm will generally be the one that uses the fewest comparisons. For simple data they fall somewhere between cheap and extremely cheap, and fancy branchless and vectorized algorithms are the best.</p>
-<h2 id="on-quicksort-versus-merge-sort">On quicksort versus merge sort</h2>
+<h2 id="on-quicksort-versus-merge-sort"><a class="header" href="#on-quicksort-versus-merge-sort">On quicksort versus merge sort</a></h2>
<p>Merge sort is better. It is deterministic, stable, and has optimal worst-case performance. Its pattern handling is better: while merge sort handles &quot;horizontal&quot; patterns and quicksort does &quot;vertical&quot; ones, merge sort gets useful work out of <em>any</em> sequence of runs but in-place quicksort will quickly mangle its analogue until it may as well be random.</p>
<p>But that doesn't mean merge sort is always faster. Quicksort seems to work a little better branchlessly. For sorting, quicksort's partitioning can reduce the range of the data enough to use an extremely quick counting sort. Partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments. So it can be useful. But it doesn't merge, and can't easily be made to merge, and that's a shame.</p>
<p>The same applies to the general categories of partitioning sorts (quicksort, radix sort, samplesort) and merging sorts (mergesort, timsort, multimerges). Radix sorts are definitely the best for some types and lengths, although the scattered accesses make their performance unpredictable and I think overall they're not worth it. A million uniformly random 4-byte integers is nearly the best possible case for radix sort, so the fact that this seems to be the go-to sorting benchmark means radix sorting looks better than it is.</p>
-<h2 id="on-binary-search">On binary search</h2>
+<h2 id="on-binary-search"><a class="header" href="#on-binary-search">On binary search</a></h2>
<p>Binary searches are very easy to get wrong. Do not write <code><span class='Paren'>(</span><span class='Value'>hi</span><span class='Function'>+</span><span class='Value'>lo</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Number'>2</span></code>: it's not safe from overflows. I always follow the pattern given in the first code block <a href="https://pvk.ca/Blog/2015/11/29/retrospective-on-binary-search-and-on-compression-slash-compilation/">here</a>. This code will never access the value <code><span class='Value'>*base</span></code>, so it should be considered a search on the <code><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>1</span></code> values beginning at <code><span class='Value'>base</span><span class='Function'>+</span><span class='Number'>1</span></code> (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract <code><span class='Value'>n%</span><span class='Number'>2</span></code> from <code><span class='Value'>n</span></code> in the case that <code><span class='Value'>*mid</span> <span class='Function'>&lt;</span> <span class='Value'>x</span></code>.</p>
-<h2 id="compound-data">Compound data</h2>
+<h2 id="compound-data"><a class="header" href="#compound-data">Compound data</a></h2>
<p>Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.</p>
<p>For <strong>Sort</strong> and <strong>Grade</strong>, use Timsort. It's time-tested and shows no signs of weakness (but do be sure to pick up a fix for the bug discovered in 2015 in formal verification). Hardly different from optimal comparison numbers on random data, and outstanding pattern handling. Grade can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.</p>
<p>For <strong>Bins</strong>, use a branching binary search: see <a href="#on-binary-search">On binary search</a> above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.</p>
-<h2 id="simple-data">Simple data</h2>
+<h2 id="simple-data"><a class="header" href="#simple-data">Simple data</a></h2>
<p>The name of the game here is &quot;branchless&quot;.</p>
<p>For sorting, the fastest algorithms for generic data and generic hardware are branchless <a href="#quicksort">quicksorts</a>. Fluxsort is new and very exciting because it's a <em>stable</em> algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.</p>
<p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
-<h3 id="counting-and-bucket-sort">Counting and bucket sort</h3>
+<h3 id="counting-and-bucket-sort"><a class="header" href="#counting-and-bucket-sort">Counting and bucket sort</a></h3>
<p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
<p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
-<h3 id="quicksort">Quicksort</h3>
+<h3 id="quicksort"><a class="header" href="#quicksort">Quicksort</a></h3>
<p><a href="https://github.com/scandum/fluxsort">Fluxsort</a> attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (<a href="https://github.com/scandum/quadsort">Quadsort</a>) as the base case. I haven't fully evaluated these.</p>
<p><a href="https://arxiv.org/abs/2106.05123">This paper</a> gives a good description of <a href="https://github.com/orlp/pdqsort">pdqsort</a>. I'd start with the <a href="https://github.com/rust-lang/rust/blob/master/library/core/src/slice/sort.rs">Rust version</a>, which has some advantages but can still be improved further. The subsections below describe improved <a href="#partitioning">partitioning</a> and an <a href="#initial-pass">initial pass</a> with several benefits. I also found that the pivot randomization methods currently used are less effective because they swap elements that won't become pivots soon; the pivot candidates and randomization targets need to be chosen to overlap. The optimistic insertion sort can also be improved: when a pair of elements is swapped the smaller one should be inserted as usual but the larger one can also be pushed forward at little cost, potentially saving many swaps and handling too-large elements as gracefully as too-small ones.</p>
<p>While the stable partitioning for Fluxsort seems to be an overall better choice, pdqsort's unstable partitioning is what I've worked with in the past. The following sections are written from the perspective of pdqsort and will be rewritten for Fluxsort as the methods are adapted.</p>
-<h4 id="partitioning">Partitioning</h4>
+<h4 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h4>
<p>In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The <a href="https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme">Hoare partition scheme</a> does this, and <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But the <a href="replicate.html#booleans-to-indices">bit booleans to indices</a> method is faster and fits well with vectorized comparisons.</p>
<p>It's simplest to define an operation <code><span class='Function'>P</span></code> that partitions a list <code><span class='Value'>𝕩</span></code> according to a boolean list <code><span class='Value'>𝕨</span></code>. Partitioning permutes <code><span class='Value'>𝕩</span></code> so that all elements corresponding to 0 in <code><span class='Value'>𝕨</span></code> come before those corresponding to 1. The quicksort partition step, with pivot <code><span class='Value'>t</span></code>, is <code><span class='Paren'>(</span><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>P</span><span class='Value'>𝕩</span></code>, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of <code><span class='Value'>𝕩</span></code>, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step <code><span class='Function'>P</span></code> is also good for Bins, where the boolean <code><span class='Value'>𝕨</span></code> will have to be saved.</p>
<p>For binary search <code><span class='Value'>𝕨</span><span class='Function'>⍋</span><span class='Value'>𝕩</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared to all of <code><span class='Value'>𝕩</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>𝕩</span></code> at a time. <code><span class='Value'>𝕩</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code>, then result values are found by searching the first half of <code><span class='Value'>𝕨</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>⁼</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> is saved.</p>
-<h4 id="initial-pass">Initial pass</h4>
+<h4 id="initial-pass"><a class="header" href="#initial-pass">Initial pass</a></h4>
<p>An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:</p>
<ul>
<li>Recognize sorted and reverse-sorted arrays as fast as possible</li>
@@ -44,15 +44,15 @@
<p>Finding an initial run is fast as well. Compare the first two elements to determine direction, then search for the first pair that have the opposite direction (this can be vectorized because overreading is fine). This run can be used as the first range block, because the maximum and minimum are the two elements at the ends of the run.</p>
<p>At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.</p>
<p>This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them. There may also be sampling-based tests to find when merge sort is better, even if the runs aren't perfect (but is this actually common?).</p>
-<h3 id="other-sorting-algorithms">Other sorting algorithms</h3>
+<h3 id="other-sorting-algorithms"><a class="header" href="#other-sorting-algorithms">Other sorting algorithms</a></h3>
<p><a href="https://github.com/ips4o/ips4o">IPS⁴o</a> is a horrifyingly complicated samplesort thing. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.</p>
<p><a href="https://github.com/Morwenn/vergesort">Vergesort</a> has another useful first-pass strategy, which spends an asymptotically small amount of time searching for runs before sorting. Since it only detects perfect runs it won't give the full adaptivity of a good merge sort.</p>
<p>Sorting networks compare and swap elements in a fixed pattern, and so can be implemented with branchless or even vectorized code. They're great for sorting many small arrays of the same size, but the limit before insertion sort beats it will be pretty small without hardware specialization.</p>
-<h4 id="simd-sorting">SIMD sorting</h4>
+<h4 id="simd-sorting"><a class="header" href="#simd-sorting">SIMD sorting</a></h4>
<p>A few people have done some work on merge sorting with AVX2 or AVX-512: <a href="https://github.com/sid1607/avx2-merge-sort">two</a> <a href="https://github.com/PatwinchIR/ultra-sort">examples</a>. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive?</p>
<p><a href="https://github.com/nlw0/ChipSort.jl">ChipSort</a> seems further along than those. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.</p>
<p>Or AVX can <a href="https://github.com/WojciechMula/simd-sort">speed up</a> quicksort. I suspect this is more of a marginal improvement (over BlockQuicksort/pdqsort discussed below) relative to merge sort. If partitioning is fast enough it might make stable quicksort viable.</p>
-<h3 id="binary-search">Binary search</h3>
+<h3 id="binary-search"><a class="header" href="#binary-search">Binary search</a></h3>
<p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
<p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
<p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>𝕨</span></code> down to the range of <code><span class='Value'>𝕩</span></code>. After finding the range of <code><span class='Value'>𝕩</span></code>, binary cut <code><span class='Value'>𝕨</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>𝕨</span></code> for the appropriate endpoint of the range.</p>