aboutsummaryrefslogtreecommitdiff
path: root/docs/implementation/primitive
diff options
context:
space:
mode:
Diffstat (limited to 'docs/implementation/primitive')
-rw-r--r--docs/implementation/primitive/sort.html2
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index 848b6edb..42783a08 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -32,7 +32,7 @@
<p>For binary search <code><span class='Value'>๐•จ</span><span class='Function'>โ‹</span><span class='Value'>๐•ฉ</span></code>, partitioning allows one pivot element <code><span class='Value'>t</span></code> from <code><span class='Value'>๐•จ</span></code> to be compared to all of <code><span class='Value'>๐•ฉ</span></code> at once, instead of the normal strategy of working with one element from <code><span class='Value'>๐•ฉ</span></code> at a time. <code><span class='Value'>๐•ฉ</span></code> is partitioned according to <code><span class='Value'>t</span><span class='Function'>โ‰ค</span><span class='Value'>๐•ฉ</span></code>, then result values are found by searching the first half of <code><span class='Value'>๐•จ</span></code> for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, <code><span class='Function'>P</span></code> is a self inverse, identical to <code><span class='Function'>P</span><span class='Modifier'>โผ</span></code>. So the last step is simple, provided the partitioning information <code><span class='Value'>t</span><span class='Function'>โ‰ค</span><span class='Value'>๐•ฉ</span></code> is saved.</p>
<h3 id="binary-search">Binary search</h3>
<p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
-<p>If <code><span class='Value'>๐•จ</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>๐•จ</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>๐•จ</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>๐•จ</span></code>. The other gives the number of values in <code><span class='Value'>๐•จ</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
+<p>If <code><span class='Value'>๐•จ</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>๐•จ</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>๐•จ</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>๐•จ</span></code>. The other gives the number of values in <code><span class='Value'>๐•จ</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
<p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>๐•จ</span></code> down to the range of <code><span class='Value'>๐•ฉ</span></code>. After finding the range of <code><span class='Value'>๐•ฉ</span></code>, binary cut <code><span class='Value'>๐•จ</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>๐•จ</span></code> for the appropriate endpoint of the range.</p>
<p>If <code><span class='Value'>๐•ฉ</span></code> is small-range, then a lookup table method is possible. Check the length of <code><span class='Value'>๐•จ</span></code> because if it's too large then this method is slower! The approach is simply to create a table of the number of elements in <code><span class='Value'>๐•จ</span></code> with each value, then take a prefix sum. In BQN, <code><span class='Value'>๐•ฉ</span><span class='Function'>โŠ+</span><span class='Modifier'>`</span><span class='Function'>โˆพโ‰ </span><span class='Modifier'>ยจ</span><span class='Function'>โŠ”</span><span class='Value'>๐•จ</span></code>, assuming a minimum of 0.</p>
<p><a href="#partitioning">Partitioning</a> allows one pivot <code><span class='Value'>t</span></code> from <code><span class='Value'>๐•จ</span></code> to be compared with all of <code><span class='Value'>๐•ฉ</span></code> at once. Although the comparison <code><span class='Value'>t</span><span class='Function'>โ‰ค</span><span class='Value'>๐•ฉ</span></code> can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search <em>when</em> <code><span class='Value'>๐•จ</span></code> <em>fits in L1 cache</em>. For larger <code><span class='Value'>๐•จ</span></code> (and randomly positioned <code><span class='Value'>๐•ฉ</span></code>) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when <code><span class='Value'>๐•จ</span></code> is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of <code><span class='Value'>๐•จ</span></code> as in a normal binary search. However, there is a potential issue with memory. If <code><span class='Value'>๐•ฉ</span></code> is unbalanced with respect to <code><span class='Value'>๐•จ</span></code>, then the larger part can be nearly the whole length of <code><span class='Value'>๐•ฉ</span></code> (if it's all of <code><span class='Value'>๐•ฉ</span></code> partitioning isn't actually needed and it doesn't need to be saved). This can require close to <code><span class='Number'>2</span><span class='Function'>โ‹†</span><span class='Modifier'>โผ</span><span class='Function'>โ‰ </span><span class='Value'>๐•จ</span></code> saved partitions of length <code><span class='Function'>โ‰ </span><span class='Value'>๐•ฉ</span></code>, while the expected use would be a total length <code><span class='Function'>โ‰ </span><span class='Value'>๐•ฉ</span></code>.</p>