From a280b8e04c8b4021e8f47b3eee0a7128e591e1e7 Mon Sep 17 00:00:00 2001 From: Marshall Lochbaum Date: Wed, 23 Jun 2021 20:45:04 -0400 Subject: Fix slide link --- docs/implementation/primitive/sort.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'docs/implementation') diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html index 848b6edb..42783a08 100644 --- a/docs/implementation/primitive/sort.html +++ b/docs/implementation/primitive/sort.html @@ -32,7 +32,7 @@

For binary search ๐•จโ‹๐•ฉ, partitioning allows one pivot element t from ๐•จ to be compared to all of ๐•ฉ at once, instead of the normal strategy of working with one element from ๐•ฉ at a time. ๐•ฉ is partitioned according to tโ‰ค๐•ฉ, then result values are found by searching the first half of ๐•จ for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, P is a self inverse, identical to Pโผ. So the last step is simple, provided the partitioning information tโ‰ค๐•ฉ is saved.

Reminder that we're talking about simple, not compound data. The most important thing is just to have a good branchless binary search (see above), but there are other possible optimizations.

-

If ๐•จ is extremely small, use a vector binary search as described in "Sub-nanosecond Searches" (video, slides). For 1-byte elements there's also a vectorized method that works whenever ๐•จ has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of ๐•จ, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in ๐•จ. The other gives the number of values in ๐•จ less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.

+

If ๐•จ is extremely small, use a vector binary search as described in "Sub-nanosecond Searches" (video, slides). For 1-byte elements there's also a vectorized method that works whenever ๐•จ has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of ๐•จ, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in ๐•จ. The other gives the number of values in ๐•จ less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.

It's cheap and sometimes worthwhile to trim ๐•จ down to the range of ๐•ฉ. After finding the range of ๐•ฉ, binary cut ๐•จ to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of ๐•จ for the appropriate endpoint of the range.

If ๐•ฉ is small-range, then a lookup table method is possible. Check the length of ๐•จ because if it's too large then this method is slower! The approach is simply to create a table of the number of elements in ๐•จ with each value, then take a prefix sum. In BQN, ๐•ฉโŠ+`โˆพโ‰ ยจโŠ”๐•จ, assuming a minimum of 0.

Partitioning allows one pivot t from ๐•จ to be compared with all of ๐•ฉ at once. Although the comparison tโ‰ค๐•ฉ can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search when ๐•จ fits in L1 cache. For larger ๐•จ (and randomly positioned ๐•ฉ) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when ๐•จ is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of ๐•จ as in a normal binary search. However, there is a potential issue with memory. If ๐•ฉ is unbalanced with respect to ๐•จ, then the larger part can be nearly the whole length of ๐•ฉ (if it's all of ๐•ฉ partitioning isn't actually needed and it doesn't need to be saved). This can require close to 2โ‹†โผโ‰ ๐•จ saved partitions of length โ‰ ๐•ฉ, while the expected use would be a total length โ‰ ๐•ฉ.

-- cgit v1.2.3