diff options
Diffstat (limited to 'docs/implementation')
| -rw-r--r-- | docs/implementation/primitive/search.html | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html index 128f27e8..6f2c3a42 100644 --- a/docs/implementation/primitive/search.html +++ b/docs/implementation/primitive/search.html @@ -13,6 +13,6 @@ <h2 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h2> <p><a href="https://github.com/mlochbaum/rhsort">Robin Hood sort</a> sorts small uniform arrays quickly by considering hash tables as a way of sorting hashes. This cuts both ways: RH sort <a href="https://github.com/mlochbaum/rhsort/blob/master/images/rand.svg">slows down</a> far more than other sorting methods on large arrays because of its random access patterns, and so do hash table operations. For large enough hash tables, it ought to make sense to bring in sorting-based methods in order to reduce the search size.</p> <p>Of course, it's possible to implement searches using only sorting and no hashing: <code><span class='Function'>∧</span><span class='Modifier2'>⊸</span><span class='Function'>⍋⊏⍋</span><span class='Modifier2'>∘</span><span class='Function'>⊣</span></code> with some adjustments to the binary search. A hash takes advantage of the fact that what ordering is used doesn't matter to rearrange things and get expected equal distribution of unique keys. It's usually going to be best to use a hash table as the base case, so that it's the hashes being sorted. With small element sizes and a bijective hash, only the hashes need to be compared, so the arguments can be hashed at the start and the original values discarded.</p> -<p>The simplest sorting technique to use is probably <a href="sort.html#partitioning">partitioning</a> as in quicksort. However, I'm guessing that a <a href="sort.html#radix-sort">radix</a> pass would be faster. The primary reason to avoid MSD radix for sorting is that arguments are often clustered in a small number of bins, making initial passes useless, but this only applies to hashed data if there are many duplicates. Stable partitioning is often more convenient as it ensures that equal elements are still encountered in the right order.</p> +<p>One option is <a href="sort.html#partitioning">partitioning</a> as in quicksort, but the unstable version doesn't work for many functions, and comparing with a pivot is wasted effort as top bits can be used directly. Stable <a href="sort.html#radix-sort">radix</a> passes are ideal here. However, they do use a lot of extra memory: twice the size of the hashed array (or equal to it if it can be reused). To undo a radix sort in a cache-friendly way, the original hash bits need to be kept around to retrace the steps. Even in cases where the original indices are known, traversing the radix-ed values and writing back to these indices makes many loops around the original array, moving too quickly for the cache to keep up.</p> <p>Partitioning doesn't really have to interact with hash insertions or lookups: after the data is partitioned at a suitable size, it will just hash faster because it only accesses part of the hash table at a time. It's also possible to save space by using a smaller hash table and doing one partition with it, then the next, and so on.</p> -<p>The complication is that the result comes out partitioned like the searched-for argument, and this needs to be undone at the end. I think it's best to just keep the partitioning information (first few hash bits) around and run partitioning in reverse.</p> +<p>If enough data is stored in the hash table to distinguish one pass from the next, the hash doesn't need to be re-initialized on each step. This means a very low load factor can be used, which is a big deal because a basic hash is very fast in these conditions!</p> |
