Self-search hashes turned out much faster than partitioning, and other updates

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-01-20 22:15:08 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-01-20 22:15:08 -0500
commit: 0e76b157375a8cc596c8a89a1cc01fb2a46e6419 (patch)
tree: a3f3c8eead5e6acb4e61c9cfa87cd8a69294be95 /docs/implementation/primitive/search.html
parent: 8396fbed1ac2c0358bb0c0d77d6ebffc8ee4d269 (diff)
1 files changed, 6 insertions, 6 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
index debd8a7a..a0ed196b 100644
--- a/docs/implementation/primitive/search.html
+++ b/docs/implementation/primitive/search.html
@@ -16,13 +16,13 @@
 </ul>
 <p>Each of these easily allows for SIMD comparison. A more effective use of SIMD (unless the size of the small argument is 1, up to maybe 3) is a vector binary search as presented in the first half of my &quot;Sub-nanosecond Searches&quot; talk (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). This search of sorted values is performed by using shuffle instructions to independently select a pivot for each value, and adding a pivot index bit based on the result of comparing. It can be used as a <a href="#sparse-and-reverse-lookups">reverse lookup</a> as well, performing a non-SIMD update whenever a match is found. Multiple registers can be searched, but each one has to be touched at every step leading to O(mn) asymptotic performance despite the binary search.</p>
 <h2 id="lookup-tables"><a class="header" href="#lookup-tables">Lookup tables</a></h2>
-<p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but they need to cache table accesses to be fast. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers. You might expect the entire table has to be initialized, but it doesn't always: see <a href="#sparse-and-reverse-lookups">sparse lookups</a>.</p>
+<p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but only if the table remains in cache. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers. You might expect the entire table has to be initialized, but it doesn't always: see <a href="#sparse-and-reverse-lookups">sparse lookups</a>.</p>
 <p>For example, a lookup table algorithm for dyadic <code><span class='Function'>⊐</span></code> might traverse <code><span class='Value'>𝕨</span></code>, writing each value's index to the table. Doing this step in reverse index order makes sure the lowest index &quot;wins&quot;. Similarly, empty entries must be initialized to <code><span class='Function'>≠</span><span class='Value'>𝕨</span></code> beforehand. Then the result is <code><span class='Value'>𝕩</span><span class='Function'>⊏</span><span class='Value'>t</span></code> where <code><span class='Value'>t</span></code> is the table constructed this way. A nonzero minimum value can be handled for free by subtracting it from the table pointer.</p>
-<p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed.</p>
+<p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed. With sparse lookups this seems to be very rare.</p>
 <p>A 1-byte lookup can be packed into vector registers for extra-fast searching. To look up a byte, select the appropriate byte from the table with the top 5 bits, and a mask from another table with the bottom 3. Put these together and pack into bits with compare-movemask.</p>
 <h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
-<p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but I don't think that happens much with searching. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
-<p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and in several 4-byte cases, lookup with <a href="#partitioning">partitioning</a> is okay for smaller arrays and much better for large ones as the hash table outgrows the cache (&gt;1e5 elements to be hashed or so).</p>
+<p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but that's no good for search functions. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
+<p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and lookup with <a href="#partitioning">partitioning</a> is better for 4-byte arguments that outgrow the cache (&gt;1e5 elements to be hashed or so).</p>
 <p>While hash tables are well studied, almost all the work is focused on large persistent tables, meaning that they're not too suited for a one-shot search function. Abseil's <a href="https://github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_map.h">flat_hash_map</a> is fine, I guess. Roger Hui's <a href="https://www.jsoftware.com/papers/indexof/indexof.htm">Index-Of, a 30-Year Quest</a> works as an introduction to hashing in APL, although it has begun to suffer from the small number of years in places, and some details have serious issues (with a power-of-two table size, multiplying by a prime causes high bits to be lost and so is hardly better than no hash). The second half of my &quot;Sub-nanosecond Searches&quot; talk (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>) covers a difficult 4-byte design that's very good for membership and negative lookups (in particular, it's perfect for the reverse lookup as described in the next section).</p>
 <p>I'd take the following choices as a given for an array language hash design:</p>
 <ul>
@@ -30,7 +30,7 @@
 <li>Open addressing</li>
 <li>Linear probing</li>
 </ul>
-<p>The main cost for larger data is the hashing itself; <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> appears to be one of the best choices at the time of writing. 4- and 8-byte lookups are where all the fancy optimizations are wanted. Hashes on these fixed sizes should be reversible and are often called mixing functions in the literature. A CRC instruction makes a good one if available.</p>
+<p>The main cost for larger data is the hashing itself; <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> appears to be one of the best choices at the time of writing. 4- and 8-byte lookups are where all the fancy optimizations are wanted. Hashes on these fixed sizes should be reversible and are often called mixing functions in the literature. A CRC instruction makes a good one if available. Reversibility means they can be stored in the table for comparison instead of the original data.</p>
 <h3 id="resizing-hash-tables"><a class="header" href="#resizing-hash-tables">Resizing hash tables</a></h3>
 <p>When performing a one-shot array operation, the maximum hash table size needed is known in advance. However, if the array to be hashed doesn't have many unique elements, allocating a table this large will have extra costs because it needs to be initialized and the hash probes won't cache as well (they're random by design!). Starting with a smaller hash table and resizing mitigates these costs. But! Getting the resizing policy right is difficult, and any way you do it there's going to be some overhead from resizing when the large size is needed. And hash tables are already somewhat adaptive to small numbers of unique values, because these hit fewer memory regions, avoiding cache penalties. With a factor of 2 to 4 upside, this is not a very high-value use of time.</p>
 <p>Hash model resizing is usually based on a counter of the number of values in the hash. I like using a collision counter instead, because tracking it has the same cost (possibly a little less) and can protect from a compromised hash function. Whenever a value is inserted, add the difference between the index computed from the hash and the index it was actually inserted at. This is the number of collisions for that value and measures the cost of inserting or looking it up. It can be checked for a resizing condition periodically, since checking at every step would be expensive.</p>
@@ -86,5 +86,5 @@
 <h2 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h2>
 <p><a href="https://github.com/mlochbaum/rhsort">Robin Hood sort</a> sorts small uniform arrays quickly by considering hash tables as a way of sorting hashes. This cuts both ways: RH sort <a href="https://github.com/mlochbaum/rhsort/blob/master/images/rand.svg">slows down</a> far more than other sorting methods on large arrays because of its random access patterns, and so do hash table operations. For large enough hash tables, it ought to make sense to bring in sorting-based methods in order to reduce the search size. As in quicksort, partitioning has two advantages in that it reduces the number of values to be compared and also their range. It does have a lot of memory usage because entries need to be copied rather than swapped for stability, and ordering data has to be kept around.</p>
 <p><a href="sort.html#partitioning">Partitioning</a> as in quicksort is possible, but I think stable <a href="sort.html#radix-sort">radix</a> passes are almost always better. The typical problem with radix passes on high bits is that they might distribute values unevenly, but the values will either be searched with constant memory (lookup table) or redistributed (hash table). To undo a radix sort in a cache-friendly way, the original hash bits need to be kept around to retrace the steps. Even in cases where the original indices are known, traversing the radix-ed values and writing back to these indices makes many loops around the original array, moving too quickly for the cache to keep up.</p>
-<p>For 4-byte values, two 8-bit passes reduce the range to two bytes, lookup table range. With <a href="#sparse-and-reverse-lookups">sparse</a> initialization, this is very fast, and useful even for small sizes under 100 elements. I think it just about obseletes 4-byte hash tables for self-search functions. But hash tables favoring lookup speed can be better for search functions as long as the argument sizes aren't too close.</p>
+<p>For 4-byte values, two 8-bit passes reduce the range to two bytes, lookup table range. With <a href="#sparse-and-reverse-lookups">sparse</a> initialization, this is fairly fast, even for sizes under 100 elements. But hash tables are better until cache effects of a large table kick in. At least partitioning is a good backup for cases where the hash function fails. And one advantage it does have over hash tables is that for small-range data it might be possible to skip one or both rounds of radix passes. I haven't found a way to make it work for Classify (<code><span class='Function'>⊐</span></code>). Deriving it from <code><span class='Function'>⊐</span><span class='Modifier'>˜</span></code> requires a cache-incoherent final pass, so <code><span class='Function'>∊</span><span class='Modifier2'>⊸</span><span class='Function'>⊐</span></code> might be the best option.</p>
 <p>Partitioning doesn't really have to interact with hash insertions or lookups: after the data is partitioned at a suitable size, it will just hash faster because it only accesses part of the hash table at a time. But it's better to use a smaller hash table, doing one partition with it, then the next, and so on. There are a few tricks to avoid having to re-initialize the table on each pass. This is a big deal because it means a very low load factor can be used, which speeds things up! First, and more generally, the table can be cleaned up <em>after</em> a pass by walking through the elements again (possibly in reverse for some hash designs). Second, if enough data is stored in the hash table to distinguish one pass from the next, then values from previous passes can be interpreted as empty so that no re-initialization is necessary.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-01-20 22:15:08 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-01-20 22:15:08 -0500
commit	0e76b157375a8cc596c8a89a1cc01fb2a46e6419 (patch)
tree	a3f3c8eead5e6acb4e61c9cfa87cd8a69294be95 /docs/implementation/primitive/search.html
parent	8396fbed1ac2c0358bb0c0d77d6ebffc8ee4d269 (diff)