Self-search hash resizing math

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-11-06 20:04:48 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-11-06 20:04:48 -0500
commit: 5dbe5b502b760b5d87c043b5ab836751b2c3e5e8 (patch)
tree: a1b4f63d6ac913991f2b6bd9e31b8fd3b6e7d193 /docs
parent: dd3f993803386d28b3966c6b6ebcc5d1d372b99d (diff)
1 files changed, 22 insertions, 1 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
index 08140eb3..debd8a7a 100644
--- a/docs/implementation/primitive/search.html
+++ b/docs/implementation/primitive/search.html
@@ -22,7 +22,7 @@
 <p>A 1-byte lookup can be packed into vector registers for extra-fast searching. To look up a byte, select the appropriate byte from the table with the top 5 bits, and a mask from another table with the bottom 3. Put these together and pack into bits with compare-movemask.</p>
 <h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
 <p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but I don't think that happens much with searching. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
-<p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and in several 4-byte cases, lookup with <a href="#partitioning">partitioning</a> is competitive for smaller arrays and much better for large ones as the hash table outgrows the cache (&gt;1e5 elements to be hashed or so).</p>
+<p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and in several 4-byte cases, lookup with <a href="#partitioning">partitioning</a> is okay for smaller arrays and much better for large ones as the hash table outgrows the cache (&gt;1e5 elements to be hashed or so).</p>
 <p>While hash tables are well studied, almost all the work is focused on large persistent tables, meaning that they're not too suited for a one-shot search function. Abseil's <a href="https://github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_map.h">flat_hash_map</a> is fine, I guess. Roger Hui's <a href="https://www.jsoftware.com/papers/indexof/indexof.htm">Index-Of, a 30-Year Quest</a> works as an introduction to hashing in APL, although it has begun to suffer from the small number of years in places, and some details have serious issues (with a power-of-two table size, multiplying by a prime causes high bits to be lost and so is hardly better than no hash). The second half of my &quot;Sub-nanosecond Searches&quot; talk (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>) covers a difficult 4-byte design that's very good for membership and negative lookups (in particular, it's perfect for the reverse lookup as described in the next section).</p>
 <p>I'd take the following choices as a given for an array language hash design:</p>
 <ul>
@@ -31,6 +31,27 @@
 <li>Linear probing</li>
 </ul>
 <p>The main cost for larger data is the hashing itself; <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> appears to be one of the best choices at the time of writing. 4- and 8-byte lookups are where all the fancy optimizations are wanted. Hashes on these fixed sizes should be reversible and are often called mixing functions in the literature. A CRC instruction makes a good one if available.</p>
+<h3 id="resizing-hash-tables"><a class="header" href="#resizing-hash-tables">Resizing hash tables</a></h3>
+<p>When performing a one-shot array operation, the maximum hash table size needed is known in advance. However, if the array to be hashed doesn't have many unique elements, allocating a table this large will have extra costs because it needs to be initialized and the hash probes won't cache as well (they're random by design!). Starting with a smaller hash table and resizing mitigates these costs. But! Getting the resizing policy right is difficult, and any way you do it there's going to be some overhead from resizing when the large size is needed. And hash tables are already somewhat adaptive to small numbers of unique values, because these hit fewer memory regions, avoiding cache penalties. With a factor of 2 to 4 upside, this is not a very high-value use of time.</p>
+<p>Hash model resizing is usually based on a counter of the number of values in the hash. I like using a collision counter instead, because tracking it has the same cost (possibly a little less) and can protect from a compromised hash function. Whenever a value is inserted, add the difference between the index computed from the hash and the index it was actually inserted at. This is the number of collisions for that value and measures the cost of inserting or looking it up. It can be checked for a resizing condition periodically, since checking at every step would be expensive.</p>
+<h4 id="resizing-math"><a class="header" href="#resizing-math">Resizing math</a></h4>
+<p>Here's a model for when to resize in a self-search function, assuming the average number of collisions increases linearly as values are inserted. Suppose we've reached position <code><span class='Value'>i</span></code> of <code><span class='Value'>n</span></code> and observed an average of <code><span class='Value'>c</span><span class='Function'>÷</span><span class='Number'>2</span></code> collisions per element. When is resizing now better than resizing later? There will be two changes: cache-related costs which we'll say will change by <code><span class='Value'>d</span></code> per element, and collision related costs which will change by <code><span class='Value'>c</span></code> per element (times a constant; we can absorb this into <code><span class='Value'>d</span></code> and later <code><span class='Value'>s</span></code> as well). So it's better to switch now instead of a step later as soon as <code><span class='Value'>c</span> <span class='Function'>&gt;</span> <span class='Value'>d</span></code>.</p>
+<p>But resizing also incurs a one-time cost. We'll call it <code><span class='Value'>s</span></code>, which is some coefficient times the table size, and we now need to look at all future hashes to see if the averaged cost <code><span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>n</span><span class='Function'>-</span><span class='Value'>i</span></code> per element is worth it. We assume the collision-related costs will go down by a constant factor, but the future cost remaining is proportional to an adjustment factor <code><span class='Paren'>(</span><span class='Value'>i</span><span class='Function'>+</span><span class='Value'>n</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Value'>i</span></code>, because with linearly increasing collisions we'd average <code><span class='Value'>i</span><span class='Function'>÷</span><span class='Number'>2</span></code> before and <code><span class='Paren'>(</span><span class='Value'>i</span><span class='Function'>+</span><span class='Value'>n</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Number'>2</span></code> after the current position. Setting <code><span class='Value'>r</span><span class='Gets'>←</span><span class='Value'>n</span><span class='Function'>-</span><span class='Value'>i</span></code>, this is <code><span class='Paren'>(</span><span class='Number'>2n</span><span class='Function'>-</span><span class='Value'>r</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Value'>n</span><span class='Function'>-</span><span class='Value'>r</span><span class='Paren'>)</span></code>, or <code><span class='Number'>2</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>-</span><span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>-</span><span class='Value'>r</span><span class='Function'>÷</span><span class='Value'>n</span><span class='Paren'>)</span></code>. Now we want to resize at some point if:</p>
+<pre><span class='Number'>0</span> <span class='Function'>&gt;</span> <span class='Value'>d</span> <span class='Function'>-</span> <span class='Paren'>(</span><span class='Value'>c</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>-</span><span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>-</span><span class='Value'>r</span><span class='Function'>÷</span><span class='Value'>n</span><span class='Paren'>))</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span>
+</pre>
+<p>Solve for <code><span class='Value'>c</span></code>, and drop several terms to give an approximation that's exact at <code><span class='Value'>r</span><span class='Function'>=</span><span class='Number'>0</span></code> and larger when <code><span class='Value'>r</span><span class='Function'>&gt;</span><span class='Number'>0</span></code>. This is good because the assumption of linearly increasing <code><span class='Value'>c</span></code> is definitely not always going to hold so being more hesitant to resize early is good.</p>
+<pre><span class='Value'>c</span> <span class='Function'>&gt;</span> <span class='Paren'>(</span><span class='Value'>d</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span><span class='Paren'>)</span> <span class='Function'>×</span> <span class='Paren'>(</span><span class='Number'>1</span> <span class='Function'>-</span> <span class='Value'>r</span><span class='Function'>÷</span><span class='Value'>n</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Number'>1</span> <span class='Function'>-</span> <span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span>
+  <span class='Function'>&gt;</span> <span class='Paren'>(</span><span class='Value'>d</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span><span class='Paren'>)</span> <span class='Function'>×</span> <span class='Paren'>(</span><span class='Number'>1</span> <span class='Function'>-</span> <span class='Value'>r</span><span class='Function'>÷</span><span class='Value'>n</span><span class='Paren'>)</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Number'>1</span> <span class='Function'>+</span> <span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span>
+  <span class='Function'>&gt;</span> <span class='Paren'>(</span><span class='Value'>d</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span><span class='Paren'>)</span> <span class='Function'>×</span> <span class='Paren'>(</span><span class='Number'>1</span> <span class='Function'>-</span> <span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span>
+  <span class='Function'>=</span> <span class='Value'>d</span> <span class='Function'>+</span> <span class='Paren'>(</span><span class='Function'>-</span><span class='Value'>d</span><span class='Function'>×</span><span class='Value'>r</span><span class='Function'>÷</span><span class='Number'>2n</span><span class='Paren'>)</span> <span class='Function'>+</span> <span class='Paren'>(</span><span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span><span class='Paren'>)</span> <span class='Function'>-</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Number'>2n</span>
+  <span class='Function'>&gt;</span> <span class='Value'>d</span> <span class='Function'>+</span> <span class='Paren'>(</span><span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>r</span><span class='Paren'>)</span> <span class='Function'>-</span> <span class='Value'>s</span><span class='Function'>÷</span><span class='Number'>2n</span>
+  <span class='Function'>=</span> <span class='Value'>d</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Number'>2n</span><span class='Function'>-</span><span class='Value'>r</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Number'>2n</span><span class='Function'>×</span><span class='Value'>r</span><span class='Paren'>)</span>
+  <span class='Function'>=</span> <span class='Value'>d</span> <span class='Function'>+</span> <span class='Value'>s</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Value'>n</span><span class='Function'>+</span><span class='Value'>i</span><span class='Paren'>)</span><span class='Function'>÷</span><span class='Paren'>(</span><span class='Number'>2n</span><span class='Function'>×</span><span class='Value'>r</span><span class='Paren'>)</span>
+</pre>
+<p>Some rearrangement gives this form, which doesn't divide by any variables. <code><span class='Value'>ct</span></code> is the total number of collisions, <code><span class='Value'>i</span><span class='Function'>×</span><span class='Value'>c</span></code>.</p>
+<pre><span class='Paren'>(</span><span class='Value'>r</span> <span class='Function'>×</span> <span class='Value'>ct</span><span class='Function'>-</span><span class='Value'>d</span><span class='Function'>×</span><span class='Value'>i</span><span class='Paren'>)</span> <span class='Function'>&gt;</span> <span class='Value'>i</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Value'>n</span><span class='Function'>+</span><span class='Value'>i</span><span class='Paren'>)</span><span class='Function'>×</span><span class='Value'>s</span><span class='Function'>÷</span><span class='Value'>n</span>
+</pre>
+<p>Now you have two parameters to tune, <code><span class='Value'>d</span></code> and <code><span class='Value'>s</span></code>. And the coefficient for <code><span class='Value'>d</span></code> depends on the current size of the table. Try not to go insane.</p>
 <h2 id="sparse-and-reverse-lookups"><a class="header" href="#sparse-and-reverse-lookups">Sparse and reverse lookups</a></h2>
 <p>The classic pattern for searching is to build an index of the data to be searched, then use it for each searched-for value. This is an optimization though: the obvious way is to search for one value at a time. Two methods flip this around in a sense, beginning with the searched-for array. Of course this requires that there <em>is</em> such an array, that is, all lookups are done at one time. So these methods work for one-shot lookups as in array languages but not persistent search structures used in others.</p>
 <table>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-11-06 20:04:48 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-11-06 20:04:48 -0500
commit	5dbe5b502b760b5d87c043b5ab836751b2c3e5e8 (patch)
tree	a1b4f63d6ac913991f2b6bd9e31b8fd3b6e7d193 /docs
parent	dd3f993803386d28b3966c6b6ebcc5d1d372b99d (diff)