aboutsummaryrefslogtreecommitdiff
path: root/docs/implementation/primitive/search.html
diff options
context:
space:
mode:
Diffstat (limited to 'docs/implementation/primitive/search.html')
-rw-r--r--docs/implementation/primitive/search.html14
1 files changed, 13 insertions, 1 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
index 8110170f..2ac94521 100644
--- a/docs/implementation/primitive/search.html
+++ b/docs/implementation/primitive/search.html
@@ -6,10 +6,22 @@
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
<h1 id="implementation-of-search-functions"><a class="header" href="#implementation-of-search-functions">Implementation of search functions</a></h1>
<p>This page covers the <a href="../../doc/search.html">search functions</a>, dyadic <code><span class='Function'>⊐⊒∊</span></code>, and <a href="../../doc/selfcmp.html">self-search functions</a>, monadic <code><span class='Function'>⊐⊒∊⍷</span></code>. Generally speaking, hash tables or plain lookup tables are the fastest way to implement these functions, because they transform the problem of searching into random access, which is something computers specialize in. In some edge cases, and when the search becomes large enough that caches can't speed up random access, other methods can be relevant.</p>
-<h2 id="lookup-table"><a class="header" href="#lookup-table">Lookup table</a></h2>
+<p>Searching is closely related to <a href="sort.html">sorting</a>. I've found it helpful to think of a search function as an ordered lookup according to an unspecified order. The advantage over a predefined order is that the order can be shuffled with a hash function to randomize the distribution. The fact that the wanted result is in the original argument order, not sorted, also means that sparse tables are more effective—there is never any need to traverse the table to pack ordered values as in sorting.</p>
+<h2 id="lookup-tables"><a class="header" href="#lookup-tables">Lookup tables</a></h2>
<p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but the entire table needs to be initialized and stay in cache. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers.</p>
<p>For example, a lookup table algorithm for dyadic <code><span class='Function'>⊐</span></code> might traverse <code><span class='Value'>𝕨</span></code>, writing each value's index to the table. Doing this step in reverse index order makes sure the lowest index &quot;wins&quot;. Similarly, empty entries must be initialized to <code><span class='Function'>≠</span><span class='Value'>𝕨</span></code> beforehand. Then the result is <code><span class='Value'>𝕩</span><span class='Function'>⊏</span><span class='Value'>t</span></code> where <code><span class='Value'>t</span></code> is the table constructed this way. A nonzero minimum value can be handled for free by subtracting it from the table pointer.</p>
<p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed.</p>
+<h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
+<p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but I don't think that happens much with searching. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
+<p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and in several 4-byte cases, lookup with <a href="#partitioning">partitioning</a> is competitive for smaller arrays and much better for large ones as the hash table outgrows the cache (&gt;1e5 elements to be hashed or so).</p>
+<p>While hash tables are well studied, almost all the work is focused on large persistent tables, meaning that they're not too suited for a one-shot search function. Abseil's <a href="https://github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_map.h">flat_hash_map</a> is fine, I guess. Roger Hui's <a href="https://www.jsoftware.com/papers/indexof/indexof.htm">Index-Of, a 30-Year Quest</a> works as an introduction to hashing in APL, although it has begun to suffer from the small number of years in places, and some details have serious issues (with a power-of-two table size, multiplying by a prime causes high bits to be lost and so is hardly better than no hash). The second half of my &quot;Sub-nanosecond Searches&quot; talk (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>) covers a difficult 4-byte design that's very good for membership and negative lookups (in particular, it's perfect for the reverse lookup as described in the next section).</p>
+<p>I'd take the following choices as a given for an array language hash design:</p>
+<ul>
+<li>Power-of-two size</li>
+<li>Open addressing</li>
+<li>Linear probing</li>
+</ul>
+<p>The main cost for larger data is the hashing itself; <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> appears to be one of the best choices at the time of writing. 4- and 8-byte lookups are where all the fancy optimizations are wanted. Hashes on these fixed sizes should be reversible and are often called mixing functions in the literature. A CRC instruction makes a good one if available.</p>
<h2 id="reverse-lookups"><a class="header" href="#reverse-lookups">Reverse lookups</a></h2>
<p>The classic pattern for searching is to build an index of the data to be searched, then use it for each searched-for value. This is an optimization though: the obvious way is to search for one value at a time. What I call a reverse lookup returns to this method in a sense, and is useful if the searched-in array is larger than searched-for by a factor of 2 or so.</p>
<p>The method is to build an index of all searched-for values, then iterate over searched-in values one at a time. For each one, check if it matches any searched-for values, update results for those accordingly, and remove that value from the index. Since each value only has one first match, the total number of removals is at most the number of searched-for values. The traversal can stop early if all these values are found, and it could also switch to a faster index if most of them are found, although I haven't tried this.</p>