From ffa4337909898f3a54469bee1baf0294b7bcffa3 Mon Sep 17 00:00:00 2001
From: Marshall Lochbaum <mwlochbaum@gmail.com>
Date: Fri, 2 Sep 2022 22:35:56 -0400
Subject: Notes on linear searches (including vector binary) and vector 256-bit
 tables

---
 docs/implementation/primitive/search.html | 9 +++++++++
 1 file changed, 9 insertions(+)

(limited to 'docs')
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
index dcbe6704..08140eb3 100644
--- a/docs/implementation/primitive/search.html
+++ b/docs/implementation/primitive/search.html
@@ -7,10 +7,19 @@
 <h1 id="implementation-of-search-functions"><a class="header" href="#implementation-of-search-functions">Implementation of search functions</a></h1>
 <p>This page covers the <a href="../../doc/search.html">search functions</a>, dyadic <code><span class='Function'>⊐⊒∊</span></code>, and <a href="../../doc/selfcmp.html">self-search functions</a>, monadic <code><span class='Function'>⊐⊒∊⍷</span></code>. Generally speaking, hash tables or plain lookup tables are the fastest way to implement these functions, because they transform the problem of searching into random access, which is something computers specialize in. In some edge cases, and when the search becomes large enough that caches can't speed up random access, other methods can be relevant.</p>
 <p>Searching is closely related to <a href="sort.html">sorting</a>. I've found it helpful to think of a search function as an ordered lookup according to an unspecified order. The advantage over a predefined order is that the order can be shuffled with a hash function to randomize the distribution. The fact that the wanted result is in the original argument order, not sorted, also means that sparse tables are more effective—there is never any need to traverse the table to pack ordered values as in sorting.</p>
+<h2 id="small-arguments"><a class="header" href="#small-arguments">Small arguments</a></h2>
+<p>When one argument to a search function is small, it's best to use a linear-time method—that is, linear in both arguments, O(mn). The simplest versions:</p>
+<ul>
+<li>For self-search, branchlessly compare to all previous elements, or use insertion sort.</li>
+<li>With small searched-for, search for each element one at a time.</li>
+<li>With small searched-in, compare each one to all searched-for elements and accumulate.</li>
+</ul>
+<p>Each of these easily allows for SIMD comparison. A more effective use of SIMD (unless the size of the small argument is 1, up to maybe 3) is a vector binary search as presented in the first half of my &quot;Sub-nanosecond Searches&quot; talk (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). This search of sorted values is performed by using shuffle instructions to independently select a pivot for each value, and adding a pivot index bit based on the result of comparing. It can be used as a <a href="#sparse-and-reverse-lookups">reverse lookup</a> as well, performing a non-SIMD update whenever a match is found. Multiple registers can be searched, but each one has to be touched at every step leading to O(mn) asymptotic performance despite the binary search.</p>
 <h2 id="lookup-tables"><a class="header" href="#lookup-tables">Lookup tables</a></h2>
 <p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but they need to cache table accesses to be fast. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers. You might expect the entire table has to be initialized, but it doesn't always: see <a href="#sparse-and-reverse-lookups">sparse lookups</a>.</p>
 <p>For example, a lookup table algorithm for dyadic <code><span class='Function'>⊐</span></code> might traverse <code><span class='Value'>𝕨</span></code>, writing each value's index to the table. Doing this step in reverse index order makes sure the lowest index &quot;wins&quot;. Similarly, empty entries must be initialized to <code><span class='Function'>≠</span><span class='Value'>𝕨</span></code> beforehand. Then the result is <code><span class='Value'>𝕩</span><span class='Function'>⊏</span><span class='Value'>t</span></code> where <code><span class='Value'>t</span></code> is the table constructed this way. A nonzero minimum value can be handled for free by subtracting it from the table pointer.</p>
 <p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed.</p>
+<p>A 1-byte lookup can be packed into vector registers for extra-fast searching. To look up a byte, select the appropriate byte from the table with the top 5 bits, and a mask from another table with the bottom 3. Put these together and pack into bits with compare-movemask.</p>
 <h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
 <p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but I don't think that happens much with searching. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
 <p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and in several 4-byte cases, lookup with <a href="#partitioning">partitioning</a> is competitive for smaller arrays and much better for large ones as the hash table outgrows the cache (&gt;1e5 elements to be hashed or so).</p>
-- 
cgit v1.2.3