1 files changed, 36 insertions, 10 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
index 2ac94521..dcbe6704 100644
--- a/docs/implementation/primitive/search.html
+++ b/docs/implementation/primitive/search.html
@@ -8,7 +8,7 @@
 <p>This page covers the <a href="../../doc/search.html">search functions</a>, dyadic <code><span class='Function'>⊐⊒∊</span></code>, and <a href="../../doc/selfcmp.html">self-search functions</a>, monadic <code><span class='Function'>⊐⊒∊⍷</span></code>. Generally speaking, hash tables or plain lookup tables are the fastest way to implement these functions, because they transform the problem of searching into random access, which is something computers specialize in. In some edge cases, and when the search becomes large enough that caches can't speed up random access, other methods can be relevant.</p>
 <p>Searching is closely related to <a href="sort.html">sorting</a>. I've found it helpful to think of a search function as an ordered lookup according to an unspecified order. The advantage over a predefined order is that the order can be shuffled with a hash function to randomize the distribution. The fact that the wanted result is in the original argument order, not sorted, also means that sparse tables are more effective—there is never any need to traverse the table to pack ordered values as in sorting.</p>
 <h2 id="lookup-tables"><a class="header" href="#lookup-tables">Lookup tables</a></h2>
-<p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but the entire table needs to be initialized and stay in cache. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers.</p>
+<p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but they need to cache table accesses to be fast. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers. You might expect the entire table has to be initialized, but it doesn't always: see <a href="#sparse-and-reverse-lookups">sparse lookups</a>.</p>
 <p>For example, a lookup table algorithm for dyadic <code><span class='Function'>⊐</span></code> might traverse <code><span class='Value'>𝕨</span></code>, writing each value's index to the table. Doing this step in reverse index order makes sure the lowest index &quot;wins&quot;. Similarly, empty entries must be initialized to <code><span class='Function'>≠</span><span class='Value'>𝕨</span></code> beforehand. Then the result is <code><span class='Value'>𝕩</span><span class='Function'>⊏</span><span class='Value'>t</span></code> where <code><span class='Value'>t</span></code> is the table constructed this way. A nonzero minimum value can be handled for free by subtracting it from the table pointer.</p>
 <p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed.</p>
 <h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
@@ -22,13 +22,39 @@
 <li>Linear probing</li>
 </ul>
 <p>The main cost for larger data is the hashing itself; <a href="https://github.com/wangyi-fudan/wyhash">wyhash</a> appears to be one of the best choices at the time of writing. 4- and 8-byte lookups are where all the fancy optimizations are wanted. Hashes on these fixed sizes should be reversible and are often called mixing functions in the literature. A CRC instruction makes a good one if available.</p>
-<h2 id="reverse-lookups"><a class="header" href="#reverse-lookups">Reverse lookups</a></h2>
-<p>The classic pattern for searching is to build an index of the data to be searched, then use it for each searched-for value. This is an optimization though: the obvious way is to search for one value at a time. What I call a reverse lookup returns to this method in a sense, and is useful if the searched-in array is larger than searched-for by a factor of 2 or so.</p>
-<p>The method is to build an index of all searched-for values, then iterate over searched-in values one at a time. For each one, check if it matches any searched-for values, update results for those accordingly, and remove that value from the index. Since each value only has one first match, the total number of removals is at most the number of searched-for values. The traversal can stop early if all these values are found, and it could also switch to a faster index if most of them are found, although I haven't tried this.</p>
-<p>When the searched-in array is much larger, performance tends to the speed of a <em>set</em> lookup on that array, which can be several times faster than an index lookup for smaller types. The overhead for the searched-in values is usually higher than normal hash table insertion.</p>
+<h2 id="sparse-and-reverse-lookups"><a class="header" href="#sparse-and-reverse-lookups">Sparse and reverse lookups</a></h2>
+<p>The classic pattern for searching is to build an index of the data to be searched, then use it for each searched-for value. This is an optimization though: the obvious way is to search for one value at a time. Two methods flip this around in a sense, beginning with the searched-for array. Of course this requires that there <em>is</em> such an array, that is, all lookups are done at one time. So these methods work for one-shot lookups as in array languages but not persistent search structures used in others.</p>
+<table>
+<thead>
+<tr>
+<th>Biggest</th>
+<th>Method</th>
+<th>Pattern</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Searched-for</td>
+<td>Normal lookup</td>
+<td>table - in - for</td>
+</tr>
+<tr>
+<td>Searched-in</td>
+<td>Reverse lookup</td>
+<td>table - for - in - maybe for/table</td>
+</tr>
+<tr>
+<td>Table</td>
+<td>Sparse lookup</td>
+<td>for - in - for</td>
+</tr>
+</tbody>
+</table>
+<p>A sparse lookup is mostly useful for lookup tables, and is useful when the table size is much larger than the data. The idea is that the main cost would be initializing the table, but it will only be read to look up searched-for values. So initialize according to the searched-for array, <em>then</em> write searched-in values and look up searched-for ones (in a self-search function these last two are combined, making this method even more effective). It's not so great for hash tables because probing means it's not enough to initialize more than one value, and anyway the hash table can be sized to avoid a large initialization cost.</p>
+<p>A reverse lookup has a similar pattern but does require table initialization. It works for lookup tables, but is more impressive for hashes, and is useful if the searched-in array is larger than searched-for by a factor of 2 or so. The method is to build an index of all searched-for values, then iterate over searched-in values one at a time. For each one, check if it matches any searched-for values, update results for those accordingly, and remove that value from the index. Since each value only has one first match, the total number of removals is at most the number of searched-for values. The traversal can stop early if all these values are found, and it could also switch to a faster index if most of them are found, although I haven't tried this.</p>
+<p>As the searched-in array gets larger in a reverse lookup, ideal performance tends to the speed of a <em>set</em> lookup on that array, which can be several times faster than an index lookup for smaller types. Since each searched-for value can only be found once, this performance is achieved with a set lookup followed by a slow path that does an index lookup (other methods might be better depending on table specifics, but this one is general). However, the overhead for the searched-for values is usually higher than normal hash table insertion.</p>
 <h2 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h2>
-<p><a href="https://github.com/mlochbaum/rhsort">Robin Hood sort</a> sorts small uniform arrays quickly by considering hash tables as a way of sorting hashes. This cuts both ways: RH sort <a href="https://github.com/mlochbaum/rhsort/blob/master/images/rand.svg">slows down</a> far more than other sorting methods on large arrays because of its random access patterns, and so do hash table operations. For large enough hash tables, it ought to make sense to bring in sorting-based methods in order to reduce the search size.</p>
-<p>Of course, it's possible to implement searches using only sorting and no hashing: <code><span class='Function'>∧</span><span class='Modifier2'>⊸</span><span class='Function'>⍋⊏⍋</span><span class='Modifier2'>∘</span><span class='Function'>⊣</span></code> with some adjustments to the binary search. A hash takes advantage of the fact that what ordering is used doesn't matter to rearrange things and get expected equal distribution of unique keys. It's usually going to be best to use a hash table as the base case, so that it's the hashes being sorted. With small element sizes and a bijective hash, only the hashes need to be compared, so the arguments can be hashed at the start and the original values discarded.</p>
-<p>One option is <a href="sort.html#partitioning">partitioning</a> as in quicksort, but the unstable version doesn't work for many functions, and comparing with a pivot is wasted effort as top bits can be used directly. Stable <a href="sort.html#radix-sort">radix</a> passes are ideal here. However, they do use a lot of extra memory: twice the size of the hashed array (or equal to it if it can be reused). To undo a radix sort in a cache-friendly way, the original hash bits need to be kept around to retrace the steps. Even in cases where the original indices are known, traversing the radix-ed values and writing back to these indices makes many loops around the original array, moving too quickly for the cache to keep up.</p>
-<p>Partitioning doesn't really have to interact with hash insertions or lookups: after the data is partitioned at a suitable size, it will just hash faster because it only accesses part of the hash table at a time. It's also possible to save space by using a smaller hash table and doing one partition with it, then the next, and so on.</p>
-<p>There are a few tricks to avoid having to re-initialize the table on each pass. This is a big deal because it means a very low load factor can be used, which allows for faster hash or even a lookup table! First, and more generally, the table can be cleaned up <em>after</em> a pass by walking through the elements again (possibly in reverse for some hash designs). Second, if enough data is stored in the hash table to distinguish one pass from the next, then values from previous passes can be interpreted as empty so that no re-initialization is necessary.</p>
+<p><a href="https://github.com/mlochbaum/rhsort">Robin Hood sort</a> sorts small uniform arrays quickly by considering hash tables as a way of sorting hashes. This cuts both ways: RH sort <a href="https://github.com/mlochbaum/rhsort/blob/master/images/rand.svg">slows down</a> far more than other sorting methods on large arrays because of its random access patterns, and so do hash table operations. For large enough hash tables, it ought to make sense to bring in sorting-based methods in order to reduce the search size. As in quicksort, partitioning has two advantages in that it reduces the number of values to be compared and also their range. It does have a lot of memory usage because entries need to be copied rather than swapped for stability, and ordering data has to be kept around.</p>
+<p><a href="sort.html#partitioning">Partitioning</a> as in quicksort is possible, but I think stable <a href="sort.html#radix-sort">radix</a> passes are almost always better. The typical problem with radix passes on high bits is that they might distribute values unevenly, but the values will either be searched with constant memory (lookup table) or redistributed (hash table). To undo a radix sort in a cache-friendly way, the original hash bits need to be kept around to retrace the steps. Even in cases where the original indices are known, traversing the radix-ed values and writing back to these indices makes many loops around the original array, moving too quickly for the cache to keep up.</p>
+<p>For 4-byte values, two 8-bit passes reduce the range to two bytes, lookup table range. With <a href="#sparse-and-reverse-lookups">sparse</a> initialization, this is very fast, and useful even for small sizes under 100 elements. I think it just about obseletes 4-byte hash tables for self-search functions. But hash tables favoring lookup speed can be better for search functions as long as the argument sizes aren't too close.</p>
+<p>Partitioning doesn't really have to interact with hash insertions or lookups: after the data is partitioned at a suitable size, it will just hash faster because it only accesses part of the hash table at a time. But it's better to use a smaller hash table, doing one partition with it, then the next, and so on. There are a few tricks to avoid having to re-initialize the table on each pass. This is a big deal because it means a very low load factor can be used, which speeds things up! First, and more generally, the table can be cleaned up <em>after</em> a pass by walking through the elements again (possibly in reverse for some hash designs). Second, if enough data is stored in the hash table to distinguish one pass from the next, then values from previous passes can be interpreted as empty so that no re-initialization is necessary.</p>