Notes on reverse and partitioned search implementation

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-08-21 18:41:31 -0400
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2022-08-21 18:41:31 -0400
commit: a23e8be6fec9510be6fbba6b3c3201e88e9785d2 (patch)
tree: 405f50f433bf95c0b69bcdec3f983013f4536bbb /docs/implementation/primitive/search.html
parent: 522593291751c89db2cab0db7e6522973894cafd (diff)
1 files changed, 18 insertions, 0 deletions
diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
new file mode 100644
index 00000000..128f27e8
--- /dev/null
+++ b/docs/implementation/primitive/search.html
@@ -0,0 +1,18 @@
+<head>
+  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
+  <link href="../../style.css" rel="stylesheet"/>
+  <title>BQN: Implementation of search functions</title>
+</head>
+<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
+<h1 id="implementation-of-search-functions"><a class="header" href="#implementation-of-search-functions">Implementation of search functions</a></h1>
+<p>This page covers the <a href="../../doc/search.html">search functions</a>, dyadic <code><span class='Function'>⊐⊒∊</span></code>, and <a href="../../doc/selfcmp.html">self-search functions</a>, monadic <code><span class='Function'>⊐⊒∊⍷</span></code>. Generally speaking, hash tables or plain lookup tables are the fastest way to implement these functions, because they transform the problem of searching into random access, which is something computers specialize in. In some edge cases, and when the search becomes large enough that caches can't speed up random access, other methods can be relevant.</p>
+<h2 id="reverse-lookups"><a class="header" href="#reverse-lookups">Reverse lookups</a></h2>
+<p>The classic pattern for searching is to build an index of the data to be searched, then use it for each searched-for value. This is an optimization though: the obvious way is to search for one value at a time. What I call a reverse lookup returns to this method in a sense, and is useful if the searched-in array is larger than searched-for by a factor of 2 or so.</p>
+<p>The method is to build an index of all searched-for values, then iterate over searched-in values one at a time. For each one, check if it matches any searched-for values, update results for those accordingly, and remove that value from the index. Since each value only has one first match, the total number of removals is at most the number of searched-for values. The traversal can stop early if all these values are found, and it could also switch to a faster index if most of them are found, although I haven't tried this.</p>
+<p>When the searched-in array is much larger, performance tends to the speed of a <em>set</em> lookup on that array, which can be several times faster than an index lookup for smaller types. The overhead for the searched-in values is usually higher than normal hash table insertion.</p>
+<h2 id="partitioning"><a class="header" href="#partitioning">Partitioning</a></h2>
+<p><a href="https://github.com/mlochbaum/rhsort">Robin Hood sort</a> sorts small uniform arrays quickly by considering hash tables as a way of sorting hashes. This cuts both ways: RH sort <a href="https://github.com/mlochbaum/rhsort/blob/master/images/rand.svg">slows down</a> far more than other sorting methods on large arrays because of its random access patterns, and so do hash table operations. For large enough hash tables, it ought to make sense to bring in sorting-based methods in order to reduce the search size.</p>
+<p>Of course, it's possible to implement searches using only sorting and no hashing: <code><span class='Function'>∧</span><span class='Modifier2'>⊸</span><span class='Function'>⍋⊏⍋</span><span class='Modifier2'>∘</span><span class='Function'>⊣</span></code> with some adjustments to the binary search. A hash takes advantage of the fact that what ordering is used doesn't matter to rearrange things and get expected equal distribution of unique keys. It's usually going to be best to use a hash table as the base case, so that it's the hashes being sorted. With small element sizes and a bijective hash, only the hashes need to be compared, so the arguments can be hashed at the start and the original values discarded.</p>
+<p>The simplest sorting technique to use is probably <a href="sort.html#partitioning">partitioning</a> as in quicksort. However, I'm guessing that a <a href="sort.html#radix-sort">radix</a> pass would be faster. The primary reason to avoid MSD radix for sorting is that arguments are often clustered in a small number of bins, making initial passes useless, but this only applies to hashed data if there are many duplicates. Stable partitioning is often more convenient as it ensures that equal elements are still encountered in the right order.</p>
+<p>Partitioning doesn't really have to interact with hash insertions or lookups: after the data is partitioned at a suitable size, it will just hash faster because it only accesses part of the hash table at a time. It's also possible to save space by using a smaller hash table and doing one partition with it, then the next, and so on.</p>
+<p>The complication is that the result comes out partitioned like the searched-for argument, and this needs to be undone at the end. I think it's best to just keep the partitioning information (first few hash bits) around and run partitioning in reverse.</p>
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-08-21 18:41:31 -0400
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2022-08-21 18:41:31 -0400
commit	a23e8be6fec9510be6fbba6b3c3201e88e9785d2 (patch)
tree	405f50f433bf95c0b69bcdec3f983013f4536bbb /docs/implementation/primitive/search.html
parent	522593291751c89db2cab0db7e6522973894cafd (diff)