From fec8ec1131ab0797e138a6c3c72ab5a441304e12 Mon Sep 17 00:00:00 2001
From: Marshall Lochbaum <mwlochbaum@gmail.com>
Date: Sat, 12 Feb 2022 11:27:11 -0500
Subject: =?UTF-8?q?Stop=20using=20=E2=89=A0=C2=A8=E2=88=98=E2=8A=94=20sinc?=
 =?UTF-8?q?e=20/=E2=81=BC=20does=20it=20now?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/implementation/compile/dynamic.html | 2 +-
 docs/implementation/primitive/sort.html  | 4 ++--
 implementation/compile/dynamic.md        | 2 +-
 implementation/primitive/sort.md         | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/docs/implementation/compile/dynamic.html b/docs/implementation/compile/dynamic.html
index 0ffb6c46..59f6ad75 100644
--- a/docs/implementation/compile/dynamic.html
+++ b/docs/implementation/compile/dynamic.html
@@ -45,7 +45,7 @@
 <h3 id="derived-functions"><a class="header" href="#derived-functions">Derived functions</a></h3>
 <p>Like blocks, it can be valuable to optimize derived functions if they are run many times. Derived functions are often known at the program start by constant folding, but might also be constructed dynamically, particularly to bind an argument to a function.</p>
 <p>Compound arithmetic functions like <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, <code><span class='Function'>⌈</span><span class='Modifier'>`</span></code>, or <code><span class='Function'>=</span><span class='Modifier'>⌜</span></code> are essential to array programming, and have fast SIMD implementations, so they need to be recognized wherever they are found.</p>
-<p>In addition to these, there are patterns like <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> that can be implemented faster than their components, and bindings like <code><span class='Value'>l</span><span class='Modifier2'>⊸</span><span class='Function'>⊐</span></code> where a computation (here, a hash table) on the left argument can be saved. These can be handled by inspecting the function. However, it's more robust to convert it to a canonical form, so this possibility should also be considered.</p>
+<p>In addition to these, there are patterns like <code><span class='Function'>∨</span><span class='Modifier'>`</span><span class='Modifier2'>∘</span><span class='Function'>≠</span></code> that can be implemented faster than their components, and bindings like <code><span class='Value'>l</span><span class='Modifier2'>⊸</span><span class='Function'>⊐</span></code> where a computation (here, a hash table) on the left argument can be saved. These can be handled by inspecting the function. However, it's more robust to convert it to a canonical form, so this possibility should also be considered.</p>
 <p>Tacit code can be converted to <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">SSA</a> form very easily. To translate it into stack-based bytecode it would need a way to reuse values from the stack in multiple places; instructions to duplicate or extract a value from higher in the stack are an obvious candidate. Either of these forms is a natural step on the way to native compilation, and a bytecode representation would make it easier to optimize mixed tacit and explicit code—but it's easier to do the optimizations on SSA-form rather than stack-based code, so perhaps the right path is to convert both bytecode and derived functions to SSA.</p>
 <h3 id="compile-in-another-thread"><a class="header" href="#compile-in-another-thread">Compile in another thread</a></h3>
 <p>A simple and widely-used strategy to reduce slowdown due to dynamic compilation is to compile blocks in a separate thread from the one that runs them. The new code needs to be added in a thread-safe manner, which is not hard as the set of optimized implementations is a small lookup table of some sort with only one writer.</p>
diff --git a/docs/implementation/primitive/sort.html b/docs/implementation/primitive/sort.html
index 5c6c3d70..8e69a9ea 100644
--- a/docs/implementation/primitive/sort.html
+++ b/docs/implementation/primitive/sort.html
@@ -22,7 +22,7 @@
 <p>For sorting, the fastest algorithms for generic data and generic hardware are branchless <a href="#quicksort">quicksorts</a>. Fluxsort is new and very exciting because it's a <em>stable</em> algorithm that's substantially faster than runner-up pdqsort on random arrays. However, it's immature and is missing a lot of the specialized strategies pdqsort has. I'm working on adapting these improvements to work for stable sorting and also on hybridizing with counting/bucket sort.</p>
 <p>A branchless binary search is adequate for Bins but in many cases—very small or large <code><span class='Value'>𝕨</span></code>, and small range—there are better methods.</p>
 <h3 id="counting-and-bucket-sort"><a class="header" href="#counting-and-bucket-sort">Counting and bucket sort</a></h3>
-<p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>/≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, with <code><span class='Function'>≠</span><span class='Modifier'>¨</span><span class='Modifier2'>∘</span><span class='Function'>⊔</span></code> as a single efficient operation.</p>
+<p>Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written <code><span class='Paren'>(</span><span class='Function'>//</span><span class='Modifier'>⁼</span><span class='Paren'>)</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Function'>-</span><span class='Modifier2'>⟜</span><span class='Value'>min</span><span class='Paren'>)</span></code> in BQN, relying on the extension of <code><span class='Function'>/</span><span class='Modifier'>⁼</span></code> to unsorted arguments.</p>
 <p>Bucket sort can be used for Grade or sort-by (<code><span class='Function'>⍋</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span></code>), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with <a href="replicate.html#non-booleans-to-indices">fast Indices</a>, Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.</p>
 <h3 id="quicksort"><a class="header" href="#quicksort">Quicksort</a></h3>
 <p><a href="https://github.com/scandum/fluxsort">Fluxsort</a> attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, Fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (<a href="https://github.com/scandum/quadsort">Quadsort</a>) as the base case. I haven't fully evaluated these.</p>
@@ -56,5 +56,5 @@
 <p>Reminder that we're talking about simple, not <a href="#compound-data">compound</a> data. The most important thing is just to have a good branchless binary search (see <a href="#on-binary-search">above</a>), but there are other possible optimizations.</p>
 <p>If <code><span class='Value'>𝕨</span></code> is extremely small, use a vector binary search as described in &quot;Sub-nanosecond Searches&quot; (<a href="https://dyalog.tv/Dyalog18/?v=paxIkKBzqBU">video</a>, <a href="https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D08_Searches_Using_Vector_Instructions.zip">slides</a>). For 1-byte elements there's also a vectorized method that works whenever <code><span class='Value'>𝕨</span></code> has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of <code><span class='Value'>𝕨</span></code>, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in <code><span class='Value'>𝕨</span></code>. The other gives the number of values in <code><span class='Value'>𝕨</span></code> less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.</p>
 <p>It's cheap and sometimes worthwhile to trim <code><span class='Value'>𝕨</span></code> down to the range of <code><span class='Value'>𝕩</span></code>. After finding the range of <code><span class='Value'>𝕩</span></code>, binary cut <code><span class='Value'>𝕨</span></code> to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of <code><span class='Value'>𝕨</span></code> for the appropriate endpoint of the range.</p>
-<p>If <code><span class='Value'>𝕩</span></code> is small-range, then a lookup table method is possible. Check the length of <code><span class='Value'>𝕨</span></code> because if it's too large then this method is slower! The approach is simply to create a table of the number of elements in <code><span class='Value'>𝕨</span></code> with each value, then take a prefix sum. In BQN, <code><span class='Value'>𝕩</span><span class='Function'>⊏+</span><span class='Modifier'>`</span><span class='Function'>∾≠</span><span class='Modifier'>¨</span><span class='Function'>⊔</span><span class='Value'>𝕨</span></code>, assuming a minimum of 0.</p>
+<p>If <code><span class='Value'>𝕩</span></code> is small-range, then a lookup table method is possible. Check the length of <code><span class='Value'>𝕨</span></code> because if it's too large then this method is slower—binary search doesn't have to hit every element! The approach is simply to create a table of the number of elements in <code><span class='Value'>𝕨</span></code> with each value, then take a prefix sum. In BQN, <code><span class='Value'>𝕩</span><span class='Function'>⊏+</span><span class='Modifier'>`</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Function'>+⌈</span><span class='Modifier'>´</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>↑/</span><span class='Modifier'>⁼</span><span class='Value'>𝕨</span></code>, assuming a minimum of 0.</p>
 <p><a href="#partitioning">Partitioning</a> allows one pivot <code><span class='Value'>t</span></code> from <code><span class='Value'>𝕨</span></code> to be compared with all of <code><span class='Value'>𝕩</span></code> at once. Although the comparison <code><span class='Value'>t</span><span class='Function'>≤</span><span class='Value'>𝕩</span></code> can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search <em>when</em> <code><span class='Value'>𝕨</span></code> <em>fits in L1 cache</em>. For larger <code><span class='Value'>𝕨</span></code> (and randomly positioned <code><span class='Value'>𝕩</span></code>) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when <code><span class='Value'>𝕨</span></code> is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of <code><span class='Value'>𝕨</span></code> as in a normal binary search. However, there is a potential issue with memory. If <code><span class='Value'>𝕩</span></code> is unbalanced with respect to <code><span class='Value'>𝕨</span></code>, then the larger part can be nearly the whole length of <code><span class='Value'>𝕩</span></code> (if it's all of <code><span class='Value'>𝕩</span></code> partitioning isn't actually needed and it doesn't need to be saved). This can require close to <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Function'>≠</span><span class='Value'>𝕨</span></code> saved partitions of length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>, while the expected use would be a total length <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code>.</p>
diff --git a/implementation/compile/dynamic.md b/implementation/compile/dynamic.md
index bd49a98f..a605545e 100644
--- a/implementation/compile/dynamic.md
+++ b/implementation/compile/dynamic.md
@@ -71,7 +71,7 @@ Like blocks, it can be valuable to optimize derived functions if they are run ma
 
 Compound arithmetic functions like `+´`, `` ⌈` ``, or `=⌜` are essential to array programming, and have fast SIMD implementations, so they need to be recognized wherever they are found.
 
-In addition to these, there are patterns like `≠¨∘⊔` that can be implemented faster than their components, and bindings like `l⊸⊐` where a computation (here, a hash table) on the left argument can be saved. These can be handled by inspecting the function. However, it's more robust to convert it to a canonical form, so this possibility should also be considered.
+In addition to these, there are patterns like ``∨`∘≠`` that can be implemented faster than their components, and bindings like `l⊸⊐` where a computation (here, a hash table) on the left argument can be saved. These can be handled by inspecting the function. However, it's more robust to convert it to a canonical form, so this possibility should also be considered.
 
 Tacit code can be converted to [SSA](https://en.wikipedia.org/wiki/Static_single_assignment_form) form very easily. To translate it into stack-based bytecode it would need a way to reuse values from the stack in multiple places; instructions to duplicate or extract a value from higher in the stack are an obvious candidate. Either of these forms is a natural step on the way to native compilation, and a bytecode representation would make it easier to optimize mixed tacit and explicit code—but it's easier to do the optimizations on SSA-form rather than stack-based code, so perhaps the right path is to convert both bytecode and derived functions to SSA.
 
diff --git a/implementation/primitive/sort.md b/implementation/primitive/sort.md
index 0a4c5211..36b9c916 100644
--- a/implementation/primitive/sort.md
+++ b/implementation/primitive/sort.md
@@ -36,7 +36,7 @@ A branchless binary search is adequate for Bins but in many cases—very small o
 
 ### Counting and bucket sort
 
-Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written `(/≠¨∘⊔)⌾(-⟜min)` in BQN, with `≠¨∘⊔` as a single efficient operation.
+Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written `(//⁼)⌾(-⟜min)` in BQN, relying on the extension of `/⁼` to unsorted arguments.
 
 Bucket sort can be used for Grade or sort-by (`⍋⊸⊏`), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with [fast Indices](replicate.md#non-booleans-to-indices), Counting sort is vastly more powerful, and is effective with a range four to eight times the argument length. This is large enough that it might pose a memory usage problem, but the memory use can be made arbitrarily low by partitioning.
 
@@ -97,6 +97,6 @@ If `𝕨` is extremely small, use a vector binary search as described in "Sub-na
 
 It's cheap and sometimes worthwhile to trim `𝕨` down to the range of `𝕩`. After finding the range of `𝕩`, binary cut `𝕨` to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of `𝕨` for the appropriate endpoint of the range.
 
-If `𝕩` is small-range, then a lookup table method is possible. Check the length of `𝕨` because if it's too large then this method is slower! The approach is simply to create a table of the number of elements in `𝕨` with each value, then take a prefix sum. In BQN, ``𝕩⊏+`∾≠¨⊔𝕨``, assuming a minimum of 0.
+If `𝕩` is small-range, then a lookup table method is possible. Check the length of `𝕨` because if it's too large then this method is slower—binary search doesn't have to hit every element! The approach is simply to create a table of the number of elements in `𝕨` with each value, then take a prefix sum. In BQN, ``𝕩⊏+`(1+⌈´𝕩)↑/⁼𝕨``, assuming a minimum of 0.
 
 [Partitioning](#partitioning) allows one pivot `t` from `𝕨` to be compared with all of `𝕩` at once. Although the comparison `t≤𝕩` can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search *when* `𝕨` *fits in L1 cache*. For larger `𝕨` (and randomly positioned `𝕩`) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when `𝕨` is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of `𝕨` as in a normal binary search. However, there is a potential issue with memory. If `𝕩` is unbalanced with respect to `𝕨`, then the larger part can be nearly the whole length of `𝕩` (if it's all of `𝕩` partitioning isn't actually needed and it doesn't need to be saved). This can require close to `2⋆⁼≠𝕨` saved partitions of length `≠𝕩`, while the expected use would be a total length `≠𝕩`.
-- 
cgit v1.2.3