Implementing bit interleaving and uninterleaving

author: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-02-11 15:04:26 -0500
committer: Marshall Lochbaum <mwlochbaum@gmail.com> 2023-02-11 15:04:26 -0500
commit: 29c036e86130e35a473bd9c0867b53c470a0aa08 (patch)
tree: f4aabc1adae724d899154a105a6794153b5b58b9
parent: 68fd472402e3b192402a2a23673fc1e6f83fe94b (diff)
2 files changed, 83 insertions, 0 deletions
diff --git a/docs/implementation/primitive/take.html b/docs/implementation/primitive/take.html
new file mode 100644
index 00000000..032d9952
--- /dev/null
+++ b/docs/implementation/primitive/take.html
@@ -0,0 +1,38 @@
+<head>
+  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
+  <link href="../../style.css" rel="stylesheet"/>
+  <title>BQN: Implementation of Take and Drop</title>
+</head>
+<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
+<h1 id="implementation-of-take-and-drop"><a class="header" href="#implementation-of-take-and-drop">Implementation of Take and Drop</a></h1>
+<p>The function <a href="../../doc/take.html">Take</a> on multidimensional arrays can be an important utility for working with arrays that have an odd cell shape. For example, a sorting algorithm on 25-bit cells would be very hard to write, but it's fast to expand each cell to 32 bits, sort, and trim back to 25 bits.</p>
+<h2 id="bit-interleaving-and-uninterleaving"><a class="header" href="#bit-interleaving-and-uninterleaving">Bit interleaving and uninterleaving</a></h2>
+<p>When the argument and result cells fit in a machine word, Take performs an operation I call bit interleaving if the width increases, or bit uninterleaving if it decreases. That's because it inserts some number of zero bits between every few bits of <code><span class='Function'>⥊</span><span class='Value'>𝕩</span></code>, or undoes this process. Bit interleaving with nonzero bits might be used for <code><span class='Function'>⍉</span><span class='Value'>𝕩</span></code> when <code><span class='Function'>≠</span><span class='Value'>𝕩</span></code> is small, or <code><span class='Value'>𝕩</span><span class='Function'>∾</span><span class='Modifier'>˘</span><span class='Value'>𝕨</span></code> when both arguments have small cells.</p>
+<p><strong>Careful!</strong> A cell under 64 bits wide might not fit into any single machine word. For example, 57 bits starting at a 6-bit offset span 9 bytes. The first bit is bit 6 of byte 0, and the last is bit 0 of byte 8. Assuming the entire array is byte aligned, each cell always fits in a word for sizes ≤58, and 60. Cell sizes ≥61, and 59, might not. <strong>Beware 59!</strong></p>
+<p>Interleaving can be implemented with pdep, and uninterleaving with pext, in the BMI2 instructions. And these operations can be performed generically with a series of shifts and masks. Consider <code><span class='Number'>7</span> <span class='Function'>↑</span> <span class='Value'>𝕩</span></code> where a cell of <code><span class='Value'>𝕩</span></code> is 5 bits. Here are the input and expected result, labelling zeros with <code><span class='Value'>.</span></code> and argument bits with letters:</p>
+<pre><span class='Value'>...................</span><span class='Function'>ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span>
+</pre>
+<p>The number of cells that can be widened at a time is <code><span class='Function'>⌊</span><span class='Number'>64</span><span class='Function'>÷</span><span class='Number'>7</span></code>, or <code><span class='Number'>9</span></code>. In some cases I suppose it'd be possible to pack in one more by letting the leading zeros run past the top bit; that sounds complicated.</p>
+<p>With the pdep instruction all we need to do is construct the appropriate mask indicating where the output cells should go. Let <code><span class='Function'>K</span> <span class='Value'>m</span></code> be <code><span class='Paren'>(</span><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>m</span><span class='Paren'>)</span><span class='Function'>-</span><span class='Number'>1</span></code>, that is, a number consisting of <code><span class='Value'>m</span></code> ones in binary. Then the appropriate mask is <code><span class='Paren'>(</span><span class='Function'>K</span> <span class='Number'>5</span><span class='Paren'>)</span> <span class='Function'>×</span> <span class='Paren'>(</span><span class='Function'>K</span> <span class='Number'>7</span><span class='Function'>×</span><span class='Number'>9</span><span class='Paren'>)</span> <span class='Function'>÷</span> <span class='Paren'>(</span><span class='Function'>K</span> <span class='Number'>7</span><span class='Paren'>)</span></code>. The mask <code><span class='Function'>K</span> <span class='Number'>7</span><span class='Function'>×</span><span class='Number'>9</span></code> has 9 groups of 7 1s, and division by <code><span class='Function'>K</span> <span class='Number'>7</span></code> converts each group to its bottom bit. Then multiplying by <code><span class='Function'>K</span> <span class='Number'>5</span></code> converts each bit to 5 of them.</p>
+<p>Because interleaving and uninterleaving are useful even on short arrays, it's best to precompute the division <code><span class='Paren'>(</span><span class='Function'>K</span> <span class='Number'>7</span><span class='Function'>×</span><span class='Number'>9</span><span class='Paren'>)</span> <span class='Function'>÷</span> <span class='Paren'>(</span><span class='Function'>K</span> <span class='Number'>7</span><span class='Paren'>)</span></code>. Since <code><span class='Number'>9</span></code> was computed as <code><span class='Function'>⌊</span><span class='Number'>64</span><span class='Function'>÷</span><span class='Number'>7</span></code>, this value depends only on the width 7 so a table of 64 words is enough. And <code><span class='Number'>7</span><span class='Function'>|</span><span class='Number'>64</span></code>, which might be useful for alignment, can be computed from the word as <code><span class='Value'>l</span><span class='Function'>¬</span><span class='Number'>7</span></code>, where <code><span class='Value'>l</span></code> is the number of leading 0 bits. Other similar schemes are possible.</p>
+<p>On generic hardware these operations take more work. If we have <code><span class='Value'>n</span></code> cells in a word (9 here), then it can be done with <code><span class='Function'>⌈</span><span class='Number'>2</span><span class='Function'>⋆</span><span class='Modifier'>⁼</span><span class='Value'>n</span></code> steps. Numbering the cells starting at 0 on the right (little-endian) and the steps <em>ending</em> at 0 for interleaving, step <code><span class='Value'>j</span></code> moves cells that have a 1 in bit position <code><span class='Value'>j</span></code>.</p>
+<pre><span class='Value'>...................</span><span class='Function'>ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>................abcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>........abcdeABCDEabcdeABCDE........abcdeABCDEabcdeABCDE</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>....abcdeABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span>
+</pre>
+<p>The amount to move is <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>j</span></code> times the difference between the argument and result widths. To move the appropriate cells but not others, we need to blend with a mask, as in <code><span class='Paren'>(</span><span class='Value'>w</span><span class='Function'>&lt;&lt;</span><span class='Value'>sh</span> <span class='Value'>&amp;~</span> <span class='Value'>mask</span><span class='Paren'>)</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>w</span> <span class='Value'>&amp;</span> <span class='Value'>mask</span><span class='Paren'>)</span></code>. To go backwards, shift first, like <code><span class='Paren'>(</span><span class='Value'>w</span> <span class='Value'>&amp;~</span> <span class='Value'>mask</span><span class='Paren'>)</span><span class='Function'>&gt;&gt;</span><span class='Value'>sh</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>w</span> <span class='Value'>&amp;</span> <span class='Value'>mask</span><span class='Paren'>)</span></code>. Interleaving leaves some junk that needs to be cleared out with a final mask (same as the one used for pdep), and likewise uninterleaving requires the initial word to be cleaned with that mask. Here are the masks interleaved (heh) with results:</p>
+<pre><span class='Value'>...................</span><span class='Function'>ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE</span>
+<span class='Number'>0000000011111111111111111111111111111111111111111111111111111111</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>................abcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE</span>
+<span class='Number'>1111111100000000000000000000000000001111111111111111111111111111</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>........abcdeABCDEabcdeABCDE........abcdeABCDEabcdeABCDE</span>
+<span class='Number'>1111111100000000000000111111111111110000000000000011111111111111</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>....abcdeABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE</span>
+<span class='Number'>0111111100000001111111000000011111110000000111111100000001111111</span>
+<span class='Value'>...</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span><span class='Value'>..abcde..</span><span class='Function'>ABCDE</span>
+<span class='Number'>0001111100111110011111001111100111110011111001111100111110011111</span>
+</pre>
+<p>The masks aren't very quick to generate, so it's best to do it once for all cells and save them. One way is to start with a mask <code><span class='Value'>m</span></code> of all ones, then repeatedly take <code><span class='Value'>m</span> <span class='Value'>^</span> <span class='Paren'>(</span><span class='Value'>m</span><span class='Function'>&lt;&lt;</span><span class='Value'>sh</span><span class='Paren'>)</span></code> with a series of shifts <code><span class='Value'>sh</span></code> that decrease by factors of 2.</p>
diff --git a/implementation/primitive/take.md b/implementation/primitive/take.md
new file mode 100644
index 00000000..37cc9e41
--- /dev/null
+++ b/implementation/primitive/take.md
@@ -0,0 +1,45 @@
+*View this file with results and syntax highlighting [here](https://mlochbaum.github.io/BQN/implementation/primitive/take.html).*
+
+# Implementation of Take and Drop
+
+The function [Take](../../doc/take.md) on multidimensional arrays can be an important utility for working with arrays that have an odd cell shape. For example, a sorting algorithm on 25-bit cells would be very hard to write, but it's fast to expand each cell to 32 bits, sort, and trim back to 25 bits.
+
+## Bit interleaving and uninterleaving
+
+When the argument and result cells fit in a machine word, Take performs an operation I call bit interleaving if the width increases, or bit uninterleaving if it decreases. That's because it inserts some number of zero bits between every few bits of `⥊𝕩`, or undoes this process. Bit interleaving with nonzero bits might be used for `⍉𝕩` when `≠𝕩` is small, or `𝕩∾˘𝕨` when both arguments have small cells.
+
+**Careful!** A cell under 64 bits wide might not fit into any single machine word. For example, 57 bits starting at a 6-bit offset span 9 bytes. The first bit is bit 6 of byte 0, and the last is bit 0 of byte 8. Assuming the entire array is byte aligned, each cell always fits in a word for sizes ≤58, and 60. Cell sizes ≥61, and 59, might not. **Beware 59!**
+
+Interleaving can be implemented with pdep, and uninterleaving with pext, in the BMI2 instructions. And these operations can be performed generically with a series of shifts and masks. Consider `7 ↑ 𝕩` where a cell of `𝕩` is 5 bits. Here are the input and expected result, labelling zeros with `.` and argument bits with letters:
+
+    ...................ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE
+    ...ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE
+
+The number of cells that can be widened at a time is `⌊64÷7`, or `9`. In some cases I suppose it'd be possible to pack in one more by letting the leading zeros run past the top bit; that sounds complicated.
+
+With the pdep instruction all we need to do is construct the appropriate mask indicating where the output cells should go. Let `K m` be `(2⋆m)-1`, that is, a number consisting of `m` ones in binary. Then the appropriate mask is `(K 5) × (K 7×9) ÷ (K 7)`. The mask `K 7×9` has 9 groups of 7 1s, and division by `K 7` converts each group to its bottom bit. Then multiplying by `K 5` converts each bit to 5 of them.
+
+Because interleaving and uninterleaving are useful even on short arrays, it's best to precompute the division `(K 7×9) ÷ (K 7)`. Since `9` was computed as `⌊64÷7`, this value depends only on the width 7 so a table of 64 words is enough. And `7|64`, which might be useful for alignment, can be computed from the word as `l¬7`, where `l` is the number of leading 0 bits. Other similar schemes are possible.
+
+On generic hardware these operations take more work. If we have `n` cells in a word (9 here), then it can be done with `⌈2⋆⁼n` steps. Numbering the cells starting at 0 on the right (little-endian) and the steps _ending_ at 0 for interleaving, step `j` moves cells that have a 1 in bit position `j`.
+
+    ...................ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE
+    ...ABCDE................abcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE
+    ...ABCDE........abcdeABCDEabcdeABCDE........abcdeABCDEabcdeABCDE
+    ...ABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE
+    ...ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE
+
+The amount to move is `2⋆j` times the difference between the argument and result widths. To move the appropriate cells but not others, we need to blend with a mask, as in `(w<<sh &~ mask) | (w & mask)`. To go backwards, shift first, like `(w &~ mask)>>sh | (w & mask)`. Interleaving leaves some junk that needs to be cleared out with a final mask (same as the one used for pdep), and likewise uninterleaving requires the initial word to be cleaned with that mask. Here are the masks interleaved (heh) with results:
+
+    ...................ABCDEabcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE
+    0000000011111111111111111111111111111111111111111111111111111111
+    ...ABCDE................abcdeABCDEabcdeABCDEabcdeABCDEabcdeABCDE
+    1111111100000000000000000000000000001111111111111111111111111111
+    ...ABCDE........abcdeABCDEabcdeABCDE........abcdeABCDEabcdeABCDE
+    1111111100000000000000111111111111110000000000000011111111111111
+    ...ABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE....abcdeABCDE
+    0111111100000001111111000000011111110000000111111100000001111111
+    ...ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE..abcde..ABCDE
+    0001111100111110011111001111100111110011111001111100111110011111
+
+The masks aren't very quick to generate, so it's best to do it once for all cells and save them. One way is to start with a mask `m` of all ones, then repeatedly take `m ^ (m<<sh)` with a series of shifts `sh` that decrease by factors of 2.
author	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-02-11 15:04:26 -0500
committer	Marshall Lochbaum <mwlochbaum@gmail.com>	2023-02-11 15:04:26 -0500
commit	29c036e86130e35a473bd9c0867b53c470a0aa08 (patch)
tree	f4aabc1adae724d899154a105a6794153b5b58b9
parent	68fd472402e3b192402a2a23673fc1e6f83fe94b (diff)