diff options
| author | Marshall Lochbaum <mwlochbaum@gmail.com> | 2021-07-10 22:32:07 -0400 |
|---|---|---|
| committer | Marshall Lochbaum <mwlochbaum@gmail.com> | 2021-07-10 22:32:07 -0400 |
| commit | 3402beec2227154d8c82445e679c9851e69d2704 (patch) | |
| tree | e7659a200a6261c2c78831c3c26fd1144c2a68ff /docs/implementation/kclaims.html | |
| parent | 9d6009137b08c2faa2006fe2d98ec856c99236ec (diff) | |
Note about data versus instruction cache
Diffstat (limited to 'docs/implementation/kclaims.html')
| -rw-r--r-- | docs/implementation/kclaims.html | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/docs/implementation/kclaims.html b/docs/implementation/kclaims.html index 5afe7efe..c84c1886 100644 --- a/docs/implementation/kclaims.html +++ b/docs/implementation/kclaims.html @@ -23,6 +23,7 @@ <h2 id="instruction-cache">Instruction cache</h2> <p>A more specific claim about K is that the key to its speed is that the interpreter, or some part of it, fits in L1 cache. I know Arthur Whitney himself has said this; I can't find that now but <a href="https://kx.com/blog/what-makes-time-series-database-kdb-so-fast/">here</a>'s some material from KX about the "L1/2 cache". Maybe this was a relevant factor in the early days of K around 2000—I'm doubtful. In the 2020s it's ridiculous to say that instruction caching matters.</p> <p>Let's clarify terms first. The CPU cache is a set of storage areas that are smaller and faster than RAM; memory is copied there when it's used so it will be faster to access it again later. L1 is the smallest and fastest level. On a typical CPU these days it might consist of 64KB of <em>data</em> cache for memory to be read and written, and 64KB of <em>instruction</em> cache for memory to be executed by the CPU. When I've seen it the L1 cache claim is specifically about the K interpreter (and not the data it works with) fitting in the cache, so it clearly refers to the instruction cache.</p> +<p>(Unlike the instruction cache, the data cache is a major factor that makes array languages faster. It's what terms like "cache-friendly" typically refer to. I think the reason KX prefers to talk about the instruction cache is that it allows them to link this well-known consideration to the size of the kdb binary, which is easily measured and clearly different from other products. Anyone can claim to use cache-friendly algorithms.)</p> <p>A K interpreter will definitely benefit from the instruction cache. Unfortunately, that's where the truth of this claim runs out. Any other interpreter you use will get just about the same benefit, because the most used code will fit in the cache with plenty of room to spare. And the best case you get from a fast core interpreter loop is fast handling of scalar code—exactly the case that array languages typically ignore.</p> <p>So, 64KB of instruction cache. That would be small even for a K interpreter. Why is it enough? I claim specifically that while running a program might cause a cache miss once in a while, the total cost of these will only ever be a small fraction of execution time. This is because an interpreter is made of loops: a core loop to run the program as a whole and usually smaller loops for some specific instructions. These loops are small, with the core loop being on the larger side. In fact it can be pretty huge if the interpreter has a lot of exotic instructions, but memory is brought to the cache in lines of around 64 bytes, so that unused regions can be ignored. The active portions might take up a kilobyte or two. Furthermore, you've got the L2 and L3 caches as backup, which are many times larger than L1 and not much slower.</p> <p>So a single loop doesn't overflow the cache. And the meaning of a loop is that it's loaded once but run multiple times—for array operations, it could be a huge number. The body of an interpreter loop isn't likely to be fast either, typically performing some memory accesses or branches or both. An L1 instruction cache miss costs tens of cycles if it's caught by another cache layer and hundreds if it goes to memory. Twenty cycles would be astonishingly fast for a go around the core interpreter loop, and array operation loops are usually five cycles or more, plus a few tens in setup. It doesn't take many loops to overcome a cache miss, and interpreting any program that doesn't finish instantly will take millions of iterations or more, spread across various loops.</p> |
