From 08ad1e1bb0a30c6bb4164f5665aa4d0f7f158742 Mon Sep 17 00:00:00 2001 From: Marshall Lochbaum Date: Wed, 16 Jun 2021 21:44:54 -0400 Subject: Add namespace instructions to VM docs --- docs/implementation/vm.html | 72 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 69 insertions(+), 3 deletions(-) (limited to 'docs/implementation/vm.html') diff --git a/docs/implementation/vm.html b/docs/implementation/vm.html index 5a6e77c4..48135fa3 100644 --- a/docs/implementation/vm.html +++ b/docs/implementation/vm.html @@ -5,7 +5,7 @@

The BQN virtual machine and runtime

-

BQN's self-hosted compiler and runtime mean that only a small amount of native code is needed to run BQN on any given platform. For example, the Javascript environment requires about 300 lines of Javascript code even though it compiles BQN bytecode to Javascript, a more complex strategy than interpreting it directly. This makes it fairly easy to port BQN to new platforms, allowing BQN to be embedded within other programming languages and interact with arrays or functions in those languages.

+

BQN's self-hosted compiler and runtime mean that only a small amount of native code is needed to run BQN on any given platform. The Javascript environment requires about 500 lines of Javascript code including system functions and performance improvements; probably around 250 would be required just to run the core language. This makes it fairly easy to port BQN to new platforms, allowing BQN to be embedded within other programming languages and interact with arrays or functions in those languages.

The way data is represented is part of the VM implementation: it can use native arrays or a custom data structure, depending on what the language supports. An initial implementation will be very slow, but can be improved by replacing functions from the BQN-based runtime with native code. As the VM system can be hard to work with if you're not familiar with it, I advise you to contact me to discuss implementing a VM if you are interested.

Bytecode

The BQN implementation here and dzaima/BQN share a stack-based object code format used to represent compiled code. This format is a list of numbers of unspecified precision (small precision will limit the length of list literals and number of locals per block, blocks, and constants). Previously it was encoded as bytes with the LEB128 format; while it no longer has anything to do with bytes it's called a "bytecode" because this is shorter than "object code".

@@ -27,11 +27,14 @@
  • Block immediateness: (1) immediate or (0) deferred
  • Block starting index in code
  • Number of variables the block needs to allocate
  • +
  • Variable names, as indices into the program's symbol list
  • +
  • A mask indicating which variables are exported
  • Compilation separates blocks so that they are not nested in bytecode. All compiled code is contained in some block. The self-hosted compiler compiles the entire program into an immediate block, and the program is run by evaluating this block. Blocks are terminated with the RETN instruction.

    The starting index refers to the position where execution starts in order to evaluate the block. When the block is evaluated depends on its type and immediateness. An immediate block (0,1) is evaluated as soon as it is pushed; a function (0,0) is evaluated when called on arguments, an immediate modifier (1 or 2, 1) is evaluated when called on operands, and a deferred modifier (1 or 2, 0) creates a derived function when called on operands and is evaluated when this derived function is called on arguments.

    +

    The program's symbol list is included in the tokenization information t: it is 02t. Since the entire program (the source code passed in one compiler call) uses this list, namespace field accesses can be performed with indices alone within a program. The symbol list is needed for cross-program access, for example if •BQN returns a namespace.

    Instructions

    -

    The following instructions are defined by dzaima/BQN. The ones emitted by the self-hosted BQN compiler are marked in the "used" column.

    +

    The following instructions are defined by dzaima/BQN. The ones emitted by the self-hosted BQN compiler are marked in the "used" column. Instructions marked NS are used only in programs with namespaces, and so are not needed to support the compiler or self-hosted runtime.

    @@ -253,6 +256,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -375,11 +418,29 @@ + + + + + + + + + + + + + + + + + +
    Returns top of stack
    26FLDONSIRead field I from namespace
    27FLDM26IPush mutable field I from namespace
    28NSPMNSIMutable target aliases field I
    29RETDNS25Return the running scope's namespace
    30SYSV0IPush dynamic system value I
    31 LOCU Xx x Returns from current block
    26FLDOn n.i
    28NSPMr sReference s gets field i and puts in r
    29RETDx? nClears stack, dropping 0 or 1 value

    Many instructions just call functions or modifiers or otherwise have fairly obvious implementations. Instructions to handle variables and blocks are more complicated (although very typical of bytecode representations for lexically-scoped languages) and are described in more detail below.

    Local variables: DFND LOCO LOCU LOCM RETN

    -

    The bytecode representation is designed with the assumption that variables will be stored in frames, one for each instance of a block. dzaima/BQN has facilities to give frame slots names, in order to support dynamic execution, but self-hosted BQN doesn't. A new frame is created when the block is evaluated (see #blocks) and in general has to be cleaned up by garbage collection, because a lexical closure might need to refer to the frame even after the corresponding block finishes. Lexical closures can form loops, so simple reference counting can leak memory, but it could be used in addition to less frequent tracing garbage collection or another strategy.

    +

    The bytecode representation is designed with the assumption that variables will be stored in frames, one for each time an instance of a block is run. dzaima/BQN has facilities to give frame slots names, in order to support dynamic execution, but self-hosted BQN doesn't. A new frame is created when the block is evaluated (see #blocks) and in general has to be cleaned up by garbage collection, because a lexical closure might need to refer to the frame even after the corresponding block finishes. Lexical closures can form loops, so simple reference counting can leak memory, but it could be used in addition to less frequent tracing garbage collection or another strategy.

    A frame is a mutable list of slots for variable values. It has slots for any special names that are available during the blocks execution followed by the local variables it defines. Special names use the ordering 𝕤𝕩𝕨𝕣𝕗𝕘; the first three of these are available in non-immediate blocks while 𝕣 and 𝕗 are available in modifiers and 𝕘 in 2-modifiers specifically.

    When a block is pushed with DFND, an instance of the block is created, with its parent frame set to be the frame of the currently-executing block. Setting the parent frame when the block is first seen, instead of when it's evaluated, is what distinguishes lexical from dynamic scoping. If it's an immediate block, it's evaluated immediately, and otherwise it's pushed onto the stack. When the block is evaluated, its frame is initialized using any arguments passed to it, the next instruction's index is pushed onto the return stack, and execution moves to the first instruction in the block. When the RETN instruction is encountered, an index is popped from the return stack and execution returns to this location. As an alternative to maintaining an explicit return stack, a block can be implemented as a native function that creates a new execution stack and returns the value in it when the RETN instruction is reached. This approach uses the implementation language's call stack for the return stack.

    Local variables are manipulated with the LOCO (or LOCU) and LOCM instructions, which load the value of a variable and a reference to it (see the next section) respectively. These instructions reference variables by frame depth and slot index. The frame depth indicates in which frame the variable is found: the current frame has depth 0, its block's parent frame has depth 1, and so on. The slot index is an index within that frame.

    @@ -389,6 +450,11 @@

    A reference list is a list of variable references or reference lists. It's created with the ARRM instruction. In the Javascript VM there's no difference between a reference list and an ordinary BQN list other than the contents.

    The SETN, SETU, and SETM instructions set a value for a reference. If the reference is to a variable, they simply set its value. For a reference list, the value needs to be destructured. It must be a list of the same length, and each reference in the reference list is set to the corresponding element of the value list.

    SETM additionally needs to get the current value of a reference. For a variable reference this is its current value (with an error if it's not defined yet); for a reference list it's a list of the values of each reference in the list.

    +

    Namespaces: FLDO FLDM NSPM RETD

    +

    A namespace is the collection of variables in a particular scope, along with an association mapping some exported symbols (case- and underscore-normalized strings) to a subset of these. It can be represented using a frame for the variables, plus the variable name list and mask of exported variables from that block's properties in the bytecode. RETD finishes executing the block and returns the namespace for that execution.

    +

    The variable name list is split into a program-level list of names and a list of indices into these names: within-program field accesses can be done with the indices only while cross-program ones must use names. One way to check whether an access is cross-program is to compare the accessor's program-level name list with the one for the accessed namespace as references or pointers. Then a lookup should be performed as appropriate. A persistent lookup table is needed to make this efficient.

    +

    FLDO reads a field from a namespace. The parameter I is a program-level identifier index for this field. The VM must ensure that the field is exported, possibly by leaving unexported identifiers out of the namespace's lookup table. FLDM does the same but pushes a reference to the field, to be modified by assignment.

    +

    NSPM is used for aliased assignment such as abcns. It tags a reference with a namespace field, identified with a program-level index. A value assigned to the tagged reference must be a namespace. The relevant field is extracted, and then stored in the original reference.

    Runtime

    Primitive functions and modifiers used in a program are stored in its consts array. The compiler needs to be passed a runtime with the value of every primitive so that these functions and modifiers are available.

    While it's perfectly possible to implement the runtime from scratch, the pseudo-BQN files r0.bqn and r1.bqn implement the full runtime in terms of a core runtime consisting of a smaller number of much simpler functions. pr.bqn converts this file so that it can be compiled. It changes values in the core runtime to primitives and primitives to generated identifiers, so that the first 22 values in the output's consts array are exactly the core runtime, and no other primitives are required. The result is a list of two elements: first the list of all primitive values, and then a function that can be called to pass in two additional core functions used for inferred properties.

    -- cgit v1.2.3