Interpreter Progress: lertulo

lertulo

Interpreter Progress

Nov 08, 2008 19:51

Things are going well on the scripting front: the parser has been enriched to handle array definitions properly, the compiler and generate about a dozen different bytecodes, and the interpreter is actually implemented fully enough to run scripts using those codes.

The result is that you can actually run simple scripts now: global variables and local variables all work, mathematical expressions are properly evaluated, and even things like += and -- are compiled and interpreted properly. Yay!

There's still a lot to go: in particular classes, methods, namespaces and array object references are still pending. Oh, and things that involve jumping around at runtime: if/else, for loops, trinary operator, logical-or/-and etc. But things are looking good.

For the die-hard curious out there, I'm posting my current .

-------------------------------------------------------------------------------
BYTECODE FILE FORMAT
-------------------------------------------------------------------------------

The output of the compiler takes the form of a dualscript bytecode blob,
whether saved as a .DSB file or simply passed in raw form directly to the
interpreter. The content is identical either way.

The DSB file begins with a variable-length header, which is followed by an
array of variable-length records: the header contains data about the
overall script and the names used within it, while each record corresponds
with a class constructor or a class- or global-scope method. The first record
contains global-scope commands.

All structures are stored little-endian; text is UTF-8 encoded. The first
name record is always zero-length, and represents unnamed (global) space.

struct DsbFileHeader {
uint32_t signature; // 0xB1DEC0DE
uint32_t nameCount; // number of names stored in header
uint32_t methodCount; // number of method records
struct {
uint8_t nameLength; // length does not include terminator
char nameText[nameLength]; // text is not null-terminated
} names[nameCount];
};

struct DsbMethodRecord {
uint32_t signature; // 0xFEEDC0DE
uint32_t nameIndex; // References the method name (0 for global)
uint32_t byteCount; // Bytes in bytecode data
uint8_t bytecode[byteCount];
};

-------------------------------------------------------------------------------
OPCODE CONVENTIONS
-------------------------------------------------------------------------------

As stored in the bytecode file, each script starts at instruction address
zero. These instructions are rebased to lay end-to-end in memory when the
scripts are loaded by the interpreter.

TOS - top-of-stack. The most recently pushed item; first to be popped.

NOS - next-on-stack. The item just below TOS on the stack.

opcode - the first byte of an instruction. Tells the interpreter what to do.
Also implicitly indicates the number of params; for example,
the "pop" opcode has no params, while "push" takes one value.

params - content that follows the opcode, providing details of exactly
should happen.

instr - a full instruction, composed of opcode and params. Instructions
are and params are unpadded; everything is byte-aligned.

addr - a reference to the start of a bytecode instruction. This value is
always a dword; in each compiled script addr=0 refers to the
beginning of the file's content, while in memory addr=0 refers to
the beginning of the first file's content--all others are rebased
to concatenate from there.

type - a reference to a type. This value is again a dword, as described
in the Types section above.

value - a pointer to a value object; for example, TOS and NOS refer to
values. A value is always 4 bytes. In the bytecode script, the
value TODO

-------------------------------------------------------------------------------
DUALSCRIPT OPCODES
-------------------------------------------------------------------------------

assign {mathcode}

Pops top two values off the stack, copies *(NOS) = TOS, pushes NOS.
Runtime faults if the assignment involves bogus types of values
(e.g., a namespace reference, a function pointer, or a const-lhs).
The mathcode argument can indicate whether this is a simple
assignment, or whether the LHS value is being adjusted in some
manner first.

x = 5;

lookup-cg {nameid:x}
constant-int {int:5}
assign
pop

x <<= 5;

lookup-cg {nameid:x}
constant-int {int:5}
assign {leftshift}
pop

clone

Pops TOS, creates an unnamed equivalent and pushes that, then re-pushes
TOS. This expression is used for post-increment and post-decrement
expressions, which are otherwise problematic since Dualscript is
reference-based (so it's hard to leave a value on the stack while the
variable itself changes content afterwards). In the example below,
note that the pre-increment and post-increment are encoded identically
except for the addition of a clone and pop expression.

// x starts as 3
y = ++x; // x=4, then y=4
z = x++; // z=4, but x=5

lookup-cg {nameid:y} // stack has &y
lookup {nameid:x} // stack has &y, &x
constant-int {int:1} // stack has &y, &x, 1
assign {+} // stack has &y, &x (x now == 4)
assign // stack has &y (y now == 4 too)
pop

lookup-cg {nameid:z} // stack has &z
lookup {nameid:x} // stack has &z, &x
->clone // stack has &z, 4, &x
constant-int {int:1} // stack has &z, 4, &x, 1
assign {+} // stack has &z, 4, &x (x now == 5)
->pop // stack has &z, 4
assign // stack has &z (z now == 4)
pop

Note also that this has one side-effect: a post-modified variable leaves
an *unnamed* equivalent on the stack, while the pre-modified variable
leaves the properly-named variable on the stack. The only time this
matters is if you try to use the thing as a function parameter:

method foo requires x { ... }
x = 5;
foo (++x); // works fine; passes x==6
foo (x++); // runtime error: foo requires "x", you passed unnamed

constant-int {int}
constant-float {float}
constant-string {string}

Pushes a dsvalue representing a constant onto the stack. The new value
is unnamed, attached to the current scope but not added to the current
namespace.

x = 5;

lookup-cg {nameid:x}
constant-int {int:5}
assign
pop

invert

Reverses the position of TOS and NOS.

x = ~y;

lookup-cg {nameid:x} // stack has &x
constant-int {int:5} // stack has &x, &y
constant-int {int:0} // stack has &x, &y, 0
invert // stack has &x, 0, &y
math {~} // stack has &x, ~y
assign
pop

lookup-cg {nameid}
lookup-cl {nameid}
lookup {nameid}

Looks up the specified name and pushes the corresponding dsvalue onto
the stack. The difference between the ops reflects what should happen
when the lookup fails: -cg creates a new global variable, -cl creates
a new local variable, and the bland variant just runtime-faults.

lhs = rhs;

lookup-cg {nameid:lhs}
lookup {nameid:rhs}
assign
pop

math {mathcode}

Pops top two values off the stack, performs *(NOS) (op) TOS, pushes
unnamed result. Runtime faults if the assignment involves bogus types
of values (e.g., a namespace reference, a function pointer, or a
const-lhs). The mathcode argument indicates what kind of operation
is being done. The difference between math and assign is that assign
modifies the LHS variable and pushes it; math does not modify either
existing value, but rather pushes an unnamed result.

x = 3 + 5;

lookup-cg {nameid:x}
constant-int {int:3}
constant-int {int:5}
math {+}
assign
pop

x <<= 5 | 7;

lookup-cg {nameid:x}
constant-int {int:5}
constant-int {int:7}
math {|}
assign {leftshift}
pop

member

Resolves a name within an already-on-stack namespace. At first this
might seem stupid: if the caller asks for "x = mypkg.string;" then
why not encode the whole RHS as a compound identifier and look it
up at once? The rationale is that the would-be namespace portions
might be function calls or object members, who can't be resolved
properly until runtime.

namespace one.two;
x = abc.mypkg.globalmethod().datamember.submember;

// This first command is encoded as looking up "one.two.x"
// since the nameid we record maps to that string. In
// practice the interpreter doesn't lookup "one.two.x"
// starting at the namespace root; rather, it looks up "x"
// starting in the "one.two" namespace and working upwards.

lookup-cg {nameid:one.two.x} // stack: &x
lookup {nameid:one.two.abc} // stack: &x, &abc
lookup {nameid:mypkg} // stack: &x, &abc, &mypkg
member // stack: &x, &abc.mypkg
lookup {nameid:globalmethod} // stack: &x, &abc, &mypkg, &globalmethod
member // stack: &x, &abc.mypkg.globalmethod
constant-int {int:0} // stack: &x, &abc.mypkg.globalmethod, 0 (no parameters)
call-function // stack: &x, &result
lookup {nameid:datamember} // stack: &x, &result, &datamember
member // stack: &x, &result.datamember
lookup {nameid:submember} // stack: &x, &result.datamember, &submember
member // stack: &x, &result.datamember.submember
assign // stack: &x
pop

pop

Pops the top value from the stack and throws it away.

;

pop

return

Pops the top address off the call stack and continues execution there.
If the call stack is empty, then execution halts and returns to
your program. This method does not affect the exprStack, so anything
you want to on the exprStack for a return value had better be there
already. This opcode implies a scope-end opcode as well.

method foobar {
return 5;
}

constant-int {int:5}
// Note: no "pop" instruction: want to leave 5 on the stack!
return

scope-begin
scope-inner
scope-end

Controls visibility and release of local variables. Scope-begin hides
all existing local variables, creating instead a new local-variable
scope; this behavior is common with function invocation, and in fact is
implied by the 'call-function' opcode. Scope-inner provides an
interior scope that does not hide existing locals, but allows new locals
to be destroyed on the next scope-end; this behavior is coded explicitly
when a statement is written as a series of statements wrapped in braces.
Scope-end closes a local variable scope; this behavior is implied by
the 'return' opcode, and is explicitly encoded for other circumstances.

x = 5;
{
y = 3;
}
x = y; // runtime exception: "y" not found

lookup-cg {nameid:x}
constant-int {int:5}
assign
pop
scope-begin
lookup-cg {nameid:y}
constant-int {int:3}
assign
pop
scope-end
lookup-cg {nameid:x}
lookup {nameid:y} // runtime exception: "y" not found
assign
pop

source

Specifies where the next set of commands came from. Useless at runtime
except to report where we are when errors occur.

foo();
bar();

source {nameid:file.ds} {1} // line 1 of file.ds
lookup {nameid:foo}
call-function
pop
source {nameid:file.ds} {2} // line 2 of file.ds
lookup {nameid:bar}
call-function
pop

I'm filling in the docs as I go, so this describes accurately what the script does today--but it lacks several bytecodes that will be necessary to finish this thing off.