Introducing the Cult of the Bound Variable to the Cult of Forth
[The twelfth in a series of posts on the evolution of TransForth]
It’s been quite fun playing with this Universal Machine from the Cult of the Bound Variable. In this post we’re going to continue the journey toward building a full Forth for this machine by assembling a Forth inner interpreter; turning this register machine into a stack machine.
Last time we finished up a single-pass macro assembler for the UM-32 (in 15 lines of F# and 40 lines of Forth!) but did nothing more than a simple “Hello, world” with it. If you remember when we built an inner interpreter for TransForth, you know that this is the heart of Forth. If you’re unclear on the mechanics of a direct threaded inner interpreter, you’d probably enjoy the video from that earlier post:
Register Machines (what a pain)
Being so used to dealing with a stack machine and essentially zero-operand instructions, it can be a bit annoying to deal with a register machine. But that’s what we have with the UM-32. Let’s assign purposes to the remaining registers (remember, z is already the zero constant and registers 1 and 2 are temp registers. Let’s allocate one more temp along with the standard inner interpreter and stack pointers.
: x 3 ; \ Temp register
: w 4 ; \ Working register
: i 5 ; \ Interpreter register
: s 6 ; \ Stack (data) register
: r 7 ; \ Return stack register
In a way you can think of what we’re doing as implementing stack machine mechanics on top of a register machine. There are, of course, many real stack machines implemented in hardware. Koopman’s book is an excellent resource (free online) by the way. For the UM-32, we’ll need to handle the stacks ourselves along with the mechanics of walking nested sequences of stack operations.
Given the data or return stack pointer and a value register, these words will handle the stacks in a few instructions:
: push, ( ab-m ) \ b.push(a)
DUP
dec, \ b--
z store, \ M[b] = a
;
: pop, ( ab-m ) \ b = a.pop()
OVER SWAP
z SWAP \ aazb
fetch, \ b = M[a]
inc, \ a++
;
These don’t emit any UM-32 code at this point. Instead they are assembler “macros” used to inline these instructions as needed later.
Inner Interpreter
Now we’re going to emit the inner interpreter and dictionary structure. Following that, we’ll initialize and kick things off. We’ll start by jumping to the initialization code (yet to be defined):
forward branch, \ over dictionary
It happens to be patched to address 60 later and the following ends up in the image:
00000 60 y literal
00001 z y loadjump
The triumvirate of an inner interpreter is enter, next and exit (in the past I’ve referred to them by their more classic names docol, next and dosemi).
The job of enter is to go into word definitions. It pushes the interpreter (i) register onto the return stack so we can get back out upon exit. It then points i at the body of the word and does a next.
VARIABLE &enter address &enter !
i r push, \ r.push(i)
2 t literal, t w i add, \ i = w + 8 (skip over enter,)
\ falls through to next,
It emits the following UM-32 code to the image (falling through to next):
00002 0 t literal
00003 t t t nand
00004 r t r add
00005 z r i store
00006 2 t literal
00007 i w t add
To make it easy to jump to this, here’s a convenience word:
: enter, &enter @ x literal, x jump, ;
Next comes next, whose job is to jump to the next word in a sequence and advance the interpreter to the one following.
VARIABLE &next address &next !
i z w fetch, \ w = M[i]
i inc, \ i++
w jump,
Emitting:
00008 w z i fetch
00009 1 t literal
00010 i t i add
00011 z w loadjump
And the same kind of convenience macro:
: next, &next @ x literal, x jump, ;
Remember that all primitive words are plain machine code followed by a next. All secondary words begin with enter (that is, with machine code to jump to &enter) and are terminated with exit. The job of exit is to leave and rejoin a parent sequence; simply recovering the return address from the return stack and doing next from there.
VARIABLE &exit address &exit !
r i pop, \ i = r.pop()
next,
Emitting an inlined pop and a jump to next:
00012 i z r fetch
00013 1 t literal
00014 r t r add
00015 8 x literal
00016 z x loadjump
That’s it for the inner interpreter! Just fifteen instructions (plus two for the jump as the start).
Hand-packed Primitives
Now we want to pack a dictionary with primitive and secondary words to exercise things a bit. Using the same example as in the direct threading demo video, we’ll need literals (lit), pick and add primitives then we can define secondary dup (in terms of lit and pick) and double (in terms of dup and add).
Starting with lit, we push the literal which is packed alongside word addresses and advance the interpreter to skip ahead:
VARIABLE &lit address &lit !
i z y fetch, \ y = M[i]
y s push, \ s.push(y)
i inc, \ i++
next,
Emitting:
00017 y z i fetch
00018 0 t literal
00019 t t t nand
00020 s t s add
00021 z s y store
00022 1 t literal
00023 i t i add
00024 8 x literal
00025 z x loadjump
The pick word is used to pull a copy of the nth element on the stack to the top:
VARIABLE &pick address &pick !
s y pop, \ y = s.pop()
s x move, \ x = s
y x x add, \ x = x + y
x z x fetch, \ x = M[x]
x s push, \ s.push(x)
next,
Emitting:
00026 y z s fetch
00027 1 t literal
00028 s t s add
00029 1 t literal
00030 x s t cmove
00031 x x y add
00032 x z x fetch
00033 0 t literal
00034 t t t nand
00035 s t s add
00036 z s x store
00037 8 x literal
00038 z x loadjump
The add word essentially pops two values and pushes back their sum. It could be implemented as:
s y pop, \ y = s.pop()
s x pop, \ x = s.pop()
y x x add, \ x = x + y
x s push, \ s.push(x)
next,
But a more efficient implementation just fetches the top of the stack for the second operand (without popping) and pokes back the result (without pushing):
VARIABLE &add address &add ! \ TODO: More efficient
s y pop, \ y = s.pop()
s z x fetch, \ x = M[s]
y x x add, \ x = x + y
x s z store, \ M[s] = x
next,
Emitting:
00039 y z s fetch
00040 1 t literal
00041 s t s add
00042 x z s fetch
00043 x x y add
00044 z s x store
00045 8 x literal
00046 z x loadjump
For our final primitive in this exercise, we just need a halt. It’s just a single instruction terminating execution (so it doesn’t need to be followed by a next). Later we’ll seed the process by pointing the interpreter here so that once all the nested sequences are complete the machine halts:
VARIABLE &halt address &halt !
halt,
Emitting:
00047 halt
Hand-packed Secondary Words
Now on to the secondary words, which are quite different. They’re a branch to enter followed by a sequence of addresses rather than of machine code. It’s starting to gain a Forth-like feel; a little more comfortable:
VARIABLE &dup address &dup !
enter,
&lit @ m,
0 m,
&pick @ m,
&exit @ m,
Emitting:
00048 2 x literal
00049 z x loadjump
00050 17
00051 0
00052 26
00053 12
Notice how the literal zero is packed in there but otherwise it’s just a list of addresses. This is the essence of threaded code. If you think of the word addresses as op codes for a zero-operand stack machine, you’re not far off!
The double word is another secondary, this time defined in terms of another secondary (the dup from above) and primitives:
VARIABLE &double address &double !
enter,
&dup @ m,
&add @ m,
&exit @ m,
Emitting:
00054 2 x literal
00055 z x loadjump
00056 48
00057 39
00058 12
That’s it for the dictionary.
Sample Program
Now we want to prime the mechanics to run a sample program. We’ll push a 42 to the stack and execute double. This will dup the 42 by doing a 0 pick and add them together. With success we’ll end up with an 84 on the stack and halt.
To kick off the interpreter, we’ll point it at a location in memory containing the address of the halt word in the dictionary. This will be pushed to the return stack upon initially entering double and will be popped and executed upon completion just as we want:
VARIABLE terminate address terminate !
&halt @ m,
Emitting:
00059 47
Remember the jump at the very start of the image? That is patched to the current address and so here finally begins execution of our sample:
tohere
We’ll partition memory much as we did before with return and data stacks in high memory and the dictionary (already packed) in low memory:
16383 r literal, \ top of return stack, 3FFF
12287 s literal, \ top of data stack, 2FFF
And like we said, seed the interpreter pointer with a cell containing the halt address:
terminate @ i literal,
Emitted so far:
00060 16383 r literal
00061 12287 s literal
00062 59 i literal
To set up our little scenario, we manually push a 42 to the stack, point the interpreter (word register) at doubleand branch into it. The inner interpreter takes it from there; walking the nested primitive and secondary words to completion and halting.
42 x literal, x s push,
&double @ w literal,
&double @ branch,
Emitting:
00063 42 x literal
00064 0 t literal
00065 t t t nand
00066 s t s add
00067 z s x store
00068 54 w literal
00069 54 y literal
00070 z y loadjump
We’re done!
We can’t quite just load up this 71-cell image in the UM-32 and go however because we’re using memory at much higher addresses for the stack. We could add code to allocate and copy over the image but I think I’ll just take the easy route and pad the rest of the image with zeros; making a nice little 64KB (16K 32-bit cells) image file:
: pad, 16384 address DO 0 m, LOOP ;
pad, msave
Tracing
I would never have been able to get this all working without some visibility into the UM-32 in action. To get some tracing I added the following to Luke’s UM-32 implementation (inside his cycle function):
//(* Debugginglet name = function 0 -> "z" | 1 -> "t" | 2 -> "y" | 3 -> "x" | 4 -> "w" | 5 -> "i" | 6 -> "s" | 7 -> "r"let an, bn, cn, a2n = (name a), (name b), (name c), (name a2)<br>printf "%05i " fingermatch code with| 0u -> printfn " %s %s %s cmove %s = %s:%i if %s:%i" an bn cn an bn reg.[b] cn reg.[c]<br>| 1u -> printfn " %s %s %s fetch %s = M[%s:%i][%s:%i]" an bn cn an bn reg.[b] cn reg.[c]<br>| 2u -> printfn " %s %s %s store M[%s:%i][%s:%i] = %s:%i" an bn cn an reg.[a] bn reg.[b] cn reg.[c]<br>| 3u -> printfn " %s %s %s add %s = %s:%i + %s:%i" an bn cn an bn reg.[b] cn reg.[c]<br>| 4u -> printfn " %s %s %s mult %s = %s:%i * %s:%i" an bn cn an bn reg.[b] cn reg.[c]<br>| 5u -> printfn " %s %s %s div %s = %s:%i / %s:%i" an bn cn an bn reg.[b] cn reg.[c]<br>| 6u -> printfn " %s %s %s nand %s = %s:%i ~& %s:%i" an bn cn an bn reg.[b] cn reg.[c]<br>| 7u -> printfn " halt"| 8u -> printfn " %s %s alloc new(%s:%i) -> %s" bn cn cn reg.[c] bn<br>| 9u -> printfn " %s free %s:%i" cn cn reg.[c]<br>| 10u -> printfn " %s echo %s:%i" cn cn reg.[c]<br>| 11u -> printfn " %s key %s:%i" cn cn reg.[c]<br>| 12u -> printfn " %s %s loadjump load(%s:%i), jump(%s:%i)" bn cn bn reg.[b] cn reg.[c]<br>| 13u -> printfn "%5i %s literal %s = %i" value a2n a2n value
//Console.ReadLine() |> ignore<br>//*)
Now we can run our image and see all the gloriously gory details!
00000 60 y literal y = 60
00001 z y loadjump load(z:0), jump(y:60)
00060 16383 r literal r = 16383
00061 12287 s literal s = 12287
00062 59 i literal i = 59
00063 42 x literal x = 42
00064 0 t literal t = 0
00065 t t t nand t = t:0 ~& t:0
00066 s t s add s = t:4294967295 + s:12287
00067 z s x store M[z:0][s:12286] = x:42
00068 54 w literal w = 54
00069 54 y literal y = 54
00070 z y loadjump load(z:0), jump(y:54)
00054 2 x literal x = 2
00055 z x loadjump load(z:0), jump(x:2)
00002 0 t literal t = 0
00003 t t t nand t = t:0 ~& t:0
00004 r t r add r = t:4294967295 + r:16383
00005 z r i store M[z:0][r:16382] = i:59
00006 2 t literal t = 2
00007 i w t add i = w:54 + t:2
00008 w z i fetch w = M[z:0][i:56]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:56
00011 z w loadjump load(z:0), jump(w:48)
00048 2 x literal x = 2
00049 z x loadjump load(z:0), jump(x:2)
00002 0 t literal t = 0
00003 t t t nand t = t:0 ~& t:0
00004 r t r add r = t:4294967295 + r:16382
00005 z r i store M[z:0][r:16381] = i:57
00006 2 t literal t = 2
00007 i w t add i = w:48 + t:2
00008 w z i fetch w = M[z:0][i:50]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:50
00011 z w loadjump load(z:0), jump(w:17)
00017 y z i fetch y = M[z:0][i:51]
00018 0 t literal t = 0
00019 t t t nand t = t:0 ~& t:0
00020 s t s add s = t:4294967295 + s:12286
00021 z s y store M[z:0][s:12285] = y:0
00022 1 t literal t = 1
00023 i t i add i = t:1 + i:51
00024 8 x literal x = 8
00025 z x loadjump load(z:0), jump(x:8)
00008 w z i fetch w = M[z:0][i:52]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:52
00011 z w loadjump load(z:0), jump(w:26)
00026 y z s fetch y = M[z:0][s:12285]
00027 1 t literal t = 1
00028 s t s add s = t:1 + s:12285
00029 1 t literal t = 1
00030 x s t cmove x = s:12286 if t:1
00031 x x y add x = x:12286 + y:0
00032 x z x fetch x = M[z:0][x:12286]
00033 0 t literal t = 0
00034 t t t nand t = t:0 ~& t:0
00035 s t s add s = t:4294967295 + s:12286
00036 z s x store M[z:0][s:12285] = x:42
00037 8 x literal x = 8
00038 z x loadjump load(z:0), jump(x:8)
00008 w z i fetch w = M[z:0][i:53]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:53
00011 z w loadjump load(z:0), jump(w:12)
00012 i z r fetch i = M[z:0][r:16381]
00013 1 t literal t = 1
00014 r t r add r = t:1 + r:16381
00015 8 x literal x = 8
00016 z x loadjump load(z:0), jump(x:8)
00008 w z i fetch w = M[z:0][i:57]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:57
00011 z w loadjump load(z:0), jump(w:39)
00039 y z s fetch y = M[z:0][s:12285]
00040 1 t literal t = 1
00041 s t s add s = t:1 + s:12285
00042 x z s fetch x = M[z:0][s:12286]
00043 x x y add x = x:42 + y:42
00044 z s x store M[z:0][s:12286] = x:84
00045 8 x literal x = 8
00046 z x loadjump load(z:0), jump(x:8)
00008 w z i fetch w = M[z:0][i:58]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:58
00011 z w loadjump load(z:0), jump(w:12)
00012 i z r fetch i = M[z:0][r:16382]
00013 1 t literal t = 1
00014 r t r add r = t:1 + r:16382
00015 8 x literal x = 8
00016 z x loadjump load(z:0), jump(x:8)
00008 w z i fetch w = M[z:0][i:59]
00009 1 t literal t = 1
00010 i t i add i = t:1 + i:59
00011 z w loadjump load(z:0), jump(w:47)
00047 halt
And viola! We indeed end up with 84 on the stack. Not a particularly impressive feat, but it shows that all seems to be working as expected.
To be continued…
Next, we’ll work on building an outer interpreter to process Forth source code and get out of this assembly language business. However, we’ll always keep the assembler handy and we’ll always be free to dip down to defining Forth words at this lowest level; for performance and to take advantage of machine-level functionality as needed. This spanning from bare metal all the way up to very high levels of abstraction is a unique beauty of Forth.