Eighth - a Forth for NES dev (formerly Bytecode Interpreter)

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Eighth - a Forth for NES dev (formerly Bytecode Interpreter)

Post by sonder »

So I've been developing a bytecode interpreter for the NES to solve a few problems. I'm using it to do r&d. Here's why I did it:

1) 6502 is hard - too hard to try out ideas on or get something running quickly.

2) C is not much easier and not space-efficient at all.

3) NES has tight storage, so space-maximizing measures are advantageous

I'd been a fan of Forth for several years and became a minor expert in it. I thought it was the perfect solution to primarily the problem of difficulty of coding. Forth simplifies things a lot, at the cost of speed. The savings in size are almost a side benefit, but they are substantial.

What I settled on is a virtual machine with 8-bit instructions and an 8-bit stack. It's almost fully functional - still tweaking the instruction set. I optimized it as best I could - the estimated speed is about 600 900 instructions on average per frame. This is enough for rapid prototyping of ideas ... but also useful where speed is not critical, like decompression. I estimate that the size savings it provides over plain asm is about 30%. That ratio, as well as the speed can be improved with peephole optimization (when the compiler combines commonly used pairs of "words" into faster hybrids)

A "compiler" coded in SwiftForth (PC) will generate asm6 assembly code. It should run on the trial version.

I thought I might create a thread to let people know what I'm doing, gauge interest, maybe get help with any issues I run into or just hear ideas and answer questions.

8/15/2013
I threw the runtime source up in a Google Code project - https://code.google.com/p/eighth6502/ sorry no docs, compiler, or example yet

8/18/2013
Updated OP to be a little more general and up-to-date
Last edited by sonder on Fri Aug 23, 2013 9:11 am, edited 4 times in total.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

I had a eureka moment and figured out a way to speed it up by 1.5x using a tiny bit of self-modifying code, while also saving a significant amount of ROM space. Like hundreds of bytes.

An additional improvement is it can now call arbitrary asm, so asm and bytecode can be freely mixed. I am looking to add 16-bit math next.

The only thing that needs doing now is testing it. I spent a long time debugging the first version, and I'm not really keen on doing that again, so I'd love it if anyone was interested in maybe helping beta test it.

To give an idea of how optimized this is... here is the heart of it, the "inner loop"

Code: Select all

_vmnext:
	ldx	$0000, y        ; 4
	iny                   ; 2
	stx   VMNEXT_JMP      ; 3
	jmp   opcode_page     ; 3
                         ; +3 for the jmp to this routine = 15
VMNEXT_JMP is actually the ZP address of the lower byte of the literal within the next instruction, only one page of ROM used for opcode implementations. one of the virtual registers IPBASE is actually the 2nd and 3rd bytes of the first instruction in this listing.
Last edited by sonder on Mon Aug 12, 2013 7:57 pm, edited 3 times in total.
sonder
User avatar
Dwedit
Posts: 4921
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: Bytecode Interpreter

Post by Dwedit »

Ever read about Steve Wozniak's Sweet16 interpreter?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Dwedit wrote:Ever read about Steve Wozniak's Sweet16 interpreter?
Neato - no I hadn't heard of that. It isn't practical to have a full 16-bit VM for realtime code (rather I'm going to just have some extra instructions for doing a couple 16-bit operations) but I will have to look into this later for ideas or code. Thanks.

Btw... I noticed your alias... Dan Weiss right? We've collabed before. :)
sonder
dullahan
Posts: 96
Joined: Mon Dec 07, 2009 11:08 am
Location: USA

Re: Bytecode Interpreter

Post by dullahan »

Thanks for the updates sonder. Is your implementation living somewhere out on the web for perusal? I am interested in checking it out.
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Update:

During the past couple days I debugged it and optimized it more. Comparison tests, if/then, for loops and while/until control structures are supported and functioning perfectly. A very small set of opcodes and routines for working with 16-bit numbers were added.

On the testing side of things I was able to get significantly more game features going on in roughly same number of cycles with tons to spare (currently using around 1/4 of the frame. About half of that is just the metasprite rendering (about 16 sprites), which is asm, for perspective. One nice thing is calling arbitrary asm routines is pretty fast so they are not limited to being behemoths to justify their existence. Surprisingly, because the interpreter is so fast, a lot of the time trying to squeeze out more speed by turning something into asm doesn't actually buy much. But as the complexity of a bytecode routine grows, the benefit of turning it into asm increases by a lot. So it is still best suited to stuff that can be slow, or simple high-level logic.

Ready for some example code? Here is an excerpt, just really basic moving a guy around. Doing it by hand for now, since the interpreter was basically rehauled and requires a major update to the compiler, but it's actually not too bad.

Code: Select all

do_player_movement:

	mcallasm andcase
	.db PAD_LEFT
mif +then
	mcallasm xfetch
	mdouble $FFF0
	.db <dplus
	mcallasm xstore
	.db <drop,<exit
+then

	mcallasm andcase
	.db PAD_RIGHT
mif +then
	mcallasm xfetch
	mdouble $0010
	.db <dplus
	mcallasm xstore
	.db <drop,<exit
+then

	mcallasm andcase
	.db PAD_UP
mif +then
	mcallasm yfetch
	mdouble $FFF0
	.db <dplus
	mcallasm ystore
	.db <drop,<exit
+then

	mcallasm andcase
	.db PAD_DOWN
mif +then
	mcallasm yfetch
	mdouble $0010
	.db <dplus
	mcallasm ystore
+then
	.db <drop,<exit

doplayer:
	.db <zero                     ; push a zero
	mcallasm padstate
	mcall do_player_movement
	mcallasm screenxy
	mdouble ms_player_idle_fwd
	mcallasm metaspr
	.db <oplus                    ; go to next object
	.db <exit
Once the compiler is done this would be how the same code will look in Forth-style syntax:

Code: Select all


: do_player_movement  ( padstate -- padstate )
   andcase b: PAD_LEFT if 
      xfetch $FFF0 d+ xstore exit
   then
   andcase b: PAD_RIGHT if 
      xfetch $0010 d+ xstore exit
   then
   andcase b: PAD_UP if 
      yfetch $FFF0 d+ ystore exit
   then
   andcase b: PAD_RIGHT if 
      yfetch $0010 d+ ystore 
   then ;

: doplayer
   0 padstate do_player_movement drop
   screenxy ms_player_idle_fwd metaspr
   o+ ;

sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

dullahan wrote:Thanks for the updates sonder. Is your implementation living somewhere out on the web for perusal? I am interested in checking it out.
No not yet. It just needs to be taken out of the game project folder and cleaned up. I will do that tomorrow since it's in a good working state and throw it in a SVN repository.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Just made the code available, see OP.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Well the desire for speed and more efficient nesting has led to yet another "rehaul" (though not as big as the last one) and I am thinking of moving to 16-bit "opcodes" (really 16-bit direct threaded code, classic Forth)

I discovered a way to do the inner loop for threaded code (basically lists of addresses) that is probably the fastest possible. It's 20 cycles so a bit slower, but, now there is absolutely no penalty for calling arbitrary asm where as before it was ~30 (on top of the 15 cycles for the inner loop that's 45 cycles just to execute something as simple as fetching an object property). As a side bonus nesting interpreted routines is a little faster. But fetching literals is almost twice as slow.

The size savings would be likely less than bytecode but not that bad. This is because of how often custom routines are called. Because all calls are now 2 bytes instead of some being 1 and some being 3, it nearly balances out. I probably don't care about size as much as before because I've moved from NROM to UNROM.

I originally envisioned a bytecode interpreter because I wanted to also use it for data compression but it's looking like I might end up doing what would have seemed counter-intuitive before my pursuit of speed started - two interpreters one for reasonably fast Forth scripting and one specifically tailored for size over speed. The bytecode interpreter may actually become a slower, limited subset of the DTC one, relying on a lot of the same code and variables just a different separate inner loop. The idea was to set it up so the opcodes are extensible, to for instsance specify data using basically simple inline scripts within the map data.
sonder
User avatar
Dwedit
Posts: 4921
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: Bytecode Interpreter

Post by Dwedit »

By the way, how is Forth as a programming language? I've never tried it myself. Seems like it uses RPN extensively, but how different would it be from Basic or C if it used infix notation?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Dwedit wrote:By the way, how is Forth as a programming language? I've never tried it myself. Seems like it uses RPN extensively, but how different would it be from Basic or C if it used infix notation?
It is slower than C (in Eighth's case probably 2x), but faster than Basic. Its big draw is ease of implementation, but it does also have the benefit of teaching you to be a better coder if you stick with it. Mostly because of how the requirement to learn to factor well to effectively use it beefs up your abstraction practices.

It's like C in that it grants you a powerful gun with which to shoot yourself, but like Basic in that it is doesn't have nearly as much punctuation and required boilerplate.

It has no types, which can be seen as a benefit. The one type is the cell - the width of the data stack. In my interpreter Eighth, that's just 8 bits. Any others are derivative and managed by you. This isn't hard if you just notate things using stack diagrams or use certain name conventions (e.g. "<name>?" returns either 0 or -1). Address and 16-bit math are accomplished with "doubles", pairs of values on the stack. You learn not to think in more than a handful of types which are fluid and morph into each other frequently.

It can be implemented in a couple days, like I did. The runtime portion anyway. Doing a complete system takes a couple weeks to months.

It lets routines return multiple values for no extra work. You don't have to name them, just push them on the stack.

It's eminently extensible. Want a custom control structures, inline data (like strings), or custom compiler keywords? Trivial. Many language features can be added, but you get the advantage of being able to tailor it specifically to your app. Many things can't be easily added in full (like C-like structs) but you learn to work around, for instance by using its equivalent of namespaces, or prefixes. Or figuring out some other way to do what you want to do in the end entirely.

Being fully concatenative it lends itself to very natural-reading code (admittedly in your hands.) It frees you to think in broad strokes when reading code, instead of poring through it line-by-line like a computer. You pay more attention to the structure and logic of the program than the nitty gritty. Oftentimes rewriting subroutines or entire modules from scratch rather than trying to shoehorn in kludgy tweaks. Because ideally they should be really short and clean. This gave it its "write-only" rep but I disagree I think it is very readable, just not always easy to make changes to a complex routine, which would be your fault anyway. Keeping your subroutines simple (you're supposed to use lots of subroutines) frees you from having to think like a compiler. (Unfortunately Eighth's subroutines are currently kind of expensive...)

Not requiring named parameters makes code really fluid. Bits and pieces can be easily lifted out and turned into subroutines far easier than any other language I've used (I've heard Lisp is similar though). That contributes to the fun factor. Local named variables/params are supported in fancier Forths.

So TL;DR, actually, Forth is really, really different than both C and Basic. In my opinion it's both easier and more powerful if your goal is writing a program fast. The fundamentals can be learned in a day, but mastering it takes time. You have to unlearn a lot of assumptions and learn all the tricks. In the meantime writing simple scripts should be a cinch for just about anyone - in the old days they'd give customers (like scientists and businesspeople) app-specific vocabularies and let them write scripts (which were really just Forth programs!) all the time.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

sonder wrote:Well the desire for speed and more efficient nesting has led to yet another "rehaul" (though not as big as the last one) and I am thinking of moving to 16-bit "opcodes" (really 16-bit direct threaded code, classic Forth)
Right now I'm not sure if it'll be worth the effort to implement 16-bit DTC (direct threaded code). Free arbitary asm, 1.5x faster subroutines, and much faster for/nextare the advantages. The big disadvantage is actually that literals are slow. More than 2x actually. And they are ubiquitous, so that's no bueno. It's totally unclear how it balances out right now, if there's enough overall speed gain to justify. This may be a future upgrade. Kind of tired of rehauling this thing.
Last edited by sonder on Sun Aug 18, 2013 10:35 pm, edited 1 time in total.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

sonder wrote:
sonder wrote:Well the desire for speed and more efficient nesting has led to yet another "rehaul" (though not as big as the last one) and I am thinking of moving to 16-bit "opcodes" (really 16-bit direct threaded code, classic Forth)
Right now I'm not sure if it'll be worth the effort to implement 16-bit DTC (direct threaded code). Free arbitary asm, 2x faster subroutines, and much faster for/next are the advantages. The big disadvantage is actually that literals are slow. More than 2x actually. And they are ubiquitous, so that's no bueno. It's totally unclear how it balances out right now, if there's enough overall speed gain to justify. This may be a future upgrade. Kind of tired of rehauling this thing.
Just did a quick analysis of the test script and literals are far less common than callasm's and call's. Moving to 16-bit DTC would eliminate 903 cycles (8 scanlines) of overhead. (originally 1759 - 16 scanlines - so around 2x faster)

That was a naive assumption. I just spent an hour creating a DTC version of the interpreter. I'll post actual test results soon.
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

The speed increase, as I thought, wasn't as dramatic as I originally hoped. Debugging the DTC version was a BITCH due to the instruction pointer being pre-incrementing now. But the savings in the test program after everything was about 4 scanlines. I tried to get literals down as much as I could, but I could only get them down to 36 cycles, down from ... like 50. I forget. (but compare to the old VM's 25... and i considered that slow...)

It's starting to get to the point of frustration - come up with a great idea, bang it out, then reality hits and after all the work that goes into debugging it, it yields only an incremental improvement. Bleh. Well, what's done is done and now it's slightly faster. Besides, I'm sore from parkour class yesterday so I was fine with coding on the couch half the day :)
sonder
User avatar
sonder
Posts: 116
Joined: Wed Jun 26, 2013 12:35 pm
Location: Baltimore
Contact:

Re: Bytecode Interpreter

Post by sonder »

Banged out a brand new compiler supporting DTC and several useful bonus features. Converted the player's script, and got it working fairly quickly:

Code: Select all


\ these aren't even used here ... just an example of adding asm routines

code{ uplus
	tsx
	clc
	adc $102,x
	sta $102,x
	vmnext_pla
}

code{ twostore
	tax
	pla
	sta 0,x
	pla
	sta 1,x
	vmnext_pla
}

asm{
enum TEMP
	NX: .db 0
	NY: .db 0
	PX: .db 0
	PY: .db 0
ende
}


\ hardwired obj size = 16x16 for now

: solid_obj_move
;


bindlit PAD_LEFT PAD_LEFT
bindlit PAD_RIGHT PAD_RIGHT
bindlit PAD_UP PAD_UP
bindlit PAD_DOWN PAD_DOWN
bindlit OVERWORLD_POS OVERWORLD_POS

bind2lit ms_player_idle_right ms_player_idle_right

bindword vx! vxstore
bindword x! xstore
bindword vy! vystore
bindword y! ystore
bindword vx@ vxfetch
bindword x@ xfetch
bindword vy@ vyfetch
bindword y@ yfetch
bindword load_screen
bindword screenxy
bindword padstate
bindword metaspr

: do_player_control
	w: 0 2dup vx! vy!
	andcase dw: PAD_LEFT if $ffec vx! exit then
	andcase dw: PAD_RIGHT if $0014 vx! exit then
	andcase dw: PAD_UP if $ffec vy! exit then
	andcase dw: PAD_DOWN if $0014 vy! exit then
;

: load_overworld_screen
	OVERWORLD_POS @ load_screen
;

: do_player_world
	x@ drop $0f > if 1 OVERWORLD_POS +! 0 dup x! load_overworld_screen exit then
	x@ drop $00 < if -1 OVERWORLD_POS +! $0f00 x! load_overworld_screen exit then
	y@ drop $0f > if 4 OVERWORLD_POS +! 0 dup y! load_overworld_screen exit then
	y@ drop $00 < if -4 OVERWORLD_POS +! $0b00 y! load_overworld_screen exit then
;

: doplayer
	0 padstate do_player_control drop
	x@ vx@ d+ x!   y@ vy@ d+ y!
        do_player_world
	screenxy ms_player_idle_right metaspr
	o+
;
sonder
Post Reply