It is currently Tue Dec 12, 2017 10:08 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Thu Aug 04, 2016 11:08 am 
Offline

Joined: Sun Jan 31, 2016 9:55 pm
Posts: 38
I've been working a bit on a compiler that targets the 6502 (yes, I know there are plenty of things that are already good enough, but its mostly just a fun project to work on for me). It supports operations on 1, 2, 3, and 4 byte values, including the usual 6502 operations, plus variable shift operations (including arithmetic shift right), multiply, divide, modulo, logical operations, and others.

David Wheeler's ideas were a big help in figuring out the runtime for a high level language. I decided to build a split stack machine.

There are 4 data stacks -- STACK0, STACK1, STACK2, and STACK3. STACK0 could be on the zero page to save code space and allow some ldy ZP, x instructions, and STACK1, STACK2, STACK3 can be elsewhere. There is only one stack pointer, stored in the X register. This allows stack access quite quickly using indexed absolute addressing. Multi-byte values are stored across multiple stacks. For example, the bytes of a 24-bit integer would be stored at STACK0+x, STACK1+x, and STACK2+x. Manipulating the stack pointer is also very cheap, it only takes a dex or inx.

As an optimization, the current top-of-stack value is stored contiguously in the zero page at TOS. This means that most operations, which conceptually pop 2 values from the stack, then push the result back on the stack, only have to modify TOS in-place and then increment x in order to calculate their values.

Here's an example, the 2-byte addition operation:

Code:
add2:                  ; 30 cycles total
        clc            ; 2 cycles
        lda STACK0, x  ; 4 cycles
        adc TOS        ; 3 cycles
        sta TOS        ; 3 cycles
        lda STACK1, x  ; 4 cycles
        adc TOS+1      ; 3 cycles
        sta TOS+1      ; 3 cycles
        inx            ; 2 cycles
        rts            ; 6 cycles


A generated program would look something like this:

Code:
main: ; calculates $1234 + $1337
        ; put $1234 on the stack
        lda #<$1234
        sta STACK0, x
        lda #>$1234
        sta STACK1, x

        ; put $1337 on the top of the stack
        lda #<$1337
        sta TOS
        lda #>$1337
        sta TOS+1

        jmp add2


There are a few things that I am considering to improve the performance. Currently, the stack usage is very wasteful. If you use 1 byte operands, then STACK1, STACK2, and STACK3 are unused. Because the 1 and 2 byte operations take less code space than the 3 and 4 byte operations, I could create, for example, 1 byte operations the operate on STACK1 instead of STACK0, and 2 byte operations that operate on STACK2 and 3 instead of 0 and 1. Packing multiple values into one stack slot could be powerful, but it would also be very difficult to write a compiler that can take full advantage of it. A good compiler would generate normal 6502 code for 1 byte instructions anyway, rather than using the stack.


Top
 Profile  
 
PostPosted: Thu Aug 04, 2016 12:28 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7314
Location: Chexbres, VD, Switzerland
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

I doubt 24-bit and 32-bit calculations are very useful, ever, in the context of Nesdev. 16-bit however is extremely useful so using 2 separate stack for LSB and MSB and avoid the overhead of separating 16-bit and 8-bit operations and just do everything on 16-bit and ignore the MSB when appropritate is actually a good idea, in my opinion.


Top
 Profile  
 
PostPosted: Thu Aug 04, 2016 1:11 pm 
Offline

Joined: Sun Jan 31, 2016 9:55 pm
Posts: 38
It's true that 24 and especially 32 bit instructions aren't often useful, but I've sometimes wanted to use them. For example, storing a position on a large map. Super Mario Bros 3 uses a 24 bit number for each coordinate of a player's position. (Actually, more like 20 bits, because it only goes to 1/16 of a pixel precision).

I don't embed the constants because there is a trade-off on speed and code size. I could use a byte code interpreter, like Sweet16, and it could support all of the operations in one byte easily since a stack machine doesn't need to encode the operands in an instruction. This would be slower. However, they aren't mutually exclusive. I could sometime use an interpreter for code compactness, and direct subroutine calls for speed.


Top
 Profile  
 
PostPosted: Thu Aug 04, 2016 1:25 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19335
Location: NE Indiana, USA (NTSC)
Bregalad wrote:
I doubt 24-bit and 32-bit calculations are very useful, ever, in the context of Nesdev.

Horizontal coordinate of a character in a long scrolling level: 0-7 for subpixel, 8-15 for pixel, 16-23 for screen.


Top
 Profile  
 
PostPosted: Thu Aug 04, 2016 1:36 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7314
Location: Chexbres, VD, Switzerland
Ok, but it's the ONLY purpose of more than 16-bit I can think of, and you probably don't want to do the whole collision detection in 24-bit, especially if it's not hand-optimized assembly, as this code will run every frame.


Top
 Profile  
 
PostPosted: Thu Aug 04, 2016 2:16 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19335
Location: NE Indiana, USA (NTSC)
Yeah, collision detection with 24-bit coordinates typically disregards subpixels.


Top
 Profile  
 
PostPosted: Fri Aug 05, 2016 8:40 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
Bregalad wrote:
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

How does that work?


Top
 Profile  
 
PostPosted: Sun Aug 07, 2016 12:49 pm 
Offline

Joined: Sun Jan 31, 2016 9:55 pm
Posts: 38
psycopathicteen wrote:
Bregalad wrote:
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

How does that work?


I imagine it would work like in Sweet16. Sweet16 is a virtual assembly language. Most instructions are a single byte, but the byte code $1X followed by a 2 byte number represents LOAD, with loads the two byte number into the register X. Similarly, this could have a byte code for pushing 1, 2, 3, or 4 byte values onto the stack directly.


Top
 Profile  
 
PostPosted: Sun Aug 07, 2016 12:59 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19335
Location: NE Indiana, USA (NTSC)
russellsprouts wrote:
Similarly, this could have a byte code for pushing 1, 2, 3, or 4 byte values onto the stack directly.

Like in "The Princess and the pea"?


Top
 Profile  
 
PostPosted: Sun Aug 07, 2016 3:53 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
How is that different from loading and storing?


Top
 Profile  
 
PostPosted: Sun Aug 07, 2016 8:09 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
The main difference would be that a 16-bit lda #imm + sta $abs takes 8 cycles and 6 bytes total, while pea $val would only take 5 cycles + 3 bytes total.

Two complications (in my experience with pea): 1) the endian of the bytes pushed might not match what your existing code expects, 2) the values pushed are statically-defined at assemble time (i.e. instruction pushes a predetermined value, not what's in the accumulator); the "workaround" for the latter is to use self-modifying code so that the pea+operand bytes are in RAM and the latter can be changed dynamically.

Whether or not this provides you any cycle-count or space benefits over, stay, a classic lda/sta or lda/pha etc. is, obviously, indeterminable at this time.

P.S. -- Man, I haven't heard of Sweet16 in decades. Blast from the past hitting me hard. :-)


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 1:18 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
I thought only the 65816 had PEA.


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 1:39 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19335
Location: NE Indiana, USA (NTSC)
Among the 6502 family, only the 65816 has pea as a hardware instruction. Perhaps I failed to make my point clear: an interpreted bytecode implemented on any processor could have an instruction with the same semantics (push 16 bits immediate), and it would be recognizable to those with 65816 experience as being pea.


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 2:11 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
I'm still confused on what Bregalad said.


Top
 Profile  
 
PostPosted: Tue Aug 09, 2016 5:56 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2424
I had an idea with implementing math with long integers on the 65816.

If you have the C code:

Code:
long a;
long b;
long c;

a = a + b + c;


It could be compiled to:

Code:
lda a_lo
clc
adc b_lo
tax
lda a_hi
adc b_hi
tay
txa
clc
adc c_lo
sta a_lo
tya
adc c_hi
sta a_hi


Basically using X and Y together as a 32-bit accumulator, and switching back and forth.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group