6502 ASM trick

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Go ahead and make the wiki page.
bogax
Posts: 34
Joined: Wed Jul 30, 2008 12:03 am

Post by bogax »

Bregalad wrote:Sorry - my bad - I didn't think hard enough.


This is entirely my fault for not commenting
the code better.
Bregalad wrote:I haven't decided really if "my" (really Nintendo's) solution or yours solution is better.
I should take the time to analyze of much bytes each solution saves.

Anyway I think your solution is very elegant, while mine would need a "jsr FetchArguments" at the start of every routine which needs arguments - so I think your solution is probably better.


With my code you avoid dealing with an extra
return address inside the fetch arguments routine
at the expense of pushing that address out of the
routine and into the JSR/argument list for each time
you call a routine that uses the fetch arguments
routine.
The timing is probably approximately the same but
I think it will end up costing some extra bytes
of code.

Speed wise, using the same index for source and
destination is faster so here's some code that
gets the number of parameters to transfer from
the head of the parameters list.
Sort of like Pascal style strings rather than
C style strings.

Code: Select all

 ; routine to fetch parameters inline from
 ; the return address currently on the stack
 ; to be called from within a routine that will
 ; use the parameters
 ; the number of bytes to fetch is the first byte
 ; inline, the byte at the address on  the
 ; stack + 1
 ; the number of bytes does not include itself
 ; so its the number of bytes following the first
 ; byte that is fetched ie if the first byte is
 ; 4 then there should be 5 bytes inline total
 ; and the last 4 will be fetched as paramters
 ; the first byte should always be one less than
 ; the total of in line bytes so
 ; the return address is adjusted to point to the
 ; last inline byte by adding the first byte
 ; plus one
 ; no provision is made for a carry so the parameters 
 ; should not cross a page boundary

GET_PARAMETERS
 ; get the return address into pointer
 tsx
 lda $104,x
 sta pointer+1
 lda $103,x
 sta pointer           ; the return address is one short
 inc pointer           ; so fix the pointer

 ldy #$00              ; pointer points to the first byte inline

 ; adjust the return address
 sec                   ; plus one for the first byte
 adc (pointer),y 
 sta $103,x            ; and put it back on the stack

 lda (pointer),y       ; now get the number of bytes to transfer
 tay                   ; into y

LOOP
 lda (pointer),Y
 sta parameters-1,Y 
 dey 
 bne LOOP 
 rts
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

I got an awesome way to test individual bits without destroying A, simulating the inexisting BIT #$xx opcode. Just use bit $xxxx on known opcodes !

Test bit 6 :
Since your code is extremely likely to have a RTI instruction somewhere, just bit this instruction, and it will test bit 6. (RTI = $40)

Test bit 5 :
BIT any JSR opcode

Test bit 4 :
BIT any BPL opcode

Test bit 3 :
BIT a PHP opcode (okay PHP isn't as common but chances are you'll still use it at least once)
Useless, lumbering half-wits don't scare us.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

OK I just made another awesome discovery.

For a function that needs a bool value, use V for an input flag. Since so few instructions affect V (adc, sbc, bit, and obviously clv, plp and rti), it's possible that in some case you want to have a routine do something slightly different, and that you don't want to use C as it's more often done as C is affected before the part that can take two different routes, V is more likely to be unaffected until this part.

This can really save you a few bytes if you would use two different routines otherwise, or use C and use php/plp but it's not as efficient.

Also bit a RTI instruction to set V.

The same could be done for returning a bool value, although I'm so much used to return C as a bool that this habit will be hard to change.


Second part, I analyzed the passing of >4 bytes of parameters very carefully.
bogax's solution is officially the best.

I counted bytes as following, assuming you have M functions that uses N parameters :

- Standard solution (3 last args in registers and anything else stored to ZP before calling) : 9+N bytes for each call, 0 overhead for the calee
Overall : (9+N)*M bytes.

- Nintendo's solution (store arguments after the jsr opcode and have the calee call a function that copies args to ZP) : 3+N bytes for each call, and 3 bytes overhead for the callee, the argument retrieve function takes 32 bytes :
Overall : (6+N)*M + 32 bytes

- bogax's solution (jsr to a special function followed by the adress of the actual function, # of arguments and arguments) : 6+N bytes for each call, 0 overhead for the calee, argument retrieve function takes 29 bytes :
Overall : (6+N)*M + 29 bytes

So bogax's is the winner (for a close call) is M>3, that if a program needs at least 4 functions with more than 3 bytes of arguments.
I haven't checked speed, but I'm sure bogyax's is less slow since it avoids an extra jsr/rts of overhead each time.

Eventually his routine can be improved further more using BRK to simulate a JSR to the argument retrieving function if IRQs aren't used.
Useless, lumbering half-wits don't scare us.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 568
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Post by Jarhmander »

I just thought of something. I don't think it is revolutional or never discovered, though I'll share nevertheless the basic idea.

I thought that the zero page can be used as a data stack, to give parameters to subroutines and allocating auto variables (auto as in B and C programming languages)

Random points about it:

1- Let's call the data stack pointer in the following "dSP" (strange case as to not be confused with DSP which has nothing to do here).
dSP is normally located in register X, but a zero-page address is reserved to keep its value if we need X for other purposes.
The data stack grows downward like the regular stack. Let's use a post-decrement/pre-increment scheme for pushing/pulling values to/from data stack, like the regular stack.

2- X is chosen because with RMW instructions, zero-page indexed can only use X.

3- For non-RMW instructions, zero-page indexed takes the same cycles as absolute indexed, but takes less space. For RMW instructions, zero-page indexed takes both less space and less cycles to execute.

4- As an added benefit, pointers (to data) in the stack can be directly used thanks to the rarely used (Indirect, X) addressing.

5- Example of allocating/deallocating space on data stack:

Code: Select all

    lda dSP       ; if dSP already in X, do "txa" instead
    sec           ; can be optimized-out
    sbc #space_needed
    sta dSP
    tax
    lda #init_val ; now you can do whatever you want with
    sta $01,X     ; allocated space
    ...
    ...
    txa           ; time to free space
    clc
    adc #space_needed
    sta dSP
    tax
If at all times, X is used to keep dSP, then we can omit storing its value at its zero-page location. ISRs can then safely use X as dSP. Otherwise, you must keep at all times a valid dSP somewhere in memory so ISRs can load dSP, because you can't entirely rely on X validity as a dSP in an ISR . Of course, if the data stack is not used in any ISRs, then you don't have any trouble.

6- A (slightly) more complicated example, the memcpy function:

Code: Select all

; void *mempcy(void *dest, void *src, uint8_t n)
; (cdecl)
; A non-conforming implementation of memcpy
; Copies src into dest, returns original dest
memcpy_cdecl:
    lda $04,X
    sta ret0
    lda $05,X
    sta ret1
    ldy $01,X              ; fetch 'n'
    bne memcpy_cdecl_loop  ; protect against n = 0
    rts
memcpy_cdecl_loop:
    lda ($02,X)
    sta ($04,X)
    dey
    bne :+ ; end?
    rts
:   inc $02,X
    bne :+
    inc $03,X
:   inc $04,X
    bne memcpy_cdecl_loop
    inc $05,X
    jmp memcpy_cdecl_loop
Here I introduced ret0 and ret1, which are fixed zero-page locations storing return values. This isn't part of my idea; one could use only the stack for return values, or registers if the size of the datatype of the result is small enough, just to give some examples, it will work if you're consistent with your calling convention anyway. The key here is that parameters are "pushed" from left to right, so n @ dSP + 1, src @ dsp+2 and dest @ dSP+4.

7- To be really useful in a project, one could mix several calling conventions, say "cdecl" (my idea) and "fastcall" (when regular registers are used to pass values, so it's quicker) and suffix the name of the subroutine with "_cdecl" or "_fastc" so it is explicit. Moreover, if one changes its mind and change the calling convention of a given subroutine and change accordingly its name, when compiling, errors will tell where the old subroutine were called and where to make the change to adapt the code to the changed calling convention.

Sorry if my code examples don't compile or don't work because if messed something, but I hope you get the basic idea :P
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

Yes, I always found that it was retarded that they used (),Y to adress data stack in CC65 when they could do what you describe instead.

Also is there any reason the stack should go "upwards" like the hardware stack ? I think it would make just as much sense to have a downwards stack instead - I don't think it would affect performance in any way - but it would be better that when you push pointers on the stack you push the low byte first and then you can use (,X). If we do it as you describe we would have to be sure to push the high byte first.

Or even better, get rid of data stacks completely, like SDCC, at the small price to be unable to do re-entrant functions, which nobody ever uses anyways.

Also I think there's a 99% of chances X has to be used for something else once in a while, so you should definitely save it somewhere.
Useless, lumbering half-wits don't scare us.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 568
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Post by Jarhmander »

Bregalad, I think you misunderstood something. I described the data stack as a "downward stack", that is, the data stack pointer points initially at a "high" address, and the more you "push" things into it, the more the pointer get low, and hence it is like the regular stack. It has nothing to do with the order of the individual bytes "pushed" into it — obviously, you put pointers in the stack in little endian, so you can use correcty the (indirect, X) addressing.

Using stack-based memory allocation has the benefit of allowing recursion, but it wasn't the intent. In the absence of an intelligent linker and a keyword to define a memory location to use only within a subroutine¹, you have to ensure yourself that any "temp" variable is used only by one active subroutine at a time (including ISRs), or you can welcome bugs, especially the gay ones. Using a data stack remove that burden. Note that you don't have to manipulate the return address nor the regular stack to get parameters, and because it is located in zero page, accesses are fairly fast (slower than absolute access though), compact and you can use pointers directly from your data stack without having to move them into zero page. You can manipulate your data directly into zero-page, and when you're done with it, you can free it automagically by adjusting your data stack pointer.

But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary. ISRs conforming to that model must then save all zero-page locations needed prior to using them. Therefore, there's no data stack needed, the part "saved-by-caller" must be big enough for any return type, the part "saved-by-called" big enough for temp data and parameters. With care, it can be both efficient and easy to maintain.

[1] An external tool to an assembler that analyse the usage of some marked memory locations and when these subroutines using them are active counts.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

~J-@D!~ wrote:But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary.
Yeah, that's exactly what I adopted for parts of Thwaite: $0000-$0007 saved by caller, $0008-$000F saved by callee.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

I don't think I misunderstood something, but it's possible I got "downward" and "upward" backwards.

If you use a stack like the hardware stack at $100, for example let's say the pointer points to $ff, and you push a pointer you'll have to push the MSB first becuase :
$ff = MSB of pointer
$fe = LSB of pointer

If X = $fe :
lda ($00,X) -> load from pointer

If you would have pushed the LSB first this would not have worked.

However if you opt for the stack for the other order, let's say the stack starts at $0 and goes downward.
Then if you push your pointer you'll have to push it in the regular little endian format :

$00 = LSB of pointer
$01 = MSB of pointer
X = 0;
lda ($00,X) -> load from pointer

I'm not saying one is better than the other, it's just you'll have to be careful with that.

And you're completely right and I 100% agree about the issues that comes if you don't use a data stack, I know those poblems very well because how many bitchy bugs I had because I called a routine that called another routine modified the Temp variables I relied on.

However a good compiler, such as SDCC could compile 6502 while avoiding those clashes, as you said by treating a part of zero-page as registers. I think this would be the most efficient solution, even though the zero page data stack would still be reasonably efficient, while the current solution implemented in CC65 is very inefficient.
Useless, lumbering half-wits don't scare us.
User avatar
Movax12
Posts: 541
Joined: Sun Jan 02, 2011 11:50 am

Post by Movax12 »

I'm going to use this one:


Normally, you would do this (“flag” is a byte in the zero page):

Code: Select all

function     code     bytes     cycles

set	       lda #1
             sta flag    4          5

clear	     lda #0
             sta flag    4          5

test	      lda flag
             beq clear   4          5/6
Woz did it like this:

Code: Select all

function	  code     bytes     cycles

set          lda #$80
             sta flag    4          5

clear	     lsr flag    2          5

test	      bit flag   
             bpl clear   4          5/6
This assumes only bit 7 is important. (You could also use the overflow flag without much more effort for two flags.)

For setting the flag you could also use:

sec
ror flag

So you never have to touch a register, but it is slower.

Credited to Wozniak.. Thank you Apple Computers.
http://www.pagetable.com/?p=33
3gengames
Formerly 65024U
Posts: 2284
Joined: Sat Mar 27, 2010 12:57 pm

Post by 3gengames »

^ But that's a RAM waster considering that 1 bytes is basically sacrificed as a 1-bit flag, I don't think that's useful in the NES which has only 2KB of CPU-side RAM.
User avatar
Movax12
Posts: 541
Joined: Sun Jan 02, 2011 11:50 am

Post by Movax12 »

Yeah, speed/convenience vs space are always an issue, but many games use one byte for some flags. You do save rom space.
3gengames
Formerly 65024U
Posts: 2284
Joined: Sat Mar 27, 2010 12:57 pm

Post by 3gengames »

Maybe on a personal computer of the 80's where there's around 8KB+ of system RAM and around 32KBish for programs. But not NES.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

I often use "inc" to set flags... you'd just don't want to do this if you risk to set it more than 255 times between two clears.
Useless, lumbering half-wits don't scare us.
User avatar
Movax12
Posts: 541
Joined: Sun Jan 02, 2011 11:50 am

Post by Movax12 »

3gengames wrote:Maybe on a personal computer of the 80's where there's around 8KB+ of system RAM and around 32KBish for programs. But not NES.
I'm hardly any kind of authority, but looking over the source of SMB and Metroid I see lots of 1 byte flags.
Post Reply