6502 ASM trick
Moderator: Moderators
Bregalad wrote:Sorry - my bad - I didn't think hard enough.
This is entirely my fault for not commenting
the code better.
Bregalad wrote:I haven't decided really if "my" (really Nintendo's) solution or yours solution is better.
I should take the time to analyze of much bytes each solution saves.
Anyway I think your solution is very elegant, while mine would need a "jsr FetchArguments" at the start of every routine which needs arguments - so I think your solution is probably better.
With my code you avoid dealing with an extra
return address inside the fetch arguments routine
at the expense of pushing that address out of the
routine and into the JSR/argument list for each time
you call a routine that uses the fetch arguments
routine.
The timing is probably approximately the same but
I think it will end up costing some extra bytes
of code.
Speed wise, using the same index for source and
destination is faster so here's some code that
gets the number of parameters to transfer from
the head of the parameters list.
Sort of like Pascal style strings rather than
C style strings.
Code: Select all
; routine to fetch parameters inline from
; the return address currently on the stack
; to be called from within a routine that will
; use the parameters
; the number of bytes to fetch is the first byte
; inline, the byte at the address on the
; stack + 1
; the number of bytes does not include itself
; so its the number of bytes following the first
; byte that is fetched ie if the first byte is
; 4 then there should be 5 bytes inline total
; and the last 4 will be fetched as paramters
; the first byte should always be one less than
; the total of in line bytes so
; the return address is adjusted to point to the
; last inline byte by adding the first byte
; plus one
; no provision is made for a carry so the parameters
; should not cross a page boundary
GET_PARAMETERS
; get the return address into pointer
tsx
lda $104,x
sta pointer+1
lda $103,x
sta pointer ; the return address is one short
inc pointer ; so fix the pointer
ldy #$00 ; pointer points to the first byte inline
; adjust the return address
sec ; plus one for the first byte
adc (pointer),y
sta $103,x ; and put it back on the stack
lda (pointer),y ; now get the number of bytes to transfer
tay ; into y
LOOP
lda (pointer),Y
sta parameters-1,Y
dey
bne LOOP
rts
I got an awesome way to test individual bits without destroying A, simulating the inexisting BIT #$xx opcode. Just use bit $xxxx on known opcodes !
Test bit 6 :
Since your code is extremely likely to have a RTI instruction somewhere, just bit this instruction, and it will test bit 6. (RTI = $40)
Test bit 5 :
BIT any JSR opcode
Test bit 4 :
BIT any BPL opcode
Test bit 3 :
BIT a PHP opcode (okay PHP isn't as common but chances are you'll still use it at least once)
Test bit 6 :
Since your code is extremely likely to have a RTI instruction somewhere, just bit this instruction, and it will test bit 6. (RTI = $40)
Test bit 5 :
BIT any JSR opcode
Test bit 4 :
BIT any BPL opcode
Test bit 3 :
BIT a PHP opcode (okay PHP isn't as common but chances are you'll still use it at least once)
Useless, lumbering half-wits don't scare us.
OK I just made another awesome discovery.
For a function that needs a bool value, use V for an input flag. Since so few instructions affect V (adc, sbc, bit, and obviously clv, plp and rti), it's possible that in some case you want to have a routine do something slightly different, and that you don't want to use C as it's more often done as C is affected before the part that can take two different routes, V is more likely to be unaffected until this part.
This can really save you a few bytes if you would use two different routines otherwise, or use C and use php/plp but it's not as efficient.
Also bit a RTI instruction to set V.
The same could be done for returning a bool value, although I'm so much used to return C as a bool that this habit will be hard to change.
Second part, I analyzed the passing of >4 bytes of parameters very carefully.
bogax's solution is officially the best.
I counted bytes as following, assuming you have M functions that uses N parameters :
- Standard solution (3 last args in registers and anything else stored to ZP before calling) : 9+N bytes for each call, 0 overhead for the calee
Overall : (9+N)*M bytes.
- Nintendo's solution (store arguments after the jsr opcode and have the calee call a function that copies args to ZP) : 3+N bytes for each call, and 3 bytes overhead for the callee, the argument retrieve function takes 32 bytes :
Overall : (6+N)*M + 32 bytes
- bogax's solution (jsr to a special function followed by the adress of the actual function, # of arguments and arguments) : 6+N bytes for each call, 0 overhead for the calee, argument retrieve function takes 29 bytes :
Overall : (6+N)*M + 29 bytes
So bogax's is the winner (for a close call) is M>3, that if a program needs at least 4 functions with more than 3 bytes of arguments.
I haven't checked speed, but I'm sure bogyax's is less slow since it avoids an extra jsr/rts of overhead each time.
Eventually his routine can be improved further more using BRK to simulate a JSR to the argument retrieving function if IRQs aren't used.
For a function that needs a bool value, use V for an input flag. Since so few instructions affect V (adc, sbc, bit, and obviously clv, plp and rti), it's possible that in some case you want to have a routine do something slightly different, and that you don't want to use C as it's more often done as C is affected before the part that can take two different routes, V is more likely to be unaffected until this part.
This can really save you a few bytes if you would use two different routines otherwise, or use C and use php/plp but it's not as efficient.
Also bit a RTI instruction to set V.
The same could be done for returning a bool value, although I'm so much used to return C as a bool that this habit will be hard to change.
Second part, I analyzed the passing of >4 bytes of parameters very carefully.
bogax's solution is officially the best.
I counted bytes as following, assuming you have M functions that uses N parameters :
- Standard solution (3 last args in registers and anything else stored to ZP before calling) : 9+N bytes for each call, 0 overhead for the calee
Overall : (9+N)*M bytes.
- Nintendo's solution (store arguments after the jsr opcode and have the calee call a function that copies args to ZP) : 3+N bytes for each call, and 3 bytes overhead for the callee, the argument retrieve function takes 32 bytes :
Overall : (6+N)*M + 32 bytes
- bogax's solution (jsr to a special function followed by the adress of the actual function, # of arguments and arguments) : 6+N bytes for each call, 0 overhead for the calee, argument retrieve function takes 29 bytes :
Overall : (6+N)*M + 29 bytes
So bogax's is the winner (for a close call) is M>3, that if a program needs at least 4 functions with more than 3 bytes of arguments.
I haven't checked speed, but I'm sure bogyax's is less slow since it avoids an extra jsr/rts of overhead each time.
Eventually his routine can be improved further more using BRK to simulate a JSR to the argument retrieving function if IRQs aren't used.
Useless, lumbering half-wits don't scare us.
- Jarhmander
- Formerly ~J-@D!~
- Posts: 568
- Joined: Sun Mar 12, 2006 12:36 am
- Location: Rive nord de Montréal
I just thought of something. I don't think it is revolutional or never discovered, though I'll share nevertheless the basic idea.
I thought that the zero page can be used as a data stack, to give parameters to subroutines and allocating auto variables (auto as in B and C programming languages)
Random points about it:
1- Let's call the data stack pointer in the following "dSP" (strange case as to not be confused with DSP which has nothing to do here).
dSP is normally located in register X, but a zero-page address is reserved to keep its value if we need X for other purposes.
The data stack grows downward like the regular stack. Let's use a post-decrement/pre-increment scheme for pushing/pulling values to/from data stack, like the regular stack.
2- X is chosen because with RMW instructions, zero-page indexed can only use X.
3- For non-RMW instructions, zero-page indexed takes the same cycles as absolute indexed, but takes less space. For RMW instructions, zero-page indexed takes both less space and less cycles to execute.
4- As an added benefit, pointers (to data) in the stack can be directly used thanks to the rarely used (Indirect, X) addressing.
5- Example of allocating/deallocating space on data stack:
If at all times, X is used to keep dSP, then we can omit storing its value at its zero-page location. ISRs can then safely use X as dSP. Otherwise, you must keep at all times a valid dSP somewhere in memory so ISRs can load dSP, because you can't entirely rely on X validity as a dSP in an ISR . Of course, if the data stack is not used in any ISRs, then you don't have any trouble.
6- A (slightly) more complicated example, the memcpy function:
Here I introduced ret0 and ret1, which are fixed zero-page locations storing return values. This isn't part of my idea; one could use only the stack for return values, or registers if the size of the datatype of the result is small enough, just to give some examples, it will work if you're consistent with your calling convention anyway. The key here is that parameters are "pushed" from left to right, so n @ dSP + 1, src @ dsp+2 and dest @ dSP+4.
7- To be really useful in a project, one could mix several calling conventions, say "cdecl" (my idea) and "fastcall" (when regular registers are used to pass values, so it's quicker) and suffix the name of the subroutine with "_cdecl" or "_fastc" so it is explicit. Moreover, if one changes its mind and change the calling convention of a given subroutine and change accordingly its name, when compiling, errors will tell where the old subroutine were called and where to make the change to adapt the code to the changed calling convention.
Sorry if my code examples don't compile or don't work because if messed something, but I hope you get the basic idea
I thought that the zero page can be used as a data stack, to give parameters to subroutines and allocating auto variables (auto as in B and C programming languages)
Random points about it:
1- Let's call the data stack pointer in the following "dSP" (strange case as to not be confused with DSP which has nothing to do here).
dSP is normally located in register X, but a zero-page address is reserved to keep its value if we need X for other purposes.
The data stack grows downward like the regular stack. Let's use a post-decrement/pre-increment scheme for pushing/pulling values to/from data stack, like the regular stack.
2- X is chosen because with RMW instructions, zero-page indexed can only use X.
3- For non-RMW instructions, zero-page indexed takes the same cycles as absolute indexed, but takes less space. For RMW instructions, zero-page indexed takes both less space and less cycles to execute.
4- As an added benefit, pointers (to data) in the stack can be directly used thanks to the rarely used (Indirect, X) addressing.
5- Example of allocating/deallocating space on data stack:
Code: Select all
lda dSP ; if dSP already in X, do "txa" instead
sec ; can be optimized-out
sbc #space_needed
sta dSP
tax
lda #init_val ; now you can do whatever you want with
sta $01,X ; allocated space
...
...
txa ; time to free space
clc
adc #space_needed
sta dSP
tax
6- A (slightly) more complicated example, the memcpy function:
Code: Select all
; void *mempcy(void *dest, void *src, uint8_t n)
; (cdecl)
; A non-conforming implementation of memcpy
; Copies src into dest, returns original dest
memcpy_cdecl:
lda $04,X
sta ret0
lda $05,X
sta ret1
ldy $01,X ; fetch 'n'
bne memcpy_cdecl_loop ; protect against n = 0
rts
memcpy_cdecl_loop:
lda ($02,X)
sta ($04,X)
dey
bne :+ ; end?
rts
: inc $02,X
bne :+
inc $03,X
: inc $04,X
bne memcpy_cdecl_loop
inc $05,X
jmp memcpy_cdecl_loop
7- To be really useful in a project, one could mix several calling conventions, say "cdecl" (my idea) and "fastcall" (when regular registers are used to pass values, so it's quicker) and suffix the name of the subroutine with "_cdecl" or "_fastc" so it is explicit. Moreover, if one changes its mind and change the calling convention of a given subroutine and change accordingly its name, when compiling, errors will tell where the old subroutine were called and where to make the change to adapt the code to the changed calling convention.
Sorry if my code examples don't compile or don't work because if messed something, but I hope you get the basic idea
Yes, I always found that it was retarded that they used (),Y to adress data stack in CC65 when they could do what you describe instead.
Also is there any reason the stack should go "upwards" like the hardware stack ? I think it would make just as much sense to have a downwards stack instead - I don't think it would affect performance in any way - but it would be better that when you push pointers on the stack you push the low byte first and then you can use (,X). If we do it as you describe we would have to be sure to push the high byte first.
Or even better, get rid of data stacks completely, like SDCC, at the small price to be unable to do re-entrant functions, which nobody ever uses anyways.
Also I think there's a 99% of chances X has to be used for something else once in a while, so you should definitely save it somewhere.
Also is there any reason the stack should go "upwards" like the hardware stack ? I think it would make just as much sense to have a downwards stack instead - I don't think it would affect performance in any way - but it would be better that when you push pointers on the stack you push the low byte first and then you can use (,X). If we do it as you describe we would have to be sure to push the high byte first.
Or even better, get rid of data stacks completely, like SDCC, at the small price to be unable to do re-entrant functions, which nobody ever uses anyways.
Also I think there's a 99% of chances X has to be used for something else once in a while, so you should definitely save it somewhere.
Useless, lumbering half-wits don't scare us.
- Jarhmander
- Formerly ~J-@D!~
- Posts: 568
- Joined: Sun Mar 12, 2006 12:36 am
- Location: Rive nord de Montréal
Bregalad, I think you misunderstood something. I described the data stack as a "downward stack", that is, the data stack pointer points initially at a "high" address, and the more you "push" things into it, the more the pointer get low, and hence it is like the regular stack. It has nothing to do with the order of the individual bytes "pushed" into it — obviously, you put pointers in the stack in little endian, so you can use correcty the (indirect, X) addressing.
Using stack-based memory allocation has the benefit of allowing recursion, but it wasn't the intent. In the absence of an intelligent linker and a keyword to define a memory location to use only within a subroutine¹, you have to ensure yourself that any "temp" variable is used only by one active subroutine at a time (including ISRs), or you can welcome bugs, especially the gay ones. Using a data stack remove that burden. Note that you don't have to manipulate the return address nor the regular stack to get parameters, and because it is located in zero page, accesses are fairly fast (slower than absolute access though), compact and you can use pointers directly from your data stack without having to move them into zero page. You can manipulate your data directly into zero-page, and when you're done with it, you can free it automagically by adjusting your data stack pointer.
But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary. ISRs conforming to that model must then save all zero-page locations needed prior to using them. Therefore, there's no data stack needed, the part "saved-by-caller" must be big enough for any return type, the part "saved-by-called" big enough for temp data and parameters. With care, it can be both efficient and easy to maintain.
—
[1] An external tool to an assembler that analyse the usage of some marked memory locations and when these subroutines using them are active counts.
Using stack-based memory allocation has the benefit of allowing recursion, but it wasn't the intent. In the absence of an intelligent linker and a keyword to define a memory location to use only within a subroutine¹, you have to ensure yourself that any "temp" variable is used only by one active subroutine at a time (including ISRs), or you can welcome bugs, especially the gay ones. Using a data stack remove that burden. Note that you don't have to manipulate the return address nor the regular stack to get parameters, and because it is located in zero page, accesses are fairly fast (slower than absolute access though), compact and you can use pointers directly from your data stack without having to move them into zero page. You can manipulate your data directly into zero-page, and when you're done with it, you can free it automagically by adjusting your data stack pointer.
But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary. ISRs conforming to that model must then save all zero-page locations needed prior to using them. Therefore, there's no data stack needed, the part "saved-by-caller" must be big enough for any return type, the part "saved-by-called" big enough for temp data and parameters. With care, it can be both efficient and easy to maintain.
—
[1] An external tool to an assembler that analyse the usage of some marked memory locations and when these subroutines using them are active counts.
Yeah, that's exactly what I adopted for parts of Thwaite: $0000-$0007 saved by caller, $0008-$000F saved by callee.~J-@D!~ wrote:But another idea I've just got is to treat simply part of the zero page as a big set of slow virtual registers, part of it must be saved by the caller and the other part saved by the callee, if necessary.
I don't think I misunderstood something, but it's possible I got "downward" and "upward" backwards.
If you use a stack like the hardware stack at $100, for example let's say the pointer points to $ff, and you push a pointer you'll have to push the MSB first becuase :
$ff = MSB of pointer
$fe = LSB of pointer
If X = $fe :
lda ($00,X) -> load from pointer
If you would have pushed the LSB first this would not have worked.
However if you opt for the stack for the other order, let's say the stack starts at $0 and goes downward.
Then if you push your pointer you'll have to push it in the regular little endian format :
$00 = LSB of pointer
$01 = MSB of pointer
X = 0;
lda ($00,X) -> load from pointer
I'm not saying one is better than the other, it's just you'll have to be careful with that.
And you're completely right and I 100% agree about the issues that comes if you don't use a data stack, I know those poblems very well because how many bitchy bugs I had because I called a routine that called another routine modified the Temp variables I relied on.
However a good compiler, such as SDCC could compile 6502 while avoiding those clashes, as you said by treating a part of zero-page as registers. I think this would be the most efficient solution, even though the zero page data stack would still be reasonably efficient, while the current solution implemented in CC65 is very inefficient.
If you use a stack like the hardware stack at $100, for example let's say the pointer points to $ff, and you push a pointer you'll have to push the MSB first becuase :
$ff = MSB of pointer
$fe = LSB of pointer
If X = $fe :
lda ($00,X) -> load from pointer
If you would have pushed the LSB first this would not have worked.
However if you opt for the stack for the other order, let's say the stack starts at $0 and goes downward.
Then if you push your pointer you'll have to push it in the regular little endian format :
$00 = LSB of pointer
$01 = MSB of pointer
X = 0;
lda ($00,X) -> load from pointer
I'm not saying one is better than the other, it's just you'll have to be careful with that.
And you're completely right and I 100% agree about the issues that comes if you don't use a data stack, I know those poblems very well because how many bitchy bugs I had because I called a routine that called another routine modified the Temp variables I relied on.
However a good compiler, such as SDCC could compile 6502 while avoiding those clashes, as you said by treating a part of zero-page as registers. I think this would be the most efficient solution, even though the zero page data stack would still be reasonably efficient, while the current solution implemented in CC65 is very inefficient.
Useless, lumbering half-wits don't scare us.
I'm going to use this one:
Normally, you would do this (“flag” is a byte in the zero page):
Woz did it like this:
This assumes only bit 7 is important. (You could also use the overflow flag without much more effort for two flags.)
For setting the flag you could also use:
sec
ror flag
So you never have to touch a register, but it is slower.
Credited to Wozniak.. Thank you Apple Computers.
http://www.pagetable.com/?p=33
Normally, you would do this (“flag” is a byte in the zero page):
Code: Select all
function code bytes cycles
set lda #1
sta flag 4 5
clear lda #0
sta flag 4 5
test lda flag
beq clear 4 5/6
Code: Select all
function code bytes cycles
set lda #$80
sta flag 4 5
clear lsr flag 2 5
test bit flag
bpl clear 4 5/6
For setting the flag you could also use:
sec
ror flag
So you never have to touch a register, but it is slower.
Credited to Wozniak.. Thank you Apple Computers.
http://www.pagetable.com/?p=33