6502 vdelay - cycle delay routine with variable length at runtime
Moderator: Moderators
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Sure, please use these if they're useful! When I can, I like to license my code as permissively as possible. Attribution is not necessary, but if you'd like, you're welcome to credit me as Fiskbit. I do have a github profile, but don't have anything public on it at this time.
I can look at extending the code to 16-bit, but haven't thought yet about how it would need to change. Happy to leave it to you, too.
I can look at extending the code to 16-bit, but haven't thought yet about how it would need to change. Happy to leave it to you, too.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I came up with another improvement. Since the problem with clockslide is that odd lengths will read from $EA, which could have side effects, I've opted to change the instruction to an always-taken branch, which is also 2 bytes and takes 3 cycles. The operand needs to both lead to an RTS instruction and be effectively a NOP when executed itself.
Here's the non-self-modifying version, which I believe is the same number of bytes as before:
And here's the self-modifying one, which is 1 byte longer than before:
Here's the non-self-modifying version, which I believe is the same number of bytes as before:
Code: Select all
VDELAY_MINIMUM = 46
; Waits for A cycles. JSR/RTS is included. Minimum is 46 cycles.
; Input: A: Number of cycles to delay.
; Clobbers: A/Y
VDelay: ; +6 = 6 (jsr)
; If the requested length is too low, wait the minimum time.
SEC ; +2 = 8
SBC #VDELAY_MINIMUM ; +2 = 10
BCC VDelay_TooLow ; +2 = 12
; Wait in 5-cycle amounts until we've waited one time too many.
-
; SEC
SBC #$05 ; +2 = 14
BCS - ; +2 = 16
; Push the high byte of the clockslide address.
TAY ; +2 = 18
LDA #>(VDelay_Clockslide-1) ; +2 = 20
PHA ; +3 = 23
TYA ; +2 = 25
; Use the remainder from the wait to calculate the low byte of the clockslide address.
EOR #$FF ; +2 = 27
ADC #<(VDelay_Clockslide-1) ; +2 = 29
PHA ; +3 = 32
; Clockslide to do the less-than-5 cycle portion.
RTS ; +6 = 38
; This spends 0-4 cycles plus 2 cycles of overhead.
VDelay_Clockslide:
.db $C9,$C9,$C9,$90,$0A ; +2 = 40
VDelay_Clockslide_End:
RTS ; +6 = 46
; Wait a fixed time for calls with an argument that's too low for us to service properly.
VDelay_TooLow: ; +3 = 13 (from branch)
; 8 cycle loop. Cycles = length * iterations + 1.
LDY #3 ; +(8*3 + 1) = +25 = 38
-
JMP +
+
DEY
BNE -
NOP ; +2 = 40
VDelay_ClockslideBranchTarget:
RTS ; +6 = 46
.IF VDelay_ClockslideBranchTarget - VDelay_Clockslide_End != $0A
.ERROR VDelay_ClockslideBranchTarget wrong distance from VDelay_Clockslide_End
.ENDIF
Code: Select all
VDELAY_MINIMUM = 33
; Waits for A cycles. JSR/RTS is included. Minimum is 33 cycles. Must be executed from RAM.
; Input: A: Number of cycles to delay.
; Clobbers: A/Y
VDelay_Start:
VDelay_ClockslideBranchTarget:
RTS
VDelay: ; +6 = 6 (jsr)
; If the requested length is too low, wait the minimum time.
SEC ; +2 = 8
SBC #VDELAY_MINIMUM ; +2 = 10
BCC VDelay_TooLow ; +2 = 12
; Wait in 5-cycle amounts until we've waited one time too many.
-
; SEC
SBC #$05 ; +2 = 14
BCS - ; +2 = 16
; Set up the target location in the clockslide for the last (less than 5-cycle) wait.
EOR #$FF ; +2 = 18
STA VDelay_RAM + (VDelay_ClockslideBranch+1 - VDelay) ; +4 = 22
VDelay_ClockslideBranch:
BPL VDelay_Clockslide ; +3 = 25
; This spends 0-4 cycles plus 2 cycles of overhead.
VDelay_Clockslide:
.db $C9,$C9,$C9,$90,$EA ; +2 = 27
VDelay_Clockslide_End:
RTS ; +6 = 33
; Wait a fixed time for calls with an argument that's too low for us to service properly.
VDelay_TooLow: ; +3 = 13 (from branch)
NOP ; +2 = 15
NOP ; +2 = 17
NOP ; +2 = 19
NOP ; +2 = 21
NOP ; +2 = 23
NOP ; +2 = 25
NOP ; +2 = 27
RTS ; +6 = 33
VDelay_End:
.IF VDelay_Clockslide_End - VDelay_ClockslideBranchTarget != ($100 - $EA)
.ERROR VDelay_ClockslideBranchTarget wrong distance from VDelay_Clockslide_End
.ENDIF
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Ah, yes that's a very nice idea for eliminating the spurious read!
I've now incorporated those two variations into the project, with some small modifications: current version
The 16-bit versions actually only need 2 more cycles of overhead. The SEC gets replaced by CPX #0, which also sets carry, and after that there's just an extra branch to the "full" version.
I was trying to puzzle out how to create meaningful asserts for the way carry was always clear for the clockslide, but then I realized we can use LDA instead of CMP and not affect carry. (The code has much lighter alignment restrictions this way too.)
The alignment of the branch target seems a little bit "lucky" in your two versions, but at least there's clear ways to move it around if needed. For the 16-bit self-modifying version I couldn't use $EA but the padding needed for $0A was an insignificant 4 bytes.
I've now incorporated those two variations into the project, with some small modifications: current version
The 16-bit versions actually only need 2 more cycles of overhead. The SEC gets replaced by CPX #0, which also sets carry, and after that there's just an extra branch to the "full" version.
I was trying to puzzle out how to create meaningful asserts for the way carry was always clear for the clockslide, but then I realized we can use LDA instead of CMP and not affect carry. (The code has much lighter alignment restrictions this way too.)
The alignment of the branch target seems a little bit "lucky" in your two versions, but at least there's clear ways to move it around if needed. For the 16-bit self-modifying version I couldn't use $EA but the padding needed for $0A was an insignificant 4 bytes.

Re: 6502 vdelay - cycle delay routine with variable length at runtime
Luck is a big part of this level of optimization. :) These seem very lean now. Really cool that the 16-bit one only needs 2 extra cycles.
Looking at vdelay_modify.s, I think if you move vdelay_toolow to the end, you can use #$18 (CLC) as the clockslide's branch operand/instruction and put the vdelay_clockslide_branch RTS right before it. This removes the 4 bytes of padding. I haven't tested it, but it should look like this:
Looking at vdelay_modify.s, I think if you move vdelay_toolow to the end, you can use #$18 (CLC) as the clockslide's branch operand/instruction and put the vdelay_clockslide_branch RTS right before it. This removes the 4 bytes of padding. I haven't tested it, but it should look like this:
Code: Select all
; This "clockslide" overlaps instructions so that each byte adds one cycle to the tally.
; 0-4 cycles + 2 cycles of overhead (A clobbered)
vdelay_clockslide: ; +2 = 29
.byte $A9 ; 0 LDA #$A9 (+2)
.byte $A9 ; 1 LDA #$A9 (+2)
.byte $A9 ; 0,2 LDA #$90 (+2)
.byte $90 ; 1,3 BCC *+2+$0A (+3, carry guaranteed clear)
.byte $18 ; 0,2,4 CLC (+2)
.assert >(vdelay_clockslide) = >(vdelay_clockslide+4), error, "Clockslide crosses page."
.assert >(*+$18) = >*, error, "Clockslide branch page crossed!"
.assert (*+$18) = vdelay_clockslide_branch, error, "Clockslide branch misplaced!"
rts ; +6 = 35 (end)
vdelay_full: ; +3 = 11
sec ; +2 = 13
sbc #VDELAY_FULL_OVERHEAD ; +2 = 15
tay ; +2 = 17
txa ; +2 = 19
sbc #0 ; +2 = 21
BRPAGE beq, vdelay_high_none ; +2 = 23
: ; 256 cycles each iteration
ldx #50 ; +2 = 2
: ; 5 cycle loop +250 = 252
dex
BRPAGE bne, :- ; -1 = 251
sbc #1 ; +2 = 253 (carry always set)
BRPAGE bne, :-- ; +3 = 256 -1 = 22 (on last iteration)
nop ; +2 = 24
vdelay_high_none: ; +3 = 24 (from branch)
tya ; +2 = 26
jmp vdelay_low ; +3 = 29
; -14+35 = 50 (full end)
vdelay_clockslide_branch: ; exactly 24 bytes past the clockslide branch
rts ; +6 = 35 (end)
vdelay_toolow: ; +3 = 15 (from branch)
php ; +3 = 18
plp ; +4 = 22
php ; +3 = 25
plp ; +4 = 29
rts ; +6 = 35 (end)
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Nice! I remember considering CLC but I think I miscounted, heh. Thanks for giving another look to it.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I've not dug too much into this thread, but I saw the stack management that powers the clockslide and questioned if there weren't shorter+faster ways of dealing with the small values. Using branches seems to work pretty well. The 8-bit version is just stripped down from the 16-bit version without any real extra work, so there may be a way to squeeze a few cycles out of the toolow handling.
I handle the 3 least significant bits with shifts and conditions and then use an 8 cycle loop to consume the rest of the low byte. The branches at the end of loops always go back to the beginning of the loop, which makes it easy to calculate clocks spent and keeps a single flow through the code. It may be possible to squeeze out a bit more by changing the loop structures; I've not really tried yet. This is a "looks interesting and is a good stopping place."
I do some branches that jump to the next instruction. From what I can tell, that is allowed as the encoding doesn't handle positive vs negative jumps differently, but it is strange enough I worried it would break something. It at least assembles and runs on the simulator without issue.
vdelay: 46 bytes
vdelay_short: 30 bytes
I handle the 3 least significant bits with shifts and conditions and then use an 8 cycle loop to consume the rest of the low byte. The branches at the end of loops always go back to the beginning of the loop, which makes it easy to calculate clocks spent and keeps a single flow through the code. It may be possible to squeeze out a bit more by changing the loop structures; I've not really tried yet. This is a "looks interesting and is a good stopping place."
I do some branches that jump to the next instruction. From what I can tell, that is allowed as the encoding doesn't handle positive vs negative jumps differently, but it is strange enough I worried it would break something. It at least assembles and runs on the simulator without issue.
vdelay: 46 bytes
Code: Select all
VDELAY_OVERHEAD = 46
.export vdelay
vdelay: ; +6 = 6 (jsr)
sec ; +2 = 8
sbc #VDELAY_OVERHEAD ; +2 = 10
tay ; +2 = 12
txa ; +2 = 14
sbc #0 ; +2 = 16
BRPAGE bcc, @toolow ; +2 = 18
@high: BRPAGE beq, @low ; +3 = 21
ldx #31
: bit $0
dex
BRPAGE bne, :-
sbc #1
BRPAGE bcs, @high
@toolow:ldy #0
@low: tya ; +2 = 23
lsr ; +2 = 25
BRPAGE bcs, @twos ; +2 = 27
@twos: lsr ; +2 = 29
BRPAGE bcc, @fours ; +3 = 32
bit $0
@fours: lsr ; +2 = 34
BRPAGE bcs, @waste4 ; +2 = 36
@rest: sec ; +2 = 38
@rest2: BRPAGE bne, @waste8 ; +2 = 40
rts ; +6 = 46
@waste4:BRPAGE bcs, @rest ; +3 = 3 (+1 from branch to get here)
@waste8:sbc #1 ; +2 = 2
BRPAGE bcs, @rest2 ; +3 = 5 (+3 from branch to get here)
Code: Select all
VDELAY_OVERHEAD = 38
.export vdelay
vdelay: ; +6 = 6 (jsr)
sec ; +2 = 8
sbc #VDELAY_OVERHEAD ; +2 = 10
BRPAGE bcc, @toolow ; +2 = 12
BRPAGE bcs, @low ; +3 = 15
@toolow:lda #0
@low: lsr ; +2 = 17
BRPAGE bcs, @twos ; +2 = 19
@twos: lsr ; +2 = 21
BRPAGE bcc, @fours ; +3 = 24
bit $0
@fours: lsr ; +2 = 26
BRPAGE bcs, @waste4 ; +2 = 28
@rest: sec ; +2 = 30
@rest2: BRPAGE bne, @waste8 ; +2 = 32
rts ; +6 = 38
@waste4:BRPAGE bcs, @rest ; +3 = 3 (+1 from branch to get here)
@waste8:sbc #1 ; +2 = 2
BRPAGE bcs, @rest2 ; +3 = 5 (+3 from branch to get here)
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
That's a very interesting approach! So, decomposing the number into bits and climbing a ladder of branches of power-of-two lengths?
Noting that your "short" version has an overhead of 38 and starts with a sec, it can trivially become a "long" verison with 2 more cycles in the way the others were done:
at the start, becomes:
After that the rest of the short version follows, and otherwise the "long" version branch subtracts some larger overhead, counts out the 256-cycle loop, then can finish the remainder with the "short" version.
So... may I use this code on the github project, and if yes how would you like to be attributed?
Noting that your "short" version has an overhead of 38 and starts with a sec, it can trivially become a "long" verison with 2 more cycles in the way the others were done:
Code: Select all
sec
Code: Select all
cpx #0 ; sets carry
bne long_version
Yes, a branch that adds 0 to the PC is perfectly fine.
So... may I use this code on the github project, and if yes how would you like to be attributed?
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Yeah. I think the slides are better when you have a larger number of clocks to consume to outweigh the setup cost (or you can modify the code).So, decomposing the number into bits and climbing a ladder of branches of power-of-two lengths?
Yeah, for a bit of code size. That would bring it closer to vdelay_modify. It would also mean we can use the stack for temporary and stop clobbering y. At that point, it looks like we could then reduce the overhead for both versions by 3 cycles by using a separate toolow flow and removing "beq, @low". It looks like that would not even increase code size if we jump back to the "bcc, @fours" line just after @twos.Noting that your "short" version has an overhead of 38 and starts with a sec, it can trivially become a "long" verison with 2 more cycles in the way the others were done:
It's disappointing the repo is using a custom license. It seems the zlib License is very similar. Would it make sense to use it instead? The current terms are quite appropriate, other than the custom aspect.So... may I use this code on the github project, and if yes how would you like to be attributed?
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I do not wish to alter the license. Though at this point I would remove the Patreon link from the readme, since it's become a community project.ejona wrote: ↑Mon Oct 19, 2020 8:46 amIt's disappointing the repo is using a custom license. It seems the zlib License is very similar. Would it make sense to use it instead? The current terms are quite appropriate, other than the custom aspect.rainwarrior wrote: ↑Mon Oct 19, 2020 12:47 ammay I use this code on the github project, and if yes how would you like to be attributed?
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Here's a branch with your technique incorporated into vdelay / vdelay_short:
Edit: that branch is now part of the master: https://github.com/bbbradsmith/6502vdelay
I credited you as "Ejona" and gave a link to this thread in the readme. Let me know if you'd like to be attributed differently than that. With your consent I'll push it to the master. (Or if you don't consent I can delete the branch.)
Edit: that branch is now part of the master: https://github.com/bbbradsmith/6502vdelay
I credited you as "Ejona" and gave a link to this thread in the readme. Let me know if you'd like to be attributed differently than that. With your consent I'll push it to the master. (Or if you don't consent I can delete the branch.)
Last edited by rainwarrior on Tue Oct 20, 2020 6:16 pm, edited 1 time in total.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
You are free to incorporate it into the 6502vdelay project. Ejona is fine, although "Eric Anderson" may be better. You may instead want to link to my github id: https://github.com/ejona86. But I'm not too picky.I credited you as "Ejona" and gave a link to this thread in the readme. Let me know if you'd like to be attributed differently than that. With your consent I'll push it to the master.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I noticed there was previously an "extreme" version that used 826 bytes. With that mentality in mind, here is a vdelay_short_modify_extreme version that has an overhead of 30 cycles and is 132 bytes. It uses the lsr+branch technique to manage even vs odd and then a simple nopslide.
If you weaken how toolow is handled to allow a +/-1 variation for small values, then you can have an rts run after the nopslide and remove the VDELAY_OVERHEAD cmp+branch which reduces overhead to 26 cycles. It would take 4 cycles to make it a full version, so wouldn't probably wouldn't be worth it compared to the current vdelay_modify's 35 cycles, unless we weaken the handling for small values to get a 30 cycle overhead. The nopslide can't be reused for a full version, so we'd use one of the other delay variations for larger values.
I'd send a PR, but I know there's quite a bit of styling decisions and choices of what to incorporate. The ".MOD 2" stuff could be removed once the code is "settled" and unlikely to change overhead. I left it in in case y'all see an improvement.
If you weaken how toolow is handled to allow a +/-1 variation for small values, then you can have an rts run after the nopslide and remove the VDELAY_OVERHEAD cmp+branch which reduces overhead to 26 cycles. It would take 4 cycles to make it a full version, so wouldn't probably wouldn't be worth it compared to the current vdelay_modify's 35 cycles, unless we weaken the handling for small values to get a 30 cycle overhead. The nopslide can't be reused for a full version, so we'd use one of the other delay variations for larger values.
I'd send a PR, but I know there's quite a bit of styling decisions and choices of what to incorporate. The ".MOD 2" stuff could be removed once the code is "settled" and unlikely to change overhead. I left it in in case y'all see an improvement.
Code: Select all
VDELAY_OVERHEAD_RAW = 29
; Only even overheads can be accurately accounted. An extra cycle is inserted if
; odd
VDELAY_OVERHEAD = VDELAY_OVERHEAD_RAW + (VDELAY_OVERHEAD_RAW .MOD 2)
.if !(VDELAY_OVERHEAD_RAW .MOD 2)
vdelay_toolow:
lda #(VDELAY_OVERHEAD>>1)^$7F
BRPAGE bcc, vdelay_toolow_resume
.endif
.export vdelay
vdelay: ; +6 = 6 (jsr)
cmp #VDELAY_OVERHEAD ; +2 = 8
.if VDELAY_OVERHEAD_RAW .MOD 2
BRPAGE bcs, @nottoolow ; (alt) +3 (if cycles would be odd)
nop
lda #(VDELAY_OVERHEAD>>1)^$7F
BRPAGE bcc, vdelay_toolow_resume
@nottoolow:
.else
BRPAGE bcc, vdelay_toolow ; +2 = 10
.endif
lsr ; +2 = 12
BRPAGE bcs, :+ ; +2 = 14
: eor #$7F ; +2 = 16
vdelay_toolow_resume:
sta @modify+1 ; +4 = 20
@modify:
; @clockslide_end-1 is not used at runtime; it is just for the BRPAGE
; and assembler range checks
BRPAGE bpl, @clockslide_end-1 ; +3 = 23
@clockslide:
.repeat 127-(VDELAY_OVERHEAD/2)
nop
.endrepeat
rts ; +6 = 29
@clockslide_end:
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Having trouble following the "mod 2" part, but the clockslide technique was to split a 2-byte 2-cycle instruction, rather than a 1-byte one like NOP. With that in mind 29 is definitely possible. Not sure if I missed any opportunities here to get it lower. (Should probably pack the "vdelay" part into the unused part of the clockslide to reach an even 256 bytes, too...)
I wouldn't relax the restriction on including toolow (or making MINIMUM implied, etc.) for the "extreme" variation because I think that optimization/compromise trivially applies to all versions? Would rather it be comparing apples to apples, and leave that last inch to be traded off manually.
Code: Select all
VDELAY_MINIMUM = 29
.align 256
vdelay_clockslide: ; +2 = 23
.repeat 254
.byte $C9 ; CMP imm
.endrepeat
.byte $C5 ; CMP $EA
nop
rts ; +6 = 29
vdelay: ; +6 = 6
sec ; +2 = 8
sbc #VDELAY_MINIMUM ; +2 = 10
BRPAGE bcc, vdelay_toolow ; +2 = 12
vdelay_toolow_return:
eor #$FF ; +2 = 14
sta vdelay_modify+1 ; +4 = 18
vdelay_modify:
jmp vdelay_clockslide ; +3 = 21
.assert (<vdelay_clockslide)=0, error, "Clockslide must be 256-byte aligned."
vdelay_toolow: ; +3 = 13
lda #VDELAY_MINIMUM-6 ; +2 = 15
jmp vdelay_toolow_return ; +3 = 18
- rainwarrior
- Posts: 7979
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Alright, ejona's technique is now in the master branch.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
It appears this suggestion wasn't applied to the master branch. So there's 3 cycles available to be shaved off still.
Code: Select all
vdelay: ; +6 = 6 (jsr)
sec ; +2 = 8
sbc #VDELAY_MINIMUM ; +2 = 10
BRPAGE bcc, vdelay_toolow ; +2 = 12
lsr ; +2 = 14
BRPAGE bcs, vdelay_2s ; +2 = 16 (1 extra if bit 1 set)
vdelay_2s:
lsr ; +2 = 18
vdelay_toolow_resume:
BRPAGE bcc, vdelay_4s ; +3 = 21 (2 extra if bit 2 set)
...
vdelay_toolow:
lda #0
BRPAGE bcc, vdelay_toolow_resume
It just adds a clock of delay to make the overhead even. Since the "lsr" rounds down to the nearest even I had to pad to an even overhead.
Yeah, totally. If you're referring to the label before the nops, yes, it is the wrong term. I noticed it later.but the clockslide technique was to split a 2-byte 2-cycle instruction
Not actually. Most of them behave very poorly, either underflowing or jumping who-knows where. I totally accept the idea of a level playing field, which is why I included the checks in the version I posted. But only being off-by-one for toolow values seems pretty close to useful.I wouldn't relax the restriction on including toolow (or making MINIMUM implied, etc.) for the "extreme" variation because I think that optimization/compromise trivially applies to all versions?
For that particular code, it was non-trivial to remove the check, because the ".mod 2" thing is mixed in and would need to be worked in elsewhere. (I tried to have the mod 2 management separate from the toolow check, but it made things really complicated.)Would rather it be comparing apples to apples, and leave that last inch to be traded off manually.
Last edited by ejona on Wed Oct 21, 2020 8:07 am, edited 1 time in total.