6502 vdelay - cycle delay routine with variable length at runtime
Moderator: Moderators
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
6502 vdelay - cycle delay routine with variable length at runtime
I finally found myself needing one of these so I wrote one:
https://github.com/bbbradsmith/6502vdelay
Pretty simple: just load the 16-bit number of cycles you want to delay into X:A and jsr vdelay will take that long.
There's some overhead involved, so there's a 48 cycle minimum here. (Or 35 if self-modifying code is allowed.)
https://github.com/bbbradsmith/6502vdelay
Pretty simple: just load the 16-bit number of cycles you want to delay into X:A and jsr vdelay will take that long.
There's some overhead involved, so there's a 48 cycle minimum here. (Or 35 if self-modifying code is allowed.)
Last edited by rainwarrior on Tue Oct 13, 2020 2:14 am, edited 3 times in total.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Check out also this slick one from Bruce Clark. The delay is 9*(256*A+Y)+8 cycles (plus 12 more for JSR & RTS if you make it a subroutine). This assumes that the BCS does not cross a page boundary.
He writes: "A and Y are the high and low bytes (respectively) of a 16-bit value; multiply that 16-bit value by 9, then add 8 and you get the cycle count. So the delay can range from 8 to 589832 cycles, with a resolution of 9 cycles. One of the nice things about this code is that it's easy to figure out what values to put in A and Y when you want a delay of, e.g. (approximately) 10000 cycles." Here's the same thing with my structure macros (the resulting machine code being identical):
Code: Select all
loop: CPY #1
DEY
SBC #0
BCS loop
Code: Select all
BEGIN
CPY #1
DEY
SBC #0
UNTIL_CARRY_CLEAR
http://WilsonMinesCo.com/ lots of 6502 resources
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
That's an interesting one, though it's limited to a resolution of 9 cycles? I like how minimal that code is for a 16-bit countdown.
At the core, my version has a short 8-cycle loop that's a similar idea, but the additional overhead/code does all the work to get you cycle-accurate resolution.
I had another variation with a 16-cycle loop that's overall a bit smaller because it does one 16-bit countdown instead of 2 x 8-bit countdowns... makes me wonder if a variation that technique you quoted can't be applied to shrink the code slightly? Hmm.
If there exists a practical fixed-cycle 16-bit divide+modulo it would open up other possibilities too...
At the core, my version has a short 8-cycle loop that's a similar idea, but the additional overhead/code does all the work to get you cycle-accurate resolution.
I had another variation with a 16-cycle loop that's overall a bit smaller because it does one 16-bit countdown instead of 2 x 8-bit countdowns... makes me wonder if a variation that technique you quoted can't be applied to shrink the code slightly? Hmm.
If there exists a practical fixed-cycle 16-bit divide+modulo it would open up other possibilities too...
Re: 6502 vdelay - cycle delay routine with variable length at runtime
It's only seven bytes; so you could even straightline it (instead of called as a subroutine) and add a NOP or two, and if necessary, a trailing BCC to the next instruction (meaning a 3-cycle NOP, assuming it doesn't cross a page boundary), in quite a few places to get the needed resolution without the longer routines, and it'd still pay for itself. A macro could made to lay down the right code to get the exact cycle count, all in one line.
http://WilsonMinesCo.com/ lots of 6502 resources
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
My whole goal here was to be able to vary the length at runtime, though. Fixed delays are a different problem entirely.Garth wrote: ↑Sat Oct 10, 2020 1:45 amIt's only seven bytes; so you could even straightline it (instead of called as a subroutine) and add a NOP or two, and if necessary, a trailing BCC to the next instruction (meaning a 3-cycle NOP, assuming it doesn't cross a page boundary), in quite a few places to get the needed resolution without the longer routines, and it'd still pay for itself. A macro could made to lay down the right code to get the exact cycle count, all in one line.
...though my thing might be convenient if you have a lot of fixed delays in one program? I dunno. Not quite the intended purpose here.
I wonder if Bisqwit's fixed delay generator could make use of it? (Though probably not... "The macros define the guaranteed*-to-be-smallest code for all delays from 2 to 20000 cycles.")
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I haven't fully digested the code yet, but it looks pretty neat. Have you considered using a clockslide instead of the jump table for delaying 0-7 cycles? You could put an 8-byte clockslide between vdelay_low and vdelay_low_rest and remove 3 cycles of overhead by not needing to jump to vdelay_low_rest anymore.
I've had to put 8-bit variable delays into some of my projects. I initially used a jump table like your code, but settled on the clockslide approach for later projects. My code is below. I opted not to remove the overhead from the input, which was fine for my purposes. This uses an indirect jump to get into the clockslide, with the high byte of the pointer being fixed and written during program init. The total overhead within the function itself is 20 cycles.
I've had to put 8-bit variable delays into some of my projects. I initially used a jump table like your code, but settled on the clockslide approach for later projects. My code is below. I opted not to remove the overhead from the input, which was fine for my purposes. This uses an indirect jump to get into the clockslide, with the high byte of the pointer being fixed and written during program init. The total overhead within the function itself is 20 cycles.
Code: Select all
-
SEC
SBC #$07
BCS -
+
EOR #$FF
ADC #<(Clockslide)
STA cycle_delay_ptr+0
JMP (cycle_delay_ptr)
Clockslide:
.db $C9,$C9,$C9,$C9,$C9,$C5,$EA
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Ah, I'd actually forgotten about the "clockslide" chain of CMPs. Thank you for reminding me of it! It's weird that it's on the programming with unofficial opcodes page of the Wiki, because it doesn't use any? Maybe the Wiki could use some sort of "cycle counting" page instead where we could give rule-of-thumb about why instructions take 2/3/4 cycles and other stuff like this. (Edit: created the new page Wiki: Cycle counting)
I was definitely avoiding illegal instructions, because this is specifically intended for the extended compatible 6502 family. Indirect JMP is also avoided because of the 65C02 timing incompatibility.
I was also avoiding having any memory requirement except stack usage, which was a second reason against indirect JMP. I also avoided doing any extraneous reads, but maybe CMP $EA is acceptable. The situations where a ZP read has a side effect I think are rare.
It occurs to me that self-modifying code could dispatch the slide even more quickly? This gives me a bunch of ideas to try. Thanks!
I was definitely avoiding illegal instructions, because this is specifically intended for the extended compatible 6502 family. Indirect JMP is also avoided because of the 65C02 timing incompatibility.
I was also avoiding having any memory requirement except stack usage, which was a second reason against indirect JMP. I also avoided doing any extraneous reads, but maybe CMP $EA is acceptable. The situations where a ZP read has a side effect I think are rare.
It occurs to me that self-modifying code could dispatch the slide even more quickly? This gives me a bunch of ideas to try. Thanks!
Re: 6502 vdelay - cycle delay routine with variable length at runtime
And even when zero page reads do have side effects, it's probably in a machine small enough that zero page and the stack page are mirrors of the same memory. The Atari 2600, for example, puts a 6532 RIOT (128-byte RAM with joystick I/O and timer) and a TIA (minimalist picture and audio generator) in pages $00 and $01. (The I/O and timer side of the RIOT ends up in $0280-$0297.) Fortunately the RAM is in the second half of the page, making cmp $EA hit the RAM instead of the TIA.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I remember looking at a number of other 6502-based machines and didn't find many (any?) others that had side effects on reading from a register.
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Apple II has tons of MMIO stuff that's done by reading an address. Nothing on ZP though... but IIGS with a 65816 and DP makes it possible. Is there a situation where it matters? Seems unlikely.
Though, without relaxing my other constraints, I can't seem to knock off more than 1 extra cycle by switching from the nop-slide to cmp-slide. However, allowing a self-modifying JMP instead of the RTS jump table got it down to a 46 cycle overhead with the cmp-slide.
Though, without relaxing my other constraints, I can't seem to knock off more than 1 extra cycle by switching from the nop-slide to cmp-slide. However, allowing a self-modifying JMP instead of the RTS jump table got it down to a 46 cycle overhead with the cmp-slide.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I could have sworn I checked this, but I should have looked at MESS's source instead of trying to make heads or tails of various technical documents. All of these tentatively seem to be registers that don't decode R/W, so fair enough. MESS seems to gate all of them with side_effects_disabled(), which ... I guess is for badly-behaved software that wasn't tested on hardware??rainwarrior wrote: ↑Sun Oct 11, 2020 1:20 pmApple II has tons of MMIO stuff that's done by reading an address.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Based solely on the name and its use to gate read side effects, I'm assuming side_effects_disabled() is used for reading memory or registers in a debugger. For example, an NES emulator running an MMC2 game needs to disable side effects when reading tiles $FD and $FE in the PPU viewer. Likewise, it needs to disable PA12-triggered side effects for MMC3 and PA13-triggered side effects for MMC5.
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I tweaked my clockslide code to operate like yours and to use RTS and the stack and got 46 cycles in total. If you can guarantee that VDelay_Clockslide starts at $xx01, then you can shave off another 2 cycles by removing the ADC.
Code: Select all
VDELAY_MINIMUM = 46
; Waits for A cycles. JSR/RTS is included. Minimum is 46 cycles.
; Input: A: Number of cycles to delay.
; Clobbers: A/Y
VDelay: ; +6 = 6 (jsr)
; If the requested length is too low, wait the minimum time.
SEC ; +2 = 8
SBC #VDELAY_MINIMUM ; +2 = 10
BCC VDelay_TooLow ; +2 = 12
; Wait in 5-cycle amounts until we've waited one time too many.
-
; SEC
SBC #$05 ; +2 = 14
BCS - ; +2 = 16
; Push the high byte of the clockslide address.
TAY ; +2 = 18
LDA #>(VDelay_Clockslide-1) ; +2 = 20
PHA ; +3 = 23
TYA ; +2 = 25
; Use the remainder from the wait to calculate the low byte of the clockslide address.
EOR #$FF ; +2 = 27
ADC #<(VDelay_Clockslide-1) ; +2 = 29
PHA ; +3 = 32
; Clockslide to do the less-than-5 cycle portion.
RTS ; +6 = 38
; Wait a fixed time for calls with an argument that's too low for us to service properly.
VDelay_TooLow: ; +3 = 13 (from branch)
; 8 cycle loop. Cycles = length * iterations + 1.
LDY #3 ; +(8*3 + 1) = +25 = 38
-
JMP +
+
DEY
BNE -
NOP ; +2 = 40
RTS ; +6 = 46
; This spends 0-4 cycles plus 2 cycles of overhead.
VDelay_Clockslide:
.db $C9,$C9,$C9,$C5,$EA ; +2 = 40
RTS ; +6 = 46
Re: 6502 vdelay - cycle delay routine with variable length at runtime
I made a self-modifying version that requires a 33 cycle minimum. I don't have ideas at the moment for further improvements.
Code: Select all
VDelay_RAM := $6000
VDELAY_MINIMUM = 33
; Waits for A cycles. JSR/RTS is included. Minimum is 33 cycles.
; Input: A: Number of cycles to delay.
; Clobbers: A/Y
VDelay: ; +6 = 6 (jsr)
; If the requested length is too low, wait the minimum time.
SEC ; +2 = 8
SBC #VDELAY_MINIMUM ; +2 = 10
BCC VDelay_TooLow ; +2 = 12
; Wait in 5-cycle amounts until we've waited one time too many.
-
; SEC
SBC #$05 ; +2 = 14
BCS - ; +2 = 16
; Set up the target location in the clockslide for the last (less than 5-cycle) wait.
EOR #$FF ; +2 = 18
STA VDelay_RAM + (VDelay_ClockslideBranch+1 - VDelay) ; +4 = 22
VDelay_ClockslideBranch:
BPL VDelay_Clockslide ; +3 = 25
; This spends 0-4 cycles plus 2 cycles of overhead.
VDelay_Clockslide:
.db $C9,$C9,$C9,$C5,$EA ; +2 = 27
RTS ; +6 = 33
; Wait a fixed time for calls with an argument that's too low for us to service properly.
VDelay_TooLow: ; +3 = 13 (from branch)
NOP ; +2 = 15
NOP ; +2 = 17
NOP ; +2 = 19
NOP ; +2 = 21
NOP ; +2 = 23
NOP ; +2 = 25
NOP ; +2 = 27
RTS ; +6 = 33
VDelay_End:
Last edited by Fiskbit on Mon Oct 12, 2020 1:43 am, edited 1 time in total.
- rainwarrior
- Posts: 7978
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: 6502 vdelay - cycle delay routine with variable length at runtime
Ah, I tried a more naive version (not self-modifying) applying the technique and it has a 51 cycle minimum.
The idea to reorder and use the decrementing loop to do double duty as a modulo was a really good one!
So, would you like me to use your versions in my github repository? If yes, how would you like to be attributed? Do you have a website or other public profile I should link?
The idea to reorder and use the decrementing loop to do double duty as a modulo was a really good one!
So, would you like me to use your versions in my github repository? If yes, how would you like to be attributed? Do you have a website or other public profile I should link?