6502 vdelay - cycle delay routine with variable length at runtime

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Fri Oct 09, 2020 10:17 pm

I finally found myself needing one of these so I wrote one:
https://github.com/bbbradsmith/6502vdelay

Pretty simple: just load the 16-bit number of cycles you want to delay into X:A and jsr vdelay will take that long.

There's some overhead involved, so there's a 48 cycle minimum here. (Or 35 if self-modifying code is allowed.)
Last edited by rainwarrior on Tue Oct 13, 2020 2:14 am, edited 3 times in total.

Garth
Posts: 194
Joined: Wed Nov 30, 2016 4:45 pm
Location: Southern California
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Garth » Fri Oct 09, 2020 10:37 pm

Check out also this slick one from Bruce Clark. The delay is 9*(256*A+Y)+8 cycles (plus 12 more for JSR & RTS if you make it a subroutine). This assumes that the BCS does not cross a page boundary.

Code: Select all

loop:   CPY  #1
        DEY
        SBC  #0
        BCS  loop
He writes: "A and Y are the high and low bytes (respectively) of a 16-bit value; multiply that 16-bit value by 9, then add 8 and you get the cycle count. So the delay can range from 8 to 589832 cycles, with a resolution of 9 cycles. One of the nice things about this code is that it's easy to figure out what values to put in A and Y when you want a delay of, e.g. (approximately) 10000 cycles." Here's the same thing with my structure macros (the resulting machine code being identical):

Code: Select all

        BEGIN
           CPY  #1
           DEY
           SBC  #0
        UNTIL_CARRY_CLEAR
http://WilsonMinesCo.com/ lots of 6502 resources

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Fri Oct 09, 2020 10:47 pm

That's an interesting one, though it's limited to a resolution of 9 cycles? I like how minimal that code is for a 16-bit countdown.

At the core, my version has a short 8-cycle loop that's a similar idea, but the additional overhead/code does all the work to get you cycle-accurate resolution.

I had another variation with a 16-cycle loop that's overall a bit smaller because it does one 16-bit countdown instead of 2 x 8-bit countdowns... makes me wonder if a variation that technique you quoted can't be applied to shrink the code slightly? Hmm.

If there exists a practical fixed-cycle 16-bit divide+modulo it would open up other possibilities too...

Garth
Posts: 194
Joined: Wed Nov 30, 2016 4:45 pm
Location: Southern California
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Garth » Sat Oct 10, 2020 1:45 am

It's only seven bytes; so you could even straightline it (instead of called as a subroutine) and add a NOP or two, and if necessary, a trailing BCC to the next instruction (meaning a 3-cycle NOP, assuming it doesn't cross a page boundary), in quite a few places to get the needed resolution without the longer routines, and it'd still pay for itself. A macro could made to lay down the right code to get the exact cycle count, all in one line.
http://WilsonMinesCo.com/ lots of 6502 resources

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Sat Oct 10, 2020 1:55 am

Garth wrote:
Sat Oct 10, 2020 1:45 am
It's only seven bytes; so you could even straightline it (instead of called as a subroutine) and add a NOP or two, and if necessary, a trailing BCC to the next instruction (meaning a 3-cycle NOP, assuming it doesn't cross a page boundary), in quite a few places to get the needed resolution without the longer routines, and it'd still pay for itself. A macro could made to lay down the right code to get the exact cycle count, all in one line.
My whole goal here was to be able to vary the length at runtime, though. Fixed delays are a different problem entirely.

...though my thing might be convenient if you have a lot of fixed delays in one program? I dunno. Not quite the intended purpose here.

I wonder if Bisqwit's fixed delay generator could make use of it? (Though probably not... "The macros define the guaranteed*-to-be-smallest code for all delays from 2 to 20000 cycles.")

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Sat Oct 10, 2020 10:01 am

I haven't fully digested the code yet, but it looks pretty neat. Have you considered using a clockslide instead of the jump table for delaying 0-7 cycles? You could put an 8-byte clockslide between vdelay_low and vdelay_low_rest and remove 3 cycles of overhead by not needing to jump to vdelay_low_rest anymore.

I've had to put 8-bit variable delays into some of my projects. I initially used a jump table like your code, but settled on the clockslide approach for later projects. My code is below. I opted not to remove the overhead from the input, which was fine for my purposes. This uses an indirect jump to get into the clockslide, with the high byte of the pointer being fixed and written during program init. The total overhead within the function itself is 20 cycles.

Code: Select all

   -
    SEC
    SBC #$07
    BCS -

   +
    EOR #$FF
    ADC #<(Clockslide)
    STA cycle_delay_ptr+0
    JMP (cycle_delay_ptr)

   Clockslide:
    .db $C9,$C9,$C9,$C9,$C9,$C5,$EA

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Sat Oct 10, 2020 3:16 pm

Ah, I'd actually forgotten about the "clockslide" chain of CMPs. Thank you for reminding me of it! It's weird that it's on the programming with unofficial opcodes page of the Wiki, because it doesn't use any? Maybe the Wiki could use some sort of "cycle counting" page instead where we could give rule-of-thumb about why instructions take 2/3/4 cycles and other stuff like this. (Edit: created the new page Wiki: Cycle counting)

I was definitely avoiding illegal instructions, because this is specifically intended for the extended compatible 6502 family. Indirect JMP is also avoided because of the 65C02 timing incompatibility.

I was also avoiding having any memory requirement except stack usage, which was a second reason against indirect JMP. I also avoided doing any extraneous reads, but maybe CMP $EA is acceptable. The situations where a ZP read has a side effect I think are rare.

It occurs to me that self-modifying code could dispatch the slide even more quickly? This gives me a bunch of ideas to try. Thanks!

tepples
Posts: 22088
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by tepples » Sun Oct 11, 2020 7:04 am

And even when zero page reads do have side effects, it's probably in a machine small enough that zero page and the stack page are mirrors of the same memory. The Atari 2600, for example, puts a 6532 RIOT (128-byte RAM with joystick I/O and timer) and a TIA (minimalist picture and audio generator) in pages $00 and $01. (The I/O and timer side of the RIOT ends up in $0280-$0297.) Fortunately the RAM is in the second half of the page, making cmp $EA hit the RAM instead of the TIA.

lidnariq
Posts: 9777
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by lidnariq » Sun Oct 11, 2020 11:53 am

I remember looking at a number of other 6502-based machines and didn't find many (any?) others that had side effects on reading from a register.

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Sun Oct 11, 2020 1:20 pm

Apple II has tons of MMIO stuff that's done by reading an address. Nothing on ZP though... but IIGS with a 65816 and DP makes it possible. Is there a situation where it matters? Seems unlikely.

Though, without relaxing my other constraints, I can't seem to knock off more than 1 extra cycle by switching from the nop-slide to cmp-slide. However, allowing a self-modifying JMP instead of the RTS jump table got it down to a 46 cycle overhead with the cmp-slide.

lidnariq
Posts: 9777
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by lidnariq » Sun Oct 11, 2020 2:14 pm

rainwarrior wrote:
Sun Oct 11, 2020 1:20 pm
Apple II has tons of MMIO stuff that's done by reading an address.
I could have sworn I checked this, but I should have looked at MESS's source instead of trying to make heads or tails of various technical documents. All of these tentatively seem to be registers that don't decode R/W, so fair enough. MESS seems to gate all of them with side_effects_disabled(), which ... I guess is for badly-behaved software that wasn't tested on hardware??

tepples
Posts: 22088
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by tepples » Sun Oct 11, 2020 3:32 pm

Based solely on the name and its use to gate read side effects, I'm assuming side_effects_disabled() is used for reading memory or registers in a debugger. For example, an NES emulator running an MMC2 game needs to disable side effects when reading tiles $FD and $FE in the PPU viewer. Likewise, it needs to disable PA12-triggered side effects for MMC3 and PA13-triggered side effects for MMC5.

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Sun Oct 11, 2020 8:14 pm

I tweaked my clockslide code to operate like yours and to use RTS and the stack and got 46 cycles in total. If you can guarantee that VDelay_Clockslide starts at $xx01, then you can shave off another 2 cycles by removing the ADC.

Code: Select all

VDELAY_MINIMUM = 46

  ; Waits for A cycles. JSR/RTS is included. Minimum is 46 cycles.
  ; Input: A: Number of cycles to delay.
  ; Clobbers: A/Y
  VDelay:                          ; +6 = 6 (jsr)
    ; If the requested length is too low, wait the minimum time.
    SEC                              ; +2 = 8
    SBC #VDELAY_MINIMUM              ; +2 = 10
    BCC VDelay_TooLow                ; +2 = 12

    ; Wait in 5-cycle amounts until we've waited one time too many.
   -
    ; SEC
    SBC #$05                         ; +2 = 14
    BCS -                            ; +2 = 16

    ; Push the high byte of the clockslide address.
    TAY                              ; +2 = 18
    LDA #>(VDelay_Clockslide-1)      ; +2 = 20
    PHA                              ; +3 = 23
    TYA                              ; +2 = 25

    ; Use the remainder from the wait to calculate the low byte of the clockslide address.
    EOR #$FF                         ; +2 = 27
    ADC #<(VDelay_Clockslide-1)      ; +2 = 29
    PHA                              ; +3 = 32

    ; Clockslide to do the less-than-5 cycle portion.
    RTS                              ; +6 = 38


   ; Wait a fixed time for calls with an argument that's too low for us to service properly.
   VDelay_TooLow:                    ; +3 = 13 (from branch)
    ; 8 cycle loop. Cycles = length * iterations + 1.
    LDY #3                           ; +(8*3 + 1) = +25 = 38
   -
    JMP +
   +
    DEY
    BNE -

    NOP                              ; +2 = 40

    RTS                              ; +6 = 46


   ; This spends 0-4 cycles plus 2 cycles of overhead.
   VDelay_Clockslide:
    .db $C9,$C9,$C9,$C5,$EA          ; +2 = 40

    RTS                              ; +6 = 46

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Sun Oct 11, 2020 10:43 pm

I made a self-modifying version that requires a 33 cycle minimum. I don't have ideas at the moment for further improvements.

Code: Select all

VDelay_RAM := $6000

VDELAY_MINIMUM = 33

  ; Waits for A cycles. JSR/RTS is included. Minimum is 33 cycles.
  ; Input: A: Number of cycles to delay.
  ; Clobbers: A/Y
  VDelay:                          ; +6 = 6 (jsr)
    ; If the requested length is too low, wait the minimum time.
    SEC                              ; +2 = 8
    SBC #VDELAY_MINIMUM              ; +2 = 10
    BCC VDelay_TooLow                ; +2 = 12

    ; Wait in 5-cycle amounts until we've waited one time too many.
   -
    ; SEC
    SBC #$05                         ; +2 = 14
    BCS -                            ; +2 = 16

    ; Set up the target location in the clockslide for the last (less than 5-cycle) wait.
    EOR #$FF                         ; +2 = 18
    STA VDelay_RAM + (VDelay_ClockslideBranch+1 - VDelay)    ; +4 = 22
   VDelay_ClockslideBranch:
    BPL VDelay_Clockslide            ; +3 = 25

   ; This spends 0-4 cycles plus 2 cycles of overhead.
   VDelay_Clockslide:
    .db $C9,$C9,$C9,$C5,$EA          ; +2 = 27

    RTS                              ; +6 = 33


   ; Wait a fixed time for calls with an argument that's too low for us to service properly.
   VDelay_TooLow:                    ; +3 = 13 (from branch)
    NOP                              ; +2 = 15
    NOP                              ; +2 = 17
    NOP                              ; +2 = 19
    NOP                              ; +2 = 21
    NOP                              ; +2 = 23
    NOP                              ; +2 = 25
    NOP                              ; +2 = 27
    RTS                              ; +6 = 33
   VDelay_End:
Last edited by Fiskbit on Mon Oct 12, 2020 1:43 am, edited 1 time in total.

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Sun Oct 11, 2020 11:03 pm

Ah, I tried a more naive version (not self-modifying) applying the technique and it has a 51 cycle minimum.

The idea to reorder and use the decrementing loop to do double duty as a modulo was a really good one!

So, would you like me to use your versions in my github repository? If yes, how would you like to be attributed? Do you have a website or other public profile I should link?

Post Reply