6502 vdelay - cycle delay routine with variable length at runtime

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

ejona
Posts: 15
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona »

After I posted the nopslide version, I realized a clockslide has only 1 instruction (the nop before the rts) overhead compared to the nopslide's 2 (lsr+bcs) (other than the 1 cycle rounding thing). I had to work out a bug before posting it. rainwarrior, your version is very close, but uses sec+sbc instead of cmp for toolow handling. I also spent effort to avoid touching the zero page. 254 bytes:

Code: Select all

VDELAY_OVERHEAD = 27

.align 256
vdelay_clockslide:
        .repeat 256-VDELAY_OVERHEAD-2
        .byte   $A9     ; lda
        .endrepeat
        .byte   $B0     ; bcs
        .byte   $18     ; clc           ; +2 = 21
        rts                             ; +6 = 27

.export vdelay  
vdelay:                                 ; +6 = 6 (jsr)
        cmp     #VDELAY_OVERHEAD        ; +2 = 8
        bcc     vdelay_toolow           ; +2 = 10
        eor     #$FF                    ; +2 = 12
        sta     vdelay_modify+1         ; +4 = 16
vdelay_modify:
        jmp     vdelay_clockslide       ; +3 = 19

vdelay_toolow:
        sec
        jmp     vdelay_clockslide+$FF-(VDELAY_OVERHEAD-1+4)

        ; 17 bytes from bcs+2
        .res    $18-17
        rts
Note that it is also not trivial to adapt to remove the toolow handling. (It's not that hard, but you need to be comfortable with the approach.)

Is there any interest in documenting approaches that inline well (those that only have easy alignment, only one rts, and ideally the rts at the end)? Inlining removes the jsr+rts overhead (although may need 3 cycles for a branch). vdelay_short is 30 bytes (net 26, after jsr+rts removal), which is low enough that inlining multiple times doesn't seem outrageous.
ejona
Posts: 15
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona »

The non-modifying vdelay has an 'sec' right before the 8-cycle loop. At the cost of one cycle we can force the carry to be cleared after the 4s vdelay_short which allows us to remove the 'sec' and net one cycle. After the next sbc though, the carry will be set which needs to be managed. Combined with the vdelay_low branch removal brings vdelay_short to 34 cycles.

Code: Select all

    ...
vdelay_4s:
    lsr                                ; +2 = 26
    BRPAGE bcc, vdelay_8s              ; +3 = 29 (4 extra if bit 3 set)
    clc
    BRPAGE bcc, vdelay_8s
vdelay_8s:                             ;         (8 extra per loop, countdown)
    BRPAGE bne, vdelay_wait8_clc       ; +2 = 31
    rts                                ; +6 = 37 (end)

vdelay_wait4:
    BRPAGE bcs, vdelay_8s              ; +3 (branch always)
vdelay_wait8_clc:
    sbc #0                             ; +2
    BRPAGE bcs, vdelay_loop8           ; +3 (branch always)

vdelay_loop8:
    BRPAGE bne, vdelay_wait8_sec       ; +2
    rts
vdelay_wait8_sec:
    sbc #1                             ; +2
    BRPAGE bcs, vdelay_loop8           ; +3 (branch always)
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior »

I put the branch-over-toolow 3 cycle optimization in a few minutes ago. I don't have time consider and test the carry optimization you posted just now but I'll try to look at it tomorrow. The extreme version I've been leaving out for now because it seems to still be in flux.
ejona wrote: Wed Oct 21, 2020 7:22 amIs there any interest in documenting approaches that inline well...
I don't want to go down a rabbit hole of permuations here. TBH having 4 versions right now already feels borderline excessive, and I worry about option-paralysis for users who just want something easy they can drop into their code.

I have to draw an arbitrary line somewhere. Inlining, making the overhead implied, or removing toolow safety/consistency are all relatively easy modifications that apply to all the methods, more or less, so to me it's good enough to let an advanced user do those for themself.

I wouldn't mind linking to this thread or other useful code examples though (e.g. the readme links to Bisqwit's delay generator). I just don't want the main repository to grow much more than it already has.
ejona
Posts: 15
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona »

rainwarrior wrote: Wed Oct 21, 2020 11:08 am I don't have time consider and test the carry optimization you posted just now but I'll try to look at it tomorrow.
I honestly wouldn't mind if you didn't integrate it at all. A bit convoluted for the small gain. But you may find it interesting.
I don't want to go down a rabbit hole of permuations here. TBH having 4 versions right now already feels borderline excessive, and I worry about option-paralysis for users who just want something easy they can drop into their code.
Quite fair and seems apt. I do think it would help the current 4 versions if the short version had a different label than the full version (e.g., vdelay vs svdelay). Normally I'd also suggest just putting them both in the same source file and share some of the code. A user can strip the full version if they don't want it. But it appears any worthwhile code sharing would increase the full version by 1 cycle. Maybe put them in one file without code sharing?

I don't think I can contribute much more; things are pretty minimal. Only breakthrough I see available is to find a way to avoid the toolow handling. I'm not holding my breath.
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior »

Okay, the split clc/sec optimization is in. I've also added the extreme versions. ( github )
ejona wrote: Thu Oct 22, 2020 9:03 pmI do think it would help the current 4 versions if the short version had a different label than the full version (e.g., vdelay vs svdelay). Normally I'd also suggest just putting them both in the same source file and share some of the code. A user can strip the full version if they don't want it. But it appears any worthwhile code sharing would increase the full version by 1 cycle. Maybe put them in one file without code sharing?
I've thought about giving the short ones a different name, since they're functionally different, but keeping them with the same name was more convenient for me for testing. The user who needs both of these in the same project is yet another possible permutation here, but to me that doesn't seem worth pursuing at this point?

If someone does need both versions at at same time, all it takes to accomplish this is 2 letters changed. There shouldn't be reason to put in the same source file if they're not sharing code? Just include both objects in the link.
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior »

So I discovered that 4 years ago Bisqwit added 2 articles to the wiki about fixed and variable delay routines. There are some very interesting ones here:

Delay Code: Callable functions

I think adjusted for the constraints they're 3 cycles faster and several bytes smaller.
ejona
Posts: 15
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona »

I saw you added a NULL test to the repo. It makes debugging a lot more straight-forward. Thank you for that.

Those functions are really nice. When I looked at "A + 25" and saw "beq :+", it became obvious the benefit of reducing input range, even if it doesn't reduce to the next power of two. I had considered that earlier, but forgot to dig into it since power-of-two was working well. So I ran off and tried some ideas and reduced the cycles to 33 cycles. Neat, but not earth-shattering.

But it made an earlier idea more viable: splitting out another tier of processing for very small numbers. I was able to add a "low" case to the toolow logic that only supported inputs between 0-3. So an approach of mine with 33 cycle overhead was able to get down to 29. If you make the toolow handling one cycle less than the rest of the code flow, it brought it down to 28.

There's a lot of interaction between the "low" and "large" code paths to get them align. "Low" values are also so limited there's opportunity for interesting combinations of toolow and low code paths. In the end, simple stuff happened to work well, but there's many options still open.

Playing it with myself, I got mine down to 28 cycles:

Code: Select all

VDELAY_MINIMUM_LARGE = 33

vdelay:                                ; +6 = 6 (jsr)
    sec                                ; +2 = 8
    sbc #VDELAY_MINIMUM_LARGE          ; +2 = 10
    BRPAGE bcc, vdelay_low             ; +2 = 12

:   sbc #5                             ; +2 = 14
    BRPAGE bcs, :-                     ; +2 = 16
    adc #5                             ; +2 = 18
    lsr                                ; +2 = 20
    BRPAGE bcs, vdelay_2s              ; +2 = 22 (1 extra if bit 1 set)
vdelay_2s:
    BRPAGE beq, vdelay_ret0            ; +3 = 25 (if no more bits set)
    lsr                                ;         (2 extra if bit 2 or 3 is set)
    BRPAGE beq, vdelay_ret             ;         (2 extra if bit 3 is set)
vdelay_ret0:
    BRPAGE bne, vdelay_ret             ; +2 = 27
vdelay_ret:
    rts                                ; +6 = 33 (end)

vdelay_low:                            ; +1 = 13 (bcc)
    adc #4                             ; +2 = 15
    BRPAGE bcc, vdelay_toolow          ; +2 = 17
    BRPAGE beq, :+
    lsr
:   BRPAGE beq, :+
    BRPAGE bcs, :+
:   rts                                ; +6 = 29 (end)

vdelay_toolow:                         ; +1 = 18 (bcc)
    nop                                ; +2 = 20
    nop                                ; +2 = 22
    rts                                ; +6 = 28 (end)
The vdelay_low flow was a byte reduction of this approach, which is more clear:

Code: Select all

    BRPAGE beq, end   ; special case 0 entirely
    lsr
    BRPAGE beq, :+    ; if zero, then carry is guaranteed set
    BRPAGE bcs, :+
:   rts

end:
    BRPAGE beq, *+2
    rts
Using the wiki's delay routine, I got things down to 27 cycles:

Code: Select all

VDELAY_MINIMUM_LARGE = 31
vdelay:                                ; +6 = 6 (jsr)
    sec                                ; +2 = 8
    sbc #VDELAY_MINIMUM_LARGE          ; +2 = 10
    BRPAGE bcc, vdelay_low             ; +2 = 12 

    ... delay_a_27_clocks without the 'sec'...

vdelay_low:                            ; +1 = 13 (bcc)
    adc #3                             ; +2 = 15
    BRPAGE bcc, vdelay_toolow          ; +2 = 17
    BRPAGE beq, :+                     ; +3 = 20 (when zero)
    lsr
:   BRPAGE bne, *+2                    ; +2 = 22
    rts                                ; +6 = 28

vdelay_toolow:                         ; +1 = 18 (bcc)
    BRPAGE bcc, *+2                    ; +3 = 21 (always branch)
    rts                                ; +6 = 27 (end)
I wouldn't be surprised if someone shaves off 1-2 more cycles relatively quickly.
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior »

Ahh! Very interesting.

Version 10 is up now.

vdelay_short.s now is pretty much what you posted, but with vdelay_toolow rolled into vdelay_low to save a couple of bytes and keep under align 32.
Post Reply