6502 vdelay - cycle delay routine with variable length at runtime

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Sun Oct 11, 2020 11:25 pm

Sure, please use these if they're useful! When I can, I like to license my code as permissively as possible. Attribution is not necessary, but if you'd like, you're welcome to credit me as Fiskbit. I do have a github profile, but don't have anything public on it at this time.

I can look at extending the code to 16-bit, but haven't thought yet about how it would need to change. Happy to leave it to you, too.

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Mon Oct 12, 2020 1:43 am

I came up with another improvement. Since the problem with clockslide is that odd lengths will read from $EA, which could have side effects, I've opted to change the instruction to an always-taken branch, which is also 2 bytes and takes 3 cycles. The operand needs to both lead to an RTS instruction and be effectively a NOP when executed itself.

Here's the non-self-modifying version, which I believe is the same number of bytes as before:

Code: Select all

VDELAY_MINIMUM = 46

  ; Waits for A cycles. JSR/RTS is included. Minimum is 46 cycles.
  ; Input: A: Number of cycles to delay.
  ; Clobbers: A/Y
  VDelay:                          ; +6 = 6 (jsr)
    ; If the requested length is too low, wait the minimum time.
    SEC                              ; +2 = 8
    SBC #VDELAY_MINIMUM              ; +2 = 10
    BCC VDelay_TooLow                ; +2 = 12

    ; Wait in 5-cycle amounts until we've waited one time too many.
   -
    ; SEC
    SBC #$05                         ; +2 = 14
    BCS -                            ; +2 = 16

    ; Push the high byte of the clockslide address.
    TAY                              ; +2 = 18
    LDA #>(VDelay_Clockslide-1)      ; +2 = 20
    PHA                              ; +3 = 23
    TYA                              ; +2 = 25

    ; Use the remainder from the wait to calculate the low byte of the clockslide address.
    EOR #$FF                         ; +2 = 27
    ADC #<(VDelay_Clockslide-1)      ; +2 = 29
    PHA                              ; +3 = 32

    ; Clockslide to do the less-than-5 cycle portion.
    RTS                              ; +6 = 38


   ; This spends 0-4 cycles plus 2 cycles of overhead.
   VDelay_Clockslide:
    .db $C9,$C9,$C9,$90,$0A          ; +2 = 40
   VDelay_Clockslide_End:
    RTS                              ; +6 = 46


   ; Wait a fixed time for calls with an argument that's too low for us to service properly.
   VDelay_TooLow:                    ; +3 = 13 (from branch)
    ; 8 cycle loop. Cycles = length * iterations + 1.
    LDY #3                           ; +(8*3 + 1) = +25 = 38
   -
    JMP +
   +
    DEY
    BNE -

    NOP                              ; +2 = 40

   VDelay_ClockslideBranchTarget:
    RTS                              ; +6 = 46


  .IF VDelay_ClockslideBranchTarget - VDelay_Clockslide_End != $0A
  .ERROR VDelay_ClockslideBranchTarget wrong distance from VDelay_Clockslide_End
  .ENDIF
And here's the self-modifying one, which is 1 byte longer than before:

Code: Select all

VDELAY_MINIMUM = 33

  ; Waits for A cycles. JSR/RTS is included. Minimum is 33 cycles. Must be executed from RAM.
  ; Input: A: Number of cycles to delay.
  ; Clobbers: A/Y

   VDelay_Start:
   VDelay_ClockslideBranchTarget:
    RTS
  VDelay:                          ; +6 = 6 (jsr)
    ; If the requested length is too low, wait the minimum time.
    SEC                              ; +2 = 8
    SBC #VDELAY_MINIMUM              ; +2 = 10
    BCC VDelay_TooLow                ; +2 = 12

    ; Wait in 5-cycle amounts until we've waited one time too many.
   -
    ; SEC
    SBC #$05                         ; +2 = 14
    BCS -                            ; +2 = 16

    ; Set up the target location in the clockslide for the last (less than 5-cycle) wait.
    EOR #$FF                         ; +2 = 18
    STA VDelay_RAM + (VDelay_ClockslideBranch+1 - VDelay)    ; +4 = 22
   VDelay_ClockslideBranch:
    BPL VDelay_Clockslide            ; +3 = 25

   ; This spends 0-4 cycles plus 2 cycles of overhead.
   VDelay_Clockslide:
    .db $C9,$C9,$C9,$90,$EA          ; +2 = 27
   VDelay_Clockslide_End:

    RTS                              ; +6 = 33


   ; Wait a fixed time for calls with an argument that's too low for us to service properly.
   VDelay_TooLow:                    ; +3 = 13 (from branch)
    NOP                              ; +2 = 15
    NOP                              ; +2 = 17
    NOP                              ; +2 = 19
    NOP                              ; +2 = 21
    NOP                              ; +2 = 23
    NOP                              ; +2 = 25
    NOP                              ; +2 = 27
    RTS                              ; +6 = 33
   VDelay_End:


  .IF VDelay_Clockslide_End - VDelay_ClockslideBranchTarget != ($100 - $EA)
  .ERROR VDelay_ClockslideBranchTarget wrong distance from VDelay_Clockslide_End
  .ENDIF 

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Tue Oct 13, 2020 1:46 am

Ah, yes that's a very nice idea for eliminating the spurious read!

I've now incorporated those two variations into the project, with some small modifications: current version

The 16-bit versions actually only need 2 more cycles of overhead. The SEC gets replaced by CPX #0, which also sets carry, and after that there's just an extra branch to the "full" version.

I was trying to puzzle out how to create meaningful asserts for the way carry was always clear for the clockslide, but then I realized we can use LDA instead of CMP and not affect carry. (The code has much lighter alignment restrictions this way too.)

The alignment of the branch target seems a little bit "lucky" in your two versions, but at least there's clear ways to move it around if needed. For the 16-bit self-modifying version I couldn't use $EA but the padding needed for $0A was an insignificant 4 bytes. ;)

Fiskbit
Posts: 199
Joined: Sat Nov 18, 2017 9:15 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by Fiskbit » Tue Oct 13, 2020 3:10 pm

Luck is a big part of this level of optimization. :) These seem very lean now. Really cool that the 16-bit one only needs 2 extra cycles.

Looking at vdelay_modify.s, I think if you move vdelay_toolow to the end, you can use #$18 (CLC) as the clockslide's branch operand/instruction and put the vdelay_clockslide_branch RTS right before it. This removes the 4 bytes of padding. I haven't tested it, but it should look like this:

Code: Select all

; This "clockslide" overlaps instructions so that each byte adds one cycle to the tally.
; 0-4 cycles + 2 cycles of overhead (A clobbered)
vdelay_clockslide:                     ; +2 = 29
	.byte $A9           ; 0     LDA #$A9 (+2)
	.byte $A9           ; 1     LDA #$A9 (+2)
	.byte $A9           ; 0,2   LDA #$90 (+2)
	.byte $90           ; 1,3   BCC *+2+$0A (+3, carry guaranteed clear)
	.byte $18           ; 0,2,4 CLC (+2)
	.assert >(vdelay_clockslide) = >(vdelay_clockslide+4), error, "Clockslide crosses page."
	.assert >(*+$18) = >*, error, "Clockslide branch page crossed!"
	.assert (*+$18) = vdelay_clockslide_branch, error, "Clockslide branch misplaced!"
	rts                                ; +6 = 35 (end)

vdelay_full:                           ; +3 = 11
	sec                                ; +2 = 13
	sbc #VDELAY_FULL_OVERHEAD          ; +2 = 15
	tay                                ; +2 = 17
	txa                                ; +2 = 19
	sbc #0                             ; +2 = 21
	BRPAGE beq, vdelay_high_none       ; +2 = 23
	: ; 256 cycles each iteration
		ldx #50            ; +2 = 2
		: ; 5 cycle loop   +250 = 252
			dex
			BRPAGE bne, :- ; -1 = 251
		sbc #1             ; +2 = 253 (carry always set)
		BRPAGE bne, :--    ; +3 = 256    -1 = 22 (on last iteration)
	nop                                ; +2 = 24
vdelay_high_none:                      ; +3 = 24 (from branch)
	tya                                ; +2 = 26
	jmp vdelay_low                     ; +3 = 29
	;                                -14+35 = 50 (full end)

vdelay_clockslide_branch: ; exactly 24 bytes past the clockslide branch
	rts                                ; +6 = 35 (end)

vdelay_toolow:                         ; +3 = 15 (from branch)
	php                                ; +3 = 18
	plp                                ; +4 = 22
	php                                ; +3 = 25
	plp                                ; +4 = 29
	rts                                ; +6 = 35 (end)	

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Tue Oct 13, 2020 4:36 pm

Nice! I remember considering CLC but I think I miscounted, heh. Thanks for giving another look to it.

ejona
Posts: 14
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona » Sun Oct 18, 2020 11:41 pm

I've not dug too much into this thread, but I saw the stack management that powers the clockslide and questioned if there weren't shorter+faster ways of dealing with the small values. Using branches seems to work pretty well. The 8-bit version is just stripped down from the 16-bit version without any real extra work, so there may be a way to squeeze a few cycles out of the toolow handling.

I handle the 3 least significant bits with shifts and conditions and then use an 8 cycle loop to consume the rest of the low byte. The branches at the end of loops always go back to the beginning of the loop, which makes it easy to calculate clocks spent and keeps a single flow through the code. It may be possible to squeeze out a bit more by changing the loop structures; I've not really tried yet. This is a "looks interesting and is a good stopping place."

I do some branches that jump to the next instruction. From what I can tell, that is allowed as the encoding doesn't handle positive vs negative jumps differently, but it is strange enough I worried it would break something. It at least assembles and runs on the simulator without issue.

vdelay: 46 bytes

Code: Select all

VDELAY_OVERHEAD = 46

.export vdelay
vdelay:                                 ; +6 = 6 (jsr)
        sec                             ; +2 = 8
        sbc     #VDELAY_OVERHEAD        ; +2 = 10
        tay                             ; +2 = 12
        txa                             ; +2 = 14
        sbc     #0                      ; +2 = 16
        BRPAGE bcc, @toolow             ; +2 = 18
@high:  BRPAGE beq, @low                ; +3 = 21
        ldx     #31
:       bit     $0
        dex
        BRPAGE bne, :-
        sbc     #1
        BRPAGE bcs, @high
@toolow:ldy     #0
@low:   tya                             ; +2 = 23
        lsr                             ; +2 = 25
        BRPAGE bcs, @twos               ; +2 = 27
@twos:  lsr                             ; +2 = 29
        BRPAGE bcc, @fours              ; +3 = 32
        bit     $0
@fours: lsr                             ; +2 = 34
        BRPAGE bcs, @waste4             ; +2 = 36
@rest:  sec                             ; +2 = 38
@rest2: BRPAGE bne, @waste8             ; +2 = 40
        rts                             ; +6 = 46

@waste4:BRPAGE bcs, @rest       ; +3 = 3 (+1 from branch to get here)
@waste8:sbc     #1              ; +2 = 2
        BRPAGE bcs, @rest2      ; +3 = 5 (+3 from branch to get here)
vdelay_short: 30 bytes

Code: Select all

VDELAY_OVERHEAD = 38

.export vdelay
vdelay:                                 ; +6 = 6 (jsr)
        sec                             ; +2 = 8
        sbc     #VDELAY_OVERHEAD        ; +2 = 10
        BRPAGE bcc, @toolow             ; +2 = 12
        BRPAGE bcs, @low                ; +3 = 15
@toolow:lda     #0
@low:   lsr                             ; +2 = 17
        BRPAGE bcs, @twos               ; +2 = 19
@twos:  lsr                             ; +2 = 21
        BRPAGE bcc, @fours              ; +3 = 24
        bit     $0
@fours: lsr                             ; +2 = 26
        BRPAGE bcs, @waste4             ; +2 = 28
@rest:  sec                             ; +2 = 30
@rest2: BRPAGE bne, @waste8             ; +2 = 32
        rts                             ; +6 = 38

@waste4:BRPAGE bcs, @rest       ; +3 = 3 (+1 from branch to get here)
@waste8:sbc     #1              ; +2 = 2
        BRPAGE bcs, @rest2      ; +3 = 5 (+3 from branch to get here)

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Mon Oct 19, 2020 12:47 am

That's a very interesting approach! So, decomposing the number into bits and climbing a ladder of branches of power-of-two lengths?

Noting that your "short" version has an overhead of 38 and starts with a sec, it can trivially become a "long" verison with 2 more cycles in the way the others were done:

Code: Select all

sec
at the start, becomes:

Code: Select all

cpx #0 ; sets carry
bne long_version
After that the rest of the short version follows, and otherwise the "long" version branch subtracts some larger overhead, counts out the 256-cycle loop, then can finish the remainder with the "short" version.
ejona wrote:
Sun Oct 18, 2020 11:41 pm
I do some branches that jump to the next instruction. From what I can tell, that is allowed as the encoding doesn't handle positive vs negative jumps differently, but it is strange enough I worried it would break something.
Yes, a branch that adds 0 to the PC is perfectly fine.


So... may I use this code on the github project, and if yes how would you like to be attributed?

ejona
Posts: 14
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona » Mon Oct 19, 2020 8:46 am

So, decomposing the number into bits and climbing a ladder of branches of power-of-two lengths?
Yeah. I think the slides are better when you have a larger number of clocks to consume to outweigh the setup cost (or you can modify the code).
Noting that your "short" version has an overhead of 38 and starts with a sec, it can trivially become a "long" verison with 2 more cycles in the way the others were done:
Yeah, for a bit of code size. That would bring it closer to vdelay_modify. It would also mean we can use the stack for temporary and stop clobbering y. At that point, it looks like we could then reduce the overhead for both versions by 3 cycles by using a separate toolow flow and removing "beq, @low". It looks like that would not even increase code size if we jump back to the "bcc, @fours" line just after @twos.
So... may I use this code on the github project, and if yes how would you like to be attributed?
It's disappointing the repo is using a custom license. It seems the zlib License is very similar. Would it make sense to use it instead? The current terms are quite appropriate, other than the custom aspect.

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Mon Oct 19, 2020 9:51 am

ejona wrote:
Mon Oct 19, 2020 8:46 am
rainwarrior wrote:
Mon Oct 19, 2020 12:47 am
may I use this code on the github project, and if yes how would you like to be attributed?
It's disappointing the repo is using a custom license. It seems the zlib License is very similar. Would it make sense to use it instead? The current terms are quite appropriate, other than the custom aspect.
I do not wish to alter the license. Though at this point I would remove the Patreon link from the readme, since it's become a community project.

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Tue Oct 20, 2020 12:21 am

Here's a branch with your technique incorporated into vdelay / vdelay_short:

Edit: that branch is now part of the master: https://github.com/bbbradsmith/6502vdelay

I credited you as "Ejona" and gave a link to this thread in the readme. Let me know if you'd like to be attributed differently than that. With your consent I'll push it to the master. (Or if you don't consent I can delete the branch.)
Last edited by rainwarrior on Tue Oct 20, 2020 6:16 pm, edited 1 time in total.

ejona
Posts: 14
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona » Tue Oct 20, 2020 8:50 am

I credited you as "Ejona" and gave a link to this thread in the readme. Let me know if you'd like to be attributed differently than that. With your consent I'll push it to the master.
You are free to incorporate it into the 6502vdelay project. Ejona is fine, although "Eric Anderson" may be better. You may instead want to link to my github id: https://github.com/ejona86. But I'm not too picky.

ejona
Posts: 14
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona » Tue Oct 20, 2020 9:37 am

I noticed there was previously an "extreme" version that used 826 bytes. With that mentality in mind, here is a vdelay_short_modify_extreme version that has an overhead of 30 cycles and is 132 bytes. It uses the lsr+branch technique to manage even vs odd and then a simple nopslide.

If you weaken how toolow is handled to allow a +/-1 variation for small values, then you can have an rts run after the nopslide and remove the VDELAY_OVERHEAD cmp+branch which reduces overhead to 26 cycles. It would take 4 cycles to make it a full version, so wouldn't probably wouldn't be worth it compared to the current vdelay_modify's 35 cycles, unless we weaken the handling for small values to get a 30 cycle overhead. The nopslide can't be reused for a full version, so we'd use one of the other delay variations for larger values.

I'd send a PR, but I know there's quite a bit of styling decisions and choices of what to incorporate. The ".MOD 2" stuff could be removed once the code is "settled" and unlikely to change overhead. I left it in in case y'all see an improvement.

Code: Select all

VDELAY_OVERHEAD_RAW = 29
; Only even overheads can be accurately accounted. An extra cycle is inserted if
; odd
VDELAY_OVERHEAD = VDELAY_OVERHEAD_RAW + (VDELAY_OVERHEAD_RAW .MOD 2)

.if !(VDELAY_OVERHEAD_RAW .MOD 2)
vdelay_toolow:
        lda     #(VDELAY_OVERHEAD>>1)^$7F
        BRPAGE  bcc, vdelay_toolow_resume
.endif

.export vdelay
vdelay:                                 ; +6 = 6 (jsr)
        cmp     #VDELAY_OVERHEAD        ; +2 = 8
.if VDELAY_OVERHEAD_RAW .MOD 2
        BRPAGE  bcs, @nottoolow         ; (alt) +3 (if cycles would be odd)
        nop
        lda     #(VDELAY_OVERHEAD>>1)^$7F
        BRPAGE  bcc, vdelay_toolow_resume
@nottoolow:
.else
        BRPAGE  bcc, vdelay_toolow      ; +2 = 10
.endif
        lsr                             ; +2 = 12
        BRPAGE  bcs, :+                 ; +2 = 14
:       eor #$7F                        ; +2 = 16
vdelay_toolow_resume:
        sta @modify+1                   ; +4 = 20
@modify:
        ; @clockslide_end-1 is not used at runtime; it is just for the BRPAGE
        ; and assembler range checks
        BRPAGE bpl, @clockslide_end-1   ; +3 = 23

@clockslide:
        .repeat 127-(VDELAY_OVERHEAD/2)
        nop
        .endrepeat
        rts                             ; +6 = 29
@clockslide_end:

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Tue Oct 20, 2020 12:13 pm

Having trouble following the "mod 2" part, but the clockslide technique was to split a 2-byte 2-cycle instruction, rather than a 1-byte one like NOP. With that in mind 29 is definitely possible. Not sure if I missed any opportunities here to get it lower. (Should probably pack the "vdelay" part into the unused part of the clockslide to reach an even 256 bytes, too...)

Code: Select all

VDELAY_MINIMUM = 29

.align 256

vdelay_clockslide:              ; +2 = 23
    .repeat 254
        .byte $C9 ; CMP imm
    .endrepeat
    .byte $C5 ; CMP $EA
    nop
    rts                         ; +6 = 29

vdelay:                         ; +6 = 6
    sec                         ; +2 = 8
    sbc #VDELAY_MINIMUM         ; +2 = 10
    BRPAGE bcc, vdelay_toolow   ; +2 = 12
vdelay_toolow_return:
    eor #$FF                    ; +2 = 14
    sta vdelay_modify+1         ; +4 = 18
vdelay_modify:
    jmp vdelay_clockslide       ; +3 = 21
    .assert (<vdelay_clockslide)=0, error, "Clockslide must be 256-byte aligned."

vdelay_toolow:                  ; +3 = 13
    lda #VDELAY_MINIMUM-6       ; +2 = 15
    jmp vdelay_toolow_return    ; +3 = 18
I wouldn't relax the restriction on including toolow (or making MINIMUM implied, etc.) for the "extreme" variation because I think that optimization/compromise trivially applies to all versions? Would rather it be comparing apples to apples, and leave that last inch to be traded off manually.

User avatar
rainwarrior
Posts: 7877
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by rainwarrior » Tue Oct 20, 2020 5:19 pm

Alright, ejona's technique is now in the master branch.

ejona
Posts: 14
Joined: Sun Jul 14, 2019 9:45 pm

Re: 6502 vdelay - cycle delay routine with variable length at runtime

Post by ejona » Wed Oct 21, 2020 7:21 am

ejona wrote:
Mon Oct 19, 2020 8:46 am
it looks like we could then reduce the overhead for both versions by 3 cycles by using a separate toolow flow and removing "beq, @low". It looks like that would not even increase code size if we jump back to the "bcc, @fours" line just after @twos.
It appears this suggestion wasn't applied to the master branch. So there's 3 cycles available to be shaved off still.

Code: Select all

vdelay:                                ; +6 = 6 (jsr)
    sec                                ; +2 = 8
    sbc #VDELAY_MINIMUM                ; +2 = 10
    BRPAGE bcc, vdelay_toolow          ; +2 = 12
    lsr                                ; +2 = 14
    BRPAGE bcs, vdelay_2s              ; +2 = 16 (1 extra if bit 1 set)
vdelay_2s:
    lsr                                ; +2 = 18
vdelay_toolow_resume:
    BRPAGE bcc, vdelay_4s              ; +3 = 21 (2 extra if bit 2 set)
    ...

vdelay_toolow:
    lda #0
    BRPAGE bcc, vdelay_toolow_resume
rainwarrior wrote:
Tue Oct 20, 2020 12:13 pm
Having trouble following the "mod 2" part
It just adds a clock of delay to make the overhead even. Since the "lsr" rounds down to the nearest even I had to pad to an even overhead.
but the clockslide technique was to split a 2-byte 2-cycle instruction
Yeah, totally. If you're referring to the label before the nops, yes, it is the wrong term. I noticed it later.
I wouldn't relax the restriction on including toolow (or making MINIMUM implied, etc.) for the "extreme" variation because I think that optimization/compromise trivially applies to all versions?
Not actually. Most of them behave very poorly, either underflowing or jumping who-knows where. I totally accept the idea of a level playing field, which is why I included the checks in the version I posted. But only being off-by-one for toolow values seems pretty close to useful.
Would rather it be comparing apples to apples, and leave that last inch to be traded off manually.
For that particular code, it was non-trivial to remove the check, because the ".mod 2" thing is mixed in and would need to be worked in elsewhere. (I tried to have the mod 2 management separate from the toolow check, but it made things really complicated.)
Last edited by ejona on Wed Oct 21, 2020 8:07 am, edited 1 time in total.

Post Reply