I saw you added a NULL test to the repo. It makes debugging a lot more straight-forward. Thank you for that.
Those functions are really nice. When I looked at "A + 25" and saw "beq :+", it became obvious the benefit of reducing input range, even if it doesn't reduce to the next power of two. I had considered that earlier, but forgot to dig into it since power-of-two was working well. So I ran off and tried some ideas and reduced the cycles to 33 cycles. Neat, but not earth-shattering.
But it made an earlier idea more viable: splitting out another tier of processing for very small numbers. I was able to add a "low" case to the toolow logic that only supported inputs between 0-3. So an approach of mine with 33 cycle overhead was able to get down to 29. If you make the toolow handling one cycle less than the rest of the code flow, it brought it down to 28.
There's a lot of interaction between the "low" and "large" code paths to get them align. "Low" values are also so limited there's opportunity for interesting combinations of toolow and low code paths. In the end, simple stuff happened to work well, but there's many options still open.
Playing it with myself, I got mine down to 28 cycles:
Code: Select all
VDELAY_MINIMUM_LARGE = 33
vdelay: ; +6 = 6 (jsr)
sec ; +2 = 8
sbc #VDELAY_MINIMUM_LARGE ; +2 = 10
BRPAGE bcc, vdelay_low ; +2 = 12
: sbc #5 ; +2 = 14
BRPAGE bcs, :- ; +2 = 16
adc #5 ; +2 = 18
lsr ; +2 = 20
BRPAGE bcs, vdelay_2s ; +2 = 22 (1 extra if bit 1 set)
vdelay_2s:
BRPAGE beq, vdelay_ret0 ; +3 = 25 (if no more bits set)
lsr ; (2 extra if bit 2 or 3 is set)
BRPAGE beq, vdelay_ret ; (2 extra if bit 3 is set)
vdelay_ret0:
BRPAGE bne, vdelay_ret ; +2 = 27
vdelay_ret:
rts ; +6 = 33 (end)
vdelay_low: ; +1 = 13 (bcc)
adc #4 ; +2 = 15
BRPAGE bcc, vdelay_toolow ; +2 = 17
BRPAGE beq, :+
lsr
: BRPAGE beq, :+
BRPAGE bcs, :+
: rts ; +6 = 29 (end)
vdelay_toolow: ; +1 = 18 (bcc)
nop ; +2 = 20
nop ; +2 = 22
rts ; +6 = 28 (end)
The vdelay_low flow was a byte reduction of this approach, which is more clear:
Code: Select all
BRPAGE beq, end ; special case 0 entirely
lsr
BRPAGE beq, :+ ; if zero, then carry is guaranteed set
BRPAGE bcs, :+
: rts
end:
BRPAGE beq, *+2
rts
Using the wiki's delay routine, I got things down to 27 cycles:
Code: Select all
VDELAY_MINIMUM_LARGE = 31
vdelay: ; +6 = 6 (jsr)
sec ; +2 = 8
sbc #VDELAY_MINIMUM_LARGE ; +2 = 10
BRPAGE bcc, vdelay_low ; +2 = 12
... delay_a_27_clocks without the 'sec'...
vdelay_low: ; +1 = 13 (bcc)
adc #3 ; +2 = 15
BRPAGE bcc, vdelay_toolow ; +2 = 17
BRPAGE beq, :+ ; +3 = 20 (when zero)
lsr
: BRPAGE bne, *+2 ; +2 = 22
rts ; +6 = 28
vdelay_toolow: ; +1 = 18 (bcc)
BRPAGE bcc, *+2 ; +3 = 21 (always branch)
rts ; +6 = 27 (end)
I wouldn't be surprised if someone shaves off 1-2 more cycles relatively quickly.