here's 8 ticks of the $002D LFSR at once as a multiply and XOR. ($2D = %101101, so this is 4 operations of shift + XOR).
Code: Select all
lfsr8:
ldx seed+0
lda seed+1
sta seed+0 ; low x << 0 (9 cycles)
lda seed+1
asl
asl
pha
eor seed+0
sta seed+0 ; low x << 2 (+15 = 24 cycles)
pla
asl
pha
eor seed+0
sta seed+0 ; low x << 3 (+12 = 36 cycles)
pla
asl
asl
eor seed+0
sta seed+0 ; low x << 5 (+12 = 48 cycles)
lda seed+1
stx seed+1 ; high x << 8
lsr
lsr
lsr
pha
eor seed+1
sta seed+1 ; high x << 5 (+20 = 68 cycles)
pla
lsr
lsr
pha
eor seed+1
sta seed+1 ; high x << 3 (+14 = 82 cycles)
pla
lsr
eor seed+1
sta seed+1 ; high x << 2 (+10 = 92 cycles)
lda seed+0
rts ; total: 95 cycles (57 bytes)
This runs in constant time, but average time is almost the same as the .repeat version (also requires one byte of temporary storage? X in this example). Compared to the other version, it can't be broken down by number of bits needed like I suggested, though, so I think this way is inferior, at least. This is optimizing only the 8-bits case at the expense of cases requiring less. (Warning: this code is untested and could have errors, but I was only investigating how many cycles it
should take. If this is incorrect, a correct version should be comparable.)
Actually, now that I've compared this I can see where Cook's CRC16 gets its added efficiency from. If I eliminated one of the taps here, I could probably cut ~25 cycles (putting it a lot closer to the CRC16 algorithm), but I'd also be losing maximal length at the same time (just like CRC16 does).
So, really, making an LFSR efficient for 8 bits at once is probably mostly a matter of finding a polynomial that will minimize the number of shifts required. There's probably some extra optimization hiding in Cook's method versus my naive multiply and XOR, but it's a bit minor compared to what it picks up by sacrificing sequence length.