That sounds about right. Assuming the SD card uses SPI over controller 2 bit 1, with a passthrough receptacle to connect actual controller 2 to bit 0, I count 92 master clocks per bit, or 32 + (108 * 8) - 6 = 890 clocks to read a byte, less than one 1324-clock scanline. (Correct me if I'm wrong.)
Code:
; Assume sep $20 and fast ROM
lda #$01 ; 12
sta tmp ; 20 ; loop counter: once this 1 reaches carry, we're done
loop:
lda $4017 ; 30 ; 401x access takes 12 clocks or two fast cycles
lsr a ; 12
lsr a ; 12 ; select adapter (bit 1), not player 1 (bit 0)
rol tmp ; 36
bcc loop ; 18
; -6 ; last untaken
Equivalent NES code runs in 60 + 192 * 8 - 12 = 1584 clocks, just over one 1364-clock scanline.
Code:
lda #$01 ; 24
sta tmp ; 36 ; loop counter: once this 1 reaches carry, we're done
loop:
lda $4017 ; 48
and #$08 ; 24
cmp #$01 ; 24 ; select adapter (bit 3), not player 1 (bit 0)
rol tmp ; 60
bcc loop ; 36
; -12 ; last untaken
This is why a few SPI-on-NES proposals give the SPI serialization job to cart hardware.