rainwarrior wrote:Well, if you want to compare them by speed alone, LDA immediate is the fastest possible, yes. It's a tradeoff of size (either ROM or RAM) for speed though.
I considered using the immediate method, in ROM, for tiles that need to be updated constantly, like the main character and animated level objects. This is an incredible waste of ROM though (each byte expands to 5 bytes!), so this is only worth it if you have lots of space available. I started coding a tool to minimize this expansion a little, by using all 3 registers as a small dictionary of recently used values and using INX, DEX, INY, DEY, ASL, LSR, ROR and ROL instead of loading new values whenever possible, but never implemented most of the ideas. The hard part is to figure out which register is the best one for loading a new value to, considering transformations it will go through in the future and how soon the old value will be used again.
The PLA method is reasonably fast, and makes conservative use of RAM. A partially unrolled LDA abs, X is similar in speed to using PLA, but the code required is slightly larger.
Unrolled PLAs are interesting because they use less ROM, and when you have it completely unrolled, you can easily jump in the middle of the code to copy variable amounts of bytes, if you ever need that.
Another place you can put the buffer in is zero page, and as long as you use ZP addressing (meaning the code has to be completely unrolled) you can transfer each byte in 7 cycles, slightly faster than the 8 cycles of the PLA or absolute indexed methods.
Outside the vblank there is probably some increased comlexity somewhere to organize the data (ideally: done offline by a tool).
Yes, the complexity is all about converting the CHR data from linear to interleaved, which should be done by a little script before the ROM is assembled, so this doesn't have any negative impact on the NES program itself. The main advantage of this method is that you don't have to waste CPU time or RAM preparing the data and buffering, you can blast it straight from ROM to VRAM without any speed penalty.
How exactly? I don't understand how to address groups of 4, do you do something like this below or is it more involved than that?
Code: Select all
$8000 $8100 $8200 $8300
A1 A2 A3 A4
B1 B2 B3 B4
------------
LDA $8000, x
STA $2007
LDA $8100, x
STA $2007
LDA $8200, x
(...)
Something like this, but I'm not sure what A1, A2, etc. mean in your diagram. Let me draw a diagram of what I have in mind:
Code: Select all
G = INDEX OF THE GROUP;
B = INDEX OF THE BYTE WITHIN THE GROUP;
$8000: G$00 B$00, G$01 B$00, G$02 B$00, G$03 B$00, G$04 B$00 (...) G$FE B$00, G$FF B$00
$8100: G$00 B$01, G$01 B$01, G$02 B$01, G$03 B$01, G$04 B$01 (...) G$FE B$01, G$FF B$01
$8200: G$00 B$03, G$01 B$03, G$02 B$03, G$03 B$03, G$04 B$03 (...) G$FE B$03, G$FF B$03
(...)
$BD00: G$00 B$3D, G$01 B$3D, G$02 B$3D, G$03 B$3D, G$04 B$3D (...) G$FE B$3D, G$FF B$3D
$BE00: G$00 B$3E, G$01 B$3E, G$02 B$3E, G$03 B$3E, G$04 B$3E (...) G$FE B$3E, G$FF B$3E
$BF00: G$00 B$3F, G$01 B$3F, G$02 B$3F, G$03 B$3F, G$04 B$3F (...) G$FE B$3F, G$FF B$3F
Since each tile is 16 bytes, a group of 4 would be 64 bytes, so you'd need an unrolled loop that transfers that many bytes (this code would use 384 bytes of ROM). The code would look like this:
Code: Select all
ldx GroupIndex
ldy GroupCount
TransferBlock:
lda $8000, x
sta $2007
lda $8100, x
sta $2007
lda $8200, x
sta $2007
(...)
lda $BE00, x
sta $2007
lda $BF00, x
sta $2007
dey
beq Done
inx
jmp TransferBlock
Done:
Yes, it does use quite a bit of ROM, but it's the only way I'm aware of to avoid the (slow) indirect indexed addressing and still copy the data straight from ROM to VRAM, not wasting CPU time on buffering. If you can program using any language on the PC you can make a tool to interleave the CHR data in 5 minutes.
Each iteration of this loop takes 521 cycles to execute, so you can safely copy 4 blocks of 4 tiles = 16 tiles each frame if you're doing nothing else during VBlank.