edit: Here's his post with great explanations. I understand almost all of it!
Yes! Sweet, thanks... y must = 0 to access the first element? If I set y to zero before the loop... that wont help right? I need to go away and think about this some more... bye.Kasumi wrote:The two pieces of code accomplish the same goal. (Though mine sets up the buffers differently. The different way would be faster for your NMI to read as well, though.) You don't need to store/restore X in goodlocation because it just stays in X. (I mean... you may still have to load it before the loop, but you no longer have to do it IN the loop.) You come out ahead because the code I added takes fewer cycles than the unneeded code I removed. (storing/restoring goodlocation)
I omitted some stuff, but the full thing would be like:
After loading the metatile index, you did tax. Mine does this too (well... tay instead), in addition to storing the position to temp ram. That takes 3 extra cycles.
Code: Select all
ldx #29;Before everything. So not during the loop. This is like goodlocation ;But we load it with #29 instead of #59 for other reasons. loop: lda ($10), y;Originally omitted. Have to do that still to get the index, of course sty pointerposition;This wasn't needed before, so we're 3 cycles behind tay;This was needed before. We overwrote what was in y, which is why we stored it above ;Metatile index is Y. Location in RAM buffer is in X. lda MetatileTile0, y;Assuming this top left tile sta RAMbuffereven, x;Even buffer lda MetatileTile1,y;Assuming this is top right tile sta RAMbufferodd, x;Odd buffer dex;Takes us to the next tile for BOTH buffers lda MetatileTile2, y;Assuming this bottom left tile sta RAMbuffereven, x;Even buffer lda MetatileTile3,y;Assuming this is bottom right tile sta RAMbufferodd, x;Odd buffer lda pointerposition;used to be tya. You lose just one cycle doing this instead ;But you gain that back by not having ;ldx goodLocation and stx goodLocation (which would take 6 cycles) ;because X doesn't jobs in mine. It's always where you are in the buffer. clc adc #$10 ;increment y by 16!!!! tay dex bpl loop
Later, you did tya because you can only add to A. Mine does lda tempRAM instead which takes 1 extra cycle than tya. (if zero page)
All together, I've made your metatile index transfer work another way. It takes 4 cycles extra.
Right. You need X/Y for three tasks. 1. Loading from the pointer. (can only be done with Y) 2. Loading tiles from the metatile. 3. Storing the tiles to the buffer. This means either X or Y must change jobs, because two things can't do three jobs without changing. This is true for mine, and it was true for yours.I needed to move y anyway, but i didn't need to move x?
Because of how I preserved X instead of Y (which needed to be change jobs in both because it's needed to access the pointer), I've eliminated stx goodLocation and ldx goodLocation (DURING the loop anyway) which would have taken 6 cycles. So it ends up 2 cycles faster.
But mine is also faster for other reasons related to why I did the transfers that way. I dex once for every two times you do, because I do both even and odd at once using separate buffers. I avoid storing each tile of the metatile in the very beginning of the buffer RAM, because there's no need. I have where I am in the buffer in X already when I load the metatile index in y (you load where you are in the buffer later), so they're just stored exactly where they need to be. No need for the temp stores.
It saves a lot of cycles per loop. I think 42. 4 for doing dex twice instead of four times, 9*4=36 for not doing the indexed temp stores, 6 for not storing/restoring goodlocation. -4 for things I added.
This loops 15 times, so that's 630 cycles. 630 more if you do it twice for two 16x16 columns like it seems you're planning.
All that said, I make no guarantees this will work verbatim. There may be some extra stuff you need to do before/after the loop I'm forgetting, but I can't imagine any of it not making the savings worth it.
Edit: Heck, I was being safe, but you can move the clc before the add from the loop to before the loop if the pointer is set up such that y = 0 to access the first element. Nothing in the loop changes the carry except the add, and the adds during the loop will NEVER set the carry. (You add 16 to Y 15 times, which would only make it 240. Not greater than 255, so carry would be clear throughout.). This saves another 28 cycles per loop. 2*15 for not doing it in the loop -2 because you still need to do it before the loop.
edit2: I guess y can be guarenteed to be less than 16... that would work right? It would be 240+15=255 so the carry would be clear because 255 is not greater than 255. I'm not ever going to draw the bottom half of a column... it will start near 0 each time, I think.