8x16 and whatever else unreg wants to know

Are you new to 6502, NES, or even programming in general? Post any of your questions here. Remember - the only dumb question is the question that remains unasked.

Moderator: Moderators

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Fri Dec 06, 2013 11:47 pm

So you can partially unroll, fully unroll (probably not worth it) or do stack stuff.


Partially unrolled looks like this:

Code: Select all

ldx #5
- lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
   dex
   bpl -
It has the inner part of your loop copied five times, so it loops 6 times at 55 cycles per loop. (330 total. Well... minus 1 because the branch not taken to end the loop takes 1 less cycle than a branch taken to loop again.)

Stack is tricky, because you have to store/restore the stack pointer. When your program returns from the NMI, the stack pointer needs to be what is what when the NMI started, since it uses values on the stack to return.

But it looks like a little like this:

Code: Select all

;assume your data is at $0100-whatever instead of
;RAMbufferW-whatever

ldx #$FF;$FF not $00 because the stack pointer sets where
txs;a value is pushed to. Pulling from the stack takes the value
;one greater than that with wrap.
ldy #29
-pla;automatically increases the stack pointer
sta PPUDATA7
dey
bpl -
It's 13 cycles per loop, loops 30 times for 390-1 total. Only 60 cycles faster than what you've got now (assuming no page cross). Dex is 2 cycles, we avoid doing it 30 times. It's worth noting that pla is 2 bytes smaller than absolute,y, and removing the dex saves 1 more. So it doesn't take as much space to partially unroll or unroll.

Here's what a stack partially unrolled approach would look like:

Code: Select all

ldx #$FF;$FF not $00 because the stack pointer sets where
txs;a value is pushed to. Pulling from the stack takes the value
;one greater than that with wrap.
ldy #2
-pla;automatically increases the stack pointer
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
pla
sta PPUDATA7
dey
bpl -
It loops 3 times, at 83 cycles a loop. 249-1 total. Also, even though it has the inner part of the loop copied ten times instead of the 5 used in the absolute,y unrolled example, it's only 5 bytes larger. Not quite twice as fast as what you have now, but pretty close. Requires some additional work to setup and restore the stack, though.

Another option is to update just one column in the NMI. Just save the attribute bytes to RAM to make those updates easier when you're only doing one. This would be faster for your NMI than doing anything above twice.

You could even do both! Have a faster update method AND use it for only one column at once.

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Mon Dec 09, 2013 11:17 am

unregistered wrote:edit: ok... each of these loops runs 30 times

Code: Select all

 sta $401e
 - lda RAMbufferW, y
   sta PPUDATA7
   dey
   dex
   bpl -
sta $401f
The upper loop runs 449 cycles and the lower loop runs 479 cycles cause I guess it crosses a page boundary.
How do I make it not cross a page boundary?... Then I'd save 30 cycles! :)

User avatar
tokumaru
Posts: 11894
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: 8x16 and whatever else unreg wants to know

Post by tokumaru » Mon Dec 09, 2013 11:35 am

unregistered wrote:How do I make it not cross a page boundary?... Then I'd save 30 cycles! :)
Make sure that the branch and the address branched to are in the same page. You might need to move some code/data around.

I often group routines that can't cross pages together, before all the code that's not timing sensitive, so that I can mix and match until I find the best order for them. You obviously can't keep changing the routines after you find a place for them, otherwise you might break the page alignment. During development you can use the .align command to align subroutines to page boundaries.

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Mon Dec 09, 2013 4:55 pm

tokumaru wrote:I often group routines that can't cross pages together, before all the code that's not timing sensitive, so that I can mix and match until I find the best order for them. You obviously can't keep changing the routines after you find a place for them, otherwise you might break the page alignment. During development you can use the .align command to align subroutines to page boundaries.
How do I use the .align command? :?

User avatar
tokumaru
Posts: 11894
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: 8x16 and whatever else unreg wants to know

Post by tokumaru » Tue Dec 10, 2013 7:22 am

unregistered wrote:How do I use the .align command? :?
Say that the PC (program counter) is $8d54. If you put an ".align 256" command, the assembler will pad the ROM to $8e00 (the next 256-byte boundary). This means that 172 bytes go to waste, which is why you shouldn't use this in your final program, just during development while you still don't know the sizes of all timing-sensitive routines.

Also, don't use .align in the middle of a routine, because the CPU will try to execute the padding bytes and will most likely crash. Use .align only between tables and subroutines.

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: 8x16 and whatever else unreg wants to know

Post by rainwarrior » Tue Dec 10, 2013 7:59 am

If using ca65 I prefer to create a segment with the desired alignment and then use that segment.

I usually combine with something like this to make sure the code is placed where I expect:

Code: Select all

.assert * = $8000, error, "This code is not at the correct position."

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Tue Dec 10, 2013 10:31 am

tokumaru wrote:
unregistered wrote:How do I use the .align command? :?
Say that the PC (program counter) is $8d54. If you put an ".align 256" command, the assembler will pad the ROM to $8e00 (the next 256-byte boundary). This means that 172 bytes go to waste, which is why you shouldn't use this in your final program, just during development while you still don't know the sizes of all timing-sensitive routines.

Also, don't use .align in the middle of a routine, because the CPU will try to execute the padding bytes and will most likely crash. Use .align only between tables and subroutines.
Excellent! Thanks tokumaru!! :D You answered both of my questions... thank you kind sir. :D

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Tue Dec 10, 2013 8:48 pm

Kasumi wrote:So you can partially unroll, fully unroll (probably not worth it) or do stack stuff.


Partially unrolled looks like this:

Code: Select all

ldx #5
- lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
lda RAMbufferW, y
   sta PPUDATA7
   dey
   dex
   bpl -
It has the inner part of your loop copied five times, so it loops 6 times at 55 cycles per loop. (330 total. Well... minus 1 because the branch not taken to end the loop takes 1 less cycle than a branch taken to loop again.)
KASUMI!!!!!!!!!!!!!!!!!!!!!!!!!!!! THANK YOU SO INCREDIBLY MUCH!!! :D :D
Well I triedmeasuring my enitre vblank and the max is 2259. That's using your partially unrolled loop for each column. It's crazy how that works unrolling partially... That 2259 is too high... looks like I'm going to have to draw one 16 bit wide column per vblank. And I'm happy about that. :D

tepples
Posts: 22091
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 8x16 and whatever else unreg wants to know

Post by tepples » Tue Dec 10, 2013 9:04 pm

16px per vblank is how fast Sonic scrolls, so you're in good company.

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Wed Dec 11, 2013 12:12 am

I want to do this the correct way... and I feel that 2 16bit columns per vblank is possible... maybe next could go away. 2223 cycles. That is without my attribute table coloring code. And that is ok because I haven't started writing it yet. And... there isn't any scroll_screen code to remove from that number because scroll_screen is after SkipUpdates: and Kasumi, you had me end cycle counting at SkipUpdates.

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Wed Dec 11, 2013 12:24 am

I feel that 2 16bit columns per vblank is possible.
Tokumaru says it can be done. But if your game never actually scrolls that fast, doing the work it requires is wasted effort. For the record, my game is prepared to scroll 8 pixels in both directions every frame, and isn't optimized enough to even do 16 pixels in one direction and 8 in another. And it's already pretty large/optimized. Doing four 8 pixel wide updates is pretty intense.
Kasumi, you had me end cycle counting at SkipUpdates.
The scroll stuff could take just 14 cycles, so it shouldn't be a problem. But technically the writes to $2005 should also be done before ~2270 cycles pass. It just won't break terribly much if you don't make it. (Maybe vertical scrolling will be off for a frame, with a wrong scanline or two of horizontal scrolling.) Unlike $2006 and $2007, $2005 is safe to write to during rendering.

User avatar
tokumaru
Posts: 11894
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: 8x16 and whatever else unreg wants to know

Post by tokumaru » Wed Dec 11, 2013 8:33 am

Kasumi wrote:
I feel that 2 16bit columns per vblank is possible.
Tokumaru says it can be done.
Indeed, but you won't be doing much else (sprite DMA is OK, but that's it)! =)

The way I did it is pretty intense, because of the 8-way scrolling, which means that new rows/columns can cross name table boundaries... To unroll the code without having to check for the boundaries I had to use a jump table (to jump to the middle of the unrolled data transfer routine, with the correct amount of bytes left to copy) and some index trickery (mostly explained here and here - man, this thing is OLD!).
But if your game never actually scrolls that fast, doing the work it requires is wasted effort.
Yeah, I see no point in maxing out the amount of bytes you send to VRAM each frame if you don't need that much. You could better spend the time on CHR-RAM animations and things like that.

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Wed Dec 11, 2013 1:07 pm

Thank you both Kasumi and tokumaru! You two have helped me realize that scrolling 16 px per vblank will be an effort supported by a welcoming relief. We will soundly build our 16 pixel column and save part of our coloring in RAM for the next vblank. Ah ok, it is 2 pm... time to begin! :)

unregistered
Posts: 1098
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Fri Dec 13, 2013 7:05 pm

tepples wrote:I'd be tempted to allocate five 32-byte buffers $0100, $0120, $0140, $0160, and $0180.
Ah, I understand now why you would use 32 byte buffers... it's really confusing trying to compare three thirty-byte columns. Would have been easier to compare the buffers you suggested! :) Thank you very much tepples!! :D

---
And GRAND EXCELLENT GREATNESS!!! :D :D :mrgreen: Ya'lls help has been excellent! Kasumi, I relearned how to use FCEUX's hex editor... and I had to check my buffers to see if they were correct... and they were! So thank you Kasumi!! :D MY GRAFITTIE IS GONE AND ITS RUNNING EXTRA SMOOTH WITH YOUR UNROLLED LOOP!! I bet that's the cause of all the extra grafittie... now that my loop is running within vblank time (my MAX is 1410 cycles! :)) the extra grafittie art that happened in nametable 01 while drawing nametable 00 columns is gone!! So if you are ever shown grafittie during column drawing don't worry about it... it will go away once your vblank is within the vblank limit (I'm sorry I don't know what the limit is right now) ...I'm ready to post this and take a break... you know relax! Sigh... :D

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Fri Dec 13, 2013 8:04 pm

Cool, glad you've got it sorted!

Post Reply