8x16 and whatever else unreg wants to know

Are you new to 6502, NES, or even programming in general? Post any of your questions here. Remember - the only dumb question is the question that remains unasked.

Moderator: Moderators

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Sun Dec 01, 2013 9:48 pm

THANK YOU SO MUCH FOR YOUR EXPLANATIONS KASUMI! :D goodnight! :)

edit: Here's his post with great explanations. I understand almost all of it! :)
Kasumi wrote:The two pieces of code accomplish the same goal. (Though mine sets up the buffers differently. The different way would be faster for your NMI to read as well, though.) You don't need to store/restore X in goodlocation because it just stays in X. (I mean... you may still have to load it before the loop, but you no longer have to do it IN the loop.) You come out ahead because the code I added takes fewer cycles than the unneeded code I removed. (storing/restoring goodlocation)

I omitted some stuff, but the full thing would be like:

Code: Select all

ldx #29;Before everything. So not during the loop. This is like goodlocation
;But we load it with #29 instead of #59 for other reasons.
loop:
lda ($10), y;Originally omitted. Have to do that still to get the index, of course
sty pointerposition;This wasn't needed before, so we're 3 cycles behind
tay;This was needed before. We overwrote what was in y, which is why we stored it above


;Metatile index is Y. Location in RAM buffer is in X.
 lda MetatileTile0, y;Assuming this top left tile
 sta RAMbuffereven, x;Even buffer
 lda MetatileTile1,y;Assuming this is top right tile
 sta RAMbufferodd, x;Odd buffer
 dex;Takes us to the next tile for BOTH buffers

 lda MetatileTile2, y;Assuming this bottom left tile
 sta RAMbuffereven, x;Even buffer
 lda MetatileTile3,y;Assuming this is bottom right tile
 sta RAMbufferodd, x;Odd buffer

lda pointerposition;used to be tya. You lose just one cycle doing this instead
;But you gain that back by not having  
;ldx goodLocation and stx goodLocation (which would take 6 cycles)
;because X doesn't jobs in mine. It's always where you are in the buffer.

  clc
 adc #$10 ;increment y by 16!!!!
tay

 dex
bpl loop
After loading the metatile index, you did tax. Mine does this too (well... tay instead), in addition to storing the position to temp ram. That takes 3 extra cycles.

Later, you did tya because you can only add to A. Mine does lda tempRAM instead which takes 1 extra cycle than tya. (if zero page)

All together, I've made your metatile index transfer work another way. It takes 4 cycles extra.
I needed to move y anyway, but i didn't need to move x? :?
Right. You need X/Y for three tasks. 1. Loading from the pointer. (can only be done with Y) 2. Loading tiles from the metatile. 3. Storing the tiles to the buffer. This means either X or Y must change jobs, because two things can't do three jobs without changing. This is true for mine, and it was true for yours.

Because of how I preserved X instead of Y (which needed to be change jobs in both because it's needed to access the pointer), I've eliminated stx goodLocation and ldx goodLocation (DURING the loop anyway) which would have taken 6 cycles. So it ends up 2 cycles faster.

But mine is also faster for other reasons related to why I did the transfers that way. I dex once for every two times you do, because I do both even and odd at once using separate buffers. I avoid storing each tile of the metatile in the very beginning of the buffer RAM, because there's no need. I have where I am in the buffer in X already when I load the metatile index in y (you load where you are in the buffer later), so they're just stored exactly where they need to be. No need for the temp stores.

It saves a lot of cycles per loop. I think 42. 4 for doing dex twice instead of four times, 9*4=36 for not doing the indexed temp stores, 6 for not storing/restoring goodlocation. -4 for things I added.

This loops 15 times, so that's 630 cycles. 630 more if you do it twice for two 16x16 columns like it seems you're planning.

All that said, I make no guarantees this will work verbatim. There may be some extra stuff you need to do before/after the loop I'm forgetting, but I can't imagine any of it not making the savings worth it.

Edit: Heck, I was being safe, but you can move the clc before the add from the loop to before the loop if the pointer is set up such that y = 0 to access the first element. Nothing in the loop changes the carry except the add, and the adds during the loop will NEVER set the carry. (You add 16 to Y 15 times, which would only make it 240. Not greater than 255, so carry would be clear throughout.). This saves another 28 cycles per loop. 2*15 for not doing it in the loop -2 because you still need to do it before the loop.
Yes! Sweet, thanks... y must = 0 to access the first element? If I set y to zero before the loop... that wont help right? I need to go away and think about this some more... bye.

edit2: I guess y can be guarenteed to be less than 16... that would work right? It would be 240+15=255 so the carry would be clear because 255 is not greater than 255. :) I'm not ever going to draw the bottom half of a column... it will start near 0 each time, I think.

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Mon Dec 02, 2013 6:23 pm

Ok, how do I somehow create RAMbufferW? I've got this part...

Code: Select all

;======================v
RAMbufferw0even .dsb 30
RAMbufferw1even .dsb 30
RAMbufferw0odd  .dsb 30
RAMbufferw1odd  .dsb 30
;======================^
... ok, now I want to add

Code: Select all

RAMbufferW .dsb 120
in the exact same place where those are... I've thought about this part so much... I've figured out so much :mrgreen: :mrgreen: :mrgreen: using yall's info Kasumi, 3gengames, and tepples... thank yall so much!!! :D Need to understand this blah :? help me please.

3gengames
Formerly 65024U
Posts: 2277
Joined: Sat Mar 27, 2010 12:57 pm

Re: 8x16 and whatever else unreg wants to know

Post by 3gengames » Mon Dec 02, 2013 6:29 pm

You can use the first variable name as the entire space, because it's going to be the same spot as the different variable name, just make sure you comment "This variable is used as (whatever) in this code." or mark the label so you know. :)

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Mon Dec 02, 2013 6:48 pm

3gengames wrote:You can use the first variable name as the entire space, because it's going to be the same spot as the different variable name, just make sure you comment "This variable is used as (whatever) in this code." or mark the label so you know. :)
THANK YOU INCREDIBLY SO MUCH 3GENGAMES!!! :D

User avatar
thefox
Posts: 3141
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Re: 8x16 and whatever else unreg wants to know

Post by thefox » Tue Dec 03, 2013 7:41 am

Rather than using a comment, it's usually a better idea to define the label so that the code is easier to read:

Code: Select all

RAMbufferW = RAMbufferw0even
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Tue Dec 03, 2013 10:08 am

thefox wrote:Rather than using a comment, it's usually a better idea to define the label so that the code is easier to read:

Code: Select all

RAMbufferW = RAMbufferw0even
WOAH That's what I wanted.....!!!!! Thank you thefox!! :D
This is becoming so much fun creating labels and overextending my variable! SWEET! :D

edit: so here it is :D :mrgreen:

Code: Select all

0C6A9                           ;************************************************************************
0C6A9                           ; recieves x... x selects which buffers to use: 
0C6A9                           ;  0 for RAMbufferw0even and RAMbufferw0odd
0C6A9                           ;  1 for RAMbufferw1even and RAMbufferw1odd
0C6A9                           ; destroys a x and y   ooooooooooohhhhh  nooooooooooooooooo
0C6A9                           ;************************************************************************
0C6A9                           draw_us_a_column: ;(please)
0C6A9                           
0C6A9 20 73 C6                     jsr next
0C6AC A5 3B                        lda columnLo
0C6AE C5 45                        cmp playdough
0C6B0 F0 3F                        beq +end
0C6B2 A5 3A                        lda columnHi
0C6B4 8D 06 20                     sta $2006
0C6B7 A5 3B                        lda columnLo
0C6B9 8D 06 20                     sta $2006
0C6BC                              
0C6BC 86 47                        stx t10
0C6BE                              ;start with the even column
0C6BE                           
0C6BE                              
0C6BE                              ;x should be 0 or 1.  Set it before hand, please. :)
0C6BE BC C9 DD                     ldy bufferoffsettable, x ;lda RAMbufferw0even, y
0C6C1 A2 1D                        ldx #29
0C6C3 B9 A8 05                   - lda RAMbufferW, y
0C6C6 8D 07 20                     sta $2007
0C6C9 88                           dey
0C6CA CA                           dex
0C6CB 10 F6                        bpl -
0C6CD                              
0C6CD A6 3B                        ldx columnLo
0C6CF E8                           inx
0C6D0 A5 3A                        lda columnHi
0C6D2 8D 06 20                     sta $2006
0C6D5 8E 06 20                     stx $2006
0C6D8                              ;then do the odd one
0C6D8                              
0C6D8 A6 47                        ldx t10
0C6DA E8                           inx
0C6DB E8                           inx
0C6DC                              
0C6DC BC C9 DD                     ldy bufferoffsettable, x
0C6DF A2 1D                        ldx #29
0C6E1 B9 A8 05                   - lda RAMbufferW, y
0C6E4 8D 07 20                     sta $2007
0C6E7 88                           dey
0C6E8 CA                           dex
0C6E9 10 F6                        bpl -
0C6EB                              
0C6EB                           +complete
0C6EB A5 3B                        lda columnLo
0C6ED 85 45                        sta playdough
0C6EF                              
0C6EF E6 30                        inc valid_left
0C6F1                           
0C6F1 60                        +end: rts; end of draw_us_a_column
lunch time :)

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Tue Dec 03, 2013 2:46 pm

unregistered wrote:Wow! Once my game becomes alive... the Max part in Nintendulator debug says 2872... which is no good cause you said
Kasumi wrote:If all the input is now good (i.e. You verified the RAM now has the correct 60 values, and you can't find issues with the code that reads from it) and you're getting bad output, it's possible you're trying to write to $2007 when rendering has begun. ~2270 cycles before rendering begins after the NMI. ~513 of those are eaten by sprite DMA. Anything from the start of the NMI to the last write to $2006/$2007 should happen in less.

Code: Select all

                     vblank: sta $401E
                            pha
               
                            tya
                            pha
                            txa
                            pha
*********SNIP*************
                      SkipUpdates: sta $401F
so 2270-513 is quite less than 2000. Thank you for your example code... it helped me realize that my vblank is overused. :shock: Need to spend some time on using it less. :?: I am going to reread your reply. :) edit: Ok I'm going to think more about your simple question. It seems like a hard question right now. Ok it's supper time! Goodnight and thank you for all of this theory! :D
Kasumi, now my NintendulatorDX Max says 2776... but average says 1630.26.... this still is much better than 2872. Maybe it would be better to just have draw_us_a_column and draw_us_another_column... both have same code with either (RAMbufferw1even and RAMbufferw1odd) or ( RAMbufferw0even and RAMbufferw0odd). Then my loops would be quicker by 2 cycles each time through, right? :? :)

edit: Average is at 1034... then it keeps rising... now at 2409! :shock:
edit2: it always keeps rising.. now at 2626 max is still 2776. :)
last edit: Average starts out at like 609 when my game comes alive!! :shock: :D That's with no changes code is exactly the same as above.

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Tue Dec 03, 2013 5:54 pm

609 is probably when you are only updating sprites. Anything else is when you're updating tiles+sprites, which yeah, would result in an average in between. And if you never move so that new tiles scroll, the average would keep getting lower because no frames with tile updates are happening. And vice versa.

Focus only on max, which is still way too high. It's over the limit and you haven't even added attribute stuff. Benchmark each part of your NMI.

Code: Select all

sta $401E
jsr draw_us_a_column
sta $401F
How many cycles is that taking? If you're doing it twice, how much does each take? That's a way you can find out what to optimize. Maybe your loops are fine, but the next routine that's run inside draw_us_a_column is really slow or something. Benchmark stuff to find out what's eating all your cycles.

And remember, you can prepare everything outside of the vblank. You can do what next does (prepare columnhi and columnlo) outside of vblank, store it in RAM guaranteed to not be used by anything else, then have the NMI read the decision rather than have the NMI MAKE the decision. Anytime you can avoid your NMI making a decision (before screen updates are finished at least), you should do it.

Things your NMI needs to do:
Store registers A, X and Y so your program doesn't break upon return.
Decide whether or not the tile buffers are READY to update, so it can skip them if not.
Any writes to $2006 or $2007.
Sprite DMA.
Scroll Writes. (Well... sorta)
Restore registers A, X, and Y.

Anything not in the above list, including the preparation of said buffers can probably be moved out.

Here's a thought: Instead of having your buffers contain JUST the tiles, why not have them contain the address too? So now they're 32 bytes instead of 30. And you do this above the loop:

Code: Select all

0C6BE BC C9 DD                     ldy bufferoffsettable, x ;(Gets a value of multiple of 32-1 rather than 30-1.)
lda RAMbufferW, y
sta $2006
dey
lda RAMbufferW,y
sta $2006
dey
0C6C1 A2 1D                        ldx #29
0C6C3 B9 A8 05                   - lda RAMbufferW, y
0C6C6 8D 07 20                     sta $2007
Then you have no need for columnhi or columnlo and next doesn't need to be run in the NMI. (You still need to prepare the addresses with the buffer OUTSIDE the NMI, of course.)

One other really tiny thing that won't save you a lot, but is worth seeing.

Code: Select all

jsr update_sprite
This adds 12 cycles and 3 bytes to your NMI, as opposed to just putting sprite update stuff directly in.

Like so:

Code: Select all

	ldy #$00	
	sty $2003

	lda #;Whatever page you're using for sprite updates
	sta $4014
I understand it's sometimes nice to have a 1 line thing rather than more, but it costs you time and space if the subroutine is called from just that one place. If you really like the 1 line thing, you can still do that by making update_sprite or anything else like it a macro.
Maybe it would be better to just have draw_us_a_column and draw_us_another_column...
...
Then my loops would be quicker by 2 cycles each time through, right? :? :)
Really there's no shame in doubling up code sometimes. And yes, you'd save 2 cycles per each loop, but be careful always! Sometimes the cycles you save will be overcome by the additional setup you need to do. (Probably not in this case, but think about it.)

Code: Select all

	lda RAMbufferW, y
0C6C6 8D 07 20                     sta $2007
0C6C9 88                           dey
0C6CA CA                           dex
0C6CB 10 F6                        bpl -
This is 10 bytes. If you had one of these loops for each of the each buffers, it's not a huge hit. And like you said, you wouldn't need both dex and dey (because both would always be the same value between 0 and 29), so each would only really be 9 bytes. (A small bit more logic at the beginning to select which piece of the quadrupled up code you want, though.)

I quadrupled a larger routine (42 bytes, so 168...) for the sole reason I save a compare at the end of the loops. Didn't think twice about it. That said, I could probably use a pointer and come out VERY ahead on the current 168 bytes, and only slightly behind on cycles.

Anyway, what to do in optimization is a choice for you. I focus on speed at the expense of size 9 times out of 10. Now I'm out of space in my ROM and have to dial things back. :lol: Sometimes fortune smiles on you and you find a way to make something both smaller and faster, but generally when you're making a thing faster, you're making it bigger. When you're making it smaller, you're making it slower.

See unrolled loops vs rolled loops, quadrupled code vs not, huge tables vs loops. You can find a middle ground most of the time, with semi unrolled loops, doubled up code with some extra stuff to setup each version, smaller tables + loops that don't run as many times...

Your NMI routine doesn't necessarily need to be AS FAST AS POSSIBLE, but right now it's not fast enough. And even the 60 or so cycles you'd save by doubling up the code doesn't get you anywhere near where you need to be. And I have no idea what's taking up all the time, so find out!

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Tue Dec 03, 2013 6:53 pm

Kasumi wrote:609 is probably when you are only updating sprites. Anything else is when you're updating tiles+sprites, which yeah, would result in an average in between. And if you never move so that new tiles scroll, the average would keep getting lower because no frames with tile updates are happening. And vice versa.
I've only read this so far and I want to say that 609 is after my game comes alive... after 2 16-bit wide columns are drawn my screen said Avg 609. That's very impressive!!! Thank you so much Kasumi and some others... 3gengames, tokumaru, tepples, thefox, and qbradq!! :D And... ...well secondly, after coming alive, my game's average grows grows an d grows. That's all it does... there isn't any tiles scrolling... I'm just standing there after 2 16-bit wide columns are drawn and it grows! Up up up up up up up and up. It almost got higher than my new lower around 2776 Max... but I ended it. Ok that's all I can say now I'm going to be viewing Naruto Shippuden. Bye. :)

edit: It's Wednesday afternoon now. Ok I'm going to find out what is taking up all this time; thank you, Kasumi, you bless me with all this knowledge! :D There's past words that I'm going to try too. :)

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Wed Dec 04, 2013 4:46 pm

Kasumi wrote:One other really tiny thing that won't save you a lot, but is worth seeing.

Code: Select all

jsr update_sprite
This adds 12 cycles and 3 bytes to your NMI, as opposed to just putting sprite update stuff directly in.

Like so:

Code: Select all

	ldy #$00	
	sty $2003

	lda #;Whatever page you're using for sprite updates
	sta $4014
I understand it's sometimes nice to have a 1 line thing rather than more, but it costs you time and space if the subroutine is called from just that one place. If you really like the 1 line thing, you can still do that by making update_sprite or anything else like it a macro.

I'm trying making update_sprite a macro...

Code: Select all

.MACRO update_sprite
		lda #>sprite
        sta $4014 ;OAM_DMA register ; Jam sprite page ($200-$2FF) into SPR-RAM

                      ;takes 513 cycles.
					  
       .ENDM
but it says "Label already defined." I guess the code thinks my other "update_sprite" is a label instead of recognizing it's a macro? :?

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Wed Dec 04, 2013 4:52 pm

unregistered wrote: but it says "Label already defined." I guess the code thinks my other "update_sprite" is a label instead of recognizing it's a macro?
Sounds right. You can't have both a macro called update_sprite and a label called update_sprite. Just get rid of the version with the label, you don't need it when you have the macro.

Got a text editor that can find all instances of a thing in a directory? Search for it if it's not where you think.

Edit: Or wait, how are you calling the macro? Not sure how asm6 works, but you need to have space before it or it will think it's a label.

Code: Select all

update_sprite;Is different than
	update_sprite;
;at least in nesasm. 

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Wed Dec 04, 2013 5:03 pm

Kasumi wrote:Sounds right. You can't have both a macro called update_sprite and a label called update_sprite. Just get rid of the version with the label, you don't need it when you have the macro.

Got a text editor that can find all instances of a thing in a directory? Search for it if it's not where you think.
but it's my method call...

Code: Select all

MACRO / ENDM

        MACRO name args...

        Define a macro.  Macro arguments are comma separated.
        Labels defined inside macros are local (visible only to that macro).

                MACRO setAXY x,y,z
                    LDA #x
                    LDX #y
                    LDY #z
                ENDM

                setAXY $12,$34,$56
                        ;expands to LDA #$12
                        ;           LDX #$34
                        ;           LDY #$56
That's what README.TXT says. My macro doesn't have any arguments... so when I call it it is just

Code: Select all

update_sprite
edit: there are two spaces infront of update_sprite... I guess asm6 has a problem with an argument-less macro call... I bet tokumaru would know.

User avatar
Kasumi
Posts: 1292
Joined: Wed Apr 02, 2008 2:09 pm

Re: 8x16 and whatever else unreg wants to know

Post by Kasumi » Wed Dec 04, 2013 5:17 pm

Define your macro before you use it.

Code: Select all

 

macro mario
	lda #$00
endm

mario
works
and

Code: Select all

mario

macro mario
	lda #$00
endm
doesn't.

If it ain't that, I have NO idea.

edit: Anyway, typically when I do things like this (put a bunch of code into a macro instead for one line convenience), I put the hide-code macros in a different file and include that file at the top of the asm file that uses it.

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Wed Dec 04, 2013 5:38 pm

Kasumi wrote:Define your macro before you use it.

Code: Select all

 

macro mario
	lda #$00
endm

mario
works
and

Code: Select all

mario

macro mario
	lda #$00
endm
doesn't.

If it ain't that, I have NO idea.
That was it! Thank you Kasumi! :D

edit: Your edit is a great idea!

unregistered
Posts: 1075
Joined: Thu Apr 23, 2009 11:21 pm
Location: cypress, texas

Re: 8x16 and whatever else unreg wants to know

Post by unregistered » Fri Dec 06, 2013 11:17 am

Kasumi wrote:Focus only on max, which is still way too high. It's over the limit and you haven't even added attribute stuff. Benchmark each part of your NMI.

Code: Select all

sta $401E
jsr draw_us_a_column
sta $401F
How many cycles is that taking? If you're doing it twice, how much does each take?
The top one's cycles Max is 1051. And the lower one's cycles Max is 1068. Is that too much?I guess it is. :( I'll continue benchmarking each part of draw_us_a_column after lunch.

edit: ok... each of these loops runs 30 times

Code: Select all

 sta $401e
 - lda RAMbufferW, y
   sta PPUDATA7
   dey
   dex
   bpl -
sta $401f
The upper loop runs 449 cycles and the lower loop runs 479 cycles cause I guess it crosses a page boundary. The code in between runs 29 cycles once. So thats a Max of 957 cycles for the two loops and the inbetween. That's close to 1051 and 1068... :?

Post Reply