It is currently Fri Aug 17, 2018 3:55 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Fri Jan 12, 2018 10:28 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
So, after some profiling I've realized that a lot of my precious CPU time is being spent on calculating sprites. Not just the reads and writes, but also all the meta stuff, picking the right palette, doing offsets to read out animation data, flipping the sprite if it's turned around, etc. I got a neat animation system that I'm very happy with, but it's a little costy.

I've been optimizing all of this to be as fast and clever as possible, but the one thing that really irks me is that most object's sprites end up being exactly the same each frame, so I'm constantly recalculating the same results over and over.

So I've been trying to implement some sort of caching mechanism, so that I don't have to recalculate everything if nothing has changed. I usually know when things have changed (changed object state, scrolled the background, etc) so I know exactly when to reuse the cache and when to refresh it.

But building an efficient and lightweight cache has proved difficult.

The simplest and fastest way would be to simply reuse my DMA's Sprite RAM, and not clear it every frame. Maybe adjusting some x and y positions if the game has scrolled. But, all the sprite flickering techniques I know involves scrambling the order of the sprites every frame, which means no object ever gets the same Sprite RAM position twice in a row. This ruins everything.

Next up I tried allocating some more RAM as a temporary buffer, so that objects could put their sprite data there, and then it could be copy-scrambled over to the real Sprite RAM right before the DMA. But, to allow for all 64 sprites that's 256 bytes of RAM down the drain. Ouch. Not sure I want to spend that much memory.

According to the wiki, a "simple OAM cycling technique" can be implemented by using a write to OAMADDR before the DMA transfer. However, due to OAMADDR writes also having a "corruption" effect this technique is not recommended. Also, if the technique works like how I think it does, the OAM cycling would be very crude and might leave objects invisible for several frames.

So, is my quest impossible, or are there some other ideas or techniques? :D


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:03 am 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 335
Post some code? It's surprising to me that sprites would be so expensive.

Quote:
But, all the sprite flickering techniques I know involves scrambling the order of the sprites every frame

I've never implemented flickering, but I thought you could write OAMADDR before OAMDMA to do the shuffle on the DMA. I don't think you have to reorder the sprites in CPU RAM, but maybe I'm wrong.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:14 am 
Offline
User avatar

Joined: Sun Nov 09, 2008 9:18 pm
Posts: 1101
Location: Pennsylvania, USA
Metasprite rendering continues to be the bottleneck for me, too. There are some improvements I could make, but the biggest I got so far was to simply move to 8x16 sprites, halving the number of iterations the meta sprite drawing routines must do. That's been good enough for now and has given me the performance I want for the game I'm building.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:18 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
pubby wrote:
Post some code? It's surprising to me that sprites would be so expensive.

I'll write up an explanation, it's a tad complex so it might take some minutes.

Quote:
I've never implemented flickering, but I thought you could write OAMADDR before OAMDMA to do the shuffle on the DMA. I don't think you have to reorder the sprites in CPU RAM, but maybe I'm wrong.


The wiki recommends against this technique, but maybe it's too conservative? Does anybody know the ups and downs in more detail?


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:19 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7471
Location: Chexbres, VD, Switzerland
Quote:
But, all the sprite flickering techniques I know involves scrambling the order of the sprites every frame, which means no object ever gets the same Sprite RAM position twice in a row. This ruins everything.

Well, as usual in computing and in particular in retro-computing, you have to sacrifice thigns in order to get the desired features. You should just have two OAM pages, one where the sprites are not shuffled, which is your cache, and one where you shuffle the sprites from the cache so they're re-ordered and flickers properly when there's more than 8 per line instead of disappearing. That sounds rather simple to do.

Quote:
Ouch. Not sure I want to spend that much memory.

Well, that's the price for your sprite caching system. You can save memory by caching only some of the 4 parameters if RAM usage is really this much a problem.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:24 am 
Offline
User avatar

Joined: Wed Apr 02, 2008 2:09 pm
Posts: 1175
Sprite updates are indeed pretty expensive. Some switch to 8x16 sprites purely so less time is spent rendering an object.

If your game has larger objects, having a separate render routine when you know the object is entirely onscreen can skip a lot of extra logic for checking offscreen per sprite. (Alternatively... don't check for offscreen per sprite.)

Code:
.macro DMSNORMALBODY
   ;35 bytes
   iny
   
   lda [reserved4],y;Y position
   clc
   adc <reserved2
   sta OAM,x

   iny
   
   lda [reserved4],y;X Position
   ;clc
   adc <reserved0
   sta OAM+3,x
   
   iny

   
   lda [reserved4],y
   ;clc
   adc <reserved7;This should guarantee a clear carry
   sta OAM+1,x
   iny
   
   lda [reserved4],y
   sta OAM+2,x
   
   txa
   ;clc;guaranteed clear above
   adc #4
   tax;Carry not guaranteed anything after that add since oampos wraps
   .endm

Reserved0 is low X, reserved2 is low Y, reserved7 is a tile offset. You can totally get rid of that if the tiles used to render your object don't "move" in CHR.

Checking offscreen is not much harder, but you lose the guarantee of the clear carry.
Code:
dms.partial.o.loop:;{
   iny
   
   lda [reserved4],y;Y position
   clc
   adc <reserved2
   sta OAM,x
   
   lda <reserved3
   adc #$00
   bne dms.partial.o.yoffscreen

   iny
   
   lda [reserved4],y;X Position
   clc
   adc <reserved0
   sta OAM+3,x
   
   lda <reserved1
   adc #$00
   bne dms.partial.o.skipsprite.twoiny
   
   iny

   
   lda [reserved4],y
   clc
   adc <reserved7
   sta OAM+1,x
   iny
   
   lda [reserved4],y
   and #%11111100;Clear out the palette
   ora <reserved8
   sta OAM+2,x
   
   txa
   ;clc;guaranteed clear above
   adc #4
   tax;Carry not guaranteed anything after that add since oampos wraps

   dec <reserved6
   bne dms.partial.o.loop
dms.partial.o.end:
   rts
dms.partial.o.yoffscreen:
   iny;move to next OAM entry
dms.partial.o.skipsprite.twoiny:
   iny
   iny
   
   lda #$FF
   sta OAM,x;Set it offscreen

   ;Just repeat what we'd branch to and save a branch
   dec <reserved6
   bne dms.partial.o.loop
   rts;}

Reserved1 is high X, reserved 3 is high Y. Reserved6 is how many sprites are left.

But I made this note as an optimization (untested, so I'm not including it with the code I know works)
Quote:
Indivisible's rolled drawmetasprite loop can probably be made faster. They end with

dec <reserved6
bne dms.o.loop

But:
cpy <reserved6; (or some other zero page variable, since reserved6 can't really be changed
bne dms.o.loop; without making the other code slower)

cpy is 2 cycles faster than dec, but it also ensures a clear carry when the loop begins again.

Basically the setup code should just add <reserved6 *4 to y and store it somewhere. I imagine the reason I didn't is because reserved6 is technically variable (due to the greater than 64 sprite stuff), but it wouldn't really affect this if the loop were set up properly.


I also have a separate subroutine when I want to do "versatile" things like dynamically changing the palette of every sprite in the object.
Edit: Oh wait, no it's just the one above. That's what the reserved8 thing is. So basically I have a fast and a slow one.

Basically I recommend having a different routines for every case. Usually you don't want to do anything advanced, so at least have one for the fastest possible case. (Guaranteed on screen, no dynamic anything.)

But post some of your advanced code, maybe we can improve it.

_________________
https://kasumi.itch.io/indivisible


Last edited by Kasumi on Fri Jan 12, 2018 11:34 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:29 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10714
Location: Rio de Janeiro - Brazil
The use of OAMADDR with values other than zero is heavily discouraged, since that can result in sprite corruption.

The best method for caching sprites I can think of is indeed using another 256-bytes for a second OAM shadow, so you can alternate between them every frame and copy data from one to the other if the sprites are known to not have changed.

What kind of sprite cycling method are you currently using? Are you willing to change that to accommodate the sprite caching? Maybe you can come up with a solution that swaps individual OAM entries when they need to be kept, and simply overwrites the ones that don't.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:30 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Bregalad wrote:
Quote:
Ouch. Not sure I want to spend that much memory.

Well, that's the price for your sprite caching system. You can save memory by caching only some of the 4 parameters if RAM usage is really this much a problem.


That's fair enough, and it's probably what I will fall back to if no other secret technique pops up. I've been playing around with doing an "in place" shuffle of the original Sprite RAM so that objects have a new position in the buffer every frame, yet still retain their old values. The shuffling process is fairly expensive though, since you gotta shuffle 256 different values.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:33 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
pubby wrote:
Post some code? It's surprising to me that sprites would be so expensive.


Now, I haven't posted any code in this post. It's not like my code is secret, but I'd much rather explain what it does (and why it's slow) instead of posting a big blob of asm and forcing everybody to decrypt what's going on. Also, I haven't commented it yet :oops:

Some games like Super Mario Bros 3 has a lot of restrictions on what game objects can exist where, doing tricks like hardcoding the available palettes and CHR banks to the level. So if this is a Goomba+Koopa Troopa level, you simply can't use the Boo or Thwomp enemies or they will look strange and miscolored, and vice versa.

I've been working on a system to defeat such restrictions, by dynamically loading and unloading CHR and palettes as they are needed. The way things work, when an ingame object is created, it attempts to grab an 8k sprite CHR page and a palette for itself. I use a lot of techniques to maximize their potential like reusing as much as possible, having optional alternative graphics and color schemes, and even splitting palettes in two (and if it's just utterly impossible to fit in, the object simply despawns before it's seen).

But all of this only happens when the object is created, not every frame, so it's not the expensive part. But the point is, any object might end up with any of the CHR pages or any of the palettes. So, while SMB3 can optimize it's Koopa Troopa drawing routine by always refering to palette 2, my system has to do a lookup to see which of the four palettes my object was assigned.

Then there is animation data, and some extra goodies I've baked in there like allowing small x/y offsets on the sprites or vflip/hflip flags on individual sprites in the meta-sprite. Despite having so much more stuff than SMB3, my system is still faster due to better coding.

Still, I can see the potential for massive gains by reusing those values rather than having to recalculate them every frame.


Last edited by Drakim on Fri Jan 12, 2018 11:40 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:38 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Kasumi wrote:
Sprite updates are indeed pretty expensive...


Thanks for the routines, I'll compare them to my own and see if there are any places I can shave off some cycles.

I do indeed have different drawing routines, with varying levels of functionality. Objects, in their initialization routine, pick one that they know will be enough for them.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:39 am 
Offline
User avatar

Joined: Wed Apr 02, 2008 2:09 pm
Posts: 1175
Quote:
Then there is animation data, and some extra goodies I've baked in there like allowing small x/y offsets on the sprites or vflip/hflip flags on individual sprites in the meta-sprite.

That shouldn't really affect rendering at all. Whether an individual sprite is flipped or not doesn't matter to the block copy, whether an individual sprite is offset a little doesn't matter to the block copy.

Is the issue that you're also trying to save space? I stored every frame twice, once flipped and once not, rather than flipping it at runtime and I don't feel bad about it.

_________________
https://kasumi.itch.io/indivisible


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 11:47 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Kasumi wrote:
That shouldn't really affect rendering at all. Whether an individual sprite is flipped or not doesn't matter to the block copy, whether an individual sprite is offset a little doesn't matter to the block copy.


The thing is, I don't block copy my animation data right to the sprite buffer, I do stuff like XOR the global flip flags with the individual flip flags, and add the global x/y coordinate with the sprite's local x/y offset. I also have to fish out the correct palette since it's not hardcoded.

Quote:
Is the issue that you're also trying to save space? I stored every frame twice, once flipped and once not, rather than flipping it at runtime and I don't feel bad about it.


Huh....I hadn't thought about that. That's genius! It totally saves me the XOR of the flip flags. I could even use a macro, and potentially do it for other things too. Thanks mate!


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 12:03 pm 
Offline
User avatar

Joined: Wed Apr 02, 2008 2:09 pm
Posts: 1175
Quote:
I also have to fish out the correct palette since it's not hardcoded.

The palette thing is easy, if the object only uses one palette. (Which, if you're dynamically allocating palettes is probably the common case) It's one instruction:
Code:
lda [reserved4],y
ora SPRpalette
sta OAM+2,x

You store the palette the object wants to SPRpalette before rendering and lose 3 cycles per sprite, oh well.

In a case where palette 0 is like... a shared palette (reserved for player one or something, that enemies can also use)... you could maybe get cute with the data and store it shifted right one bit.

Now the highest bit is free to use as a flag for that.
Essentially:
Bit 7: Use Palette 0
Bit 6: Flip Sprite Vertically
Bit 5: Flip Sprite Horizontally
etc.
Code:
lda [reserved4],y
asl a;Whether to use palette 0 is now in the carry, flip sprite vertically is in bit 7 where OAM expects it, etc.
bcs storepalette;If the high bit was set, we use palette zero
ora SPRpalette
storepalette:
sta OAM+2,x

Admittedly that's still a bit heavy 64 times, but well...
Edit: Just to say it, I'm not sure how much help caching sprite data will be in a game that scrolls. But I'll think about it.

_________________
https://kasumi.itch.io/indivisible


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 12:08 pm 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Kasumi wrote:
The palette thing is easy, if the object only uses one palette. (Which, if you're dynamically allocating palettes is probably the common case) It's one instruction.


I am saving the palette per sprite per frame, for the entire metasprite, but in hindsight it's just as you say, maybe 90% of all metasprites use only one palette. I'll make a faster drawing routine they can use instead of the full one, that doesn't need to look up the palette byte each time. :beer:

Quote:
Edit: Just to say it, I'm not sure how much help caching sprite data will be in a game that scrolls. But I'll think about it.


That's a good point, but I think it could be worked around since scrolling most of the time only happens on one axis even for games with multi-directional scrolling like SMB3. You'd simply have to loop over every 4th byte and do an addition :D The carry flag will tell you if the sprite is now off-screen.

But even if it's not viable, just being able to reuse the pattern and attribute bytes would still be a boon.


Top
 Profile  
 
PostPosted: Fri Jan 12, 2018 12:52 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2731
For each object, do you have a register holding the palette bits?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Memblers and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group