Sprite Drawing Function Performance

Are you new to 6502, NES, or even programming in general? Post any of your questions here. Remember - the only dumb question is the question that remains unasked.

Moderator: Moderators

Post Reply
User avatar
Goose2k
Posts: 107
Joined: Wed Dec 11, 2019 9:38 pm
Contact:

Sprite Drawing Function Performance

Post by Goose2k » Wed Jun 24, 2020 10:55 pm

I am starting to profile my game a bit, as I am hitting cases where my update is not finishing within the frame.

During this process I have found that my sprite drawing routine is nearly half my frame. Considering that this isn't actually doing anything but updating oam sprites (logic to calculate final position etc is done elsewhere), this seems really expensive. Especially considering I am using half the actual hardware limit of sprites.

Anyway, I just wanted to guage if this is normal, or am I totally out of whack. If its the latter, can you provide any general ideas of how to best update oam sprites to avoid such large frame times?

My current attempt just loops through all the objects in the game and draws each 8x8 sprite relative to its world position.

Code: Select all

void draw_gameplay_sprites(void)
{
	static unsigned char start_x;
	static unsigned char start_y;
	static unsigned char ix;
	static unsigned char iy;
	static unsigned int t;
	static unsigned char bit;
	static unsigned char row_status;
	static const unsigned char OOB_TOP = (BOARD_START_Y_PX + (BOARD_OOB_END << 3));
POKE(0x2001,0x5f); // green
	// clear all sprites from sprite buffer
	oam_clear();
POKE(0x2001,0x9f); // blue

	start_x = (cur_block.x << 3) + BOARD_START_X_PX;
	start_y = (cur_block.y << 3) + BOARD_START_Y_PX;

	// 255 means hide.
	if (cur_block.y != 255)
	{
		for (iy = 0; iy < 4; ++iy)
		{	
			for (ix = 0; ix < 4; ++ix)
			{
				// essentially an index into a bit array.
				bit = ((iy * 4) + (ix & 3)); // &3 = %4

				if (cur_cluster.layout & (0x8000 >> bit))
				{
					// Don't draw the current cluster if it is above the top of the board.
					// We want it to be able to function and move up there, but should not
					// be visible.
					if (start_y + (iy << 3) > OOB_TOP)
					{
						oam_spr(start_x + (ix << 3), start_y + (iy << 3), cur_cluster.sprite, 0);
					}
				}
			}
		}
	}

	start_x = 15 << 3;
	start_y = 1 << 3;

	for (iy = 0; iy < 4; ++iy)
	{	
		for (ix = 0; ix < 4; ++ix)
		{
			// essentially an index into a bit array.
			bit = ((iy * 4) + (ix & 3)); // &3 = %4

			if (next_cluster.layout & (0x8000 >> bit))
			{
				oam_spr(start_x + (ix << 3), start_y + (iy << 3), next_cluster.sprite, 0);
			}
		}
	}

POKE(0x2001,0x1f); // white

	// Loop through the attack columns and draw the off board portion as sprites.
	for (ix = 0; ix < BOARD_WIDTH; ++ix)
	{
		row_status = attack_row_status[ix];
		if (row_status > 0)
		{
			for (iy = 0; iy < row_status /*&& iy < ATTACK_QUEUE_SIZE*/; ++iy)
			{
				// gross. Try to detect if this is the last piece, and also the end of the arm.
				if (iy == row_status - 1)
				{
				oam_spr(
					BOARD_START_X_PX + (ix << 3), 
					(BOARD_END_Y_PX) + (ATTACK_QUEUE_SIZE << 3) - (iy << 3),
					0xf9, 
					1);
				}
				else
				{
				oam_spr(
					BOARD_START_X_PX + (ix << 3), 
					(BOARD_END_Y_PX) + (ATTACK_QUEUE_SIZE << 3) - (iy << 3),
					0xf8, 
					1);
				}
				
			}
		}
	}
	
POKE(0x2001,0x3f); // red

	// HIT REACTION
	if (hit_reaction_remaining > 0)
	{
		// -1, 0, 1
		//r = (rand8() % 3) - 1;
		oam_spr((3 << 3) /*+ r*/, (24 << 3), 0x65, 1);
		oam_spr(3 << 3, 25 << 3, 0x64, 1);
		oam_spr(3 << 3, 26 << 3, 0x74, 1);		
	}
	// BLINKING
	else
	{
		t = tick_count_large % BLINK_LEN;

		if (t > BLINK_LEN - 5)
		{
			oam_spr(3 << 3, 25 << 3, 0x62, 1);
			oam_spr(3 << 3, 26 << 3, 0x72, 1);
		}
		else if (t > (BLINK_LEN - 10))
		{
			oam_spr(3 << 3, 25 << 3, 0x63, 1);
			oam_spr(3 << 3, 26 << 3, 0x73, 1);
		}
		else if (t > BLINK_LEN - 15)
		{
			oam_spr(3 << 3, 25 << 3, 0x62, 1);
			oam_spr(3 << 3, 26 << 3, 0x72, 1);
		}
	}

	// FLAGS
	t = tick_count & 63;
	if (t > 48)
	{
		ix = 0x69;
	}
	else if (t > 32)
	{
		ix = 0x68;
	}
	else if (t > 16)
	{
		ix = 0x67;
	}
	else
	{
		ix = 0x66;
	}

	oam_spr(8 << 3, 1 << 3, ix, 2);
	oam_spr(24 << 3, 1 << 3, ix, 2);
	oam_spr(3 << 3, 10 << 3, ix, 0);
	oam_spr(27 << 3, 10 << 3, ix, 0);
}
Attachments
timings.png

lidnariq
Posts: 9779
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: Sprite Drawing Function Performance

Post by lidnariq » Wed Jun 24, 2020 11:24 pm

Goose2k wrote:
Wed Jun 24, 2020 10:55 pm
for (iy = 0; iy < 4; ++iy) {
for (ix = 0; ix < 4; ++ix) {
bit = ((iy * 4) + (ix & 3)); // &3 = %4
Don't reconstruct the thing from its components; instead iterate on bit and keep ix and iy in bounds.
... or actually, make it something more like
for (mask = 0x8000; mask; mask >>= 1)
if (cur_cluster.layout & (0x8000 >> bit))
Bitshifts by a variable number of bits are comparatively expensive. Use a lookup table instead.

All those bitshifts by 3 aren't great either; try more lookup tables.

calima
Posts: 1211
Joined: Tue Oct 06, 2015 10:16 am

Re: Sprite Drawing Function Performance

Post by calima » Thu Jun 25, 2020 12:35 am

Put your loop variables ix, iy, t to ZP in addition to above.

vnsbr
Posts: 28
Joined: Sun Feb 17, 2019 5:18 pm

Re: Sprite Drawing Function Performance

Post by vnsbr » Sun Jul 05, 2020 6:48 pm

Local variables are bad even if they are static. Better put them in ZP if possible. Also are all your sprites 8x8? If notYou could benefit from using oam_meta_spr that allows you to draw a chunk in one go. Also no good in having 3 << 3 when you can put the literal using a define or just the plain value. Iy*4 could be << 2. But even then.. It is not very complicated code.. What flags are you compiling with(afaik -O gives best performance sacrificing a bir of size)? As a last resource you could try updating every other frame

lidnariq
Posts: 9779
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: Sprite Drawing Function Performance

Post by lidnariq » Sun Jul 05, 2020 7:01 pm

vnsbr wrote:
Sun Jul 05, 2020 6:48 pm
Also no good in having 3 << 3 when you can put the literal using a define or just the plain value
CC65 is clever enough to recognize constant expressions and evaluate them at compile time. (if they don't involve function calls...)
Iy*4 could be << 2
And also to replace multiplications and divisions by powers of two with bit shifts.

That said, he did switch from multiplication by a power of two to bit shifts that in this commit

vnsbr
Posts: 28
Joined: Sun Feb 17, 2019 5:18 pm

Re: Sprite Drawing Function Performance

Post by vnsbr » Sun Jul 05, 2020 7:05 pm

lidnariq wrote:
Sun Jul 05, 2020 7:01 pm
vnsbr wrote:
Sun Jul 05, 2020 6:48 pm
Also no good in having 3 << 3 when you can put the literal using a define or just the plain value
CC65 is clever enough to recognize constant expressions and evaluate them at compile time. (if they don't involve function calls...)
Iy*4 could be << 2
And also to replace multiplications and divisions by powers of two with bit shifts.

That said, he did switch from multiplication by a power of two to bit shifts that in this commit
Cool didnt know about that :)

User avatar
Goose2k
Posts: 107
Joined: Wed Dec 11, 2019 9:38 pm
Contact:

Re: Sprite Drawing Function Performance

Post by Goose2k » Sun Jul 05, 2020 8:59 pm

I ended up getting things running well, and it was 90% because of 2 changes:

1) I was able to collapse multiple nested for loops, into a single loop using the suggestion of walking through a bitmask, rather than reconstructing it from x and y. I was doing this twice in the sprite drawing function, and was able to collapse it all into a single loop. I was able to apply this in a couple other places in code as well!
Thanks for this suggestion lidnariq!
lidnariq wrote:
Wed Jun 24, 2020 11:24 pm
Don't reconstruct the thing from its components; instead iterate on bit and keep ix and iy in bounds.
... or actually, make it something more like
for (mask = 0x8000; mask; mask >>= 1)
2) I was making some additional calls to draw_sprites() which was very expensive. Instead I just copy the small piece of draw_sprites() that I was interested in, inline. Much much faster.

I also tried moving locals to globals, but I found statics to be much cleaner, and in my profiling appear to be just as performant as globals (after the first time the function is called). I'm sure it must be worse, because everyone keeps telling me it is, but it seemed insignificant enough that I would take the more readable code over the optimization at this point.
Last edited by Goose2k on Sun Jul 05, 2020 9:17 pm, edited 1 time in total.

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Sprite Drawing Function Performance

Post by rainwarrior » Sun Jul 05, 2020 9:05 pm

Statics and globals are the same, performance wise, the only difference is visibility in the code. A static that is local to a function has allocated dedicated RAM that you can't reuse for anything else.

User avatar
Goose2k
Posts: 107
Joined: Wed Dec 11, 2019 9:38 pm
Contact:

Re: Sprite Drawing Function Performance

Post by Goose2k » Sun Jul 05, 2020 9:22 pm

rainwarrior wrote:
Sun Jul 05, 2020 9:05 pm
Statics and globals are the same, performance wise, the only difference is visibility in the code. A static that is local to a function has allocated dedicated RAM that you can't reuse for anything else.
Wouldn't that be true for Globals as well?

In fact, the more I read about this, the more it seems like static locals and globals are treated exactly the same. (yes, I spent 5 minutes on stack overflow and now I am an expert :wink: )

User avatar
rainwarrior
Posts: 7878
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Sprite Drawing Function Performance

Post by rainwarrior » Mon Jul 06, 2020 2:30 am

rainwarrior wrote:
Sun Jul 05, 2020 9:05 pm
Wouldn't that be true for Globals as well?
I don't know what "that" refers to, because I just said that statics and globals are the same except...
rainwarrior wrote:
Sun Jul 05, 2020 9:05 pm
...the only difference is visibility in the code. A static that is local to a function has allocated dedicated RAM that you can't reuse for anything else.
A static that is local to a function can't be used outside that function, but has dedicated storage. A global is not confined to a scope and is available in as many places as you want to use it.

Globals can also be static. A static global can't be exported and used by another module (another C or assembly file). Thus they are limited to the file rather than to the whole program.


So these all have dedicated storage. The actual generated machine code won't treat any of them differently, but your C code is restricted from accessing anything outside its scope. Stuff in the global scope can be reused, stuff in a function-local scope can't, hence it's a greater potential burden on the RAM budget. If you're not running out of RAM, then that difference is unimportant.

User avatar
Goose2k
Posts: 107
Joined: Wed Dec 11, 2019 9:38 pm
Contact:

Re: Sprite Drawing Function Performance

Post by Goose2k » Mon Jul 06, 2020 9:59 am

Sorry, I should have been more clear. I was wondering how this (in quote) is not also true for a global.
rainwarrior wrote:
Sun Jul 05, 2020 9:05 pm
A static that is local to a function has allocated dedicated RAM that you can't reuse for anything else.
I was confused by the term "reuse". I didn't realize this sentence was still talking about the scope of the variable. I thought you literally meant that RAM (if assigned to a global) could be used for something different (like re-allocated or something), not simply describing the difference between global and local scope. :D

Sorry for the confusion, and thanks for taking the time with such a detailed explanation!

vnsbr
Posts: 28
Joined: Sun Feb 17, 2019 5:18 pm

Re: Sprite Drawing Function Performance

Post by vnsbr » Tue Jul 07, 2020 9:11 pm

It should result in a speed increase if moved to zero page not just making it global. I think that was what everyone was pointig out. But nice that it is fast already :D

Post Reply