Speeding up my graphics routine

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

Undubbed
Posts: 24
Joined: Sun Aug 09, 2009 7:46 am

Speeding up my graphics routine

Post by Undubbed » Tue Sep 22, 2009 9:19 am

It seems that the demo mode for donkey does in fact come up but my graphics routine is apparently super super slow so it takes like a minute to get up.My graphics are probably like around 5 frames a second going by how slow donkey kong is moving.

Don't know how I'm gonna make it faster but hey I've gotten this far eh.

GlowCode, a profiler that I got is telling me that my graphics function is taking up 97% of the program time soooooo I think I know what to focus on here.

And 98% of my UpdateGraphics function is taken from within the function itself so it seems like the main culprit is stretch_blit(). Though I'm pretty sure the real problem is in my UpdateBackGround() function and how the Buffer BITMAP gets blitted.

Code: Select all

void Video::UpdateGraphics() {

	UpdateBackGround(); // Buffer comes from this function

// just to show how many times this function got called
	textprintf_centre_ex(Buffer, font, 75, 0, makecol(100,100,100), -1, "Update Graphics: %i", count); 
	stretch_blit(Buffer, screen, 0, 0, Buffer->w, Buffer->h, 0, 0, SCREEN_W, SCREEN_H);
	count++;

}
now here's my background graphics routine. Now don't laugh, this just might be the most newbie, unoptimized crap you've ever seen....

Code: Select all

void Video::UpdateBackGround() {

	
	int NTval = 0;	// The value of the current NameTable address
	int PTaddr = 0; // Pattern table address
	int ATaddr = 0; // The Attribute Table address
	
	int PalleteVal = 0;
	int COLOR = 0;	// Current color of the pixel
	int y = 0; // For use with the PutPixel function

	// For blitting the Pattern Table onto the Background buffer
	int TileX = 0;
	int TileY = 0;

	int NTstart = NameTableAddr;
	int NTend = NameTableAddr + 0x3C0;

	int bit = 7; // used for getting the first 2 bits of Color Index


		// MAKING OF THE BACKGROUND
		// 1) Find the current Name Table Value, Pattern Table address, Attribute Table address, and the current
		//    Attribute Table byte
		// 2) Assemble all the 16 bytes of Pattern Table values to make one tile
		// 3) Figure out first 2 bits for Color Index
		// 4) Minus the bit variable for the next loop
		// 5) Get last 2 bits from the Attribute Table. ATquadrant[] is a map used as a look-up table
		//    to see in which quadrant the Name Table byte is in
		// 6) Get the Pallete value determined by the ColorIndex number which points to where
		//    in the Pallete table that color is
		// 7) Find the color using the color from the int array look up table Color[0x40]
		// 8) Plot that one pixel in the appropriate spot in the PatternTable BITMAP
		// 9) Loop the x variable until the x coordinate reaches 8. Then up the y coordinate by one
		// 10) Loop the PTbyte until all pixels have filled the tile
		// 11) Now we have a Tile!
		// 12) blit that tile(PatternTable) onto the BGbuffer BITMAP(256 x 240)
		// 13) up the TileX coordinate by 8 to make room for the next tile
		// 14) Loop back up to the 'addr' for loop(the very top loop) and move on to the next tile
		// 15) rinse and repeat until all tiles are placed
		// 16) FINALY: blit the fully tiled BGbuffer BITMAP onto the Buffer BITMAP 

	// Looping the current selected NameTable from beginning to end
	for (int addr = NTstart; addr < NTend; addr++) {

		NTval = VideoMem->GetValue(addr);
		PTaddr = (NTval*0x10) + ScreenPTA;
		ATaddr = FindATaddr(addr);
		ATbyte = VideoMem->GetValue(ATaddr);

		// Getting all the values needed to make one tile from the Pattern Table called
		// from the value in the current Name Table byte.
		for (int PToffset = 0; PToffset < 16; PToffset++) {

			PTpixels[PToffset] = VideoMem->GetValue(PTaddr+PToffset);
			
		}

		
		for (int PTbyte = 0; PTbyte < 8; PTbyte++) {
			for (int x = 0; x < 8; x++) {

				ColorIndex.set(0, PTpixels[PTbyte].test(bit));
				ColorIndex.set(1, PTpixels[PTbyte+8].test(bit));

				bit--;
				if (bit < 0) bit = 7;

				switch(ATquadrant[addr]) {

					case 0: 
						ColorIndex.set(2, ATbyte.test(0)); 
						ColorIndex.set(3, ATbyte.test(1)); break;
					case 1: 
						ColorIndex.set(2, ATbyte.test(2));
						ColorIndex.set(3, ATbyte.test(3)); break;
					case 2:
						ColorIndex.set(2, ATbyte.test(4));
						ColorIndex.set(3, ATbyte.test(5)); break;
					case 3:
						ColorIndex.set(2, ATbyte.test(6));
						ColorIndex.set(3, ATbyte.test(7)); break;

				}

				switch(ColorIndex.to_ulong()) {

					case 0: PalleteVal = VideoMem->GetValue(0x3F00); break;
					case 1: PalleteVal = VideoMem->GetValue(0x3F01); break;
					case 2: PalleteVal = VideoMem->GetValue(0x3F02); break;
					case 3: PalleteVal = VideoMem->GetValue(0x3F03); break;
					case 4: PalleteVal = VideoMem->GetValue(0x3F04); break;
					case 5: PalleteVal = VideoMem->GetValue(0x3F05); break;
					case 6: PalleteVal = VideoMem->GetValue(0x3F06); break;
					case 7: PalleteVal = VideoMem->GetValue(0x3F07); break;
					case 8: PalleteVal = VideoMem->GetValue(0x3F08); break;
					case 9: PalleteVal = VideoMem->GetValue(0x3F09); break;
					case 10: PalleteVal = VideoMem->GetValue(0x3F0A); break;
					case 11: PalleteVal = VideoMem->GetValue(0x3F0B); break;
					case 12: PalleteVal = VideoMem->GetValue(0x3F0C); break;
					case 13: PalleteVal = VideoMem->GetValue(0x3F0D); break;
					case 14: PalleteVal = VideoMem->GetValue(0x3F0E); break;
					case 15: PalleteVal = VideoMem->GetValue(0x3F0F); break;

				}

				
				for (int col = 0; col < 0x40; col++) {

					if (PalleteVal == col) {COLOR = Color[col]; break;}

				}
				putpixel(PatternTable, x, y, COLOR);


			}
			y++;
			if (y == 8) y = 0;

				

				
			}

		stretch_blit(PatternTable, BGbuffer, 0, 0, PatternTable->w, PatternTable->h, TileX, TileY, 8, 8);
		TileX += 8;
		if (TileX == 256) {TileY += 8; TileX = 0;}
	}

	blit(BGbuffer, Buffer, 0, 0, 0, 0, 256, 240);
	 

}
Anyone that's using allegro, please help me at least somewhat optimize this mess cause I really have no idea here sadly :(

User avatar
thefox
Posts: 3141
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Re: Speeding up my graphics routine

Post by thefox » Tue Sep 22, 2009 10:46 am

Undubbed wrote:

Code: Select all

putpixel(PatternTable, x, y, COLOR);
There's your biggest problem. Calling putpixel for every single pixel you're drawing -- not good, as putpixel will do a lot of unnecessary work like calculating the pixel position and so on. Instead retrieve a pointer to the bitmap contents and modify it manually. I'm not sure how to do this in Allegro but you probably have to "lock" the bitmap to get the pointer. Then to move in X direction you usually just do pointer++ and to move in Y direction pointer += width and so on.

Good luck :)
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi

User avatar
MottZilla
Posts: 2832
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla » Tue Sep 22, 2009 11:14 am

With Allegro there are "inline" versions of putpixel which are ALOT faster. However to use them you must know the color depth you are running in. Look them up in the allegro.txt file.

Another thing you should think about is the PatternTables. In alot of games CHR-ROM is used meaning that Pattern Tables never change. Therefore it makes sense on ROM loading to convert the CHR-ROM into a faster to use format, some sort of cache. For CHR-RAM games you can still do this and it still makes sense but you have to keep track of when you need to update the tile's decoded/cached data because the game updated the pattern.

Other than that it's just basic optimization to avoid doing unneeded calculations. The graphics part of your emulator will probably always take the majority of calculation time though so don't like that fool you. The CPU really shouldn't take that much. Maybe sound could compete with graphics but I doubt it.

Undubbed
Posts: 24
Joined: Sun Aug 09, 2009 7:46 am

Post by Undubbed » Tue Sep 22, 2009 12:08 pm

Thanks guys I'll look into those for sure, but it turns out that a certain part of the code is causing the slowdown for some reason.

GlowCode shows it's a std::map that's taking about 80%+ of my UpdateBackground function. So I took this switch away just to see what happens. ATquadrant[] is the std::map btw. ATbyte and ColorIndex are std::bitset

Code: Select all

switch(ATquadrant[addr]) {

					case 0: 
						ColorIndex.set(2, ATbyte.test(0)); 
						ColorIndex.set(3, ATbyte.test(1)); break;
					case 1: 
						ColorIndex.set(2, ATbyte.test(2));
						ColorIndex.set(3, ATbyte.test(3)); break;
					case 2:
						ColorIndex.set(2, ATbyte.test(4));
						ColorIndex.set(3, ATbyte.test(5)); break;
					case 3:
						ColorIndex.set(2, ATbyte.test(6));
						ColorIndex.set(3, ATbyte.test(7)); break;

				}
And it shot up SIGNIFICANTLY. But that's not saying much cause it's still kinda slow. probably 15 or 20 FPS I think.

Seems like the problem is using a std::map as a lookup table.

qeed
Posts: 61
Joined: Tue Jun 17, 2008 11:51 am

Post by qeed » Tue Sep 22, 2009 12:18 pm

If you want the best speed, do away with the abstractions like std::map, std::bitset or whatever, and do manipulations by raw array/bitwise operation by hand. Casting char pointers to long or even long long pointers when copying will also help. How you do timing is also one of the biggest contributor to speed, like doing it cycle based run cpu one cycle, then run the ppu, that will be veryyyyy slow. Probably using the catch up timing method is one of the best speedups I can recommend, if you are not doing that already. Also, take a look into decoding the chr rom before hand or decoding the chr ram upon writes to the pattern tables, since drawing takes place alot more often then updating, that way you dont have to decode everything again.

User avatar
thefox
Posts: 3141
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Post by thefox » Tue Sep 22, 2009 12:19 pm

The obvious optimization that comes to mind is moving the switch() above this loop which should work because "addr" doesn't change in the loop.

Code: Select all

      for (int PTbyte = 0; PTbyte < 8; PTbyte++) { 
         for (int x = 0; x < 8; x++) { 
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi

Undubbed
Posts: 24
Joined: Sun Aug 09, 2009 7:46 am

Post by Undubbed » Tue Sep 22, 2009 12:29 pm

thefox wrote:The obvious optimization that comes to mind is moving the switch() above this loop which should work because "addr" doesn't change in the loop.

Code: Select all

      for (int PTbyte = 0; PTbyte < 8; PTbyte++) { 
         for (int x = 0; x < 8; x++) { 
WHOA! Holy crap, you're right! FPS shot up quite a bit cause of that.

And I'll certainly be trying to find a different way than using the abstractions.

Man, what would I do without you guys?

User avatar
MottZilla
Posts: 2832
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla » Tue Sep 22, 2009 2:25 pm

qeed wrote:How you do timing is also one of the biggest contributor to speed, like doing it cycle based run cpu one cycle, then run the ppu, that will be veryyyyy slow. Probably using the catch up timing method is one of the best speedups I can recommend, if you are not doing that already.
"Very slow" is subjective. On any half decent PC system there is no reason not to do this or even more accurate timing. Really on any modern system (say 2ghz) you should be able to handle running a cpu cycle then ppu cycles to catch up. The whole catch up method you mention is probably better but as I recall the main benefit or rather the drawback to doing cpu cycle then puu cycle is on older CPUs cache was small and access to RAM was quite slow. Nowdays caches are much bigger, RAM is much faster.

All I'm saying is that this approach is not really that slow at all. It was the method I used and it never performed slow on any of my PCs I tried it on except for a Pentium 266mhz laptop I tried it on for amusement.

qeed
Posts: 61
Joined: Tue Jun 17, 2008 11:51 am

Post by qeed » Tue Sep 22, 2009 3:24 pm

I should've said slow relative on other timing methods. Sorry if I wasn't clear. Yeah, per cycle isn't that bad of an option anymore, i get ~200 fps here with a cycle based emu 2.0 ghz pentium core, though it can probably be alot more if I used a properly optimized emu, ie, not mine :P.

tepples
Posts: 21752
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples » Tue Sep 22, 2009 4:24 pm

MottZilla wrote:Really on any modern system (say 2ghz) you should be able to handle running a cpu cycle then ppu cycles to catch up.
At that point, the bottleneck becomes switching among 12 threads, each emulating an NES in a file chooser reminiscent of old PS1 demo discs or the Wii Menu. Or between your NES and your post-processing filter (e.g. hq3x or TV emulation). Or between your NES and your video encoder if you're recording in DivX (.avi) or Theora (.ogg) format.
The whole catch up method you mention is probably better but as I recall the main benefit or rather the drawback to doing cpu cycle then puu cycle is on older CPUs cache was small and access to RAM was quite slow. Nowdays caches are much bigger
Register files on x86 are still small: ax bx cx dx si di bp sp, unless you go for a 64-bit executable.

User avatar
koitsu
Posts: 4216
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Post by koitsu » Tue Sep 22, 2009 7:38 pm

tepples wrote:Register files on x86 are still small: ax bx cx dx si di bp sp, unless you go for a 64-bit executable.
Err, what? http://en.wikipedia.org/wiki/X86#32-bit

Mednafen
Posts: 60
Joined: Wed Sep 13, 2006 12:45 pm

Post by Mednafen » Wed Sep 23, 2009 3:05 am

Regarding qeed's suggestion of typecasting a single-byte pointer to a multi-byte pointer and dereferencing it for faster memory access, you need to be aware of alignment. x86 handles misaligned multi-byte memory access in hardware with a reasonable performance penalty, but some other architectures like SPARC(like you care about that ;b) and ARM(maybe newer support it?) do not(at all, or with good performance). malloc(), calloc(), new, and new[] should return memory aligned to the greatest multi-byte unit addressable by the CPU(ignoring some larger FP types on some archs, and SIMD types), unless your compiler or libc is doing something stupid, and refer to your compiler's manual for directives to align variables themselves.

Also, using generic C types: long(especially long, considering differences in certain 64-bit ABIs...)/int(relatively safe, unless you're compiling with 16-bit DOS compilers)/short/char(char's generally safe, if you always explicitly define signedness, such as "signed char" and "unsigned char") can be unsafe in practice(I won't bring up really arcane architectures that will warp your mind). I prefer using uint32_t, int16_t, int8_t, uint8_t, etc. as defined in the header file inttypes.h(though older versions of MSVC are missing it IIRC).

tepples
Posts: 21752
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples » Wed Sep 23, 2009 5:29 am

koitsu wrote:
tepples wrote:Register files on x86 are still small: ax bx cx dx si di bp sp, unless you go for a 64-bit executable.
Err, what? http://en.wikipedia.org/wiki/X86#32-bit
So I forgot the 'E' in front of each name. Forgive me; all the asm I do is for 6502 and occasionally ARM. But my point is that x86 doesn't have R8 through R15 of x86-64.

Undubbed
Posts: 24
Joined: Sun Aug 09, 2009 7:46 am

Post by Undubbed » Tue Sep 29, 2009 4:54 pm

I'm sure a lot of the techniques I used from you guys made it faster, but it was still slow. Then I just found out that running the program under debugging makes the program much slower :/. So I tried using the EXE outside the IDE and now it's running at more than perfect speed :) I feel like such a n00b sometimes.

Updated routine just for closure:

Code: Select all

void Video::UpdateBackGround() {
	
	int NTval = 0;	// The value of the current NameTable address
	int PTaddr = 0; // Pattern table address
	int ATaddr = 0; // The Attribute Table address
	
        unsigned int PalleteVal = 0;
	int COLOR = 0;	// Current color of the pixel
	int y = 0; // For use with the PutPixel function

	// For blitting the Pattern Table onto the Background buffer
	int TileX = 0;
	int TileY = 0;

	int NTstart = NameTableAddr;
	int NTend = NameTableAddr + 0x3C0;

	int bit = 0x80; // used for getting the first 2 bits of Color Index

	// Looping the current selected NameTable from beginning to end
	for (int addr = NTstart; addr < NTend; addr++) {

		NTval = VideoMem->GetValue(addr);
		ATaddr = FindATaddr(addr);
		ATbyte = VideoMem->GetValue(ATaddr);

      // preventing tiles from being needlessly rendered
		if (NTval == NTchange[addr - NTstart]) {

			if (ATbyte == ATchange[ATaddr - NTend]) {
				TileX += 8;
				if (TileX == 256) {TileY += 8; TileX = 0;}
				continue;	
			}
			else ATchange[ATaddr - NTend] = ATbyte;
		
		}
		else {

			NTchange[addr-NTstart] = NTval;
			if (ATbyte != ATchange[ATaddr-NTend]) ATchange[ATaddr - NTend] = ATbyte;
		}

		PTaddr = ((NTval*0x10) + ScreenPTA) >> 4;
		

		
		switch(ATquadrant[addr]) {

					case 0: 
						ColorIndex = ATbyte & 0x01 ? ColorIndex | 0x04 : ColorIndex & (~0x04);
						ColorIndex = ATbyte & 0x02 ? ColorIndex | 0x08 : ColorIndex & (~0x08); break;
					case 1: 
						ColorIndex = ATbyte & 0x04 ? ColorIndex | 0x04 : ColorIndex & (~0x04);
						ColorIndex = ATbyte & 0x08 ? ColorIndex | 0x08 : ColorIndex & (~0x08); break;
					case 2:
						ColorIndex = ATbyte & 0x10 ? ColorIndex | 0x04 : ColorIndex & (~0x04);
						ColorIndex = ATbyte & 0x20 ? ColorIndex | 0x08 : ColorIndex & (~0x08); break;
					case 3:
						ColorIndex = ATbyte & 0x40 ? ColorIndex | 0x04 : ColorIndex & (~0x04);
						ColorIndex = ATbyte & 0x80 ? ColorIndex | 0x08 : ColorIndex & (~0x08); break;

				}

		for (int PTbyte = 0; PTbyte < 8; PTbyte++) {
			for (int x = 0; x < 8; x++) {

				ColorIndex = PatternTable[PTaddr][PTbyte] & bit ? ColorIndex | 0x01 : ColorIndex & (~0x01);
				ColorIndex = PatternTable[PTaddr][PTbyte+8] & bit ? ColorIndex | 0x02 : ColorIndex & (~0x02);

				bit >>= 1; // bit/2
				if (bit < 1) bit = 0x80;
				
				
				
				switch(ColorIndex) {

					case 0: PalleteVal = VideoMem->GetValue(0x3F00); break;
					case 1: PalleteVal = VideoMem->GetValue(0x3F01); break;
					case 2: PalleteVal = VideoMem->GetValue(0x3F02); break;
					case 3: PalleteVal = VideoMem->GetValue(0x3F03); break;
					case 4: PalleteVal = VideoMem->GetValue(0x3F04); break;
					case 5: PalleteVal = VideoMem->GetValue(0x3F05); break;
					case 6: PalleteVal = VideoMem->GetValue(0x3F06); break;
					case 7: PalleteVal = VideoMem->GetValue(0x3F07); break;
					case 8: PalleteVal = VideoMem->GetValue(0x3F08); break;
					case 9: PalleteVal = VideoMem->GetValue(0x3F09); break;
					case 10: PalleteVal = VideoMem->GetValue(0x3F0A); break;
					case 11: PalleteVal = VideoMem->GetValue(0x3F0B); break;
					case 12: PalleteVal = VideoMem->GetValue(0x3F0C); break;
					case 13: PalleteVal = VideoMem->GetValue(0x3F0D); break;
					case 14: PalleteVal = VideoMem->GetValue(0x3F0E); break;
					case 15: PalleteVal = VideoMem->GetValue(0x3F0F); break;

				}

				
				for (int col = 0; col < 0x40; col++) {

					if (PalleteVal == col) {COLOR = Color[col]; break;}

				}
				_putpixel16(PTbmp, x, y, COLOR);


			}
			y++;
			if (y == 8) y = 0;

				
			}

		stretch_blit(PTbmp, BGbuffer, 0, 0, PTbmp->w, PTbmp->h, TileX, TileY, 8, 8);
		TileX += 8;
		if (TileX == 256) {TileY += 8; TileX = 0;}
	}

	blit(BGbuffer, Buffer, 0, 0, 0, 0, 256, 240);
	 

}

User avatar
ehguacho
Posts: 83
Joined: Tue Mar 09, 2010 11:12 pm
Location: Rosario, Argentina
Contact:

Post by ehguacho » Fri May 21, 2010 5:59 pm

thanks Undubbed for your routine. hope it'll helpful for me!
sorry about my english, i'm from argentina...

http://nestate.uuuq.com

Post Reply