Multiple switch for CPU emulation

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
Disch
Posts: 1849
Joined: Wed Nov 10, 2004 6:47 pm

Multiple switch for CPU emulation

Post by Disch » Mon Nov 30, 2009 8:36 am

I'm sure almost everyone has a big switch() for their opcodes in their CPU emulator.

I was recently thinking about this approach qeed showed me (which I guess was blargg's idea?)

Code: Select all

switch (cpu.opcode)
{
  case lda_imm:
    get_imm();
    goto lda;
  case lda_zp:
    get_zp():
    goto lda;
  case adc_imm:
    get_imm();
    goto adc;

  lda: /* do stuff */
    break;
  adc: /* do stuff */
    break;
}
At first I was like "really? goto? wtf". But now that I think about it, it makes some sense. It avoids duplicate code and keeps the switch smaller because there's less inline.

But still... goto? There has to be a better way.

So I thought about it... and what if you made two separate switches... one for the addressing mode lookup, and another for the opcode execution?

Code: Select all

switch(cpu.opcode)
{
  case lda_imm:
  case adc_imm:
    get_imm();
    break;
 //...
}

switch(cpu.opcode)
{
  case lda_imm:
    lda();
    break;

  case adc_imm:
    adc();
    break;
}
This seems like it would keep the switches even smaller, which might help speed up the code. Although I'm not sure if the double-switch would slow it down.

Has anyone tried something like this? I'm curious as to how well it would work.

Nessie
Posts: 134
Joined: Mon Sep 20, 2004 11:13 am
Location: Sweden
Contact:

Post by Nessie » Mon Nov 30, 2009 9:12 am

If you're thinking of performance, I would guess that if the total size of two switches is smaller than one switch, they could possibly run faster. The few cycles wasted on table lookup are probably insignificant as long as your code fits in the cache.

The goto version looks like a good option. If you dislike the goto keyword I suppose you could replace it with a function call (fastcall) followed by a return, perhaps your compiler will be clever enough to translate that into an absolute jmp.


I've implemented this by splitting addressing mode lookup and operation lookup using a function pointer. pseudo code:

Code: Select all

function execute(opcode)
{
  operation = operations[opcode]
  mode = modes[opcode]
  mode(operation)
}

// Example addressing mode
function zeroPageRMW(operation)
{
  addr = nextbyte()
  value = read(addr)
  write(addr, value) // Write back unmodified :)
  value = operation(value)
  write(addr, value)
}

// Example operation
function inc(value)
{
  value++;
  saveSomeFlags(value)
  return value;
}

User avatar
Zepper
Formerly Fx3
Posts: 3223
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper » Mon Nov 30, 2009 1:26 pm

- RockNES uses a switch statement for the addressing mode, then a jump table to the instruction, which can jump into another depending of the case.

User avatar
Dwedit
Posts: 4409
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit » Mon Nov 30, 2009 3:07 pm

PocketNES used a 256-word jump table, then used the jump table inside every instruction to go on to the next. It only escaped the CPU loop by running out of cycles before the next event. Also separate 8-word call tables for memory reads and writes.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!

User avatar
koitsu
Posts: 4218
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Post by koitsu » Mon Nov 30, 2009 4:24 pm

Weird. The disassembler I wrote for 6502 (not TRaCER) in C was done unlike any of the above... well, the closest thing would be how PocketNES's is done. I'll try to dig up the source to confuse people. :-)

I don't think ultimately it matters though, given that most compilers will optimise a large switch/case statement into a jumptable.

User avatar
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near » Mon Nov 30, 2009 5:17 pm

Code: Select all

(this->*opcode_table[op_readpc()])();
Table is built off template functions.

User avatar
Disch
Posts: 1849
Joined: Wed Nov 10, 2004 6:47 pm

Post by Disch » Mon Nov 30, 2009 8:25 pm

I was more interested in how this would work efficiency wise.

Additional function calls (especially if they're done via a function pointer) add some overhead. And inlining the addressing mode and/or instruction code for all 256 opcodes would make "the big switch" huge with a lot of duplicate / redundant code.

The idea here was that you have no duplicate code with the switches, and also have no additional function calls (provided your code is inlined). So the code is small and potentially faster. But I don't know if having a second switch (and thusly a second jump table) would negate the benefits.

I dunno. It was just a thought.

User avatar
Dwedit
Posts: 4409
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit » Mon Nov 30, 2009 9:10 pm

Wouldn't 64K of cache fit an unrollled 6502 CPU core?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!

tepples
Posts: 22281
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples » Tue Dec 01, 2009 6:51 am

That and how much of the PPU and how much of the APU? Some programs do tricky things that need a lot of catchups.

User avatar
essial
Posts: 72
Joined: Thu Dec 03, 2009 8:20 am

Post by essial » Thu Dec 03, 2009 9:55 am

I don't have a switch in my CPU emulator (and it is 99% finished, and passes all CPU test roms).

Here' is the code:

Code: Select all

bool CNES2A03::processOpcode(void)
{
	if (clockDuty > 0) {
		clockDuty--;

		if (clockDuty == 0) {
			switch (dutyIntType) {
				case NES_INTTYPE_NMI:
					{
						NES_FLAGCLEAR(regs.P, NES_PROCESSOR_STATUS_BREAK_COMMAND)
						SETFLAG(regs.P, NES_PROCESSOR_STATUS_UNUSED, false)
						stackPush(regs.PC >> 8);
						stackPush(regs.PC & 0xFF);
						stackPush(regs.P);
						regs.PC = readMem(0xFFFA) + (readMem(0xFFFB) << 8);
					} break;
				case NES_INTTYPE_BRK:
					{
						NES_FLAGSET(regs.P, NES_PROCESSOR_STATUS_BREAK_COMMAND)
						NESADDRESS16 addr = regs.PC + 1;
						SETFLAG(regs.P, NES_PROCESSOR_STATUS_UNUSED, true)
						stackPush(addr >> 8);
						stackPush(addr & 0xFF);
						stackPush(regs.P);
						regs.PC = readMem(0xFFFE) + (readMem(0xFFFF) << 8);

					} break;
			}
		}
	}

	bool result = false;
	OPCODE opcode = readMem(regs.PC++);
	if (opFunc[opcode] == NULL) {
		return false; // TODO: handle invalid opcode
	} else {
		if (!opFunc[opcode](this, opcode, opFuncParam[opcode]))
			return false;
	}
	return true;
}
So instead of a huge switch my code simply says:

Code: Select all

opFunc[opcode](this, opcode, opFuncParam[opcode])
Here's an example of an opcode implimentation:

Code: Select all

OPDEF(JMP)
{
	/* JMP Jump. The program jumps to a label and continues from there.
	
	   An original 6502 has does not correctly fetch the target address if the indirect vector falls 
	   on a page boundary (e.g. $xxFF where xx is and value from $00 to $FF). In this case fetches 
	   the LSB from $xxFF as expected but takes the MSB from $xx00. This is fixed in some later chips 
	   like the 65SC02. */

	if (opParam == AFP_ABSOLUTE) {
		NESADDRESS16 addr = NEXTOP;
		addr += (NEXTOP << 8);
		parent->regs.PC = addr;
	} else {// Indirect
		NESADDRESS16 srcAddr = NEXTOP;
		srcAddr += (NEXTOP << 8);
		NESADDRESS16 finalAddr;
#ifdef ENABLE_6502_BUGS
		if ((srcAddr & 0xFF) == 0xFF)
		{
			// Emulate the original 6502 bug where fetching an indirect address on a page boundery
			// for JMP opcodes causes the address to be loaded improperly.
			finalAddr = parent->readMem(srcAddr);
			finalAddr += (parent->readMem((srcAddr & 0xFF00)) << 8); // Emulate the page wrap bug
		} else {
			// Not on a page boundery -- do it the normal way.
			finalAddr = parent->readMem(srcAddr) +  (parent->readMem(srcAddr+1) << 8);
		}
#else 
		finalAddr = parent->readMem(srcAddr);
		finalAddr += (parent->readMem(srcAddr+1) << 8);
#endif // ENABLE_6502_BUGS
		parent->regs.PC = finalAddr;
	}

	return true;
}
And the OPDEF macro...

Code: Select all

#define OPDEF(x) __inline bool CNES2A03::op ## x(CNES2A03* parent, OPCODE opcode, VAL8 opParam)
Runs super fast. Switch statements require a LOT of branching (one per case) that kills the (real) CPU's prefetch cache. Having the higher used opcodes at the top of the switch will help, but the speed results will be different for different opcodes.

I decided to go with a global function pointer array. There's only one call, and no branching (in regards to finding the proper opcode functionality). There may be some duplicated code in certain situations, but this method works for me and keeps things nice and clean. Besides the CPU isn't going to change once it's done, so a few extra lines is better than dotting your code with gotos.

Hamburgler
Posts: 36
Joined: Wed Jul 04, 2007 8:40 am

Post by Hamburgler » Thu Dec 03, 2009 11:14 am

Large switch statements usually get turned into jump tables by the compiler. Implementing your own jump table may be reimplementing the wheel.

Code size is the more important optimization. If you can keep the main execution loop in cache, the cost of branching and jumping is minimized.

Having a bunch of #defined macros and inline code can lead to a lot of bloat. Be careful.

User avatar
essial
Posts: 72
Joined: Thu Dec 03, 2009 8:20 am

Post by essial » Thu Dec 03, 2009 12:03 pm

Hamburgler wrote:Large switch statements usually get turned into jump tables by the compiler. Implementing your own jump table may be reimplementing the wheel.

Code size is the more important optimization. If you can keep the main execution loop in cache, the cost of branching and jumping is minimized.

Having a bunch of #defined macros and inline code can lead to a lot of bloat. Be careful.
In the case of jump tables, that will simply mean that there will be no positive benefit speed-wise in the best-case of compiler optimization. I prefer to take the safer route and ensure the jumps work as expected. Branching is always an issue as a mis-predicted branch will invalidate the cache.

As far as the #define macros go, I know how they work, and the "bloat" is just the textual representation. I use the #defines in the case of opcodes to standardize the header format, so any future changes (if needed) don't require updating the same thing on every opcode instruction.

User avatar
Zepper
Formerly Fx3
Posts: 3223
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper » Thu Dec 03, 2009 2:14 pm

Disch wrote:The idea here was that you have no duplicate code with the switches, and also have no additional function calls (provided your code is inlined).
- Yup, that's my idea and the code is pretty compact.
Disch wrote:So the code is small and potentially faster.
- I don't know if mine is "potentially faster" or "potentially crap" as someone could reply, heh... ^_^;; Anyway, I can send you my CPU sources if you want to give an easy look into my style.

Hamburgler
Posts: 36
Joined: Wed Jul 04, 2007 8:40 am

Post by Hamburgler » Thu Dec 03, 2009 7:23 pm

essial wrote:In the case of jump tables, that will simply mean that there will be no positive benefit speed-wise in the best-case of compiler optimization. I prefer to take the safer route and ensure the jumps work as expected.
While absolutely true, I still prefer to work the other way around: Let the compiler worry about optimization and how it will fit in with the program as a whole. I'll concentrate on building simple, readable and easily maintainable code. Fancy hand-made optimization can come later if needed.

Anyway, either method is essentially equal in performance. I just wanted to point out that switch statements aren't always evil anymore.

User avatar
essial
Posts: 72
Joined: Thu Dec 03, 2009 8:20 am

Post by essial » Thu Dec 03, 2009 7:39 pm

Hamburgler wrote:While absolutely true, I still prefer to work the other way around: Let the compiler worry about optimization and how it will fit in with the program as a whole. I'll concentrate on building simple, readable and easily maintainable code. Fancy hand-made optimization can come later if needed.
Guilty :D I also help my friend with developing an x86 operating system (for fun of course), and I sort of gotten used to the compiler doing bad things when optimizations were turned on, so we simply have the optimizations disabled and do that by hand. But secondly, it's a matter of style. I like my style and you like yours -- no argument either way on either style from me :)
Hamburgler wrote: Anyway, either method is essentially equal in performance. I just wanted to point out that switch statements aren't always evil anymore.
Agreed :)

Post Reply