It is currently Fri Dec 15, 2017 4:47 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Tue Aug 23, 2005 2:13 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
In my NES 6502 CPU core I read opcodes (instructions) directly from memory rather than using the usual memory read function that handles memory-mapped I/O devices like the PPU and APU. My reasoning is that nothing will intentionally execute from an I/O device and that doing so wouldn't be useful anyway. This approach improves performance by avoiding a function call and the need to keep track of the timestamp for opcode reads.

I divide memory into pages (currently 4K in size) and direct memory access simply goes through a mapping table. The table covers the entire 64K address space, not just the upper half where the ROM is usually mapped. Unmapped and I/O pages are assigned to a special page filled with bytes that aren't a legal opcode, helping to catch any attempted execution of these. I also use this optimization for zero-page and the stack, since I don't handle cartridges with hardware that does anything special when those areas are accessed (and I don't know of any that do).

Code:
const unsigned page_size = 0x1000;
byte* pages [0x10000 / page_size];

// the usual memory read function
int emulate_read( unsigned addr );

inline int read_mem( unsigned addr )
{
    return pages [addr / page_size] [addr % page_size];
}

void emulate_cpu()
{
    int opcode = read_mem( pc );
    switch ( opcode )
    {
        case 0xA9: // LDA #imm
            a = read_mem( pc + 1 );
            set_nz( a );
            pc += 2;
            break;
       
        case 0xA5: // LDA zp
            a = read_mem( read_mem( pc + 1 ) );
            set_nz( a );
            pc += 2;
            break;
           
        case 0xAD: { // LDA abs
            unsigned addr = read_mem( pc + 2 ) * 0x100 + read_mem( pc + 1 );
            a = emulate_read( addr );
            set_nz( a );
            pc += 3;
            break;
        }
       
        case 0x68: // PLA
            sp = (sp + 1) & 0xff;
            a = read_mem( sp + 0x100 );
            set_nz( a );
            pc += 1;
            break;
       
        // ...
    }
}


Any questions, problems, or improvements?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Aug 23, 2005 3:42 pm 
Offline
User avatar

Joined: Thu Oct 21, 2004 4:02 pm
Posts: 210
Location: San Diego
Just as a clarification, when a mapper does a bankswitch how do you handle it? I'm assuming it's by updating the pages[] array?
Also, have you profiled this performance improvement? You're still doing a divide operation and a table lookup on each read operation, so I wouldn't really expect a huge improvement over a fully decoded implementation. I guess you are saving either a function call or a bunch of conditionals (depending on how your memory decoder works), so maybe it will speed some things up. Do you have any numbers on this?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Aug 23, 2005 4:21 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
teaguecl wrote:
Just as a clarification, when a mapper does a bankswitch how do you handle it? I'm assuming it's by updating the pages[] array?


Right. In fact, I use the table for mapping (in the normal read memory function I use the table for ROM accesses).

Quote:
Also, have you profiled this performance improvement? You're still doing a divide operation and a table lookup on each read operation, so I wouldn't really expect a huge improvement over a fully decoded implementation.


The divide is of an unsigned value by a power of two, so it should be optimized into a shift (same for the modulo, it converts into a mask): pages [addr >> 12] [addr & 0x0fff].

The full read emulation function is going to have some kind of table or switch statement based on the address, so there is always an unknown branch involved. If the read is from ROM, then you have the above table lookup in addition to address decoding.

In my CPU I use a table of function pointers, similar to the page table, and do the lookup in the CPU core (read_funcs [addr >> 12] ( addr )), so using the page table generates less code in addition to being faster. I use lots of goto statements to reuse common sections of code, so the inline lookups don't expand the code by much (the whole CPU core compiles to 4K).

I don't have any performance measurements handy. I'll look into gathering some.

I keep the implementation hidden behind an abstract interface that has a few functions that map a range of bytes, given the starting address and the number of bytes. This makes the setup code clear and insensitive to changes in the page size (as long as it's not too large).

Code:
typedef unsigned address_t;
typedef int (*read_func_t)( address_t );
typedef void (*write_func_t)( address_t, int data );

void map_memory( address_t, unsigned size, read_func_t, write_func_t );
void map_code( address_t, unsigned size );


Last edited by blargg on Wed Aug 24, 2005 8:50 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Tue Aug 23, 2005 9:07 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19348
Location: NE Indiana, USA (NTSC)
blargg wrote:
I also use this optimization for zero-page and the stack, since I don't handle cartridges with hardware that does anything special when those areas are accessed (and I don't know of any that do).

Though it isn't exactly relevant on nesdev.com, a lot of Atari 2600 carts watch the stack for accesses in the range of $1FD-$1FF to perform bankswitching so that they can put main() in one bank and subroutines in the other.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Aug 24, 2005 10:03 am 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
I ran some basic performance profiling on my CPU core using emulated read/write for all memory accesses, and using direct access to zero-page, stack, and instructions. The average time per frame includes CPU emulation and PPU emulation* with rendering turned on (sound is disabled), running on a 400 MHz PowerPC G3 Mac, playing The Guardian Legend for a minute or so. If I can get my sampling profiler running, I can generate more thorough data with timing for the CPU core alone.

All emulated memory accesses: 3.22 msec/frame, 4808 bytes of code in the core CPU emulation function.

Direct memory optimization: 2.05 msec/frame, 3524 bytes of code in the core CPU emulation function.

* PPU emulation is by no means complete. It handles basic games like Castlevania.

EDIT: Corrected byte counts to include only the CPU emulation function.


Last edited by blargg on Sat Aug 27, 2005 10:14 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Wed Aug 24, 2005 11:59 am 
Offline
User avatar

Joined: Thu Oct 21, 2004 4:02 pm
Posts: 210
Location: San Diego
blargg wrote:
All emulated memory accesses: 3.22 msec/frame, 6020 bytes of code in the CPU core.

Direct memory optimization: 2.05 msec/frame, 4252 bytes of code in the CPU core.


Thanks Blargg, that's very cool of you to do that testing. I think it might be a good idea to start documenting stuff like this somewhere. Obviously it's all very platform specific, but it could lead to some "rules of thumb" for certain aspects of NES emulation based on your hardware.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Aug 25, 2005 1:19 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
Thanks for the feedback; it motivated me to completely rewrite my NES 6502 emulation page. Here's the current draft:

http://www.slack.net/~ant/nes-emu/6502.html

I should probably move it over to the Nesdev wiki and put each technique on its own page to allow more thorough discussion.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Aug 25, 2005 4:01 pm 
Offline
Formerly Fx3
User avatar

Joined: Fri Nov 12, 2004 4:59 pm
Posts: 3076
Location: Brazil
"For every CPU cycle, the NTSC PPU renders 3 pixels". Look my draft:

Code:
unsigned char readvalue(int address)
{
   ppu_run(); apu_run();
   return cpu->readmem(address);
}

void cpu_run()
{
   data = readvalue(PC);
//do stuff
}


blargg, you mean something like:

Code:
void cpu_run()
{
   ppu_run(); apu_run();
   data = cpu->bank[PC>>13][PC & 0x1fff];
   //do stuff
}


Is this?? Well, I was doing that. Let me share what I do... ^_^;; Opcodes that access RAM (or the stack) have a pointer, rather than a (*hook)(). After fetching the opcode, it jumps to proper address mode (goto _address_mode_XX), and another jumptable to execute the opcode. I could reduce the code size by more than 75% because a few addressing modes (after the proper opcode/address data fetching) do the same of others.

_________________
Zepper
RockNES developer


Top
 Profile  
 
 Post subject:
PostPosted: Thu Aug 25, 2005 5:41 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
I posted my replies to two other threads since they weren't on-topic for this thread:

http://nesdev.com/bbs/viewtopi ... =4058#4058
http://nesdev.com/bbs/viewtopi ... =4059#4059


Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 02, 2005 12:39 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
I was working on my CPU core today and realized that by pre-biasing the pointers in the table, the masking operation could be eliminated. Below I've written the code to be closer to the machine code output:

Code:
const int low_bits = 12; // page_size = 1 << 12 = 4096
byte* pages [0x10000 >> low_bits];

void set_page( int index, byte* data )
{
    pages [index] = data - (index << low_bits);
}

inline int read_mem( unsigned addr )
{
    int index = addr >> low_bits;
    byte* page = pages [index];
    return page [addr]; // no masking needed!
}


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group