Newbie to emulation questions

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Newbie to emulation questions

Post by tepples »

kyuusaku wrote:
tepples wrote:For the CPU built-in I/O registers, set your $4000 handler to check if the write is in $4000-$4017 and use a separate pair of 24-entry tables for those.
Wouldn't compare logic and additional tables be far slower than a pointer array which could directly point to separate memory ignore/conflict/success/open bus handlers and PPU/APU functions?
Having 65536 pointers, where you jump to a separate function for every different address, will wreck your host CPU's cache. Unlike the NES, Super NES, Game Boy, and Game Boy Advance, PCs use cache locality. If the pointer table can be kept in cache, the CPU can more quickly jump through it.
NSF, for one. Crazy Japanese Famicom mappers, for another.
I didn't know, right now NSF isn't in the picture. The dynamic map idea was to allow individual mapper modules (NSF or FDS for example) to change the map on a case by case basis. Which FC mappers are these with 4KB banks? Are you sure they aren't pirate originals?
Are you sure pirate originals aren't worthy of emulation?
I was thinking about running both the CPU and PPU in real time by interleaving which would allow me to skip the look ahead logic for the PPU, just evaluate the interrupts before each instruction. It's unfortunate that this would be so intense for even current desktops but "interesting" detection algorithms not only seem like a lot of work but are pretty complex, which sorta goes against the simplicity philosophy.
Then make your interleaved engine first. Then when you get time to make your "interesting" engine, use regression testing by running the precise engine and the "interesting" engine in parallel and comparing the state of RAM each time the engine catches up.
n6 wrote:
Try 16 pointers, each referencing a 4 KB bank, for reads, and 16 pointers for writes.
Isnt it even better with 2kb blocks? because of the size of WRAM
Wouldn't 1 byte blocks be best since that's the size of i/o ports? :P
And wreck your cache.
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

laughy wrote:In my emulator the most complex thing would actually be the JIT compiler, but I suggest NOT writing one of those.
Aye aye; I wouldn't know where to even start with that.
tepples wrote: Having 65536 pointers, where you jump to a separate function for every different address, will wreck your host CPU's cache. Unlike the NES, Super NES, Game Boy, and Game Boy Advance, PCs use cache locality. If the pointer table can be kept in cache, the CPU can more quickly jump through it.
Would a cache miss would be slower than the 17 compares for the r/w and address plus figure out what to do with it? Can't the compiler or a profiler to do something to optimize for a CPU with humble cache? I can see how WRAM which allows both r/w can be decoded very quickly but $4000 onwards is trouble. Is the pointer idea is really as bad as you make it out to be?
tepples wrote:Are you sure pirate originals aren't worthy of emulation?
Some pirate originals are worthy. Even so, I haven't seen one with 4KB banks.
mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago

Post by mattmatteh »

kyuusaku wrote:Would a cache miss would be slower than the 17 compares
i would guess yes, but the only real test is to benchmark it yourself. also, where do you get 17 compares from?

also, i am using sdl right now. my emulator is not ready for ppu optimizations yet. still working on other stuff.

matt
User avatar
Quietust
Posts: 1920
Joined: Sun Sep 19, 2004 10:59 pm
Contact:

Post by Quietust »

mattmatteh wrote:where do you get 17 compares from?
Probably from doing an if/elseif/elseif on $4000, $4001, $4002, ..., $4017 (and counting in hex instead of decimal), which is not what you'd get from such a switch() - you'd get one IF check for >$4018 (for the 'default' case) and a 24-entry jump table to handle $4000-$4017. At least with a reasonably sane compiler, that's what I'd expect to get.
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

Would a cache miss would be slower than the 17 compares for the r/w and address plus figure out what to do with it?
If you're using a switch statement, the compiler should generate at most 5 compares (binary search), or two (table lookup). The real question is, how would this affect the available cache for the cases where it really helps? If you push out all the often-accessed data with these huge tables, the CPU can really slow down.
Can't the compiler or a profiler to do something to optimize for a CPU with humble cache?
If a CPU with a humble cache could work just as well, then why would a CPU have anything more?
I can see how WRAM which allows both r/w can be decoded very quickly but $4000 onwards is trouble. Is the pointer idea is really as bad as you make it out to be?
Only way to really find out is to profile the two. I tried changing the page size from 2K to 1 byte in my emulator and it uses 460% more CPU time (5.6x slower). This is on a PowerPC G3 with 32K data cache and 1MB secondary cache. The table size goes from 128 bytes to 256K, and I only use it for memory reads.
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

I was thinking:

if(addr < 0x8000)
{
if(addr < 0x4000)
{
etc etc + R/W until you're in your target range. Using a switchblock is better though until you must break down that range further.
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

Bump.

What's the simplest way to get timing for a program? Are there easy to use libraries or must I do the math myself? What should I be Googling for? Thanks
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

A decent emulator with support for breakpoints (either 'brk' instructions or opcode fetch watches) should tell you the number of cycles that have elapsed between breakpoints.
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

Oops, I should been more clear. I mean timing for the actual program, like how may I do "virtual interrupts" in C with time.h (or other) as a timebase.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Decent compilers have something called "profiling" which will tell you how long the CPU spends in each function.
mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago

Post by mattmatteh »

well, as blargg said before, and something i have kinda done too, is to profile the memory reads too. find out what address ranges are accessed the most and you can put those at the top of your if-else block if that is what you are using. i was thinking of doing that in a few places.

as for the cache misses, i used valgrind. the cpu wasnt really a problem. its the ppu for mine. but still working on it. i need to understand the ppu and mappers more before i can optimise. first get it working, then speed it up

matt
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

Right now I'm not concerned with profiling for speed, what I need help with is more basic: getting everything synched and call step_PPU() every ~18.624uS (PPU cycle).

This is what I've come up with so far:

loop()
{
wait(((1/(21477270/4))-(time2-time1/CLOCKS_PER_SEC))
time1 = clock()
step_NES()
time2 = clock()
}

Edit: bad algo
Last edited by kyuusaku on Mon Sep 18, 2006 1:53 pm, edited 2 times in total.
mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago

Post by mattmatteh »

i use sdl right now for approxamate timing. also, you can not profile with time(). its only good to the nearest 10 milliseconds i think.

is that 18.624 micro seconds ? that will not work. the cpu can not switch applications that fast, its more like in a few milliseconds. i do an entire render frame then sleep.

matt
User avatar
kyuusaku
Posts: 1665
Joined: Mon Sep 27, 2004 2:13 pm

Post by kyuusaku »

Do you not handle controller input and sound output in realtime then? How long do you wait? Just until the next frame? What's the benefit of using SDL's timers to time.h?

To clarify, my routine does is not use the time() but a function that returns clock cycles since program started, and translates that into execution time of time1 to time 2. By timing the emulation event over and over, I figure it will execute about the desired speed assuming it can complete a loop at the desired speed.
Last edited by kyuusaku on Sun Sep 17, 2006 10:10 pm, edited 1 time in total.
mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago

Post by mattmatteh »

each render frame is 16.6 milliseconds. i round to that. and even if you/i didnt, i would guess that a single core cpu would, since it can do the input or emulator. unless you are polling. but i also havent perfected my input that much. i am still working on getting the core done still.

matt
Post Reply