It is currently Mon Dec 11, 2017 8:32 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 41 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Thu Sep 07, 2006 5:54 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
kyuusaku wrote:
tepples wrote:
For the CPU built-in I/O registers, set your $4000 handler to check if the write is in $4000-$4017 and use a separate pair of 24-entry tables for those.

Wouldn't compare logic and additional tables be far slower than a pointer array which could directly point to separate memory ignore/conflict/success/open bus handlers and PPU/APU functions?

Having 65536 pointers, where you jump to a separate function for every different address, will wreck your host CPU's cache. Unlike the NES, Super NES, Game Boy, and Game Boy Advance, PCs use cache locality. If the pointer table can be kept in cache, the CPU can more quickly jump through it.

Quote:
Quote:
NSF, for one. Crazy Japanese Famicom mappers, for another.

I didn't know, right now NSF isn't in the picture. The dynamic map idea was to allow individual mapper modules (NSF or FDS for example) to change the map on a case by case basis. Which FC mappers are these with 4KB banks? Are you sure they aren't pirate originals?

Are you sure pirate originals aren't worthy of emulation?

Quote:
I was thinking about running both the CPU and PPU in real time by interleaving which would allow me to skip the look ahead logic for the PPU, just evaluate the interrupts before each instruction. It's unfortunate that this would be so intense for even current desktops but "interesting" detection algorithms not only seem like a lot of work but are pretty complex, which sorta goes against the simplicity philosophy.

Then make your interleaved engine first. Then when you get time to make your "interesting" engine, use regression testing by running the precise engine and the "interesting" engine in parallel and comparing the state of RAM each time the engine catches up.

Quote:
n6 wrote:
Quote:
Try 16 pointers, each referencing a 4 KB bank, for reads, and 16 pointers for writes.

Isnt it even better with 2kb blocks? because of the size of WRAM

Wouldn't 1 byte blocks be best since that's the size of i/o ports? :P

And wreck your cache.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Sep 07, 2006 11:09 am 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
laughy wrote:
In my emulator the most complex thing would actually be the JIT compiler, but I suggest NOT writing one of those.

Aye aye; I wouldn't know where to even start with that.

tepples wrote:
Having 65536 pointers, where you jump to a separate function for every different address, will wreck your host CPU's cache. Unlike the NES, Super NES, Game Boy, and Game Boy Advance, PCs use cache locality. If the pointer table can be kept in cache, the CPU can more quickly jump through it.

Would a cache miss would be slower than the 17 compares for the r/w and address plus figure out what to do with it? Can't the compiler or a profiler to do something to optimize for a CPU with humble cache? I can see how WRAM which allows both r/w can be decoded very quickly but $4000 onwards is trouble. Is the pointer idea is really as bad as you make it out to be?

tepples wrote:
Are you sure pirate originals aren't worthy of emulation?

Some pirate originals are worthy. Even so, I haven't seen one with 4KB banks.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Sep 07, 2006 12:04 pm 
Offline

Joined: Fri Jul 29, 2005 3:40 pm
Posts: 345
Location: near chicago
kyuusaku wrote:
Would a cache miss would be slower than the 17 compares

i would guess yes, but the only real test is to benchmark it yourself. also, where do you get 17 compares from?

also, i am using sdl right now. my emulator is not ready for ppu optimizations yet. still working on other stuff.

matt


Top
 Profile  
 
 Post subject:
PostPosted: Thu Sep 07, 2006 12:12 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1393
mattmatteh wrote:
where do you get 17 compares from?


Probably from doing an if/elseif/elseif on $4000, $4001, $4002, ..., $4017 (and counting in hex instead of decimal), which is not what you'd get from such a switch() - you'd get one IF check for >$4018 (for the 'default' case) and a 24-entry jump table to handle $4000-$4017. At least with a reasonably sane compiler, that's what I'd expect to get.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Sep 07, 2006 12:14 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
Quote:
Would a cache miss would be slower than the 17 compares for the r/w and address plus figure out what to do with it?
If you're using a switch statement, the compiler should generate at most 5 compares (binary search), or two (table lookup). The real question is, how would this affect the available cache for the cases where it really helps? If you push out all the often-accessed data with these huge tables, the CPU can really slow down.

Quote:
Can't the compiler or a profiler to do something to optimize for a CPU with humble cache?
If a CPU with a humble cache could work just as well, then why would a CPU have anything more?

Quote:
I can see how WRAM which allows both r/w can be decoded very quickly but $4000 onwards is trouble. Is the pointer idea is really as bad as you make it out to be?
Only way to really find out is to profile the two. I tried changing the page size from 2K to 1 byte in my emulator and it uses 460% more CPU time (5.6x slower). This is on a PowerPC G3 with 32K data cache and 1MB secondary cache. The table size goes from 128 bytes to 256K, and I only use it for memory reads.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Sep 07, 2006 12:54 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
I was thinking:

if(addr < 0x8000)
{
if(addr < 0x4000)
{
etc etc + R/W until you're in your target range. Using a switchblock is better though until you must break down that range further.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 4:30 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
Bump.

What's the simplest way to get timing for a program? Are there easy to use libraries or must I do the math myself? What should I be Googling for? Thanks


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 5:16 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
A decent emulator with support for breakpoints (either 'brk' instructions or opcode fetch watches) should tell you the number of cycles that have elapsed between breakpoints.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 5:22 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
Oops, I should been more clear. I mean timing for the actual program, like how may I do "virtual interrupts" in C with time.h (or other) as a timebase.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 5:55 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
Decent compilers have something called "profiling" which will tell you how long the CPU spends in each function.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 9:07 pm 
Offline

Joined: Fri Jul 29, 2005 3:40 pm
Posts: 345
Location: near chicago
well, as blargg said before, and something i have kinda done too, is to profile the memory reads too. find out what address ranges are accessed the most and you can put those at the top of your if-else block if that is what you are using. i was thinking of doing that in a few places.

as for the cache misses, i used valgrind. the cpu wasnt really a problem. its the ppu for mine. but still working on it. i need to understand the ppu and mappers more before i can optimise. first get it working, then speed it up

matt


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 9:21 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
Right now I'm not concerned with profiling for speed, what I need help with is more basic: getting everything synched and call step_PPU() every ~18.624uS (PPU cycle).

This is what I've come up with so far:

loop()
{
wait(((1/(21477270/4))-(time2-time1/CLOCKS_PER_SEC))
time1 = clock()
step_NES()
time2 = clock()
}

Edit: bad algo


Last edited by kyuusaku on Mon Sep 18, 2006 1:53 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 9:46 pm 
Offline

Joined: Fri Jul 29, 2005 3:40 pm
Posts: 345
Location: near chicago
i use sdl right now for approxamate timing. also, you can not profile with time(). its only good to the nearest 10 milliseconds i think.

is that 18.624 micro seconds ? that will not work. the cpu can not switch applications that fast, its more like in a few milliseconds. i do an entire render frame then sleep.

matt


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 9:57 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 2:13 pm
Posts: 1667
Location: .ma.us
Do you not handle controller input and sound output in realtime then? How long do you wait? Just until the next frame? What's the benefit of using SDL's timers to time.h?

To clarify, my routine does is not use the time() but a function that returns clock cycles since program started, and translates that into execution time of time1 to time 2. By timing the emulation event over and over, I figure it will execute about the desired speed assuming it can complete a loop at the desired speed.


Last edited by kyuusaku on Sun Sep 17, 2006 10:10 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 10:08 pm 
Offline

Joined: Fri Jul 29, 2005 3:40 pm
Posts: 345
Location: near chicago
each render frame is 16.6 milliseconds. i round to that. and even if you/i didnt, i would guess that a single core cpu would, since it can do the input or emulator. unless you are polling. but i also havent perfected my input that much. i am still working on getting the core done still.

matt


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 41 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group