> I don't see running the PPU ahead of the CPU as an option. How do you solve this problem with your cooperative model?
Sorry, I had to pick CPU<>PPU because the NES doesn't have a separate processor for audio.
With my emulators, I typically don't run the PPU ahead of the CPU, because we'd have to check our clock value after every read of a non-cached register value.
But one of the fun things is that cooperative threading works great even if you can only run one thread ahead of another for a good while. Aside from spin loops (-; lda $200x; bpl -), and hell even then really, you are executing several cycles in the CPU at once. So when we switch back to the PPU, we get to run that many CPU cycles worth of operations before switching back to the CPU.
So basically, your add_clocks() or step() or whatever function for the PPU is inserted after every time-stepping event (like after your nametable read, or whatever); and that function will always switch to the CPU if the PPU is now caught up (clock >= 0)
When both can run ahead of each other (SNES CPU<>SMP): each can run ~100 opcodes ahead of each other. You'll probably end up with a context switch every ~200 executed instructions.
When only one can run ahead of the other (NES CPU<>PPU): one runs ~100 opcodes ahead, then the other catches up. Repeat. You end up with a context switch every ~100 executed instructions.
When neither can realistically run ahead of the other (SNES SMP<>DSP): disaster. You end up with a context switch every cycle.
There are no worst-case scenarios on the NES, at least. But I'll elaborate more on the SNES. Both the SMP (a processor that executes instructions) and the DSP (a sound processor that decodes BRR samples ... you know, you wrote an SNES emu ;) can access APURAM. Both to read it and to write it.
Nearly every cycle for the SMP contains a read or a write to APURAM. Lots of DSP cycles do as well. The odds both are touching the same memory at the same time are slim to none, but it's possible. So without a "rewind" (save state) mechanism like Nemesis had, we have to sync all the time. Since the SMP executes at 1MHz and the DSP at 2MHz, that means we are switches contexts at ~2 million times per second (1 million each way.)
Cooperative threading really isn't a model you should use everywhere for consistency. What I do with my compatibility profile is turn the DSP into a state machine. It's simple enough that I only need a single switch() right inside the main loop. Some clever coding hides the fact that it's a state machine, and makes it 'run' like it were a thread.
The main difference is that a state machine -has- to be able to exit at every cycle. So in this case, since we do have to exit at virtually every cycle, we just make that always the case. So instead of co_switch(dsp), we just do: while(dsp.clock < 0) { dsp_step(); dsp.clock += cycle; }
But get a situation like the SNES CPU <> SMP, and cooperative threading really shines. When you have each one running hundreds, if not thousands, if instructions ahead of each other, you get massive speed gains over the traditional state machine even at the opcode precision level. If you consider a full SNES CPU state machine (breaking not only per opcode, not only per cycle, but in the middle of reads and writes for bus hold delays), the threading model turns 3-level nested state machines into linear code. It turns a forced stackless context switch of ~10 million times/second into a stack context switch ~50,000 times a second. So you get a major speed boost and way cleaner code.
It's also easy to get carried away. You may be tempted to make the math unit a separate thread, because they act in parallel on real hardware. This is all well and good, but you can easily bog down your performance this way. Threads are not cheap.
> I guess I see your point. It seems to me, though, that it'd be relatively easy to make a design that works preemptively on a multicore system, but works more like the traditional "catch up" approach for a single core system.
I dislike complexity. Detecting whether a system is single or dual core and acting differently is more complicated, and thus, more error prone.
Also note that when you're pre-emptive, every operation that CPU A can read, CPU B writes have to be atomic, and vice versa. Atomic operations are more costly than regular ones.
> To me it's more about taking advantage of what's readily available.
But is it wise to drive 100% utilization of a quad core to run an NES emulator, when you could easily do it with 25% on one core?
Even if it gets you a higher FPS on that quad core chip, it won't be 400% faster. It'll be more like 20-40% faster or something. And it won't continue to scale, because the NES doesn't have very many chips.
> And it's not like this wouldn't run on a single core system. It just wouldn't get as good of performance. A program getting worse performance on a lower performance machine is not really something I'll lose sleep over.
In the ideal multi-threaded model, two threads would never have to stuff all their non-volatile regs on the stack, invalidate the pipeline, effectively flush much of their data cache, and perform a ring 3 -> ring 0 -> ring 3 transition to swap threads. One would just wait for the other. On a single core system, you have to do all of that. It's not going to be a little bit slower. It's going to be substantially slower.
I feel you really need to understand the costs here. When people talk about multithreading as the way of the future, they're invariably giving you examples like web servers where a thread wakes up, what, ten times a second? A hundred? Maybe a website even gets a thousand hits per second? That is child's play. We are talking about processors that achieve MILLIONS of synchronizations/context switches per second.
Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.
Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.
Also, you still seem interested in yield()'ing a thread. Be sure to try that with your tests above. You will -not- be able to yield a thread ten million times a second. It will never happen.
> I am very intrigued by your cooperative approach. If you can explain your solution for the PPU running ahead of the CPU problem I brought up above, I'm very heavily considering it.
Please take a look at my emulator source. fc/cpu and fc/ppu.
Or for the lazy ;)
http://pastebin.com/t3id0NP7
That's my cycle-accurate PPU renderer (it doesn't handle the crazy sprite fetching stuff blargg found.) It looks exactly like a scanline-based renderer, and is written exactly the same way.
And be sure to look at the performance. I get ~300fps on my Core i7. That's very, very bad for an NES emulator. A lot of it is due to audio processing at 1.78MHz with a polyphase sinc resampler, though. But even at ~500-600fps, that'd still be very bad for an NES emu.