A few questions:
1) What OS is this running under? Or more specifically, what preemptive threading model are you using?
2) How many processor cores were available to the program?
3) How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?
4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?
My prediction: it helps out preemptive a little-to-fair bit, but hurts cooperative (timestamp computation and choosing whether to use FIFO or direct read all the time will be painful.)
I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.
> That's ~11 context switches per frame. Not far from my estimate XD.
That is really interesting that thread switching is so low on NES. I have to say, I never bothered to count the switch counts at all on anything but SNES. But I will say there, SNES is very very very much worse. I've seen 200,000 - 4,000,000 syncs a second there (you can have three CPUs, two PPUs and a DSP.) Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.
Preemptive can definitely help emulators, but only at much less fine scales. It's a balance between the amount of time spent emulating to amount of time spent switching. NES is so trivial that your cache stalls and sync primitives are much more computationally expensive. PS2+ would obviously be a different story, plus later systems don't need as fine-grained syncing.
> 2) Detect common $2002 spin loops and run the PPU ahead until $2002 status changes.
Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.
> Do you have numbers for a non-multithreaded version?
A state machine design would have to be a completely separate emulator, unfortunately.
It would also be quite difficult to do a precise measurement, because the syncing point style would be totally different, but you could get fairly close.
It's hard to say in this case, given those amazingly low sync counts, I'd be tempted to say it'd be faster than a state machine, at least in Balloon Fight. But generally, for most emulators, cothreading will be slower. You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.
Win7 Home premium. Emu is written in C++ with Boost for threading.Hamburgler wrote:1) What OS is this running under? Or more specifically, what preemptive threading model are you using?
Intel i3, Dual core, hyperthreaded. So 4 "cores" (technically 2 cores with 2 threads each). 3.2 Ghz2) How many processor cores were available to the program?
(I realize that ~230 fps in an nes emu is pretty poor performance for this processor, but I am emulating as many of the nitty gritty details as possible).
2. PPU and CPU. There is no APU implementation yet. I didn't want to get too far with development until I had an idea on performance.How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?
I didn't really stray from the default "Release" settings in Visual Studio.4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?
Nope. As byuu mentioned, that would require an entirely separate emulator.James wrote:Do you have numbers for a non-multithreaded version?
I interpretted it the same way, which is why I'm pretty sure I'm going to gut out the preemptive support and switch to cooperative exclusively.byuu wrote:I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.
I'm not surprised at all. There's very little communication between CPU and PPU usually.That is really interesting that thread switching is so low on NES.
It's not always the case, though. Games like Rad Racer do a lot of mid-frame scroll changes, so sync ups will be more frequent there. But I still suspect that $2002 polling is the biggest culprit.
Yeah after seeing this I wouldn't consider it for a SNES emu either.Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.
I'm glad it's appreciated. Really this is more of an exercise and curiosity rather than a practicality, but I'm glad it has some value apart from that.
It's not very complicated at all on the NES. I'd probably just look for one of the following templates on $2002 reads:Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.
Code: Select all
loop: LDA $2002 AND #xx Bxx loop ; and... loop: BIT $2002 Bxx loop ; and possibly... loop: LDA $2002 ASL A Bxx loop
Code: Select all
s0wait: BIT $2002 BMI skip_rastereffect ; we missed sprite 0 entirely this frame BVC s0wait
So I'm going to call an end to the experiment. Switch to cooperative exclusively, and continue on the emulator.
It was fun and informative though!
It really is an intuitive method for writing an emulator. I don't know if yours is the first implementation, but I learned about this method from your posts a few years ago, so you get the credit; nice job!byuu wrote:You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.
For a scanline renderer, instead of blitting one line at a time ... cache the MMIO registers to a "line" buffer, and do this for the entire frame, under the knowledge that VRAM isn't writable, and OAM isn't supposed to be (only one game does it anyway.) You could pack the state down to ~50 bytes per scanline, plus possibly extra for palette changes (store as patches for each change, start with the CGRAM copy at start of frame.)
Now when you go to render the entire frame at once, use OpenMP (or better to roll your own lighter weight version; OpenMP has crazy overhead) to split up the scanlines.
The thing with OpenMP is that it's almost always slower than non-OpenMP, except when things get really intense. Like for instance, it really helps with my HQ2x graphics filter, but it tends to hurt a simple 2X bilinear interpolation scale.
The idea probably won't work as well on the NES, since that tends to demand a cycle-based renderer for a lot more than just one game.
But even with one thread, having all that code running in a tight loop with no contextual changes should lead to a nice performance boost by way of caching.
It seems to be the way to have the advantages of cooperative threading with multicore backing.
As a fun weekend project I thought I'd give this a try... except thatbyuu wrote:[...] Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.
Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.
- there aren't 2 counters but 5 (65c816, PPU, S-SMP, S-DSP, MSU1)
- the mainboard clock syncs at 2 * 21.477 MHz
- the audioboard clock syncs at 3,075,840 Hz instead of the full 24.6 MHz
Usage: open & compile the *.lpr file in Lazarus or use the binaries. The program logic details are in the big comment at the top. (I think byuu had a similar design once before switching to extremely small fractions of a second for time keeping?)
On an i7-4790K (May 2014 CPU) @ 4.4 GHz I get ~9.x * 60 = ~550 fps (without MSU1) and ~6.7 * 60 = ~400 fps (with MSU1).
On an i5-3570 (June 2012 CPU) @ 3.8 GHz I get ~5.8 * 60 = ~350 fps (without MSU1) and ~4.3 * 60 = ~255 fps (with MSU1).
The main advantages is of course the ability to stop anywhere, even between PHI1 and PHI2 (making it a phase-accurate emulator instead of just cycle-accurate ). The main problems are the number of components that must run in parallel (those 550 fps will go down faster than a brick) and quickly getting into the relevant state handlers. Using CPU caches and modern branch prediction is key here. Given how many cycles there are, a simple switch would probably exceed most CPU caches given the current pointer sizes.
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10