It is currently Fri Aug 23, 2019 3:03 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 55 posts ]  Go to page Previous  1, 2, 3, 4
Author Message
 Post subject:
PostPosted: Wed Jun 27, 2012 5:14 am 
Offline

Joined: Wed Jul 04, 2007 8:40 am
Posts: 36
Very interesting work! I'm currently toying with preemptive threading on my own code (pthreads/osx/c++/opengl) so this thread has been particularly interesting to me. I would have figured preemptive would win but these results are hard to argue with.

A few questions:

1) What OS is this running under? Or more specifically, what preemptive threading model are you using?

2) How many processor cores were available to the program?

3) How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?

4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 27, 2012 7:22 am 
Offline
User avatar

Joined: Sat Jan 22, 2005 8:51 am
Posts: 429
Location: Chicago, IL
Do you have numbers for a non-multithreaded version?

_________________
get nemulator
http://nemulator.com


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 27, 2012 5:52 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
> 1) Put register writes in a FIFO queue and have the ppu pull them out and apply them as it runs, rather than having to switch/sync up on every write.

My prediction: it helps out preemptive a little-to-fair bit, but hurts cooperative (timestamp computation and choosing whether to use FIFO or direct read all the time will be painful.)

I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.

> That's ~11 context switches per frame. Not far from my estimate XD.

That is really interesting that thread switching is so low on NES. I have to say, I never bothered to count the switch counts at all on anything but SNES. But I will say there, SNES is very very very much worse. I've seen 200,000 - 4,000,000 syncs a second there (you can have three CPUs, two PPUs and a DSP.) Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.

Preemptive can definitely help emulators, but only at much less fine scales. It's a balance between the amount of time spent emulating to amount of time spent switching. NES is so trivial that your cache stalls and sync primitives are much more computationally expensive. PS2+ would obviously be a different story, plus later systems don't need as fine-grained syncing.

> 2) Detect common $2002 spin loops and run the PPU ahead until $2002 status changes.

Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.

> Do you have numbers for a non-multithreaded version?

A state machine design would have to be a completely separate emulator, unfortunately.
It would also be quite difficult to do a precise measurement, because the syncing point style would be totally different, but you could get fairly close.

It's hard to say in this case, given those amazingly low sync counts, I'd be tempted to say it'd be faster than a state machine, at least in Balloon Fight. But generally, for most emulators, cothreading will be slower. You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 27, 2012 8:19 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Hamburgler wrote:
1) What OS is this running under? Or more specifically, what preemptive threading model are you using?


Win7 Home premium. Emu is written in C++ with Boost for threading.

Quote:
2) How many processor cores were available to the program?

Intel i3, Dual core, hyperthreaded. So 4 "cores" (technically 2 cores with 2 threads each). 3.2 Ghz

(I realize that ~230 fps in an nes emu is pretty poor performance for this processor, but I am emulating as many of the nitty gritty details as possible).

Quote:
How many threads did you use? Would you expect the results to change as you add more threads for things like the APU?


2. PPU and CPU. There is no APU implementation yet. I didn't want to get too far with development until I had an idea on performance.

Quote:
4) What optimization level did you use when compiling the code? Have you tried optimizing for size vs speed?


I didn't really stray from the default "Release" settings in Visual Studio.



James wrote:
Do you have numbers for a non-multithreaded version?


Nope. As byuu mentioned, that would require an entirely separate emulator.

byuu wrote:
I see that the frame rate is identical between Balloon Fight and SMB despite the thread switching increasing by over 40x. This says to me that at the levels you are getting with NES emulation, that cothreading's overhead is negligible.


I interpretted it the same way, which is why I'm pretty sure I'm going to gut out the preemptive support and switch to cooperative exclusively.

Quote:
That is really interesting that thread switching is so low on NES.


I'm not surprised at all. There's very little communication between CPU and PPU usually.

It's not always the case, though. Games like Rad Racer do a lot of mid-frame scroll changes, so sync ups will be more frequent there. But I still suspect that $2002 polling is the biggest culprit.

Quote:
Your test data basically confirms that there's no point in a preemptive bsnes. So that's one more reason I really appreciate you putting this together.


Yeah after seeing this I wouldn't consider it for a SNES emu either.

I'm glad it's appreciated. Really this is more of an exercise and curiosity rather than a practicality, but I'm glad it has some value apart from that.

Quote:
Ooooh, that sounds wild. Snes9X tried that with SA-1 sync loops, and it ended up needing to be disabled in every other game. Very tricky to get something like that 100% right.


It's not very complicated at all on the NES. I'd probably just look for one of the following templates on $2002 reads:

Code:
loop:
  LDA $2002
  AND #xx
  Bxx loop

; and...

loop:
  BIT $2002
  Bxx loop

; and possibly...

loop:
  LDA $2002
  ASL A
  Bxx loop


If I don't find those exact templates, I'd just sync up the PPU normally.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 27, 2012 8:52 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21560
Location: NE Indiana, USA (NTSC)
Another template with a fail-safe inside can be found in some of my own projects:
Code:
s0wait:
  BIT $2002
  BMI skip_rastereffect  ; we missed sprite 0 entirely this frame
  BVC s0wait


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 27, 2012 9:18 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Actually, now that I've mulled over the details... I've decided I'm not going to bother with those optimization attempts. I doubt they would push preemptive over where cooperative already is, and they would severely complicate the code.

So I'm going to call an end to the experiment. Switch to cooperative exclusively, and continue on the emulator.


It was fun and informative though!


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 28, 2012 7:26 am 
Offline
User avatar

Joined: Sat Jan 22, 2005 8:51 am
Posts: 429
Location: Chicago, IL
byuu wrote:
You would only use cothreading because it makes the code *substantially* cleaner and and less error prone (not hyperbole; you basically remove the state machine entirely yet gain effectively infinite precision to break out anywhere transparently), as I'm sure anyone who's tried it will attest.

It really is an intuitive method for writing an emulator. I don't know if yours is the first implementation, but I learned about this method from your posts a few years ago, so you get the credit; nice job!

_________________
get nemulator
http://nemulator.com


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 28, 2012 6:42 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Oh, before I forget ... I actually did have an interesting multithreading idea for an SNES PPU renderer (there, the PPU is an absolute beast.)

For a scanline renderer, instead of blitting one line at a time ... cache the MMIO registers to a "line" buffer, and do this for the entire frame, under the knowledge that VRAM isn't writable, and OAM isn't supposed to be (only one game does it anyway.) You could pack the state down to ~50 bytes per scanline, plus possibly extra for palette changes (store as patches for each change, start with the CGRAM copy at start of frame.)

Now when you go to render the entire frame at once, use OpenMP (or better to roll your own lighter weight version; OpenMP has crazy overhead) to split up the scanlines.

The thing with OpenMP is that it's almost always slower than non-OpenMP, except when things get really intense. Like for instance, it really helps with my HQ2x graphics filter, but it tends to hurt a simple 2X bilinear interpolation scale.

The idea probably won't work as well on the NES, since that tends to demand a cycle-based renderer for a lot more than just one game.

But even with one thread, having all that code running in a tight loop with no contextual changes should lead to a nice performance boost by way of caching.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Jun 30, 2012 2:34 pm 
Offline

Joined: Sat Jun 30, 2012 2:30 pm
Posts: 1
Has anyone tried User-Mode Scheduling http://msdn.microsoft.com/en-us/library ... p/dd627187 for an emulator project, or aware of someone else doing it?
It seems to be the way to have the advantages of cooperative threading with multicore backing.


Top
 Profile  
 
PostPosted: Sun Jul 21, 2019 3:27 pm 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 201
Location: Germany
*mother-of-all-bumps*

byuu wrote:
[...] Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.

Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.

As a fun weekend project I thought I'd give this a try... except that
  • there aren't 2 counters but 5 (65c816, PPU, S-SMP, S-DSP, MSU1)
  • the mainboard clock syncs at 2 * 21.477 MHz
  • the audioboard clock syncs at 3,075,840 Hz instead of the full 24.6 MHz

http://www.mediafire.com/folder/1nr1soivjggkh/
Usage: open & compile the *.lpr file in Lazarus or use the binaries. The program logic details are in the big comment at the top. (I think byuu had a similar design once before switching to extremely small fractions of a second for time keeping?)

Results:
On an i7-4790K (May 2014 CPU) @ 4.4 GHz I get ~9.x * 60 = ~550 fps (without MSU1) and ~6.7 * 60 = ~400 fps (with MSU1).
On an i5-3570 (June 2012 CPU) @ 3.8 GHz I get ~5.8 * 60 = ~350 fps (without MSU1) and ~4.3 * 60 = ~255 fps (with MSU1).

The main advantages is of course the ability to stop anywhere, even between PHI1 and PHI2 (making it a phase-accurate emulator instead of just cycle-accurate :D ). The main problems are the number of components that must run in parallel (those 550 fps will go down faster than a brick) and quickly getting into the relevant state handlers. Using CPU caches and modern branch prediction is key here. Given how many cycles there are, a simple switch would probably exceed most CPU caches given the current pointer sizes.

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 55 posts ]  Go to page Previous  1, 2, 3, 4

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group