Multithreaded emu designs

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Multithreaded emu designs

Post by Disch »

Hey everyone. Long time no see. I've been busy with "real life" nonsense... and emudev (and really, all hobby programming) has sort of drifted out of my life.

However there's just that "something" about the NES that fascinates me a keeps bringing me back.

Anyway, I was kicking around ideas for a new emu. But, with today's multicore and multithreaded CPUs, making a single-threaded emulator seems rather antiquated. Especially since emulators have to run several different systems in parallel. So I figure if I start a new emu project I'm going to try to take advantage of a multithreaded setup.

Of course, multithreading is trickier, so I thought it'd be fun/useful to open a design discussion on the topic. Has anyone here made such an emu? I know byuu has. How did you go about it?

My idea is fundamentally the same as the "catch up" approach most people here are probably familiar with. The difference is, you don't catch up the PPU on register reads/writes... instead it's constantly catching up in a parallel thread. The time it's "catching up" to is the current CPU timestamp, which would constantly be increasing as the CPU executes instructions.

There's still the same sync issues. The PPU can't surpass the CPU and we need a way to sync them on register accesses.

I'm thinking that since the CPU is clocked slower and performs generally simpler tasks, if given the same CPU time, the CPU's timestamp will advance much more quickly than the PPU timestamp, which is benefitial. This means we can probably get away with having the PPU thread have a "dumb spin" while waiting for the CPU to advance. Something like:

Code: Select all

while(ppu_time >= cpu_time);
... or if you want to be a little more intelligent... possibly this instead:

Code: Select all

while(ppu_time >= cpu_time) std::yield();
This would be done every time the PPU emulates a cycle, to ensure it doesn't surpass the CPU.

On the CPU side, however, I don't think we'd want to do this. The CPU will have to wait for the PPU to catch up on register accesses so that the two systems are synced. Since the PPU is likely going to take longer to catch up, a dumb spin on the CPU end probably wouldn't be very effective.

I'm thinking that something like C++11's condition_variable could be used. The CPU would effectively sleep until the PPU emits a signal that it has caught up.

The same thing could be done for the other subsystems, with each running in its own thread.

My main beef with the catch-up approach was that you'd have to write your PPU logic in a way that it would have to be able to enter and exit at any given cycle. With a separate thread, that's no longer the case. You can write the logic cleanly and straightforward without having to allow logic to be interrupted and restarted later. That's the hope anyway... this is still all theory. I'm not sure how well it'd work in practice.



Anyone have any thoughts?



PS. This system actually would have quite a bit of "thrashing" between threads on things like $2002 spin loops. Maybe it would be wiser to have $2002 status predicted so that the CPU can read it and resume without having to wait for the PPU to catch up. Although that gets tricky with the weird sprite overflow behavior. Maybe the thrashing wouldn't be so bad... I'd have to try it out and see.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

> Has anyone here made such an emu? I know byuu has. How did you go about it?

I went with cooperative multithreading instead of preemptive multithreading. In English: only one thread actually runs at a time, and it decides when to switch to another thread. But they are actual unique threads, with their own stack frame and all.

For my model, every two processors that communicate share an int64_t clock value, starting at zero. When CPU A runs for X clocks, you add X*CPU B frequency. When CPU B runs for X clocks, you subtract X*CPU A frequency. If the two frequencies are the same (there's only one NES oscillator, so they always are there), remove the multiplication.

The goal is to minimize context switches between threads (they're very slow.)

Allow CPU A to run as long as possible. When it does something that would be visible to CPU B (say A = CPU, B = PPU; in this case, CPU writes to or reads from a PPU register), you check the clock. If it's >= 0, then A is ahead of B. Switch to B here before doing the write. B now runs and keeps subtracting from the clock. B runs until it does something visible to CPU A (let's say PPU sets a flag that the CPU might read), at which time you see if clock < 0. If so, switch back to A. And now A performs that write.

I don't do it due to complication, but you could take the idea further: when CPU A writes to CPU B, but is ahead of CPU B, you could store that in a 'read buffer' with a timestamp and keep going. You would then only sync when CPU A -reads- from CPU B while -ahead- of CPU B. Then CPU B would read from your buffer and compensate based on the timestamp. Done right, it could help speed a lot. Done wrong, it may be slower than not doing this at all.

That said, the cooperative run out of order thing works amazing well on narrow interfaces: CPU<>PPU is a great example, only eight registers that belong to the PPU. You know when the CPU is talking to the PPU. But take SNES SMP + DSP, or CPU + SA1, they share RAM. So even if they're both doing their own things with their own areas of RAM (most likely), you won't have any way to know for sure. Forces you to sync all the time. Threading can get -very- painful in these cases.

-----

Now if you're looking for preemptive instead ... you definitely can't yield threads waiting for events. That requires OS intervention (ring 3 -> 0 -> 3 transitions on x86) ... you'll be able to call yield, at most, ~100,000 times with major latency just waiting for a thread to resume when it should be immediate. The same goes for mutexes, you can't use those.

You are going to have to use your own lock values that you read from and write to using atomic operations. It's the only way you'll get things fast enough for real emulation. So follow something like the cooperative model, but set "lock" flags that do things like: while(atomic_read(cpu_lock) == true);

-----

Both approaches have major pros and cons.

Cooperative is great in that you get the totally clean code, you don't have to allow enter+exit at every cycle, you don't have to worry about race conditions or any of that stuff, and all of the operations are in user space so it's super fast. You do have to keep in mind that you will pay a performance penalty compared to switch tables (state machines), but it's not a massive penalty. It's up to what you value more: code speed or code clarity.

Preemptive is the only way to actually take advantage of multicore. So if you want to take advantage of more than one core, this is the only way to do it. The downside is complexity, and all of the cores you use will be pegged at 100%. It may run faster, but it'll drain a lot more watts of power.

My personal opinion: cooperative works great on systems with only one CPU. Preemptive works terribly there. If you are emulating a simple enough system (NES, SNES, GB(C), GBA, etc); why -require- a multi-core processor to use the emulator? Just use cooperative. Easier, quicker, better. If you are emulating a system that likely won't run at full speed on single core systems (be it because the host hardware is slow like an ARM quad core cell phone; or because the emulated system is a beast, like the PS2), you have no choice. Go preemptive.

> Maybe it would be wiser to have $2002 status predicted so that the CPU can read it and resume without having to wait for the PPU to catch up.

Nemesis wrote a Sega Genesis emulator that used something similar to prediction. He'd save that processor's state when a prediction occurred, and then would eventually switch to the other thread. If it turns out the prediction caused a problem, he'd load that processor's state back to the old value to 'unwind' running ahead to fix it.

It was a pretty amazing design, but I will note that it -required- a quad? core CPU. I don't know how much was due to the accuracy of his emulation, and how much was due to cost of his emulation model. I presume it was a little bit of both.

-----

Sorry for rambling, hopefully this helps.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

I guess you could think of PocketNES as "multi-threaded", in that it uses the GBA's tilemapping hardware to render the screen images based on data collected during the previous frame.

Collected data includes scrolling locations for every scanline, CHR bankswitching changes, sprite table, and of course, map and character data. But not detailed enough (only scanline-level accuracy) to run MMC2/4 games.

PPU interactions are limited to sprite 0 hit. MMC3 is simulated based on simple rules based on which pattern tables sprites and bg use. Accurate enough to avoid shaky screens, not accurate enough to fail on Mario Adventure.

Even an inaccurate emulator can still pass many of the emulation tests. Sprite 0 hit works well, passes all the timing tests and everything, even though it is incorrect for any scanline where you had done unclean horizontal scrolling before the sprite.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

I'd recommend using timestamping and prediction with some sort of lightweight thread-safe queue. Have functions that peek at OAM and the mapper state to find the earliest possible time that the PPU could have an effect on the CPU (e.g. $2002 change due to sprite 0, mapper IRQ, or Zapper light detection). Run the CPU until that time, storing PPU- or mapper-relevant write timestamps in that queue, and then have it peek again. Then run the PPU until the most recent timestamped write from the CPU. The APU can run similarly, up until the last timestamped write or the next length counter change or the next frame counter IRQ.
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Post by Disch »

Thanks for all the responses everyone!
byuu wrote:Switch to B here before doing the write. B now runs and keeps subtracting from the clock. B runs until it does something visible to CPU A (let's say PPU sets a flag that the CPU might read), at which time you see if clock < 0. If so, switch back to A. And now A performs that write.
I don't see running the PPU ahead of the CPU as an option. Too many things the CPU can do will need to change PPU execution. Any kind of midframe scroll change, bankswitch, etc. If you have the PPU ahead of the CPU and then the CPU makes such a state changing write... you'd have to "rewind" the PPU back to the CPU's timestamp. That doesn't really seem practical.

How do you solve this problem with your cooperative model?
byuu wrote:My personal opinion: cooperative works great on systems with only one CPU. Preemptive works terribly there.
I guess I see your point. It seems to me, though, that it'd be relatively easy to make a design that works preemptively on a multicore system, but works more like the traditional "catch up" approach for a single core system. The only change you'd need to make is how/when the master thread (CPU) signals to the other threads that they can start running again.
byuu wrote:why -require- a multi-core processor to use the emulator?
To me it's more about taking advantage of what's readily available. Even low-end single cores these days are at least hyper threaded. And the number of available cores is just going to go up with time.

Partially I'm using this as an exercise to practice writing things for a multithreaded environment, since that really is the way of the future (or rather, the present). But I think it's also practical in this situation.

And it's not like this wouldn't run on a single core system. It just wouldn't get as good of performance. A program getting worse performance on a lower performance machine is not really something I'll lose sleep over.
byuu wrote:Just use cooperative. Easier, quicker, better
I am very intrigued by your cooperative approach. If you can explain your solution for the PPU running ahead of the CPU problem I brought up above, I'm very heavily considering it.
byuu wrote:Sorry for rambling, hopefully this helps.
It does! This was exactly the kind of post I was hoping for. Discussions like this are super informative and fascinating. I appreciate you sharing your wisdom/experience.
Bisqwit wrote:Thread switching by yield() or similar mechanisms is considered harmful.
Many of the articles you linked to seem to say the exact opposite... that yielding should be preferred over dumb looping when waiting in a multithreaded environment. I tend to agree, as well.
tepples wrote:Run the CPU until that time, storing PPU- or mapper-relevant write timestamps in that queue, and then have it peek again. Then run the PPU until the most recent timestamped write from the CPU
I've considered this in the past, as well, but in a single threaded environment. For some reason it never really sat right with me.

But now that you mention it, it does seem like it would scale up to a preemptive multi threaded environment very well. This would greatly reduce the number of times the CPU would have to sync with the PPU.

Ordinarily the CPU would have to stop and wait for the PPU to catch up on such register writes (although granted, only under very specific circumstances -- like if rendering is enabled and we're not in VBlank). But by logging writes the CPU can just plot ahead at full steam and the PPU can process those writes as they come.

The downside to this is that it requires the PPU to do a lot of additional checking after every cycle. But I don't think that'd be significant.




Something else I just thought about. The PPU acts entirely different when rendering is enabled vs. disabled. If you're running it with single logic where rendering state can be changed at any time, this means that each individual cycle will need to be wrapped in a if(rendering) statement. Furthermore, 'rendering' would have to be volatile.

That seems like a performance killer....
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Changes to 'rendering' are stored in the queue as timestamped writes to $2001.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

> I don't see running the PPU ahead of the CPU as an option. How do you solve this problem with your cooperative model?

Sorry, I had to pick CPU<>PPU because the NES doesn't have a separate processor for audio.

With my emulators, I typically don't run the PPU ahead of the CPU, because we'd have to check our clock value after every read of a non-cached register value.

But one of the fun things is that cooperative threading works great even if you can only run one thread ahead of another for a good while. Aside from spin loops (-; lda $200x; bpl -), and hell even then really, you are executing several cycles in the CPU at once. So when we switch back to the PPU, we get to run that many CPU cycles worth of operations before switching back to the CPU.

So basically, your add_clocks() or step() or whatever function for the PPU is inserted after every time-stepping event (like after your nametable read, or whatever); and that function will always switch to the CPU if the PPU is now caught up (clock >= 0)

When both can run ahead of each other (SNES CPU<>SMP): each can run ~100 opcodes ahead of each other. You'll probably end up with a context switch every ~200 executed instructions.
When only one can run ahead of the other (NES CPU<>PPU): one runs ~100 opcodes ahead, then the other catches up. Repeat. You end up with a context switch every ~100 executed instructions.
When neither can realistically run ahead of the other (SNES SMP<>DSP): disaster. You end up with a context switch every cycle.

There are no worst-case scenarios on the NES, at least. But I'll elaborate more on the SNES. Both the SMP (a processor that executes instructions) and the DSP (a sound processor that decodes BRR samples ... you know, you wrote an SNES emu ;) can access APURAM. Both to read it and to write it.

Nearly every cycle for the SMP contains a read or a write to APURAM. Lots of DSP cycles do as well. The odds both are touching the same memory at the same time are slim to none, but it's possible. So without a "rewind" (save state) mechanism like Nemesis had, we have to sync all the time. Since the SMP executes at 1MHz and the DSP at 2MHz, that means we are switches contexts at ~2 million times per second (1 million each way.)

Cooperative threading really isn't a model you should use everywhere for consistency. What I do with my compatibility profile is turn the DSP into a state machine. It's simple enough that I only need a single switch() right inside the main loop. Some clever coding hides the fact that it's a state machine, and makes it 'run' like it were a thread.

The main difference is that a state machine -has- to be able to exit at every cycle. So in this case, since we do have to exit at virtually every cycle, we just make that always the case. So instead of co_switch(dsp), we just do: while(dsp.clock < 0) { dsp_step(); dsp.clock += cycle; }

But get a situation like the SNES CPU <> SMP, and cooperative threading really shines. When you have each one running hundreds, if not thousands, if instructions ahead of each other, you get massive speed gains over the traditional state machine even at the opcode precision level. If you consider a full SNES CPU state machine (breaking not only per opcode, not only per cycle, but in the middle of reads and writes for bus hold delays), the threading model turns 3-level nested state machines into linear code. It turns a forced stackless context switch of ~10 million times/second into a stack context switch ~50,000 times a second. So you get a major speed boost and way cleaner code.

It's also easy to get carried away. You may be tempted to make the math unit a separate thread, because they act in parallel on real hardware. This is all well and good, but you can easily bog down your performance this way. Threads are not cheap.

> I guess I see your point. It seems to me, though, that it'd be relatively easy to make a design that works preemptively on a multicore system, but works more like the traditional "catch up" approach for a single core system.

I dislike complexity. Detecting whether a system is single or dual core and acting differently is more complicated, and thus, more error prone.

Also note that when you're pre-emptive, every operation that CPU A can read, CPU B writes have to be atomic, and vice versa. Atomic operations are more costly than regular ones.

> To me it's more about taking advantage of what's readily available.

But is it wise to drive 100% utilization of a quad core to run an NES emulator, when you could easily do it with 25% on one core?

Even if it gets you a higher FPS on that quad core chip, it won't be 400% faster. It'll be more like 20-40% faster or something. And it won't continue to scale, because the NES doesn't have very many chips.

> And it's not like this wouldn't run on a single core system. It just wouldn't get as good of performance. A program getting worse performance on a lower performance machine is not really something I'll lose sleep over.

In the ideal multi-threaded model, two threads would never have to stuff all their non-volatile regs on the stack, invalidate the pipeline, effectively flush much of their data cache, and perform a ring 3 -> ring 0 -> ring 3 transition to swap threads. One would just wait for the other. On a single core system, you have to do all of that. It's not going to be a little bit slower. It's going to be substantially slower.

I feel you really need to understand the costs here. When people talk about multithreading as the way of the future, they're invariably giving you examples like web servers where a thread wakes up, what, ten times a second? A hundred? Maybe a website even gets a thousand hits per second? That is child's play. We are talking about processors that achieve MILLIONS of synchronizations/context switches per second.

Write yourself some simple test programs before you choose your model: just do "dummy CPU A" + "dummy CPU B", and have each one increment a counter and then sync to the other. Watch in horror as the traditional multithreaded model gets you something like 100,000 increments per second. On a 3GHz processor. Then try the same with my cooperative threaded model, and see it reach up to 10,000,000 increments per second. Then do a barebones switch(state) on each, and observe 100,000,000 increments per second. Then try three nested states like you'd need for a complex processor like the SNES, and see that drop to 1,000,000 increments per second.

Once you have your numbers, see how they would work with how many context switches you'd need in the real world for emulating a given system. Realize that those numbers are -without- any actual emulation. This is just the overhead of keeping things synchronized.

Also, you still seem interested in yield()'ing a thread. Be sure to try that with your tests above. You will -not- be able to yield a thread ten million times a second. It will never happen.

> I am very intrigued by your cooperative approach. If you can explain your solution for the PPU running ahead of the CPU problem I brought up above, I'm very heavily considering it.

Please take a look at my emulator source. fc/cpu and fc/ppu.

Or for the lazy ;)
http://pastebin.com/t3id0NP7

That's my cycle-accurate PPU renderer (it doesn't handle the crazy sprite fetching stuff blargg found.) It looks exactly like a scanline-based renderer, and is written exactly the same way.

And be sure to look at the performance. I get ~300fps on my Core i7. That's very, very bad for an NES emulator. A lot of it is due to audio processing at 1.78MHz with a polyphase sinc resampler, though. But even at ~500-600fps, that'd still be very bad for an NES emu.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

If anything, the following can run in a separate thread because the needed one-way synchronization is especially suited to a lock-free queue:
  • APU tone generation (because $4018-$401A is disabled in a stock NES)
  • Audio filtering, preferably using blargg's blip-buf or another implementation of band-limited step resampling
  • NTSC filtering or Scale2x filtering or both (yes, "both" is possible using something like Super NES hi-res mode)
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Post by Disch »

@byuu:

Your code looks exactly like what I had envisioned when I wrote my first post. That's cooperative? I must be misunderstanding the difference between cooperative and preemptive.

You're running the PPU in a simple linear fashion without having to enter/exit at every cycle. Your 'tick' function I assume checks to make sure you aren't running ahead of the CPU, and waits for the CPU to get further ahead before resuming.

This is more or less exactly what I had planned. Adding onto that tepples' suggestion of queuing register writes -- and you would greatly decrease the number of times the threads would have to sync.


I'll go over this more when I get home from work tonight.



@tepples

Yes, but changing whether rendering is enabled or not will change the logic flow of the PPU, so the logic would have to be wrapped in a conditional.

Bah I'm not explaining it well. I'll post again later.
Nessie
Posts: 133
Joined: Mon Sep 20, 2004 11:13 am
Location: Sweden
Contact:

Post by Nessie »

Cooperative threading seems to simplify the source code by an order of magnitude compared to a typical state machine.
But how do you handle save states? The thread context can't really be saved. Did you decide on some sort of "safe" point (e.g. start of vblank) where all the threads can be told to stop/resume simultaneously?
Also, how feasible is it to to implement this on other platforms such as Android or Wii?
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

> Your code looks exactly like what I had envisioned when I wrote my first post. That's cooperative? I must be misunderstanding the difference between cooperative and preemptive.

Cooperative is just like preemptive, only it's much easier. You never have to worry about dead locks or race conditions. You don't have to perform atomic operations, and you have absolute control over the scheduler. Context switches are a hundred times faster. The literal only drawback is that you only use one real processor core this way. A lot of software has cooperative modes, HTTP servers and SQL databases often include this. You use that mode when the cost of switching contexts becomes more expensive than the work being done. Rare for web servers and databases, very very common for emulators.

But yeah, the idea is the same in both. You have:

Code: Select all

void CPU::main() {
  whle(true) {  //threads 'run forever', they switch out control on occasion though
    while(interrupt_pending) execute_interrupt();
    execute_opcode();
  }
}

Code: Select all

void PPU::main() {
  while(true) raster_scanline();
}

Code: Select all

void APU::main() {
  while(true) {
    run_square();
    run_triangle();
    run_dmc();
    run_sequencer();
    //.. made up example, it's a lot more complex, obviously
  }
}
Every thread has its own main(), and it's as if that main function -is- the entire program. It can call 30 levels deep into functions, and switch to another thread right then and there. No need for state machines. When you switch back, it resumes right where it left off.

> Your 'tick' function I assume checks to make sure you aren't running ahead of the CPU, and waits for the CPU to get further ahead before resuming.

Yep, this is your typical tick():

Code: Select all

void PPU::tick() {
  clock += 4;  //4:1 PPU to CPU
  if(clock >= 0) co_switch(cpu.thread);  //we always sync no matter what when PPU is ahead, since it's hard to run the PPU ahead of the CPU
}
In my emulator, I set and clear IRQ flags based on X/Y position too, but that's kind of a lazy hack :P

> This is more or less exactly what I had planned. Adding onto that tepples' suggestion of queuing register writes -- and you would greatly decrease the number of times the threads would have to sync.

I believe I mentioned register write timestamps in my first post ... although perhaps tepples' method is easier than mine?

But yes, if you queue writes with a timestamp, you don't have to switch threads. It's only the reads that can affect the way eg the CPU executes.

> Cooperative threading seems to simplify the source code by an order of magnitude compared to a typical state machine.

It's a little better, but practically identical if you can put all your logic in a single function with a single switch/case. Most chips are not that simple, but NES APU + SNES DSP are. SNES SMP is close.

But when you get into cases like the SNES CPU that can execute an HDMA in the middle of a 10-frame DMA in the middle of a bus read delay inside the middle of a cycle right inside the middle of an instruction ... yeah, it can become several orders of magnitude more simple than nested state machines. Even if you think you can do nested state machines, you'll often end up with subtle bugs you don't realize. I went that route for 2-3 years and could tell you horror stories. Again though, there's nothing in the NES that complicated.

> But how do you handle save states? The thread context can't really be saved. Did you decide on some sort of "safe" point (e.g. start of vblank) where all the threads can be told to stop/resume simultaneously?

You lose just a tiny bit of accuracy when you save a state, unfortunately. But in the five years I've had save states, I've never had a problem reported because of it.

How I do it: you pick a master thread. This is the most complicated, and most delicate, thread of all. The one that tends to run the longest without stopping. Usually it's the CPU. But on the NES, I actually make it the PPU. All other threads are slaves.

So the user asks to save a state. We set a flag and everything proceeds to run as normal. As soon as we hit the entry point of eg CPU::main(), we can exit that thread right then. The magic is that as long as you save all of the CPU class variables and load them again, the code executes from the same point as before. You can even delete and recreate the CPU thread, it doesn't matter.

Here is where we lose a tiny bit of accuracy. Now we run each slave thread, one by one. With the rule that it cannot synchronize with other chips again. So even if they read something that the CPU might change ... too bad. Allow the read anyway, and allow us to reach eg APU::main()'s top. Serialize that thread too. Repeat until all threads are done. In practice, the -worst case- desync is the longest time a thread can run for. For the SNES SMP, that's one instruction. For the NES APU, it's one audio sample.

This sucks, I know. But I can guarantee you it's not a problem in practice. Not even for TAS'ers that save hundreds of times per frame, as bsnes is the #1 emulator at tasvideos for SNES. It basically means your emulator goes from cycle accuracy to opcode accuracy for -one instruction- at the very, very worst case. Usually it doesn't even matter. I've rigged bsnes to save a state after every single instruction, and was able to play any game I tried with no visible issues.

The longest it takes to save a state is the longest it takes every thread to reach the entry point. So for most systems, that ends up being one PPU scanline. So for -extreme- debugging, this is probably not acceptable. You should keep that in mind.

> Also, how feasible is it to to implement this on other platforms such as Android or Wii?

You have to have a cooperative threading library available for any target you want.

I use (and wrote) libco. Lots of people have helped me (Aaron Giles, Bisqwit, Nach, vladitx, blargg, etc.)
It's used by bsnes (NES+SNES+GB+GBC+GBA); twoMbit (SMS); daifukatt.su (arcade); and MAME/MESS (sparingly so far.)
It works on x86 and amd64 (Windows, OS X, Linux, BSD); PowerPC 32-bit and 64-bit (OS X, Linux, BSD, PS3, Wii); MIPS (Loongson); SPARC (although it only runs well if you disable register windows here); ARM (Raspberry Pi); etc. It has assembly implementations for many processors, a Windows Fibers implementation that runs on any version of Windows, and a setjmp/longjmp version that works on virtually any Unix-alike operating system.

But if you're targeting a low power device, you'll want to be careful. State machines, especially at one level deep, are typically quite a bit faster. If you want a highly accurate emulator plus clean code, you may have trouble getting full speed on low powered systems.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

How is libco different from pushing a big char array on the stack and using setjmp/longjmp?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

> How is libco different from pushing a big char array on the stack and using setjmp/longjmp?

longjmp only modifies the PC. You can push a big block of memory on the stack and go to a new "thread", but how do you return from it when you're done? You can't pull a big char array back off the stack arbitrarily, nor can you really keep track of how many bytes you're supposed to push and pull for any arbitrary point you'd longjmp at.

libco, however, is very very simple. It swaps all non-volatile registers, including the stack pointer and program counter.

Code: Select all

; fastcall ABI:
; return = eax
; arguments = ecx,edx
; non-volatile registers = ebp,esi,edi,ebx

co_swap_fastcall(to = ecx, from = edx):
  mov [edx],esp
  mov esp,[ecx]
  pop eax ;get return address as soon as possible (so the CPU can start caching) [big boost over 'ret' at the end]

  mov [edx+ 4],ebp ;push / pop is slower on most archs
  mov [edx+ 8],esi ;with the notable exception being the Pentium 4
  mov [edx+12],edi
  mov [edx+16],ebx

  mov ebp,[ecx+ 4]
  mov esi,[ecx+ 8]
  mov edi,[ecx+12]
  mov ebx,[ecx+16]

  jmp eax
Twelve instructions. You can do the rest (thread creation and deletion) in ISO C.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
Post Reply