It is currently Tue Jun 18, 2019 3:46 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Wed Jun 12, 2019 1:47 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1475
(I also posted this to Spritesmind for more feedback.)

I know I'm a bit of an outlier here with using cooperative-threading to design emulators, and that most will use state machines that run each chip the minimum amount of time, but I ran into a rather interesting challenge with threads, and was curious if anyone here had some input.

The Sega Genesis + Sega CD turns out to be a bit of a nightmare of a design to emulate.

You have two 68Ks, a Z80, a VDP graphics chip, PSG audio, YM2612 audio. And tons of components in the Sega CD that I basically roll into its 68K: a graphics rotozoom ASIC, CD controller, CD driver, CD-DA playback, etc. Then there's controllers that may need threads of their own, cartridges as well, etc.

The way my scheduler works is, every time a thread is running, and it does something that takes time, I increment that thread's current clock cycle. There's some extra math in there to scale the value so that each thread can run at different clock rates.

Many chips share RAM and resources, and so you need to make sure the current thread doesn't perform an operation that could affect the other chips while it's ahead in time. So every time the CPU accesses a PSG register, it has to catch up the Z80 *and* the PSG, since the Z80 can also access the PSG. The same for the YM2612. In fact, the Z80 can access almost the entire bus of the 68K, and vice versa, so it ends up being a lot of synchronization.

The naive approach looks like this:

Code:
CPU::step(uint clocks) {
  clock += clocks;
  //resume(thread) is a thread switch:
  //after resume(cpu) later, CPU code execution continues here:
  if(clock >= apu.clock) resume(apu);
  if(clock >= psg.clock) resume(psg);
  if(clock >= ym2612.clock) resume(ym2612);
}

APU::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) resume(cpu);
  if(clock >= psg.clock) resume(psg);
  if(clock >= ym2612.clock) resume(ym2612);
}

YM2612::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) resume(cpu);
  if(clock >= apu.clock) resume(apu);
}


Once every emulated frame, I find the smallest clock value of all the threads, and then subtract that value from every thread's clock value, which prevents the clock counters from ever overflowing.

So notice how the YM2612 can't modify the PSG. As a result, it doesn't need to sync the PSG right now and can happily keep running even if it's ahead of the PSG, so long as it doesn't run so long that the PSG runs out of buffered audio samples.

The design works well, but with one glaring flaw: let's say the APU runs for a few clock cycles, and calls step. It is currently behind the CPU, but it is ahead of the YM2612. So it switches to it. The YM2612 runs for quite a while (my core is sample-based, so it eats 144 clock cycles per sample), and now it's quite a ways ahead of the CPU, so then it switches over to the CPU. BUT!! The CPU is still behind the APU, and now the CPU goes on to perform a write that affects the APU.

The naive solution to that is: every time any thread is about to do something that would influence another processor, we scan through every single thread, and find the thread with the smallest clock value, and if that's not the current thread, switch to it.

Code:
Scheduler::synchronize() {
  uint minimum = ~0;
  Thread* context = nullptr;
  for(auto thread : allThreads) {
    if(thread->clock < minimum) {
      minimum = thread->clock;
      context = minimum;
    }
  }
  //should never happen, don't check in release builds
  assert(context);
  if(context != currentThread) resume(*context);
}


But that means we have to loop over a list of threads, compare their clock values, and keep track of which thread has the smallest value. And when we're the YM2612, that means synchronizing the PSG potentially, even though we usually don't care about that. And synchronizations are *painful*, so this approach ends up costing me about 20% more performance and I absolutely cannot spare a hit that big anymore.

My initial thought was something like a binary min-heap array for the threads, so I can find min(list) in O(1) time, but that doesn't solve the problem with synchronizing threads that aren't important.

I can also unroll the synchronization logic, for instance in the YM2612's case:

Code:
YM2612::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) {
    if(clock >= apu.clock) {
      resume(apu.clock >= cpu.clock ? apu : cpu);
    } else [
      resume(cpu);
    }
  } else if(clock >= apu.clock) {
    resume(apu);
  }
}


Which is an annoying increase of code that I can hide in a library function, but mostly works ... until you have three threads. Or four. Or in the case of the Genesis CPU, the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.

So right now, the naive approach seems to be the most likely to work. But as it's way too slow, I was curious if anyone had better ideas?


Top
 Profile  
 
PostPosted: Wed Jun 12, 2019 8:51 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1475
Alright, I came up with a workable solution.

Flow: A needs to be in sync with B and C, so A jumps to B. B needs to be in sync with D, so B jumps to D. D only needs to be in sync with A, so D jumps to A. But in this scenario, B hasn't caught up to A yet, so the a.syncTo(b) call returns before B is really fully synced up.

In other words, a.syncTo(b) doesn't guarantee B is caught up. Because even though B won't swap to A before being caught up, B may switch to something else that will swap to A.

So we need to turn this:
Code:
Thread::synchronize(Thread& thread) { if(clock >= thread.clock) thread.resume(); }

Into this:
Code:
Thread::synchronize(Thread& thread) { while(clock >= thread.clock) thread.resume(); }


A much, much smaller amount of overhead, and we can go back to only synchronizing the components that need it, rather than always searching for the oldest timeslice, and syncing everything always.


Top
 Profile  
 
PostPosted: Wed Jun 12, 2019 9:57 am 
Offline
User avatar

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 163
Hmm, yeah. I was going to say maybe a min-heap, but if you only have a small number of chips a linear search might be faster anyway?

It almost sounds like you are describing transactional memory, but need a lite version of it. If you aren't familiar, it's a way to do lockless threading whereby each thread does whatever it wants inside of a memory transaction. If it tries to close a transaction and another thread has written to the same block of memory, then the transaction fails and the thread has to rerun it's computation. In the usual case there very little overhead, and only expensive when something needs to be re-run.

Your synchronization is a bit more complicated, but maybe you could do something similar. Keep a record of what memory is read/written for each thread (a pair of coarse bitmasks maybe) and compare them at the end of the frame. If there is a conflict, you only need to re-run the threads that had a conflict. If most of the chips have no violations most of the time then they can be run efficiently in big coarse loops.


Top
 Profile  
 
PostPosted: Wed Jun 12, 2019 12:38 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1475
Yeah, so without relying on any special hardware ...

Right now I can run the SNES CPU basically as far ahead of the SNES SMP as I want, as long as the SNES CPU doesn't access the four register ports on the SMP. Because it's no speed hit and to keep audio sample latency down, I will force a sync once per scanline.

But then you get something like the SA-1 where the two CPUs share the same RAM, and that goes out the window. Any memory access to RAM has to sync, even though 99.9% of the time they're not trampling over each other. But if you don't sync on the 0.1% of exceptions, games start breaking.

This can be refined further by queuing writes with timestamps to fill a fixed-size buffer, with a forced sync when the buffer is full (memory is cheap so the fill case is unlikely as heck.) Then when the other chip runs, the read handler pulls from the buffer until the queue is empty. Then we only have to synchronize on reads when a chip is ahead of another, since we can't predict those values.

I don't actually know if this would be faster or not. It's a lot of work to do it, and reads are much more common than writes are anyway, and now reads have a lot of overhead trying to find a given value at a given time in a queue.

If we tried to run-ahead on reads as well, then we would need rollback support for when we run the behind CPU and find out a modification lost us the value we actually need. The cheapest way of doing rollback would be savestates, but that's far too heavy . So then you'd have to design a partial save state system to save/restore only what's absolutely needed, which is even more complex. And the idea of save state rollback is incompatible with cooperative threading, so this one's out completely.

If we relied on hardware extensions, Intel's TSX looked like it could be promising for both reads and writes, but that's very out of my depth, experimental, and breaks portability to ARM CPUs. It is almost certainly also incompatible with cooperative threading. At that point, you'd probably be designing the emulator with one hardware core per emulated thread, and using lockless algorithms for blocking. But even that idea is screwed up now because modern chips aren't N-core, they're hierarchical designs where the CPU is split into dies, which are split into context queues, which are split into processors, which are split into threads. And each has increasing latencies to talk to one another.

Even if we had a nice system to run-ahead on writes or even on reads+writes, there's more things that need to stay synchronized than just memory accesses: think things like a PSG's DAC register, a PPU's Vsync/Hsync pins, a CD controller's interrupt pin, etc. Then you have the SA-1 that doesn't actually allow both chips to hit the RAM at the same time, so it needs to track the address bus lines during accesses and will insert SA-1 wait states into its CPU core until the CPU stops accessing the given memory location, because the CPU is not capable of blocking in this way.

So the only really surefire way to do this is through Verilog, which then means 99.999% of people can't run your emulator, and FPGAs don't really scale anyway past -maybe- N64-era emulation.

In conclusion, this is why we can't have nice things :P


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: edo9300 and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group