In terms of max overclock values, on the NES, both FCEUX/Mesen/puNES support up to 1000 lines of either type, iirc, so that's what I've implemented at the moment
Hmm ... I'm doing my overclocking as a % value. For NTSC, each frame gets:
uint clocks = 262*1364;
uint extraclocks = (clocks*overclock)-clocks; //overclock = 1.0 - 4.0
So 786 extra scanlines per frame max currently.
I guess I'll make the upper limit 500%, but ... damn that's a lot of overhead to the emulation, heh.
400% already drops max framerate from ~370fps to ~190fps.
Now throw in a 500% SA-1 overclock to go with it, or better yet, a 500% overclock of the ARM6 ... 105MHz 32-bit CPU, anyone? MSU2? :P
FYI, AxlRocks has been testing some SNES titles on Mesen-S to see how they behave on with different values of the before/after NMI settings (he's done this on the NES for like 100+ games on FCEUX/Mesen in the past, too).
Something I've been doing with bsnes' speed hacks is detecting games that don't like them to selectively disable them (eg pixel renderer for Air Strike Patrol, cycle DSP for Koushien 2, etc.)
I'd like the idea of us building out a database of 'metadata' for games like this, and we could include information like "maximum stable overclock%", etc. Stuff like this is great for the end-user experience, instead of them having to guess and change settings per-game.
In other news, just committed Super FX support (including the %-based overclocking for it, though mine only supports multiples of 100%, up to 1000% atm).
The SuperFX overclocks so much better than other CPUs. And the way it sleeps when done means you aren't completely murdering performance like with overclocking the main CPU. You can really go to town and games don't care. The one exception is the Stunt Race FX menus will fail after about 400%, but in-game benefits so much that I'd rather not cap it.
I haven't personally noticed any gains past 800%, but I guess if we're allowing 500% on the CPU, 1000% on the SFX seems reasonable.
this is definitely one of the scenarios where using something like libco can really simplify the code when that much accuracy is needed.
The areas where libco was a lifesaver:
1. I prefer to not enslave the other processors to the main CPU. This is because the main CPU can start a DMA that is ten frames in length. I know, no game ever will. But it can. Supporting the bus hold delays on CPU reads plus the DMA/HDMA sync handling to exit the CPU resulted in a nightmarishly complex state machine. If someone wants to be a troll, a test ROM that verifies proper DMA/HDMA sync and then does consecutive 10-frame DMAs would be a good one :P
2. building on 1, the CPU and SMP talk so infrequently, and only over a limited 4-byte range, that you can run them way out of order of each other. It ends up being a decent speed *boost* to use libco to be able to treat the SMP like a regular opcode-based interpreter and just context switch when it's (rarely) needed.
3. as you mentioned, the SuperFX ROM/RAM buffering is very pesky otherwise.
And the areas where libco has proven a hindrance:
1. the SA1 shares all of ROM and RAM, so you can effectively never run one CPU ahead of the other. All of that context switching is unbelievably painful.
2. the CPU and PPU have a lot of trouble, too. Even though there's a limited 64-byte window for communication, there's also the H/Vblank signals. I ended up implementing a PPUcounter class for the CPU to inherit which basically predicts what the PPU H/Vblank statuses woud be for any given cycle, because otherwise the CPU wouldn't know if it could run more before the PPU blanking signals would change, and the PPU couldn't run ahead because it wouldn't know if the CPU would write to one of its registers.
libco works amazingly when either thread can run well ahead of the other. Only one is enough. But it falls apart when neither thread can, because you just end up context switching every cycle of each thread.
Probably the best idea, if someone were willing, would be to use cooperative threading where it excels, and state machines where it does not.
Further, libco doesn't solve the problem of processors that do multiple things in parallel. Eg the CPU ALU, the PPU running backgrounds and sprites separately, etc. It would if we ran a thread for each of those things, but there is no fricking way we can afford the overhead of that on modern CPUs =(
Will probably do S-DD1 next since it seems like that'll be fairly straightforward to add using the existing public domain implementation. And it won't require any debug tools, too, which certainly helps a lot!
One of these days I'd like to simplify Andreas Naive's SDD1 decompression code. Talarubi did that to neviksti's SPC7110 decompression code, and it's probably my favorite code in higan to look at. It's just been a low priority.
The one tricky thing about the SDD1 is that it spies on $4300-437f in order to operate the decompression. If you're not ideologically opposed to crude hacks, this is no problem at all in practice to have the SDD1 core peek inside the CPU core's internal state, but otherwise it's a bit pesky.
(Eg for the NES, mappers are a hundred times easier when you can just steal the NES PPU H/Vcounter.)