- For making cartridges of your Super NES games, see Reproduction.
As for performance ... unfortunately SNES emulation lives in the shadow of ZSNES, and likely always will. It has completely distorted what the general public thinks of how demanding SNES emulation really is. I like the analogy of Nesticle versus Nestopia. We went from needing a 25MHz CPU to an 800MHz CPU, but it was necessary. Thankfully, everyone has an 800MHz+ CPU, so it was never an issue. pNES and Mesen need even more resources, because they do far more. It shouldn't be a surprise that SNES emulation went from needing 200MHz to needing 2-4GHz, but now we're butting up against a nearly stagnant IPC rise over the past decade and a half, and on the other end folks are pushing run-ahead and Raspberry Pis, and so things have only gotten worse, not better, over time. I have never found an effective way to explain to the general public that we're not just throwing cycles away. That we're not just terrible programmers that can't write optimized code.
I'm currently maintaining three separate SNES emulators. I match Snes9X's speed when I match their accuracy. But it is obscenely difficult to compete with them on performance without making those sacrifices. It's not for lack of trying: cooperative threading lets me run the CPU and SMP in huge blocks out of order, and a multi-threaded PPU lets me utilize today's multi-core CPUs, but that only closes the gap. Stuff like CPU ALU cycles, DMA<>HDMA sync, the SMP TEST register, cycle-accurate CPU synchronization, per-byte bus remapping, bus hold delays on memory accesses (the thing you just did for Rendering Ranger), IRQ pin holds, true SA1 memory conflict stalls, etc completely gut performance.
And regrettably for us, including for this PPU research, games just don't need any of this stuff (save of course for Air Strike Patrol, and that only needs the bare minimum), so it's easy to dismiss it. I don't want to cede this ground to the domain of FPGAs, however, so I'll keep trying.
The bsnes/higan split has been the most helpful thing I've done in a long time. I'd recommend you to consider the same, if you were willing. Most of the code can still be shared, possibly even gated behind #ifdefs. If nothing else, just keeping both a pixel-based and a scanline-based PPU will allow you to remain at least fractionally competitive with Snes9X. And even if you don't want to write your own scanline PPU, you're welcome to use mine (hey, free HD mode 7 gimmick!) Or at least lift the idea: I cache all the PPU registers (only 0x34 bytes) + CGRAM once per scanline, then I can render each scanline all at once using OpenMP. The only trick is you have to flush the queued scanlines when games try to force blank to change VRAM, but there's no games that ever turn off the display in the -middle- of the frame, so in practice it's fine.
But if you're not willing to maintain two Mesen-S cores ... I'd offer to team up with you on higan? I'd like higan to be a test bench where, if we drop below 60fps, it is what it is. We have to understand the exact cycle-level behavior of the SNES before we can perfectly optimize it. If you'd like a place to emulate absolutely anything, no matter how demanding, we can use my core for that. And then you can take those findings and make the best daily driver for playing games in the scene. Well, just an offer anyway ^-^;
Any approach you take is fine, but I'd hate to see PPU findings drop off due to how resource demanding it is.
My current sore point for the PPU is ... I feel like all BGs, the sprites, etc should be separate threads running in parallel. But that's just extreme overkill. They will likely need to be at least separate state machines, however.
Lastly, a fair warning: the SA-1 (done right with memory conflicts), ST018 (21MHz ARM6), and when overclocked which people love to do, SuperFX ... they will add a whole new world of pain to performance. They generally cut my performance in half =(
I've refactored a lot of the PPU's code for this to reduce the amount of duplicated work (e.g before subscreens and mainscreens were rendered separately, even though you can basically process both of them at the same time with just a couple of conditions). It also reduces a lot of the excessive templating I had originally used for the PPU, which thankfully speeds up compilation as well.
The mosaic effect for high res modes should also be fixed - processing the main & sub screens at the same time actually made it a lot easier to fix. I *think* the mosaic effect should be mostly ok now - it's still not implemented in mode 7, though.
As far as maintaining multiple versions of the core, time will tell, but I'd rather stick to a single core as much as possible, for the sake of simplicity (both mine and users'). If I can manage to get it to run at fullspeed on a RPi4, great, otherwise, one day a RPi5 will come out :p (RPi3 vs 4 boosted the Mesen libretro core's speed by 200%!)
In terms of multithreading, I'm still hoping that I can get away with rendering the entire picture on a separate thread while the emulation core is allowed to continue. I think logging the state of vram/cgram/oam once per frame + all register writes would be enough for this. The only obvious scenario where I can see this being a problem are for games that read the sprite range/time flag - but those flags can be calculated on-the-fly (with proper timing) based on the contents of OAM if that ever happens.
If it works, it would split the workload in about half and might get me a 60-80% increase in performance with relatively little change to accuracy. Obviously it can introduce a small amount of additional input lag (only if the system can't reach 60fps on a single thread), but a little input lag is better than running at 50 fps. This is essentially what I do with HD packs on Mesen.
RE: working on higan, I'm probably nowhere near knowledgeable enough about the SNES to really help on higan just yet, unfortunately. I still need to get around to implementing (at least some of) the enhancement chips and the like before I can get a better picture of how everything interacts - at the moment I still don't really know how any of them work. Now that I've fixed most other CPU/SPC/PPU issues, I might try to start working on DSP emulation soon since it sounds like that would be the most simple one.
That being said, I am looking at higan's code fairly often, so if I happen to spot anything that looks like it might be incorrect/incomplete, I'll be sure to mention it!
Just to be clear, bsnes does run the "bugged" version properly, too, it's just that the (random) contents of RAM impact the test, so sometimes you get $525, sometimes you get $560. (Mesen-S should do the same too, if you turn on the random power on ram option)
I'll expose a randomness setting in the next higan release. The core actually has three modes: none (runs a lot of homebrew, including my own older demo_* test ROMs), low (tries to mimic the patterns you see in RAM on real SNES decks, probably not great though), and high (full-on PCG randomization of everything. Clever ROMs can detect an emulator this way, but it's also the best way to suss out uninitialized memory accesses in homebrew short of a debugger keeping track of that and directly informing you. Which, hint hint with the usage map if you implemented that ^-^;)
From my end ... I have optimized the SMP and PPU to its limits, and the DSP from anomie and blargg aren't going to be beaten by my attempts. So what's left is to try optimizing the CPU and GSU. Attempts at things like 16/24-bit block reads didn't work well at all. I think I'll need bigger guns. Opcode decoding's not really a bottleneck on the 65816 like it would be with an ARM7, and there's no prefetcher or 3+-stage pipeline. So it's pretty much IRQ testing and just raw computations. We can't do the 6502 NZ=result delayed computation trick because the 65816 can swap between 8-bit and 16-bit modes for both A and X/Y. I'm using a binary min-heap for scheduling events and testing for them in O(1) time, and some fancy range-testing for IRQ trigger events. Idle loop optimizations seem like the most likely for large gains, but those are notoriously difficult to get right, and they only end up hurting performance in 100% CPU load cases, which are quite common. Hmm :/
Thanks, Indeed the updated version works fine.Sour wrote:That's my fault - the last 2 stack relative tests were incorrect in the original version. They used the wrong addressing mode, and also gave random results (because it loaded a random address, that could end up being 12-cycle or 6-cycle egisters). The updated version is here and koitsu also recorded that version of the test in this video (starting at 0:24)
With timing_test.sfc I found it strange that it showed random results(56-57 Higan/52 bsnes) at V-pos for the HDMA test but V-pos of the following tests stayed the same.
Oops, its overwriting the results of the HDMA test here? https://github.com/SourMesen/SnesTests/ ... ain.s#L370
This is an option in Mesen (Break on uninit memory reads), and it's almost available in Mesen-S - it's only missing a UI option to turn it on and a tiny bit of code to trigger the breakpoint. It's near the top of my list of debugger features I need to finish/add.byuu wrote:but it's also the best way to suss out uninitialized memory accesses in homebrew short of a debugger keeping track of that and directly informing you. Which, hint hint with the usage map if you implemented that ^-^;)
I actually haven't paid much attention to the actual CPU core so far (and there are a number of things that aren't really optimal in terms of flag handling, etc.). The majority of the time (other than SPC/PPU) seems to be spent on "clocking" the entire system (e.g incrementing PPU position, checking for IRQs, checking for DMAs, etc).
At the moment SPC seems to be 5%, DSP 5%, too. Clocking + processing CPU reads are ~15% together. Then the PPU tends to be a good 30-40%, I think, would have to profiler again and break it down into categories to really know. It's eerily similar to Mesen in general, though. (The saving grace on the SNES is that you can run the PPU in batches, whereas on the NES this becomes a nightmare because cartridges can spy on the VRAM bus)
For the "events" I'm currently just using a boolean array with a few hclock values set to true (and they get updated as needed) and then I process all possible events whenever I hit a true value. The CPU time spent on checking that boolean array is fairly high, though, so I need to see if I can somehow find a better solution for this.
And suddenly the results make so much more sense! I've been wondering for weeks why that particular value was different. Thank you!paulb_nl wrote:Oops, its overwriting the results of the HDMA test here? :)
With this, it looks like Mesen-S matches bsnes/higan pretty closely, but it looks like I might have an issue with the very last test (DMA with fast rom turned on) since the timings diverge quite a bit there compared to higan 106. According to koitsu's run of the test, it looks like higan has it right though, so I'll have to investigate.
It seems to be working for the DSP games I've tested, except Super Bases Loaded 2 (the game screen is broken once gameplay starts). But I'm not sure if that's due to a bug in the DSP code or just another general emulation issue (I haven't really looked into it).
It supports loading the bios files (in the "Bios" subfolder) as a single file or 2 separate files (using what seems to be higan's current implementation + what it used to be before, I think?)
Code: Select all
memory type=ROM content=Program map address=00-3f,80-bf:8000-ffff mask=0x8000 memory type=RAM content=Save map address=70-7d,f0-ff:0000-7fff mask=0x8000 processor architecture=uPD7725 map address=60-6f,e0-ef:0000-7fff mask=0x3fff memory type=ROM content=Program architecture=uPD7725 memory type=ROM content=Data architecture=uPD7725 memory type=RAM content=Data architecture=uPD7725 oscillator
The ST011 requires additional uPD96050 instructions that aren't in the 7725, to account for the larger ROM and RAM.
Yeah. st011.rom == st011.program.rom + st011.data.rom concatenated.It supports loading the bios files (in the "Bios" subfolder) as a single file or 2 separate files (using what seems to be higan's current implementation + what it used to be before, I think?)
Another fair warning, very few people have the firmware and there's no way to make this easy for them. After about seven years of trying, I gave in and restored the DSP HLE code for when firmware wasn't available, but that's about 600KiB of code and doesn't emulate the DSP-3, ST-011, or ST-018. It also lacks timing, so every DSP operation completes instantly, meaning games run a bit too fast.
EDIT: removed formats talk. I should've have brought it up in this thread.
That was exactly it, thank you!byuu wrote:Awesome work! Super Bases Loaded 2 uses a different memory map than the other DSP games.
And thank you for your previous work on this - would have taken me an exponentially longer amount of time to figure this out if all I had was the data sheet & nocash' docs without any actual implementation to reference.
Just got the ST01X versions working as well (wasted 2 hours on the Shougi game because I accidentally gave the chip 2kb of RAM instead of 4kb...).
Adding the bios at the end of the file seems pretty reasonable - it also sounded like nocash' snes emulator might support this based on his documentation:
RE: HLE emulation. I'm not really planning on supporting HLE emulation. Like you said, it's a huge amount of effort in terms of coding and far more error-prone, too. Plus HLE just doesn't mesh well with the debugger tools - kind of hard to support trace logging & debugging the execution when there is no real code running. And with movies (or even netplay), everything can easily fall apart unless you start requiring that a movie recorded with LLE needs the bios to be played back, etc. There are just far too many other things I need to get done to consider putting any time into HLE at this point - maybe in a few years! :pIdeally, the uPD77C25 ROM-Image should be appended at the end of the SNES ROM-Image. In practice, it's often not there, so there's no way to detect if the game uses this or that uPD77C25 ROM
The hail mary fix that blew our damned minds at the time was that the uPD7725's left shift instruction shifts in ones, not zeroes! So 3<<2=15, not 12. I was tracking down missing tiles in Top Gear 3000, and after an obscene amount of debugging I started getting desperate when I thought to try that one.And thank you for your previous work on this - would have taken me an exponentially longer amount of time to figure this out if all I had was the data sheet & nocash' docs without any actual implementation to reference.
Tangent: there's a Japanese supermarket near me that plays the same song you hear on the Shougi game title screens over their loudspeaker on a loop. Some kind of public domain song I guess. I have that song burned into my mind, so it's amusing when I go there.Just got the ST01X versions working as well (wasted 2 hours on the Shougi game because I accidentally gave the chip 2kb of RAM instead of 4kb...).
Okay, moving on ... the HG51B169 in Mega Man X2/X3 is also really quite simple. It gets a little tricky with the program ROM caching that affects the timing in the intro sequence, but it's not much worse than the uPD96050.
The ARM6 in Hayazashi Nidan Morita Shougi 2 ... honestly ... just don't support that, unless you're a masochist :P
I've never found anyone that actually tried to play it. And when I implemented it, it led to me making a GBA emulator since I had it, and now here I am 24 emulators and counting later wondering what the heck happened to my life :P
But, if you really want it ... you're free to use my core, or to write your own, just ... fair warning.
Oh, credits! I'd like to credit Talarubi, AWJ, Lord Nightmare for their help with uPD7725/96050 emulation. segher, Overload, Jonas Quinn for their help with HG51B169 emulation. Talarubi for hir help with ARM6 emulation. I couldn't have done it without those folks. Apologies if I forgot anyone! ^-^;;
EDIT: removed formats talk. I should've have brought it up in this thread.
I signed up for the No-Intro forums to politely argue for and request this. The SNES datfile now has separate "combined" and "split" variants, so it's technically possible, but the only game that actually *has* combined and split variants is... PowerFest '94. And it doesn't include the DSP firmware, it just combines (or splits) the control ROM and the three game ROMs.byuu wrote:If we can convince No-Intro to adopt this format now that apparently three emulators and counting do, I'll remove my HLE code again.
Maybe another round of polite requesting is in order.
It's also easy to figure out if the ROM has a copier header, and yet we mostly have "cleaned" files now...byuu wrote:It's also easy to figure out what firmware you have [...]
I'd like appended firmware much more if there was something else in the file that describes the file layout. For example a directory footer. Just going by file size isn't enough imo.
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10