It is currently Tue Jul 16, 2019 5:50 am

All times are UTC - 7 hours



Forum rules





Post new topic Reply to topic  [ 206 posts ]  Go to page Previous  1 ... 10, 11, 12, 13, 14
Author Message
PostPosted: Thu Jul 11, 2019 10:02 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1498
I edited my previous post just before your reply: I do want to say that I appreciate new approaches here very much. SNES emulation has been stagnant for a long time, and it's been a fear of mine that I might end up holding things back. I've heard several prominent devs state they didn't make an SNES emulator because mine existed, which is the worst thing I could hear ^^;

As for performance ... unfortunately SNES emulation lives in the shadow of ZSNES, and likely always will. It has completely distorted what the general public thinks of how demanding SNES emulation really is. I like the analogy of Nesticle versus Nestopia. We went from needing a 25MHz CPU to an 800MHz CPU, but it was necessary. Thankfully, everyone has an 800MHz+ CPU, so it was never an issue. pNES and Mesen need even more resources, because they do far more. It shouldn't be a surprise that SNES emulation went from needing 200MHz to needing 2-4GHz, but now we're butting up against a nearly stagnant IPC rise over the past decade and a half, and on the other end folks are pushing run-ahead and Raspberry Pis, and so things have only gotten worse, not better, over time. I have never found an effective way to explain to the general public that we're not just throwing cycles away. That we're not just terrible programmers that can't write optimized code.

I'm currently maintaining three separate SNES emulators. I match Snes9X's speed when I match their accuracy. But it is obscenely difficult to compete with them on performance without making those sacrifices. It's not for lack of trying: cooperative threading lets me run the CPU and SMP in huge blocks out of order, and a multi-threaded PPU lets me utilize today's multi-core CPUs, but that only closes the gap. Stuff like CPU ALU cycles, DMA<>HDMA sync, the SMP TEST register, cycle-accurate CPU synchronization, per-byte bus remapping, bus hold delays on memory accesses (the thing you just did for Rendering Ranger), IRQ pin holds, true SA1 memory conflict stalls, etc completely gut performance.

And regrettably for us, including for this PPU research, games just don't need any of this stuff (save of course for Air Strike Patrol, and that only needs the bare minimum), so it's easy to dismiss it. I don't want to cede this ground to the domain of FPGAs, however, so I'll keep trying.

The bsnes/higan split has been the most helpful thing I've done in a long time. I'd recommend you to consider the same, if you were willing. Most of the code can still be shared, possibly even gated behind #ifdefs. If nothing else, just keeping both a pixel-based and a scanline-based PPU will allow you to remain at least fractionally competitive with Snes9X. And even if you don't want to write your own scanline PPU, you're welcome to use mine (hey, free HD mode 7 gimmick!) Or at least lift the idea: I cache all the PPU registers (only 0x34 bytes) + CGRAM once per scanline, then I can render each scanline all at once using OpenMP. The only trick is you have to flush the queued scanlines when games try to force blank to change VRAM, but there's no games that ever turn off the display in the -middle- of the frame, so in practice it's fine.

But if you're not willing to maintain two Mesen-S cores ... I'd offer to team up with you on higan? I'd like higan to be a test bench where, if we drop below 60fps, it is what it is. We have to understand the exact cycle-level behavior of the SNES before we can perfectly optimize it. If you'd like a place to emulate absolutely anything, no matter how demanding, we can use my core for that. And then you can take those findings and make the best daily driver for playing games in the scene. Well, just an offer anyway ^-^;

Any approach you take is fine, but I'd hate to see PPU findings drop off due to how resource demanding it is.

My current sore point for the PPU is ... I feel like all BGs, the sprites, etc should be separate threads running in parallel. But that's just extreme overkill. They will likely need to be at least separate state machines, however.

Lastly, a fair warning: the SA-1 (done right with memory conflicts), ST018 (21MHz ARM6), and when overclocked which people love to do, SuperFX ... they will add a whole new world of pain to performance. They generally cut my performance in half =(


Top
 Profile  
 
PostPosted: Sat Jul 13, 2019 7:58 am 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 706
Spent the past couple of days optimizing a lot of stuff (mostly the PPU) - manage to boost performance by a good 30% or so on average (went from ~190fps to ~260fps in FF2 on a first gen ryzen 7 CPU). This gets me up to 350fps with frame skipping (rendering approx ~60fps and skipping the rest), which isn't bad for a single thread. It's about as fast as the 0.1.0 release in most scenarios, despite being way more accurate.

I've refactored a lot of the PPU's code for this to reduce the amount of duplicated work (e.g before subscreens and mainscreens were rendered separately, even though you can basically process both of them at the same time with just a couple of conditions). It also reduces a lot of the excessive templating I had originally used for the PPU, which thankfully speeds up compilation as well.

The mosaic effect for high res modes should also be fixed - processing the main & sub screens at the same time actually made it a lot easier to fix. I *think* the mosaic effect should be mostly ok now - it's still not implemented in mode 7, though.

As far as maintaining multiple versions of the core, time will tell, but I'd rather stick to a single core as much as possible, for the sake of simplicity (both mine and users'). If I can manage to get it to run at fullspeed on a RPi4, great, otherwise, one day a RPi5 will come out :p (RPi3 vs 4 boosted the Mesen libretro core's speed by 200%!)

In terms of multithreading, I'm still hoping that I can get away with rendering the entire picture on a separate thread while the emulation core is allowed to continue. I think logging the state of vram/cgram/oam once per frame + all register writes would be enough for this. The only obvious scenario where I can see this being a problem are for games that read the sprite range/time flag - but those flags can be calculated on-the-fly (with proper timing) based on the contents of OAM if that ever happens.
If it works, it would split the workload in about half and might get me a 60-80% increase in performance with relatively little change to accuracy. Obviously it can introduce a small amount of additional input lag (only if the system can't reach 60fps on a single thread), but a little input lag is better than running at 50 fps. This is essentially what I do with HD packs on Mesen.

RE: working on higan, I'm probably nowhere near knowledgeable enough about the SNES to really help on higan just yet, unfortunately. I still need to get around to implementing (at least some of) the enhancement chips and the like before I can get a better picture of how everything interacts - at the moment I still don't really know how any of them work. Now that I've fixed most other CPU/SPC/PPU issues, I might try to start working on DSP emulation soon since it sounds like that would be the most simple one.
That being said, I am looking at higan's code fairly often, so if I happen to spot anything that looks like it might be incorrect/incomplete, I'll be sure to mention it!


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 3:23 am 
Offline

Joined: Fri Nov 18, 2016 7:57 am
Posts: 20
The videos of op_timing_test_v2 on console show LDA XM $0525 on the Stack Rel Ind Idx Y Timing page but on Mesen-S and Bsnes the result is $0560.


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 4:57 am 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 706
That's my fault - the last 2 stack relative tests were incorrect in the original version. They used the wrong addressing mode, and also gave random results (because it loaded a random address, that could end up being 12-cycle or 6-cycle egisters). The updated version is here and koitsu also recorded that version of the test in this video (starting at 0:24)

Just to be clear, bsnes does run the "bugged" version properly, too, it's just that the (random) contents of RAM impact the test, so sometimes you get $525, sometimes you get $560. (Mesen-S should do the same too, if you turn on the random power on ram option)


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 7:49 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1498
Great work on speeding up the PPU! I was going to write my concerns with rendering video in a truly separate thread but it seems like you covered it all already. I'd like to do it with bsnes anyway as an option, knowing it will add up to one frame of latency, since it does give a nice speed boost for non-special-chip games.

I'll expose a randomness setting in the next higan release. The core actually has three modes: none (runs a lot of homebrew, including my own older demo_* test ROMs), low (tries to mimic the patterns you see in RAM on real SNES decks, probably not great though), and high (full-on PCG randomization of everything. Clever ROMs can detect an emulator this way, but it's also the best way to suss out uninitialized memory accesses in homebrew short of a debugger keeping track of that and directly informing you. Which, hint hint with the usage map if you implemented that ^-^;)

From my end ... I have optimized the SMP and PPU to its limits, and the DSP from anomie and blargg aren't going to be beaten by my attempts. So what's left is to try optimizing the CPU and GSU. Attempts at things like 16/24-bit block reads didn't work well at all. I think I'll need bigger guns. Opcode decoding's not really a bottleneck on the 65816 like it would be with an ARM7, and there's no prefetcher or 3+-stage pipeline. So it's pretty much IRQ testing and just raw computations. We can't do the 6502 NZ=result delayed computation trick because the 65816 can swap between 8-bit and 16-bit modes for both A and X/Y. I'm using a binary min-heap for scheduling events and testing for them in O(1) time, and some fancy range-testing for IRQ trigger events. Idle loop optimizations seem like the most likely for large gains, but those are notoriously difficult to get right, and they only end up hurting performance in 100% CPU load cases, which are quite common. Hmm :/


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 8:13 am 
Offline

Joined: Fri Nov 18, 2016 7:57 am
Posts: 20
Sour wrote:
That's my fault - the last 2 stack relative tests were incorrect in the original version. They used the wrong addressing mode, and also gave random results (because it loaded a random address, that could end up being 12-cycle or 6-cycle egisters). The updated version is here and koitsu also recorded that version of the test in this video (starting at 0:24)


Thanks, Indeed the updated version works fine.

With timing_test.sfc I found it strange that it showed random results(56-57 Higan/52 bsnes) at V-pos for the HDMA test but V-pos of the following tests stayed the same.

Oops, its overwriting the results of the HDMA test here? :) https://github.com/SourMesen/SnesTests/ ... ain.s#L370


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 8:45 am 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 706
byuu wrote:
but it's also the best way to suss out uninitialized memory accesses in homebrew short of a debugger keeping track of that and directly informing you. Which, hint hint with the usage map if you implemented that ^-^;)
This is an option in Mesen (Break on uninit memory reads), and it's almost available in Mesen-S - it's only missing a UI option to turn it on and a tiny bit of code to trigger the breakpoint. It's near the top of my list of debugger features I need to finish/add.

I actually haven't paid much attention to the actual CPU core so far (and there are a number of things that aren't really optimal in terms of flag handling, etc.). The majority of the time (other than SPC/PPU) seems to be spent on "clocking" the entire system (e.g incrementing PPU position, checking for IRQs, checking for DMAs, etc).
At the moment SPC seems to be 5%, DSP 5%, too. Clocking + processing CPU reads are ~15% together. Then the PPU tends to be a good 30-40%, I think, would have to profiler again and break it down into categories to really know. It's eerily similar to Mesen in general, though. (The saving grace on the SNES is that you can run the PPU in batches, whereas on the NES this becomes a nightmare because cartridges can spy on the VRAM bus)

For the "events" I'm currently just using a boolean array with a few hclock values set to true (and they get updated as needed) and then I process all possible events whenever I hit a true value. The CPU time spent on checking that boolean array is fairly high, though, so I need to see if I can somehow find a better solution for this.

paulb_nl wrote:
Oops, its overwriting the results of the HDMA test here? :)
And suddenly the results make so much more sense! I've been wondering for weeks why that particular value was different. Thank you!
With this, it looks like Mesen-S matches bsnes/higan pretty closely, but it looks like I might have an issue with the very last test (DMA with fast rom turned on) since the timings diverge quite a bit there compared to higan 106. According to koitsu's run of the test, it looks like higan has it right though, so I'll have to investigate.


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 7:00 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 706
Just finished adding LLE DSP support, so mario kart finally works.
It seems to be working for the DSP games I've tested, except Super Bases Loaded 2 (the game screen is broken once gameplay starts). But I'm not sure if that's due to a bug in the DSP code or just another general emulation issue (I haven't really looked into it).

It supports loading the bios files (in the "Bios" subfolder) as a single file or 2 separate files (using what seems to be higan's current implementation + what it used to be before, I think?)


Top
 Profile  
 
PostPosted: Sun Jul 14, 2019 10:46 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1498
Awesome work! Super Bases Loaded 2 uses a different memory map than the other DSP games.

https://preservation.byuu.org/games/Lj4b/BxxE98BT
https://preservation.byuu.org/boards/SycB

SHVC-2B3B-01
Code:
memory type=ROM content=Program
  map address=00-3f,80-bf:8000-ffff mask=0x8000
memory type=RAM content=Save
  map address=70-7d,f0-ff:0000-7fff mask=0x8000
processor architecture=uPD7725
  map address=60-6f,e0-ef:0000-7fff mask=0x3fff
  memory type=ROM content=Program architecture=uPD7725
  memory type=ROM content=Data architecture=uPD7725
  memory type=RAM content=Data architecture=uPD7725
  oscillator


(the mask selects between DR and SR.)

The ST011 requires additional uPD96050 instructions that aren't in the 7725, to account for the larger ROM and RAM.

Quote:
It supports loading the bios files (in the "Bios" subfolder) as a single file or 2 separate files (using what seems to be higan's current implementation + what it used to be before, I think?)


Yeah. st011.rom == st011.program.rom + st011.data.rom concatenated.

The format I tried (and failed) to propose was for No-Intro to store games with the firmware appended onto them, but it didn't gain any traction since only bsnes supports it. I support that plus the split external firmware files currently.

Another fair warning, very few people have the firmware and there's no way to make this easy for them. After about seven years of trying, I gave in and restored the DSP HLE code for when firmware wasn't available, but that's about 600KiB of code and doesn't emulate the DSP-3, ST-011, or ST-018. It also lacks timing, so every DSP operation completes instantly, meaning games run a bit too fast.


Top
 Profile  
 
PostPosted: Mon Jul 15, 2019 9:36 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 706
byuu wrote:
Awesome work! Super Bases Loaded 2 uses a different memory map than the other DSP games.
That was exactly it, thank you!
And thank you for your previous work on this - would have taken me an exponentially longer amount of time to figure this out if all I had was the data sheet & nocash' docs without any actual implementation to reference.

Just got the ST01X versions working as well (wasted 2 hours on the Shougi game because I accidentally gave the chip 2kb of RAM instead of 4kb...).

Adding the bios at the end of the file seems pretty reasonable - it also sounded like nocash' snes emulator might support this based on his documentation:
Quote:
Ideally, the uPD77C25 ROM-Image should be appended at the end of the SNES ROM-Image. In practice, it's often not there, so there's no way to detect if the game uses this or that uPD77C25 ROM


RE: HLE emulation. I'm not really planning on supporting HLE emulation. Like you said, it's a huge amount of effort in terms of coding and far more error-prone, too. Plus HLE just doesn't mesh well with the debugger tools - kind of hard to support trace logging & debugging the execution when there is no real code running. And with movies (or even netplay), everything can easily fall apart unless you start requiring that a movie recorded with LLE needs the bios to be played back, etc. There are just far too many other things I need to get done to consider putting any time into HLE at this point - maybe in a few years! :p


Top
 Profile  
 
PostPosted: Mon Jul 15, 2019 10:20 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1498
Quote:
And thank you for your previous work on this - would have taken me an exponentially longer amount of time to figure this out if all I had was the data sheet & nocash' docs without any actual implementation to reference.


The hail mary fix that blew our damned minds at the time was that the uPD7725's left shift instruction shifts in ones, not zeroes! So 3<<2=15, not 12. I was tracking down missing tiles in Top Gear 3000, and after an obscene amount of debugging I started getting desperate when I thought to try that one.

Quote:
Just got the ST01X versions working as well (wasted 2 hours on the Shougi game because I accidentally gave the chip 2kb of RAM instead of 4kb...).


Tangent: there's a Japanese supermarket near me that plays the same song you hear on the Shougi game title screens over their loudspeaker on a loop. Some kind of public domain song I guess. I have that song burned into my mind, so it's amusing when I go there.

Quote:
Adding the bios at the end of the file seems pretty reasonable - it also sounded like nocash' snes emulator might support this based on his documentation:


I didn't want to impose my design on you, but if you're willing to support that, I'd appreciate it.

The nice thing about appended firmware is every single game continues to work in ZSNES and Snes9X.

It's also easy to figure out what firmware you have:

Code:
auto SuperFamicom::romSize() const -> uint {
  if((size() & 0x7fff) == 0x200) return size() - 0x200;  //copier header
  //subtract appended firmware size, if firmware is present
  if((size() &  0x7fff) ==   0x100) return size() -   0x100;  //SGB1, SGB2
  if((size() &  0x7fff) ==   0xc00) return size() -   0xc00;  //Cx4 (HG51B169)
  if((size() &  0x7fff) ==  0x2000) return size() -  0x2000;  //DSP1-4 (uPD7725)
  if((size() &  0xffff) ==  0xd000) return size() -  0xd000;  //ST010-011 (uPD96050)
  if((size() & 0x3ffff) == 0x28000) return size() - 0x28000;  //ST018 (ARM6)
  return size();  //no firmware
}


Another benefit of appended firmware is it's the only way to support both the DSP1 and DSP1B variants that exist of both Super Mario Kart and Pilotwings, which otherwise have identical ROMs in No-Intro, yet they behave differently (eg the intro plane crash in Pilotwings.)

We are missing the firmware used in the MCU for Campus Challenge '92 and Powerfest '94, and I'd put the odds of us ever getting it dumped at 1,000,000:1 ... so unfortunately if you want to support those, it's HLE or nothing :c

Quote:
RE: HLE emulation. I'm not really planning on supporting HLE emulation. Like you said, it's a huge amount of effort in terms of coding and far more error-prone, too.


Well, thank you for that. And in that case, please accept my apologies that I now do.

I fought the good fight for eight years to get people to move to DSP LLE and completely failed. I gave up a few months before you announced Mesen-S. I sincerely hope me supporting HLE doesn't give me any unfair 'advantage' or hold back DSP LLE.

If we can convince No-Intro to adopt this format now that apparently three emulators and counting do, I'll remove my HLE code again.

...

Okay, moving on ... the HG51B169 in Mega Man X2/X3 is also really quite simple. It gets a little tricky with the program ROM caching that affects the timing in the intro sequence, but it's not much worse than the uPD96050.

The ARM6 in Hayazashi Nidan Morita Shougi 2 ... honestly ... just don't support that, unless you're a masochist :P
I've never found anyone that actually tried to play it. And when I implemented it, it led to me making a GBA emulator since I had it, and now here I am 24 emulators and counting later wondering what the heck happened to my life :P
But, if you really want it ... you're free to use my core, or to write your own, just ... fair warning.

...

Oh, credits! I'd like to credit Talarubi, AWJ, Lord Nightmare for their help with uPD7725/96050 emulation. segher, Overload, Jonas Quinn for their help with HG51B169 emulation. Talarubi for hir help with ARM6 emulation. I couldn't have done it without those folks. Apologies if I forgot anyone! ^-^;;


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 206 posts ]  Go to page Previous  1 ... 10, 11, 12, 13, 14

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group