It is currently Sun Sep 22, 2019 9:04 am

All times are UTC - 7 hours



Forum rules





Post new topic Reply to topic  [ 181 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13  Next
Author Message
PostPosted: Tue Dec 04, 2018 3:45 am 
Offline

Joined: Sat May 09, 2015 7:21 pm
Posts: 92
byuu is suffering from FreeBSD lag spikes on his new custom-built computer, which are severe enough to stop higan development if they aren't fixed (“I am dead in the water here until this is fixed”). He is offering a $250+ bounty to anyone who sends him a working fix. Full details on byuu's mesage board.

_________________
bsnes-mcfly: the bsnes v073 and bsnes-classic killer (GitLab repository)


Top
 Profile  
 
PostPosted: Sat Dec 08, 2018 11:21 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1524
Sorry, I was a bit desperate due to the cost of the new server. I found a sysctl workaround for a scheduling issue, and created a kernel patch for an ACPI shutdown issue, so barring more nasty future surprises, I'm good at this point. But thank you to everyone who helped!

Quote:
What's up with byuu?


Taking a break from social media. All is going well.


Last edited by byuu on Mon Dec 10, 2018 2:43 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Sun Dec 09, 2018 2:12 pm 
Offline
User avatar

Joined: Thu Oct 26, 2017 12:29 pm
Posts: 77
Seem your in a great space right now.

Best of luck to you.

_________________
...


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 5:58 am 
Offline

Joined: Tue Feb 19, 2019 4:04 am
Posts: 3
Hey all,

While exploring and working on various miscellaneous SNES-related interests of mine, I found bsnes-mcfly. Now, I switched to bsnes-mcfly from higan due it's increased usability, but it still was lacking a few features I was really hoping for. Thankfully, hex_usr kindly put all the code on GitLab so I was able to fork the repo and just do it myself.

It's been a couple months and I've since found the time to refine my improvements and modifications into a form a bit more suitable for a pull request. Unfortunately, I'm finding that hex_usr's repository (or maybe it's GitLab) doesn't seem set up for pull request interaction I'm familiar with on GitHub; further, byuu recently removed the forums from his site, and I don't really have a way to contact him to send a message. As such, I've decided to announce my changes here, hoping it may be useful to a wider audience. I hope to start a discussion about seeing these changes -- or at least the concepts -- migrate back into bsnes-mcfly and, perhaps even more appropriately, bsnes/higan as well.

Right, enough build-up. Here's the short list of what I've done:

    [1]- Compilation fixes for Qt 5.11+. Resolve paintEngine warnings. Fixes the segfaulting that occurs when program window is closed.
    [2]- Add UI setting to control GPU pipeline flushing. This noticeably reduced input latency for me (maximal reduction would be a 1 frame period, or ~16ms. I'd guess this yields probably a ~10ms reduction on average?)
    [3]- Add UI settings to adjust rewind history buffer: both length and maximal granularity, in snapshot quantity and number of frames, respectively.
    [4]- Major performance improvement: eliminating OpenMP's ridiculous thrashing. I've repeatedly profiled & measured a significant drop in CPU usage. Before: anywhere from 5-9 CPU cores each @ 60-80%. After: 1 CPU core @ 80%.
    [5]- Add support for adaptive VSync intervals. This helps avoid a client delay while swapping buffers if the VSync time slice is ever missed.
    [6]- Add support for GLsync token injection into render pipeline. (Further reduced CPU usage for me, from 1 core @ 80% down to a comfortable ~40%).
    [7]- Fix build issues arising from LTO optimizer. Enable LTO for builds.


More on [5]: More or less, it's approximately that VSync continues normally until FPS < 60, at which point the GPU will let the buffer swap clock slide and return to the CPU immediately, just as if VSync were disabled. Basically, you only perform VSync while you have performance to spare
More on [6]: The primitive way to do this, as was already in higan actually, is to glFinish(). This change adds support for GL_ARB_sync (conveniently made mandatory as of GL 3.2), so the CPU can poll instead of block. So, I still process window & input events, but just stop the emulator core cycles. Once the fence is triggered, emulator clocking resumes. Some gfx stacks are kind of dumb, apparently (*cough* my nVidia Quaddro *cough*) and use a heavy, resilient spinlock for glFinish. This naturally drives CPU usage up a bit as well.
More on [7]: byuu refers to what is actually WPO (whole-program optimization) via a setting called 'LTO'. WPO and LTO are similar, but not the same. Both are working, but with LTO being more aggressive, I had to adjust the linker settings a bit so a 'more suitable' ELF header would be generated and it could run properly. I'll need a bit more thought and more testing before I feel comfortable committing the LTO bit though.


So, give it a try if you'd like. Here's the repository: ar/bsnes-mcfly.

hex_usr: there are 7 ar/* branches of mine that have all my changes grouped by general feature/change. I derived my master branch simply by taking yours and rebasing all the ar/* branches atop of it.
byuu: I'd like to recommend by VSync and GLsync token changes for use in bsnes/higan. Oh, and definitely the OpenMP pruning, as well. I haven't compared bsnes-mcfly to your HEAD, but those primitives were...not suitable for the places and patterns they were in (or, my CPU is very fast or very different, heh). One or two could possibly be made more suitable with some loop rewrites, but from what I saw, OpenMP isn't exactly going to be the right tool for the job.

Thoughts welcome.

Thanks!


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 6:48 am 
Offline

Joined: Sat May 09, 2015 7:21 pm
Posts: 92
I was just thinking about how to bring bsnes-mcfly back. I noticed the disappearance of byuu's message board too, and I'm a little worried about the future of higan.

The problem is, I don't have Windows 7 anymore, so producing Windows builds is not trivial. I suppose I could use Windows 10 to make bsnes-mcfly builds... but I'm afraid. We all know how little Microsoft care about privacy, even going so far to make Visual Studio inject telemetry into compiled builds. It's a good thing I use MinGW instead of Visual Studio, but I still can't shake the fear that any Windows build I make could be compromised.

If anyone using Windows 7 or Windows 8.1 wants to test, I've made bsnes-mcfly v106r14c. This version is based on higan v106r88, which is the last version available on the semi-official higan Git repository. It adds the much-coveted Blur Emulation option (called "Simulate hires blurring") to support Kirby's Dream Land 3's use of pseudo-512 for transparency, among other things.

_________________
bsnes-mcfly: the bsnes v073 and bsnes-classic killer (GitLab repository)


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 7:27 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 205
Location: Germany
hex_usr wrote:
producing Windows builds is not trivial

VM?

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 7:44 am 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21595
Location: NE Indiana, USA (NTSC)
creaothceann wrote:
hex_usr wrote:
producing Windows builds is not trivial

VM?

RAM to have the host and guest loaded at once costs money, VMware costs money if you need any of the features that aren't in the GPL version of VirtualBox, and Windows to run in a VM costs money to activate and phones home. Technically, one can use unactivated Windows and the PUEL version of VirtualBox[1] without charge, but legally I wouldn't recommend it.

I have yet to play with sudo apt install mingw-w64, but it could be an option.


[1] The VirtualBox Extension Pack may be used without charge under a license that forbids anything resembling commercial use. A skilled lawyer might convince a judge that even fulfilling bug bounties on free software or taking subscriptions on Buy Me a Coffee or Patreon could be considered commercial use. Licenses for commercial use of the Extension Pack start at $5000 for 100 seats. Oracle appears to have made a business decision not to cater to home businesses or other small businesses.

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 8:45 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 205
Location: Germany
tepples wrote:
phones home

The network adapter can be configured to be intranet-only.

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Top
 Profile  
 
PostPosted: Tue Feb 19, 2019 12:03 pm 
Offline
User avatar

Joined: Wed Feb 13, 2008 9:10 am
Posts: 703
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
Windows Vista and newer will unactivate themselves at some point when they cannot phone home. There's software around that mimics the licensing server and allows things to work but it is still legally troublesome.

_________________
http://www.tmeeco.eu


Top
 Profile  
 
PostPosted: Fri Feb 22, 2019 2:02 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1524
Quote:
I don't really have a way to contact him to send a message.


The changes you've made sound interesting, but it's going to be quite difficult to search through your rep to find changes to a fork of code I haven't touched in over eight years ... if you have a single diff file of your changes, I can try to go through them for things that are still relevant to v107. Of course, I'd be happy with patches against the official Gitlab as well ^-^;

I'm most interested in the OpenMP stuff. I'm guessing you're referring to the nall::image usage? OpenMP indeed has some truly intensive overhead, and most of the time it's not worth using it at all. I've been thinking of stripping it out from there, but I've also been planning to rewrite nall::image to a new class that takes template type parameters for each channel. The extreme support in nall::image for format conversion massively cripples its performance. 99% of the time one just wants ARGB8888, and 1% of the time 30-bit or 64-bit images.

I wish I knew how much it was OpenMP not being good versus threading not being good ... because the new bsnes v107's scanline-based renderer gets hamstrung a lot in its multi-threading, and I am not sure if it's worth writing per-platform threading primitives to replace OpenMP or not.

I ... don't know what you're referring to with these GLsync tokens, sorry. Also no familiarity with GL_ARB_sync, but it sounds interesting!

Quote:
WPO and LTO are similar, but not the same.


What are the differences, if I may ask? I always expect WPO/LTO to help dramatically with all the base CPU classes and their virtualization, and indeed if I just manually merge the processor/ classes with their respective CPUs (duplicating the code in many places), I do get 3-5% program speed-ups, but -fwhole-program and -flto optimizations do not seem to convey any benefits in practice. The code runs at the same speed, or slower, and now linking is a nightmare.

Quote:
I noticed the disappearance of byuu's message board too, and I'm a little worried about the future of higan.


higan will be fine. I don't need a forum, and I really shouldn't have a Twitter account either. The less I post online, the better it is for everyone.


Top
 Profile  
 
PostPosted: Sat Feb 23, 2019 7:05 am 
Offline

Joined: Tue Feb 19, 2019 4:04 am
Posts: 3
byuu wrote:
The changes you've made sound interesting, but it's going to be quite difficult to search through your rep to find changes to a fork of code I haven't touched in over eight years ... if you have a single diff file of your changes, I can try to go through them for things that are still relevant to v107. Of course, I'd be happy with patches against the official Gitlab as well ^-^;


hex_usr has been keeping bsnes-mcfly approximately up-to-date with your repository, so the changes are actually on top of v106.88, which is your current master version at the time I'm writing this. I can set up a GitLab mirror and make some pull requests, if that'd be easier for you?

byuu wrote:
I'm most interested in the OpenMP stuff. I'm guessing you're referring to the nall::image usage? OpenMP indeed has some truly intensive overhead, and most of the time it's not worth using it at all. I've been thinking of stripping it out from there, but I've also been planning to rewrite nall::image to a new class that takes template type parameters for each channel. The extreme support in nall::image for format conversion massively cripples its performance. 99% of the time one just wants ARGB8888, and 1% of the time 30-bit or 64-bit images.

I wish I knew how much it was OpenMP not being good versus threading not being good ... because the new bsnes v107's scanline-based renderer gets hamstrung a lot in its multi-threading, and I am not sure if it's worth writing per-platform threading primitives to replace OpenMP or not.


I profiled the emulator to identify the bottlenecks rather than seek out parallelization opportunities, which would be an interesting exercise. What I can say is that OpenMP is a great way to realize performance gain for the right workloads, but the amount of work has to be larger than the synchronization impact. The best workloads would be some O(n^2) or O(n^3) reduction, e.g. a matrix multiply. On modern Intel CPUs, usually N needs to be ~o(1 000) or ~o(10 000) before there's any noticeable improvement; using small kernels over the 240 scanlines is not going to be a good candidate.

However, there could be gain from running multiple workers in parallel from the top level -- the per-platform threading would achieve this type of parallelization. You can probably get away with not writing these yourself though. With newer implementations of OpenMP you can have a top-level #pragma omp parallel with #pragma omp task, and there's also a #pragma for simd. The latter suggests GCC should try extra hard to use any on-board SIMD units, but it may not be any better than simply adding -ftree-vectorize to -O3.


byuu wrote:
What are the differences, if I may ask? I always expect WPO/LTO to help dramatically with all the base CPU classes and their virtualization, and indeed if I just manually merge the processor/ classes with their respective CPUs (duplicating the code in many places), I do get 3-5% program speed-ups, but -fwhole-program and -flto optimizations do not seem to convey any benefits in practice. The code runs at the same speed, or slower, and now linking is a nightmare.


LTO and WPO both deal mostly with the IPA/IPO optimizer pass (inter-procedural analysis/inter-procedural optimization), particularly with regards to references to external functions or variables. Usually, the compiler can't be sure that a reference will be to the same piece of code that gets run at runtime. This is usually due to unresolved dynamic references in other .so files, satisfied by the OS loader at initial runtime, but can also arise with global variables and anything considered common data (i.e. anything in .comdat).

WPO has the optimizer pretend that each translation unit's (TU's) references to external functions or variables be treated as static for purposes of optimizing that TU. Note with C++, inline functions are marked as COMDAT, and at link, each file's COMDAT gets merged into the global .comdat. At link, WPO will localize whichever inline functions it is able.

LTO takes this idea a step further, by having GCC dump bytecode of its internal GIMPLE tree into the object file. When doing the final link, all GIMPLE trees are merged into a single representation of the union of all TUs, and a final IPA pass is performed on this whole tree (this is equivalent to everything having been compiled as one single TU). The side effect is that IPA gets performed for references between files, inlined functions can be more widely applied, and the COMDAT contents can be more aggressively reduced. The downside to LTO is that it's inherently serial and so can be slower (but can be sped up by -fno-fat-lto-objects), and it needs more RAM than would WPO to store everything during link. One typically wouldn't use either WPO or LTO if creating a .so file. And, one cannot use both flags simultaneously.

How effective WPO and LTO usually depends fairly heavily on symbol visibility. By default, all symbols have global, or 'default' visiblity. Usually one either marks symbols designed not to be exported as 'hidden' via a function/variable __attribute__((visibility("hidden"))), or defaults to hidden with -fvisibility=hidden and explicitly export a public API as __attribute__((visibility("default"))). higan is a bit unconventional as to how it's laid out, but adding something like -fvisibility-inlines-hidden might be a good way to test any effects.

Edit: Also, for what it's worth, my experience is that -O2 outperforms -O3 for most workloads. I'm not a fan of -O3, as it tends to inflate code size enough to the point it reduces the effectiveness of the BTB caching HW in the CPU.

---

I've committed my previously-mentioned fix for bsnes-higan's LTO linking issues to my fork. I had to modify the symbols in the OpenGL binding code as the global symbols weren't actually being globalized correctly (the ELF header had STT_FUNC where STT_OBJECT should've been used).


Top
 Profile  
 
PostPosted: Sat Feb 23, 2019 8:07 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1524
Quote:
I can set up a GitLab mirror and make some pull requests, if that'd be easier for you?


A set of "diff -ru" patches with descriptions of how and why you're changing things thrown onto eg pastebin would be easiest, if you don't mind. Otherwise, however you want to do it, I'll make it work.

Quote:
using small kernels over the 240 scanlines is not going to be a good candidate.


Each scanline is demanding enough that in practice, it is. Not nearly by as much as we'd like, of course. But it definitely helps with the PPU.

Quote:
it may not be any better than simply adding -ftree-vectorize to -O3.


It may be better now, but -ftree-vectorize has caused me no end of pain in the past. It broke a lot of valid code in very subtle ways.

...

Thanks for the WPO/LTO explanation. I'm still the biggest fan of PGO, but it's just way too much work per release. Especially when I have 21 cores to exercise now.

I'll give -O2 a try again. -O3 has always been faster, and if that's still the case, I'll stick with -O3.


Top
 Profile  
 
PostPosted: Sun Feb 24, 2019 2:38 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 205
Location: Germany
byuu wrote:
I'm still the biggest fan of PGO, but it's just way too much work per release. Especially when I have 21 cores to exercise now.

Automated playback of an input file (e.g. a TAS) could help there.

byuu wrote:
I'll give -O2 a try again. -O3 has always been faster, and if that's still the case, I'll stick with -O3.

Might get different results on different CPUs, e.g. AMD vs Intel.

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Top
 Profile  
 
PostPosted: Mon Feb 25, 2019 6:11 am 
Offline

Joined: Tue Feb 19, 2019 4:04 am
Posts: 3
Sure thing, byuu. I'll gather together the raw changes with corresponding descriptions. I just need to explore whether it can apply (relatively) cleanly atop higan's master, just in case any bsnes-mcfly-specific stuff has worked its way in there over time.

Quote:
Each scanline is demanding enough that in practice, it is. Not nearly by as much as we'd like, of course. But it definitely helps with the PPU.
...
It may be better now, but -ftree-vectorize has caused me no end of pain in the past. It broke a lot of valid code in very subtle ways.

I wanted to let you know I spent some time looking into your claims and whether it'd be possible to better leverage OpenMP, and have some findings to shoot your way. Kindly forgive any of my misassumptions below, as I don't know the details of the render procedure or how the various modes should work -- I'm an ARM firmware/architecture expert in my day job, but know jack all about the SNES.

Oh, just to get this out of the way first: I tried -ftree-vectorize and, while the code seemed to still run okay, it did absolutely destroy performance akin to what you claimed. Well, it's not a default optimizer flag for a reason. There are a few patterns it can really help for, but I guess I didn't realize how few those were.

Now, I saw two immediate possibilities for scanline rendering parallelization:
    [1] Parallelize the renderBackground and renderWindow functions, ensuring a sync barrier prior to continuing to update the above/below window data. Unfortunately, this would require an intelligent map reduction in order to respect the various priority values (a z-order, I think). This is feasible for a GPU compute unit, but is more trouble than it's worth for a CPU or general compute platform.
    [2] Parallelize at the top level, which I believe is flush(). The PPU memory points to different locations for each scanline, so there's no interdependence and all state is effectively thread-local. Unfortunately, even using OpenMP only for >16 scanlines with a 2-thread limit and various scheduling hints to loosen constraints, CPU usage grew by ~70%. The threads have to sync before flush() returns, and the overhead from that thread barrier is still substantial. Well, I don't have to sync them, but I'm pretty sure things wouldn't look right otherwise.

Do you think these can be pushed up any higher so there's more data to run through? The thread dispatch/idle overhead appears fine, work does get dispatched properly and there is some threaded improvement, it's just the context switches from that 'join' barrier at the end that's killing things. It appears flush() gets called whenever VRAM or OAM are written in writeIO(), but as this is determined by bus address it seems it'd probably be inherently serial, right?

I investigated caching and hit/miss performance. On my CPU, an Intel Coffee Lake (nearly equivalent to Kaby Lake and Skylake), CPU usage appears highly dependent on caching. I took a look into mostly at data cache rather than instruction cache, but nothing looked too amiss. I found a couple places to insert prefetch hints, and will re-test to see if they do anything. But, I'd suspect performance differences from O3 or WPO/LTO are largely from ICACHE and code positioning within the binary. There are a few low-effort ways to help the linker better position code, but higan's cascading fractal web of includes and header implementations somewhat overwhelms me, so I didn't explore further.

Finally, I checked branch predictor performance. A couple things stood out with absurdly high misprediction rates.

Firstly, For indirect mispredictions, the SFC::CPU::read(uint24) function results in many indirect mispredictions usually via bus.read() and nall::function::operator() const, apparently uncertain whether to assume readAPU or readCPU. The same read(uint24) function is also called very often via vptr from WDC65816::fetch() on each step(), that site generating very many more mispredictions from the virtual call itself.

This happening from a virtual call is very common, but can be improved in certain calling cases via a speculative devirtualization optimizer pass, i.e. -fdevirtualize-speculatively and possibly -fdevirtualize-at-ltrans at >= -O2 (or -findirect-inlining).

Secondly, for conditional mispredictions, consider the below code from higan/cpu/timing.cpp:
Code:
 20 auto CPU::step(uint clocks) -> void {
 21   status.irqLock = false;
 22   uint ticks = clocks >> 1;
 23   while(ticks--) {
 24     counter.cpu += 2;
 25     tick();
 26     if(hcounter() & 2) pollInterrupts();
 27     if(joypadCounter() == 0) joypadEdge();
 28   }
[snip]

The biggest source of conditional mispredictions were line 26 (22% of executions mispredicted), and line 23 (10% being mispredicted). Ideally we could eliminate these to avoid the pipeline stalls, either by informing GCC of branch probabilities, or by unrolling / specializing the loop and explicitly hoist inner conditions out.

I think I've managed to glean that hcounter tracks the horizontal scanline, joypads are sampled every 256 cycles, and pollInterrupts() is supposed to run every 4 cycles (once per, uh, PPU 'dot clock'). As written, the 4-cycle check boils down to every other loop iteration, which explains the high misprediction rate on line 26. The mispredictions from line 23 are then almost certainly due to many calls to CPU::step() with a small argument value, especially e.g. step(2) or step(4).

What's the reason for stepping the CPU/PPU counter 2 cycles at a time? hcount() appears to always change by 2 -- is it correct to assume that hcount() cannot be odd? If this is true, a good solution is probably to add a special loop variant to handle clocks argument values divisible by 4, calling tick() twice and allowing the pollInterrupts check to be entirely removed, From there, one can survey the callers of step() and augment line 23 with an approximate branch probability, One could also specialize a clone of the step() function for certain values, and if properly referenced by the caller, the line 26 condition could be removed as well.

Anyway, do with that what you will... I'm happy to talk and explore this further, I just want to first make sure that I'm not being annoying by pestering you with text walls you aren't interested in! :-)

Quote:
Automated playback of an input file (e.g. a TAS) could help there.

This, or some bypass-all-UI-and-just-go would be lovely. It can be hard to measure changes without also coupling moderate amounts of noise and/or variance. I've just been letting an intro play for longer to average this out, but running while sampling all the CPU counters isn't free (i.e. slow or eats disk, depending).


Top
 Profile  
 
PostPosted: Mon Feb 25, 2019 5:24 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1524
> CPU usage grew by ~70%

I certainly understand that the overall power usage will go up way more than the framerate gain, but we're so desperate for performance with my emulators that it's better to get the small FPS bump than leave all the other cores idle.

> Do you think these can be pushed up any higher so there's more data to run through?

Unfortunately not. Most games, 99% probably, will batch all 224 or 240 scanlines of a frame at once. Usually you'll get a game that disables the top and bottom edges of the screen, for extra Vblank time, and there's a fallback to stop OpenMP from trying to run on batches that are too small.

> This happening from a virtual call is very common

The problem here is that many, many CPU cores are shared all over the place in higan. The Z80 core is the most common, with probably six systems using it. So I have to make it a base class with virtual calls to complete it.

If I tried CRTP, or a direct #include of each CPU core inside each processor, the code size and compilation times would skyrocket.

> What's the reason for stepping the CPU/PPU counter 2 cycles at a time?

The actual CPU oscillator runs at ~21.4MHz, but in practice there's no real way to step by one cycle, so it's effectively ~10.7MHz. However, I leave it as 21.4MHz for the sake of documentation: everyone that talks about SNES raw clock cycles work in these 21.4MHz units.

> If this is true, a good solution is probably to add a special loop variant to handle clocks argument values divisible by 4, calling tick() twice and allowing the pollInterrupts check to be entirely removed

It still can't be removed. You may start step<4>() with hcount()&3 == 0 or 2. If you call pollInterrupts() on the wrong one, it won't work, the values won't match up.

In my commercial forks, my step functions are indeed turned into templatized versions to unroll the while loops.

In higan, it's important to minimize code repetition, so I often pay performance penalties to avoid manual unrolling.

What this loop would really benefit from is a binary min heap priority queue to fire off the interrupts when they're supposed to trigger, rather than testing for them on every four clock cycles. However, that's a huge, monstrous can of worms because the SNES is packed full of special edge cases that games actually rely on where the interrupt trigger points change after scheduling them. Having to parse through the min heap to pull out stale events every time this could occur would be very painful. And even then, the interrupts are very nuanced: they don't just fire ... they start to raise the IRQ lines, where a line status read will abort it. Then they hold the IRQ line steady, where reading the line will reveal it's set but not lower the line. Then finally they've fired, and the read will reveal it was set, and lower the status line. Hard to describe in text, but the point is, it's a huge, massive, pain to get the details 100% right.

If you look at Snes9X, the project's been in development since 1996/7, and still gets IRQ fixes to this day. It's really, really hard code to optimize without unintended regressions. I had IRQs working correctly in 2-3 years, but at tremendous performance penalties. Not saying either approach is better. I just prefer not to spend decades fighting stubborn IRQ edge cases.

> Anyway, do with that what you will... I'm happy to talk and explore this further, I just want to first make sure that I'm not being annoying by pestering you with text walls you aren't interested in!

I've considered all of this before, of course. But I'm happy someone's taking an interest in my code, and you may still come up with something I have not, so cheers ^-^


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 181 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group