Yea, the 16kb loading is a bug I realized after posting, but didn't fix yet. It copies the 16kb bank to $8000-$BFFF, so that obviously doesn't work. Also, I forgot to mention it, but at the moment it's only meant to run with mapper 0 stuff - though I suppose very simple mappers could be added easily with HLE without really having too much impact on the accuracy.
For the alignment, I just tried changing the soft reset logic to not alter the state of the chip other than putting the reset signal low for a given number of cycles, and it seems to yield 6 (out of a possible 8) different alignments (on a half-master clock level). I was under the impression there were only 4 possible alignments, so maybe I'm doing this wrong.
A quick test with PGO seems to yield approximately ~15% faster code (4900hz -> 5500hz on my machine). Which is pretty similar to what I get on Mesen with PGO, too.calima wrote:This is exactly the kind of project that would benefit greatly from PGO. Perhaps even 2x or more.
At the moment ~50-60% of the time is spent in this recursive function. I haven't been able to find any way to make it faster though. Converting "group" from a vector to a hashset makes it slower (presumably because "group" is usually very small), and the way it works makes it pretty hard/impossible to split the work across multiple threads without a ton of lock contention.