Spoke to a PhD holding math expert (who frequently does vector calculus), yet we weren't able to come up with a good way to determine the maximum error rates for this type of scheduler through plain old computation.
So I had to fall back on a testing framework again. This is the best I can do:
http://hastebin.com/raw/isekeyuxecRequirements:
* must allow frequencies between 1hz and 2^32-1hz
* must allow at least one full emulated second to run before normalizing all the time counters
I tried many variations. What seemed to maximize the rounding errors the most was this scenario:
* precision[attoseconds] / frequencyA[hz] = xxxx.5
* precision[attoseconds] / frequencyB[hz] = xxxx.0
Where frequencyA and frequencyB were as close to the maximum (4GHz-1) as possible.
I used a brute force search to find the best values for this: 4294967262hz and 2000000000hz.
With this, I found the following:
* the iteration count doesn't really affect the error rate; just the precision of reporting it
* with picosecond precision, 7742318 clocks or 0.244283% error rate
* with femtoseconds, 6009 clocks or 0.000189% error rate
* with attoseconds, we end up with 4 clocks of rounding errors in 10 billion iterations; or basically 0.00000025% out of spec
* with 2^96, there are zero errors. I don't think I could stand to run the test long enough to find an error
With a cesium fountain clock, that's 10^-15 accuracy. Rubidium is 10^-12. But those two are atomic clocks. A typical crystal is going to be accurate to between 10^-4 and 10^-5. With attoseconds, we are above 10^-6 accuracy rate (my math could definitely be wrong here when converting between 0.x and x%); and we can run up to 18 seconds with uint64 between normalizing all the clock values to prevent overflow (though since some counters can be as high as half the value they were before, I would halve that to 9 seconds. Which means I wouldn't trust going to 10^-19 for the precision here.)
The best estimate I can give from my results is that a precision of 10^-n means we are accurate to a minimum of 10^-(n-12) for frequencies of up to 4GHz. That would be off by one second every 85 years.
So given this, I concur through empirical evidence with AWJ: there's no need at all for 128-bit integers here. Unless of course you want to brag about being more accurate than the best atomic clock in the world
If anyone sees any errors in my test harness, results or reasoning, please let me know. I am, after all, quite bad with math :/
...
EDIT: but actually, I was mistaken in that we don't need 2^32 headroom for uint128_t. We only need enough such that 2^128/precision >= 2. So in this case, we can easily do 10^-38; or 10^20 (one hundred quintillion) times more precise than attoseconds. Which gets us an error rate of 10^-26. Which is 10^11, or one hundred billion times more precise than the most precise cesium-based atomic clock mankind has ever invented
That's an error rate of one second every 2 quintillion years. And we can push it to 2^-127 in that case, which is 70% more accurate than even that.
So we might as well use 2^-63 instead of 10^-18 for our uint64_t version. That gives us 922% more precision for free. Gets me to an error of 1 cycle on my test harness for 10 billion iterations.
I think 2^63 might be cutting it a bit too close. A 1 Hz clock will go from zero to overflow in just two ticks. I've got a nagging feeling that if you have two or more 1 Hz clocks (e.g. a hypothetical SNES cart with both types of RTC in it) there might be some circumstance in which an overflow could occur even with regular rebasing. Even 2^63-1 would be substantially safer.
Also, since you've apparently convinced yourself that uint64_t is precise enough, I would strongly recommend against any kind of platform-dependent compile time selection (e.g. defaulting to 128-bit on x86_64 and 64-bit on i686 and ARM) Problems include:
* Potential input-movie incompatibility between platforms (there's a tiny but nonzero chance that a movie recorded on a more precise build will desync on a less precise build or vice-versa. Why risk it?)
* Bug reproducibility between platforms (you have enough trouble with compiler bugs given all the bleeding-edge C++ you use; why compound it by intentionally introducing more platform-dependent behaviour right in the core of the emulation?)
ETA: Also, you're using uintmax in nall/windows/detour.hpp, so now it behaves differently in 32-bit and 64-bit builds. I have no fucking idea what detour.hpp is for, but it looks very low-level to me, and I would guess it wasn't your intention to change it...