It is currently Fri Oct 20, 2017 8:10 pm

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Fri Jul 29, 2016 6:01 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
This is the best I can do:

http://hastebin.com/raw/ovoxadugef

At 79,228 times the precision of the yoctosecond (2^-24), I bring you ... the byuusecond, or 2^-96.

I reserved 2^32 headroom so that up to a 2GHz* clock can run for up to one emulated second before all the clocks need to be normalized. (* since the normalization can leave some values up to 50% capacity, we cut the 4GHz range in half.)

Basically, what I've observed from my testing is ...

Attoseconds just don't cut it. We're getting misses ~292 times a second against the CPU<>SMP clock rates stepping 4 and 5 clocks each, respectively. Adding clock=1 to the SMP, toggling between clock >= target.clock and clock > target.clock, etc does not help.

The errors are not extra cycles, they're just chips running in the wrong order. Instead of ABABABAB we get ABABBAAB periodically.

There's basically no way to stop this. Even if I move all the way up to the byuusecond, it's not enough. In my test, it runs in sync as long as I can bear to run the test, but ... if I drop the CPU counter value by 1, it swaps a sync order once every 10,000,000 iterations or so.

Since attoseconds are insufficient, we have to move to a 128-bit type. And since we're doing that, there's no point in stopping at zeptoseconds. Might as well use the full 2^96 range available to us.

Surprisingly, the uint128 type didn't result in any slowdown in the profiling tests. Although I'm sure this will break compatibility with non-x86 and non-amd64 systems; and to hell with writing a Booth algorithm software multiplication for that, so it'll be required to fallback to uint64 types and 2^32 precision for ARM, etc systems.

> The bias thing only helps when the two clocks are dividers of the same base frequency (which is also the only time that level of perfection should matter).

Yeah ... I see that now in my tests. When I set the cpuFreq and smpFreq to the same value, I start seeing the bias errors you're talking about when they don't run in perfect lockstep (eg cpuStep and smpStep are different for this test suite.) I still need to try and grok your explanation as to why, I've been busy working on the test suite.

> I don't think it's a big deal if one chip executes a cycle early or late every now and then compared to the previous method

Yeah, it's the latter as explained above and true ... it's probably not a big deal, especially given they're separate oscillators. Still, it's a shame to give up my old method, which was basically total perfection. But, at 2^96, the precision is quite great.

I guess if one wanted, they could reduce the timing to 2^64, and not worry about normalizing the counters ever. If someone runs the emulator for four billion seconds, then they will be dead anyway because humans do not live that long.


Top
 Profile  
 
PostPosted: Fri Jul 29, 2016 7:15 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
The errors are not extra cycles, they're just chips running in the wrong order. Instead of ABABABAB we get ABABBAAB periodically.


If you really are getting AABB (the slower chip getting two cycles in a row sometimes), that would be bad, but I'm not seeing that. I rewrote your test program in plain nall-free C++, using plain 64-bit integer math (no floating point and no exotic 128-bit integers), added more verbose output, and all the differences that I'm seeing look like the following, once every 30000 cycles or so:

Code:
old: 121211212121
new: 121212112121


i.e. the faster chip's extra cycle when it "laps" the slower one occasionally happens 1 cycle later with the new scheduler than with the old one. That is really, really, not something to worry about. Given the imperfections of real hardware oscillators, you can't even really claim that one is "right" and the other is "wrong". It's the difference between rounding exactly 0.5000000000 up and rounding it down. It's less than the difference between powering the same SNES on twice (and unlike hardware it's completely deterministic)

Also, my logging shows that clock ties basically never happen between devices with unrelated frequencies (search for "!" and "@" in the output, which represent cycles when the clocks are exactly zero (for the old scheduler, on the left) or equal (for the new scheduler, on the right)) Another reason you only have to worry about them between devices that are running on different dividers of the same base frequency.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 12:30 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> I rewrote your test program in plain nall-free C++

That ... seems like a lot of effort for no benefit, but ... okay. I appreciate you trying out the test suite on your end regardless.

This was a hacked up version; but I might implement something much cleaner in C (well, as clean as you can be in a language without classes) to distribute with libco itself; which would also serve as better usage examples than what's there now.

> i.e. the faster chip's extra cycle when it "laps" the slower one occasionally happens 1 cycle later with the new scheduler than with the old one.

Yes, that's what I was trying to express. My AB example had four As, four Bs, in both examples. It probably wasn't a long enough stream to show it off, but it's the same result as you get.

> Given the imperfections of real hardware oscillators, you can't even really claim that one is "right" and the other is "wrong".

True, I've always struggled with the justification in higan for one chip to use >= and the other to use < on my int64 clock values. There was no compelling reason why the latter shouldn't be <=, it was just chosen because it was the opposite of >=.

> It's the difference between rounding exactly 0.5000000000 up and rounding it down.

Mathematically speaking, that always rounds to 1.0. It's a pretty basic rule.

I think you're right as I've said ... real oscillators have some variance. The important thing is it stays deterministic. I was just surprised that it doesn't seem like there's any amount of precision that will be enough to match the simple behavior I had before, at least for a million iterations. Rounding errors of fractional bits are a real bitch. I kind of wonder just how many bits of precision I'd need to make it to a million matches ... 256? 1024? 2097152? >__<

Given that I agree, I'm still going to go with this timing method, but it still bugs me that we can't get an exact match, at least for a million iterations, no matter how precise we get. I might toy around with some other ideas (like making the scalar the product of all oscillator frequencies.)

> Also, my logging shows that clock ties basically never happen between devices with unrelated frequencies

Right, that's what I perceive as well ... it should be psychotically rare (and probably even impossible with irrational numbers.) So why are we getting the mismatched output to the old ThreadA style if they're not clock ties?

Even if we decide it's not important ... I think a mismatch every 30,000 cycles is a very high margin of error for time measured in yoctoseconds (my version) or attoseconds (your version.)

................................................................................

EDIT: wow ... yeah, that actually works really well. In the example test suite, we end up with a CPU scalar of SMP freq, and an SMP scalar of CPU freq.

If I add in two more oscillators ... 44100hz (MSU1) and 8000000hz (DSP1), I end up with:
8670412800000000000
7577181561600000000
Which I can divide by 100000000 and get:
86704128000
75771815616
And indeed, these run perfectly in sync forever ... I think?? I don't know if I can round down products like that. I feel like I can, but if I factor the numbers or try GCD calculations on them, I end up reducing it right back the original frequencies which is obviously wrong.

I think there's a significant opportunity here to have a smart frequency generation tool. Start by using clock dividers (multipliers in the clock step values) wherever possible ... the Genesis only has one, the base SNES only has two. Then build a list of all the unique frequency values. Then go through and try to produce GCDs of any relationships we can. Eg 21477272 and 24576000 = 8; so we can drop the multipliers to 2689659 * 3072000. Or for the bigger example, it's 4, so:
5369318*6144000*11025*2000000=727409429913600000000000, which is still below the headroom of 2^96.

So let's say we keep adding in more oscillators ... and eventually the number gets too big, then we can start sacrificing some precision through rounding things until we can fit nicely into the 2^96 headroom.

The downside to this approach is that adding and removing oscillators dynamically won't really be possible, because if we change the scalars, then we invalidate the current clock value meanings of all threads. That's not the end of the world ... that will only affect hotplug peripherals, which currently I have no cases where they have oscillators that are different from ones used in the base system.

................................................................................

EDIT 2: okay, changes are in for the next WIP.

The SNES retains about ~80% of its performance boost over v100 official. I'm betting the reason is because of the simplifications to the SNES CPU step() function. We've gone from adjusting the SMP, PPU, two controllers, expansion port, and all coprocessors (the peripherals and coprocessors being separate vector lists to iterate over) with 64-bit multiply-adds to just a single 128-bit multiply-add. I guess the SMP<>DSP wobbling wasn't as big a portion of the speed boost as I thought it was.

But now the NES and WSC cores are a bit slower than they were before. The psychotic ~40% speed boost for the SNES on my Intel Celeron box remains as well.

higan may no longer run on 32-bit CPUs, but at least I can claim that it now has ~80 billion times more accurate timing than MAME :P (joking aside, I'm still going to keep working on this. There's gotta be a better way to come up with approximations that have very good precision and yet fit into 64-bit types.)


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 10:21 am 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
True, I've always struggled with the justification in higan for one chip to use >= and the other to use < on my int64 clock values. There was no compelling reason why the latter shouldn't be <=, it was just chosen because it was the opposite of >=.


If you used <= and >=, you'd have the same wobbling issue I demonstrated in the previous post. You'd never have gotten all your PPU counter tests to match hardware.

byuu wrote:
Right, that's what I perceive as well ... it should be psychotically rare (and probably even impossible with irrational numbers.) So why are we getting the mismatched output to the old ThreadA style if they're not clock ties?


Rounding errors because 1e+18 / 21477272 and 1e+18 / 24576000 aren't exact integers. The CPU is running very, very, very slightly slower relative to the SMP (less than one part in 100 billion) than with the non-reciprocal algorithm. The error isn't large enough for the total cycles executed to diverge over 20 million or even 20 billion iterations; if you ran the test for hundreds of billions of iterations (omitting the per-cycle logging because you'd run out of RAM, and twiddling your thumbs because the test will take hours) you'd eventually see the SMP gain a whole cycle on the CPU relative to the old scheduler. As you got closer to that point, the single-cycle divergences (if you were capable of logging them) would get closer and closer together.

1e+18 / 21477272 is 46560848137.5. The error from truncating to 46560848137 is 0.000000001%, which is orders of magnitude smaller than the tolerance of a crystal oscillator, let alone a ceramic resonator. You're taking an error of 1 in 100 billion and exaggerating it out of all proportion by using inappropriate test criteria. That level of error is really not worth introducing complexity and killing portability over. What's important is that chips that are at an exact integer frequency ratio (because they're clocked off the same oscillator) are emulated at that exact ratio and don't wobble, which I've explained how to ensure; for chips on different oscillators an error of 1 part in 100 billion is vastly better than good enough.

Try this: Run the test twice using the same algorithm both times (either the old one or the new one), but the second time change the SMP frequency to 24576001. You'll see that the first divergence in which cycle a thread switch happens on will happen after only a few thousand iterations, even though the clock speed difference is 1 part in 24 million. "how long does it take before the first thread switch happens on a different cycle than with the old scheduler" is simply not a reasonable or meaningful measure of accuracy.

ETA: I checked when the first divergence happens in my test program (it scrolls out of the terminal window when I log them all) and it happens at 284019 cycles. It looks like you're actually *losing* precision by using 128-bit integers if you're really getting a divergence after only 15151 cycles (I would guess that converting from long double to uint128_t is where it's being lost)

Quote:
And indeed, these run perfectly in sync forever ... I think?? I don't know if I can round down products like that. I feel like I can, but if I factor the numbers or try GCD calculations on them, I end up reducing it right back the original frequencies which is obviously wrong.


Ugh.. listen to yourself, man. You're doing convoluted math you don't even pretend to understand, with only brute-force and highly artificial tests to convince yourself it's correct. I thought you liked to keep your software simple?


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 2:07 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19112
Location: NE Indiana, USA (NTSC)
Is there a reason you can't have a mode that pretends everything in the SNES is on one crystal, at least if no Japan-only coprocessors are used? APU=8/7*CPU would be well within the 0.5% tolerance of a ceramic resonator. That is, unless you want to simulate change in frequency as the thing heats up.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 7:10 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> If you used <= and >=, you'd have the same wobbling issue I demonstrated in the previous post.

Oh, that's a duh moment for me >_<

> As you got closer to that point, the single-cycle divergences (if you were capable of logging them) would get closer and closer together.

Still, it's impressive to me that I'm seeing ten instances of rounding errors every one million context switches even at the yoctosecond level. It's fine since we aren't actually introducing a real extra cycle, just shuffling the order that the two chips run in. Just, I didn't expect it to be that common.

> 1e+18 / 21477272 is 46560848137.5. The error from truncating to 46560848137 is 0.000000001%

Which is why I would have expected the rounding error to happen only once every ~billion or so synchronizations, not every 16,000 - 100,000. Even if it doesn't matter, and it doesn't, fine. Just didn't expect to see that.

> That level of error is really not worth introducing complexity and killing portability over.

The main issue here though is that attoseconds are too precise for one 64-bit number.

If I were to have a frequency of 1hz (which I do for RTCs), then I'd eat up 1/18th of the total headroom for each tick. Yes, I know there would only be one tick per second, obviously. And we normalize once per second, sure. (Well, the framerate number of times per second.) Then if we increase the frequency higher toward 4GHz, the precision falls off a cliff as rounding errors become more pronounced.

MAME manages with attoseconds, but it also keeps a seconds counter as well and doesn't worry about wrapping.

In my case, I'd want to leave 2^32 headroom, so that I can easily handle a full second of any frequency from 1Hz to 4GHz (that which fits into an unsigned integer) with strong precision. That would only leave 2^32, which is only 10^9.6 or so.

It obviously works anyway with uint64 and attoseconds and normalizing the counters after every frame. But the 128-bit version is more precise and runs at basically the same framerate. I'll probably write some code that picks the sizes and time base off the largest available integer size, so it compiles on 32-bit platforms to 64-bit, and on 64-bit platforms to 128-bit.

Like you said, I'm not that great with math at this scale. It's hard for me to calculate exact limitations where things would be "acceptable" or not, so I use the test code as a brute force (and inaccurate) crutch. If it's not any slower, and I have a 32-bit fallback without adding more than a single ifdef on two lines of code, then why not do it anyway, just to be 100000000000% sure?

> "how long does it take before the first thread switch happens on a different cycle than with the old scheduler" is simply not a reasonable or meaningful measure of accuracy.

I realize the first hit could be anywhere. It was the recurring hits that was worrisome. I was getting ten per 1m iteration loop with yoctoseconds.

> It looks like you're actually *losing* precision by using 128-bit integers if you're really getting a divergence after only 15151 cycles (I would guess that converting from long double to uint128_t is where it's being lost)

Yeah, I believe that was the case, and I switched to pure integer math. Current results are:
//femtosecond (10^15) = 16306
//attosecond (10^18) = 688838
//zeptosecond (10^21) = 13712691
//yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble)
//byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond)
And yes, adding one to the SMP causes even the "byuusecond" to start desyncing ~10 times per 1m iterations. Nasty.

> Ugh.. listen to yourself, man. You're doing convoluted math you don't even pretend to understand

Right, but this is how we learn.

I didn't know anything about blackman windowed sinc FIR filters or butterworth biquad IIR filters, yet I was able to learn and implement both fairly effectively to eliminate audio aliasing during resampling.

> I thought you liked to keep your software simple?

I do. And honestly, I think my old method was better in terms of simplicity (not needing normalization nor introducing any errors.) It was just sloppier because it was more work to manage more than 1:1 timing relationships.

It's just a thought experiment. If we can get 128-bit levels of precision in 64-bit levels of space though, I think that'd be pretty neat and *may* justify the code necessary to compute it. Or maybe not... its achilles heel is adding/removing oscillator frequencies during run-time (from peripherals.)

> Is there a reason you can't have a mode that pretends everything in the SNES is on one crystal, at least if no Japan-only coprocessors are used? APU=8/7*CPU would be well within the 0.5% tolerance of a ceramic resonator. That is, unless you want to simulate change in frequency as the thing heats up.

higan allows the end-user to set the frequencies of every oscillator by editing manifest files. Along with per-byte mapping granularity, it's a conscious design choice, so I'm not going to cut corners on it.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 8:40 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
Which is why I would have expected the rounding error to happen only once every ~billion or so synchronizations, not every 16,000 - 100,000. Even if it doesn't matter, and it doesn't, fine. Just didn't expect to see that.


See, you're not understanding the problem at all. With an error of one part in a billion, after one billion cycles one chip would be a full cycle ahead of where it would be with theoretical perfect timing. After half a billion cycles, the chip would be half a cycle ahead, meaning the context switches would be 180 degrees out of phase--every single one would be "wrong" the way you're measuring them. In between those points, the "errors" gradually increase and then decrease again.

"What percentage of context switches happen at exactly the same time" is a completely bogus, irrelevant statistic. To make a rough analogy, you're trying to change the amplitude of the beat between two sine waves by twiddling their frequencies.

Here's another analogy: Think about two kilometer-long rulers, one marked with lines dividing it into 100000 segments (each segment is exactly 1cm) and one marked with lines dividing it into 100001 segments (each segment is 0.001% shorter than a cm). If you put the rulers next to each other, the lines line up just about perfectly at one end. As you walk along, the lines diverge more and more, until in the middle of the two rulers the 50000th line on one ruler is exactly between two of the lines of the other ruler.

You're looking at the lines somewhere in the middle of the rulers and saying "don't try to tell me the difference between these rulers is only 0.001%; look how far apart the lines are!". And you're proceeding to demand that someone make you a pair of rulers long enough that the lines never get farther than 1mm apart.

Quote:
If I were to have a frequency of 1hz (which I do for RTCs), then I'd eat up 1/18th of the total headroom for each tick. Yes, I know there would only be one tick per second, obviously. And we normalize once per second, sure. (Well, the framerate number of times per second.) Then if we increase the frequency higher toward 4GHz, the precision falls off a cliff as rounding errors become more pronounced.


4 GHz is an exact divisor of 1e+18, so let's try 3 GHz.

1e+18 / 3000000000 = 333333333.333

0.333 / 333333333.333 = 1e-9. Which is still four orders of magnitude smaller than the tolerance of a quartz oscillator.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 9:11 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Quote:
You're looking at the lines somewhere in the middle of the rulers and saying "don't try to tell me the difference between these rulers is only 0.001%; look how far apart the lines are!". And you're proceeding to demand that someone make you a pair of rulers long enough that the lines never get farther than 1mm apart.


Thanks, that analogy worked perfectly. I get what you're saying now.

I have to say, it's intimidating working in a field with people who are all 20-30 IQ points higher than me where this stuff is obvious to them. But I do appreciate you taking the time to help explain it to me.

I don't know how to do the math to work it out, but ... is 10^-18 the best time unit to use for a 64-bit clock/scalar? I don't really care if my unit of time has a snazzy name like "attoseconds."

I want something that works well for all frequencies 1Hz - 4GHz (I know this is overkill; but I don't want any hidden gotchas or certain frequencies we shouldn't use), won't overflow for one entire second (so I think we need 2s of overhead since the normalization could leave other counters near the limit still), and has the most possible precision we can get away with.

Ideally, it'd be great if there were a series of formulas we could compute to tell us these things. I'd like to have official documentation right there in the thread/scheduler that explains the rate of error for the worst case scenario, the amount of time before an overflow event would trigger, etc; instead of just saying, "well four orders of magnitude sounds like a lot and I don't *think* the counters could ever get close to the limit ..."

Given that I don't really know how to compute this stuff, I hope you at least understand why I feel more comfortable just brute-forcing it to batshit insane overkill of 2^-96 until I do, especially given that it comes at no discernible performance penalty (the penalty I received was due to fixing the wobbling issue that reduced my context switch overhead.)


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 10:19 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19112
Location: NE Indiana, USA (NTSC)
The problem is that if two cores have their event at exactly the same time, which runs first is unpredictable. To fix this, disallow things happening at the same time. Clock the CPU and PPU at 43 MHz and run CPU on even clocks and PPU on odd.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 11:29 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
I wrote another test program to demonstrate what I was talking about in the previous couple of posts: http://hastebin.com/odijenugim.cpp

It uses the old signed/paired scheduler algorithm. It runs 100 million cycles with a 21477272 Hz CPU and a 24576000 Hz SMP, then repeats the run with the SMP bumped to 24576001 Hz, and compares the results. I strongly suggest redirecting the output to a text file--printing tens of millions of lines to a terminal window takes surprisingly long.

Results summary: After 100 million cycles (~50 million cycles per chip), the total cycles diverge by just 1 (i.e. the run with the 1 Hz-faster SMP runs one more SMP cycle and one fewer CPU cycle than the other run). However, the first desync of the thread switches is only 14381 cycles in, and they quickly climb in frequency. Around the 50000000-6000000 cycle neighbourhood, the two runs are maximally out of phase. By the end of the runs, they're fairly close to being in phase again--the desyncs are back down to one every hundred or so cycles.

Quote:
I have to say, it's intimidating working in a field with people who are all 20-30 IQ points higher than me where this stuff is obvious to them. But I do appreciate you taking the time to help explain it to me.


There are probably only 4 or 5 people on the MAME mailing list who really understand the MAME scheduler and one of them, the man who originally wrote it, is retired from hobbyist emulation. A major emulator-wide regression that occurred when the scheduler was converted from C to C++ took three years to get fixed.

Quote:
I don't know how to do the math to work it out, but ... is 10^-18 the best time unit to use for a 64-bit clock/scalar? I don't really care if my unit of time has a snazzy name like "attoseconds."


If you really wanted to maximize the number of frequencies that yielded exact integer reciprocals, you could open up an electronics supplier's catalogue of oscillators and start multiplying them all (dropping trailing zeroes, and skipping ones that are multiples of ones you've already included) until the product got close to 2^59. But since you're allowing users to provide arbitrary frequencies that aren't even necessarily integers, you're not going to gain much by doing that. Any arbitrary number in the 2^59~2^60 neighbourhood is as good as any other. I think Aaron Giles chose attoseconds because it's the smallest named unit of time whose reciprocal fits into a 64-bit integer.

It sounds like you might be confused on this point, but rebasing the clocks every frame is no problem even when you have clocks slower than 60 Hz. For every time that 1 Hz RTC clock gets incremented by 1 s, the rebaser will decrement it 60 times by on average 1/60 s each time. It doesn't matter whether the increments are less frequent but larger than the rebases (clocks < 60 Hz) or vice-versa (clocks > 60 Hz). None of the clocks will ever overflow, as long as the rebasing happens more frequently than your headroom (for attoseconds in a uint64, that's 18 seconds) and as long as one chip doesn't get drastically behind the others due to never being synced.


Top
 Profile  
 
PostPosted: Sat Jul 30, 2016 11:33 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
tepples wrote:
The problem is that if two cores have their event at exactly the same time, which runs first is unpredictable. To fix this, disallow things happening at the same time.


Yes, I've already explained this problem and how to solve it: initialize each device's time base with a small, unique bias (e.g. the index of its position in the scheduler's array) so that the time bases of two devices with integer-ratio clocks never become exactly equal.


Top
 Profile  
 
PostPosted: Sat Aug 06, 2016 9:40 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Spoke to a PhD holding math expert (who frequently does vector calculus), yet we weren't able to come up with a good way to determine the maximum error rates for this type of scheduler through plain old computation.

So I had to fall back on a testing framework again. This is the best I can do:

http://hastebin.com/raw/isekeyuxec

Requirements:
* must allow frequencies between 1hz and 2^32-1hz
* must allow at least one full emulated second to run before normalizing all the time counters

I tried many variations. What seemed to maximize the rounding errors the most was this scenario:
* precision[attoseconds] / frequencyA[hz] = xxxx.5
* precision[attoseconds] / frequencyB[hz] = xxxx.0
Where frequencyA and frequencyB were as close to the maximum (4GHz-1) as possible.

I used a brute force search to find the best values for this: 4294967262hz and 2000000000hz.

With this, I found the following:
* the iteration count doesn't really affect the error rate; just the precision of reporting it
* with picosecond precision, 7742318 clocks or 0.244283% error rate
* with femtoseconds, 6009 clocks or 0.000189% error rate
* with attoseconds, we end up with 4 clocks of rounding errors in 10 billion iterations; or basically 0.00000025% out of spec
* with 2^96, there are zero errors. I don't think I could stand to run the test long enough to find an error

With a cesium fountain clock, that's 10^-15 accuracy. Rubidium is 10^-12. But those two are atomic clocks. A typical crystal is going to be accurate to between 10^-4 and 10^-5. With attoseconds, we are above 10^-6 accuracy rate (my math could definitely be wrong here when converting between 0.x and x%); and we can run up to 18 seconds with uint64 between normalizing all the clock values to prevent overflow (though since some counters can be as high as half the value they were before, I would halve that to 9 seconds. Which means I wouldn't trust going to 10^-19 for the precision here.)

The best estimate I can give from my results is that a precision of 10^-n means we are accurate to a minimum of 10^-(n-12) for frequencies of up to 4GHz. That would be off by one second every 85 years.

So given this, I concur through empirical evidence with AWJ: there's no need at all for 128-bit integers here. Unless of course you want to brag about being more accurate than the best atomic clock in the world ;)

If anyone sees any errors in my test harness, results or reasoning, please let me know. I am, after all, quite bad with math :/

...

EDIT: but actually, I was mistaken in that we don't need 2^32 headroom for uint128_t. We only need enough such that 2^128/precision >= 2. So in this case, we can easily do 10^-38; or 10^20 (one hundred quintillion) times more precise than attoseconds. Which gets us an error rate of 10^-26. Which is 10^11, or one hundred billion times more precise than the most precise cesium-based atomic clock mankind has ever invented :D That's an error rate of one second every 2 quintillion years. And we can push it to 2^-127 in that case, which is 70% more accurate than even that.

So we might as well use 2^-63 instead of 10^-18 for our uint64_t version. That gives us 922% more precision for free. Gets me to an error of 1 cycle on my test harness for 10 billion iterations.


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 1:21 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
Spoke to a PhD holding math expert (who frequently does vector calculus), yet we weren't able to come up with a good way to determine the maximum error rates for this type of scheduler through plain old computation.

So I had to fall back on a testing framework again. This is the best I can do:

http://hastebin.com/raw/isekeyuxec

Requirements:
* must allow frequencies between 1hz and 2^32-1hz
* must allow at least one full emulated second to run before normalizing all the time counters

I tried many variations. What seemed to maximize the rounding errors the most was this scenario:
* precision[attoseconds] / frequencyA[hz] = xxxx.5
* precision[attoseconds] / frequencyB[hz] = xxxx.0
Where frequencyA and frequencyB were as close to the maximum (4GHz-1) as possible.

I used a brute force search to find the best values for this: 4294967262hz and 2000000000hz.

With this, I found the following:
* the iteration count doesn't really affect the error rate; just the precision of reporting it
* with picosecond precision, 7742318 clocks or 0.244283% error rate
* with femtoseconds, 6009 clocks or 0.000189% error rate
* with attoseconds, we end up with 4 clocks of rounding errors in 10 billion iterations; or basically 0.00000025% out of spec
* with 2^96, there are zero errors. I don't think I could stand to run the test long enough to find an error

With a cesium fountain clock, that's 10^-15 accuracy. Rubidium is 10^-12. But those two are atomic clocks. A typical crystal is going to be accurate to between 10^-4 and 10^-5. With attoseconds, we are above 10^-6 accuracy rate (my math could definitely be wrong here when converting between 0.x and x%); and we can run up to 18 seconds with uint64 between normalizing all the clock values to prevent overflow (though since some counters can be as high as half the value they were before, I would halve that to 9 seconds. Which means I wouldn't trust going to 10^-19 for the precision here.)

The best estimate I can give from my results is that a precision of 10^-n means we are accurate to a minimum of 10^-(n-12) for frequencies of up to 4GHz. That would be off by one second every 85 years.

So given this, I concur through empirical evidence with AWJ: there's no need at all for 128-bit integers here. Unless of course you want to brag about being more accurate than the best atomic clock in the world ;)

If anyone sees any errors in my test harness, results or reasoning, please let me know. I am, after all, quite bad with math :/

...

EDIT: but actually, I was mistaken in that we don't need 2^32 headroom for uint128_t. We only need enough such that 2^128/precision >= 2. So in this case, we can easily do 10^-38; or 10^20 (one hundred quintillion) times more precise than attoseconds. Which gets us an error rate of 10^-26. Which is 10^11, or one hundred billion times more precise than the most precise cesium-based atomic clock mankind has ever invented :D That's an error rate of one second every 2 quintillion years. And we can push it to 2^-127 in that case, which is 70% more accurate than even that.

So we might as well use 2^-63 instead of 10^-18 for our uint64_t version. That gives us 922% more precision for free. Gets me to an error of 1 cycle on my test harness for 10 billion iterations.


I think 2^63 might be cutting it a bit too close. A 1 Hz clock will go from zero to overflow in just two ticks. I've got a nagging feeling that if you have two or more 1 Hz clocks (e.g. a hypothetical SNES cart with both types of RTC in it) there might be some circumstance in which an overflow could occur even with regular rebasing. Even 2^63-1 would be substantially safer.

Also, since you've apparently convinced yourself that uint64_t is precise enough, I would strongly recommend against any kind of platform-dependent compile time selection (e.g. defaulting to 128-bit on x86_64 and 64-bit on i686 and ARM) Problems include:

* Savestate incompatibility between platforms
* Potential input-movie incompatibility between platforms (there's a tiny but nonzero chance that a movie recorded on a more precise build will desync on a less precise build or vice-versa. Why risk it?)
* Bug reproducibility between platforms (you have enough trouble with compiler bugs given all the bleeding-edge C++ you use; why compound it by intentionally introducing more platform-dependent behaviour right in the core of the emulation?)

ETA: Also, you're using uintmax in nall/windows/detour.hpp, so now it behaves differently in 32-bit and 64-bit builds. I have no fucking idea what detour.hpp is for, but it looks very low-level to me, and I would guess it wasn't your intention to change it...


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 4:57 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> I think 2^63 might be cutting it a bit too close. A 1 Hz clock will go from zero to overflow in just two ticks. I've got a nagging feeling that if you have two or more 1 Hz clocks (e.g. a hypothetical SNES cart with both types of RTC in it) there might be some circumstance in which an overflow could occur even with regular rebasing. Even 2^63-1 would be substantially safer.

Neat, I thought about that exact same thing. My fear was that 2^63-1 results in a number that's far more likely to round poorly and result in more fractional error accumulation. Especially with clocks that are powers of two already (I have 32*1024, 4*1024*1024, etc clocks.) However, at the margin of error we're dealing with, that's probably statistically insignificant. Still, I wish I was better at math to know for sure whether this even matters or not.

The reason why I believe it should be safe: I don't actually wait one full second before normalizing counters. I actually do it once per frame. It's just that framerates vary per system, so I wanted to leave the headroom for "up to one second", and "up to one second minus one cycle" is close enough.

If you're sure it won't have adverse effects on the precision, then I'll switch it to 2^63-1 just to be safe.

> ETA: Also, you're using uintmax in nall/windows/detour.hpp, so now it behaves differently in 32-bit and 64-bit builds. I have no fucking idea what detour.hpp is for, but it looks very low-level to me, and I would guess it wasn't your intention to change it...

detour is dead code. Just kept around in case I need it again.

I used it for the automation of a proprietary work application, and again against Lunar Magic. I can talk about the latter.

So, FuSoYa refused to add headerless ROM support to Lunar Magic, and it's closed source. So I wrote a program called Header Magic that opens Lunar Magic as a debuggee process, then used DLL injection to get code into its process. Then I had that code use detour.hpp to hijack all the Win32 file APIs to pretend that ".smc|.sfc" files had an extra 0x200-byte header that really didn't exist. And voila, Lunar Magic now supports headerless ROMs anyway :D (And yes, that really works. Perfectly, in fact. Have a few smwcentral.net people using it.)

...

Good catch though on intmax ... this is something that's been bothering me.

The idea of intmax is "hold the biggest possible number this platform supports", yet on 64-bit platforms, that is 128-bits, not 64-bits. And so now, all of my code that uses intmax_t won't be able to work with 128-bit numbers without truncation.

detour is just dead code that happened to not be updated from back when I had non-_t versions of intmax_t/intptr_t. But my intention now is to go in and update all of my intmax_t functions to be compatible with 128-bits. Obviously to do that, I have to use my own non-_t versions, since I can't override the official ones from stdint.h now.

Now obviously, it's immediately apparent that relying on > 64-bits makes your code no longer portable to 32-bit systems. But that was always the idea behind the max variables in the first place. If you want portability, you use intN_t. This mess is why int is still 32-bits instead of 64-bits. That type is basically stuck as short-hand for int32_t now.

NOTE: I understand that technically uint128_t is the compiler working on two uint64_t fields. But the same can be said of 32-bit systems where intmax_t is 64-bits. If this were the reasoning, then intmax_t should be 32-bits on 32-bit systems, right?


Top
 Profile  
 
PostPosted: Mon Aug 08, 2016 5:46 pm 
Online

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
Neat, I thought about that exact same thing. My fear was that 2^63-1 results in a number that's far more likely to round poorly and result in more fractional error accumulation.


2^63 (or any power of 2 for that matter) won't evenly divide by any divisor that isn't a power of 2 itself. Likewise, 10^18 or any power of 10 won't evenly divide by anything that has prime factors other than 2 and 5. This is basic number theory. On top of the fact that worrying about which numbers are "more likely to round poorly" is silly when you're dealing with maximum rounding errors of parts in 100 billion, you aren't even worrying about the right numbers :mrgreen: "nice round numbers" like 2^n or 10^n have the property that very few divisors will divide them evenly.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2

All times are UTC - 7 hours


Who is online

Users browsing this forum: Bing [Bot] and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group