It is currently Sun Dec 17, 2017 10:25 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue Oct 11, 2016 2:06 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
The wiki contains an approximation that uses 2 lookup tables. However, since pulse, triangle and noise are 4-bit channels and DMC is a 7-bit channel, doesn't that mean a single lookup table with 2^23 = 8,388,608 entries could provide exact values? Assuming that 16-bit values are required, the lookup table only needs to be 16 MB. Are there any cache considerations for randomly reading from an array of that size?


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 2:13 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6540
Location: Seattle
... Yes?
Enormous ones?

The largest in-die L3 cache I've heard of is on the order of 8 MiB, and you're talking about using more than twice that just for fixing APU non-linearity?


... That said, where did 2²³ come from? 4+4+7 = 15...

Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 3:04 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5899
Location: Canada
The lookup tables are only one way to implement the given non-linearity formulae, which themselves are only an approximation of the actual output (which still needs more research, I think).

I don't think the wiki really needs to prescribe how to implement the non-linearity; the formula is the important part. The lookup table is just an implementation detail. Whether or not it's worth doing depends entirely on the context of implementation.

...and yes, those tables are much, much smaller than what you're saying.


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 3:15 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
lidnariq wrote:
Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".


Let's put that claim to the test. Below is a Java application that compares 32-bit floating-point operations to lookup tables:

Code:
public final class BenchmarkApuMixers {
 
  private static final float[] pulseTable = new float[31];
  private static final float[] tndTable = new float[203];
 
  static {
    for(int i = pulseTable.length - 1; i >= 0; i--) {
      pulseTable[i] = 95.52f / (8128f / i + 100f);
    }
    for(int i = tndTable.length - 1; i >= 0; i--) {
      tndTable[i] = 163.67f / (24329f / i + 100f);
    }
  }

  public static float noTableTest(final float pulse1, final float pulse2,
      final float triangle, final float noise, final float dmc) {
    return 95.88f / (8128f / (pulse1 + pulse2) + 100f) + 159.79f
        / ((1f / (triangle / 8227f + noise / 12241f + dmc / 22638f)) + 100f);
  }
 
  public static float lookupTableTest(final int pulse1, final int pulse2,
      final int triangle, final int noise, final int dmc) {
    return pulseTable[pulse1 + pulse2] + tndTable[3 * triangle + (noise << 1)
        + dmc];
  } 

  public static void main(final String... args) throws Throwable {
    float result = 0;
    int x = 0;
    int y = 0;
    long startTime = System.nanoTime();   
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += noTableTest(x, x, x, x, y);
    }   
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime)
        / 200_000_000.0);
    result = 0;
    x = 0;
    y = 0;
    startTime = System.nanoTime();
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += lookupTableTest(x, x, x, x, y);
    }
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime)
        / 200_000_000.0);
    System.out.println(result);
  }
}


The results:

floating-point: 30.239289 nanos/iteration

lookup tables: 2.060711 nanos/iteration

That can't be right. The iterations are way too brief. This is only a 3 GHz machine.

Any suggestions for better ways to benchmark this?


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 6:24 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
I tried both ways in my emulator and I can't visually see a difference in the task manager. This is a i5-2400 @ 3.1 GHz 64-bit Win7 box from mid-2012. My emulator is not particularly efficient. One of the 4 cores that appear in the task manager bounces between 30% and 40% load. Overall, it uses up about 15% of the available CPU. Maybe that's fine considering today's desktops.

To my ears, there is no audible difference. But, my hearing is not that great :)


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 7:49 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6540
Location: Seattle
A 203-entry LUT is short enough to fit into a few lines of L1 cache, so that might be the difference you're seeing.

A tiny LUT is definitely a very different story than a multi-megabyte one.

(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 7:56 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
lidnariq wrote:
(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)


On what box?


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 8:05 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6540
Location: Seattle
Haswell i3-4330. (nominally 3.5GHz peak, and 64B/line of cache)


Top
 Profile  
 
PostPosted: Tue Oct 11, 2016 8:32 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 3969
Huge lookup tables = tons of cache misses = slow

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 6:50 am 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
The reason that I am reviewing the mixer code is that I want to provide sound options that enable the user to adjust the mix via sliders similar to many other emulators. Using floats, the code is straightforward and clean. Using integer math, things become messy. For ints only, a pre-scaled LUT could be used for each channel. Or, I could multiply by a constant and right-shift. But, it's not clear if either is really faster than a multiplication by a 32-bit float. Or, in my case, if it is even worth it; after mixing, my emulator applies filtering and decimation, both of which use a ton of 64-bit float multiplications, additions and table look-ups. In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 7:16 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
zeroone wrote:
In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.

Perhaps it's a matter of whether you want to target PCs or smartphones. On the one hand, people tend to want to play simpler games on the go rather than at home, and mobile devices with ARM CPUs have tended to perform far worse than desktop PCs at floating-point division in the past. On the other hand, the mobile devices that people are carrying nowadays have a touch screen and tilting the whole device as their only input methods, and these don't map quite so well to the twitchy actions that NES games tend to require on the Control Pad and A and B Buttons. Clip-on gamepads for phones exist, but I've never seen one in use in person.


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 10:30 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6540
Location: Seattle
I'm not certain exactly how the reciprocal-of-reciprocals model came about, but given how MOSFETs work it really "ought" to involve a square root instead. (wikipedia)


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 2:28 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
The LUT approximation deviates from the float version by more than 10% when DMC is not being used. For instance, plug-in the max values for pulse1, pulse2, triangle, and noise, and set dmc to 0. Maybe there is an audible difference.


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 2:48 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5899
Location: Canada
zeroone wrote:
The LUT approximation deviates from the float version by more than 10% when DMC is not being used. For instance, plug-in the max values for pulse1, pulse2, triangle, and noise, and set dmc to 0. Maybe there is an audible difference.

Yeah, it's unclear why the lookup table implementation on the wiki is using different numbers than blargg's formula, maybe mostly because it's using 3 magic values instead of summing triangle + noise + dmc. (Is the actual mix an analog combination of 3 DACs or a digital sum output to 1 DAC?)

I hinted at this earlier, but you should know that this formula is an old approximation that blargg made up years ago to fit the data available at that time. It's not really that accurate to begin with, but it's better than the linear approximation that was used before it. Between Visual2A03, and the wealth of knowledge and available expertise this community has accumulated since then I'm sure someone could do a lot better if they were to take a crack at it now. (I intend to try once I'm finished my game project, if nobody else gets to it before then.)

As for whether you can hear a difference, probably you won't. The analog mix between the two APU outputs is highly variable from system to system anyway; every release of NSFPlay I get complaints on both sides of "this channel is too loud, it's quieter on my NES" and the other way around. Every NES is a little different in that way, and it's a much stronger difference than the kind of thing you'll notice between these approximations of the same formula.


Top
 Profile  
 
PostPosted: Wed Oct 12, 2016 3:05 pm 
Offline
User avatar

Joined: Mon Dec 29, 2014 1:46 pm
Posts: 753
Location: New York, NY
According to the wiki, "When the values for one of the groups are all zero, the result for that group should be treated as zero rather than undefined due to the division by 0 that otherwise results."

This is not the case. In the limit, as the pulse channels approach 0, pulse_out smoothly approaches 0. Similarly, tnd_out smoothly approaches a constant. Consequentially, the checks are not necessary if the formulas are reduced to the expression below:

Code:
  public static float mix(final int pulse1, final int pulse2,
      final int triangle, final int noise, final int dmc) {   

    return (0.9588f * (pulse1 + pulse2)) / (pulse1 + pulse2 + 81.28f)
        - 361.733f / (2.75167f * triangle + 1.84936f * noise + dmc + 226.38f)
            + 1.5979f;
  }


Also, with fewer divisions, it should run a lot faster.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Gilbert and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group