APU mixer function efficiency

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

APU mixer function efficiency

Post by zeroone »

The wiki contains an approximation that uses 2 lookup tables. However, since pulse, triangle and noise are 4-bit channels and DMC is a 7-bit channel, doesn't that mean a single lookup table with 2^23 = 8,388,608 entries could provide exact values? Assuming that 16-bit values are required, the lookup table only needs to be 16 MB. Are there any cache considerations for randomly reading from an array of that size?
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: APU mixer function efficiency

Post by lidnariq »

... Yes?
Enormous ones?

The largest in-die L3 cache I've heard of is on the order of 8 MiB, and you're talking about using more than twice that just for fixing APU non-linearity?


... That said, where did 2²³ come from? 4+4+7 = 15...

Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: APU mixer function efficiency

Post by rainwarrior »

The lookup tables are only one way to implement the given non-linearity formulae, which themselves are only an approximation of the actual output (which still needs more research, I think).

I don't think the wiki really needs to prescribe how to implement the non-linearity; the formula is the important part. The lookup table is just an implementation detail. Whether or not it's worth doing depends entirely on the context of implementation.

...and yes, those tables are much, much smaller than what you're saying.
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

lidnariq wrote:Honestly, even using 32 KiW isn't worthwhile. The lookup table approach is only useful on architectures where floating point or division on integers is really expensive, i.e. "before the pentium".
Let's put that claim to the test. Below is a Java application that compares 32-bit floating-point operations to lookup tables:

Code: Select all

public final class BenchmarkApuMixers {
  
  private static final float[] pulseTable = new float[31];
  private static final float[] tndTable = new float[203];
  
  static {
    for(int i = pulseTable.length - 1; i >= 0; i--) {
      pulseTable[i] = 95.52f / (8128f / i + 100f);
    }
    for(int i = tndTable.length - 1; i >= 0; i--) {
      tndTable[i] = 163.67f / (24329f / i + 100f);
    }
  }

  public static float noTableTest(final float pulse1, final float pulse2, 
      final float triangle, final float noise, final float dmc) {
    return 95.88f / (8128f / (pulse1 + pulse2) + 100f) + 159.79f 
        / ((1f / (triangle / 8227f + noise / 12241f + dmc / 22638f)) + 100f);
  }
  
  public static float lookupTableTest(final int pulse1, final int pulse2, 
      final int triangle, final int noise, final int dmc) {
    return pulseTable[pulse1 + pulse2] + tndTable[3 * triangle + (noise << 1) 
        + dmc];
  }  

  public static void main(final String... args) throws Throwable {
    float result = 0;
    int x = 0;
    int y = 0;
    long startTime = System.nanoTime();    
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += noTableTest(x, x, x, x, y);
    }    
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime) 
        / 200_000_000.0);
    result = 0;
    x = 0;
    y = 0;
    startTime = System.nanoTime();
    for(int i = 200_000_000; i >= 0; i--) {
      if (++x == 16) {
        x = 0;
      }
      if (++y == 128) {
        y = 0;
      }
      result += lookupTableTest(x, x, x, x, y);
    }
    System.out.format("%f nanos/iteration%n", (System.nanoTime() - startTime) 
        / 200_000_000.0);
    System.out.println(result);
  }
}
The results:

floating-point: 30.239289 nanos/iteration

lookup tables: 2.060711 nanos/iteration

That can't be right. The iterations are way too brief. This is only a 3 GHz machine.

Any suggestions for better ways to benchmark this?
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

I tried both ways in my emulator and I can't visually see a difference in the task manager. This is a i5-2400 @ 3.1 GHz 64-bit Win7 box from mid-2012. My emulator is not particularly efficient. One of the 4 cores that appear in the task manager bounces between 30% and 40% load. Overall, it uses up about 15% of the available CPU. Maybe that's fine considering today's desktops.

To my ears, there is no audible difference. But, my hearing is not that great :)
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: APU mixer function efficiency

Post by lidnariq »

A 203-entry LUT is short enough to fit into a few lines of L1 cache, so that might be the difference you're seeing.

A tiny LUT is definitely a very different story than a multi-megabyte one.

(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

lidnariq wrote:(FWIW, adapting your core loop to C and compiling it with gcc -O3 -ffast-math produces 15ns/loop with floats, 1.9ns/loop with LUTs, and 14ns/loop (as well as wrong results) with integers)
On what box?
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: APU mixer function efficiency

Post by lidnariq »

Haswell i3-4330. (nominally 3.5GHz peak, and 64B/line of cache)
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: APU mixer function efficiency

Post by Dwedit »

Huge lookup tables = tons of cache misses = slow
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

The reason that I am reviewing the mixer code is that I want to provide sound options that enable the user to adjust the mix via sliders similar to many other emulators. Using floats, the code is straightforward and clean. Using integer math, things become messy. For ints only, a pre-scaled LUT could be used for each channel. Or, I could multiply by a constant and right-shift. But, it's not clear if either is really faster than a multiplication by a 32-bit float. Or, in my case, if it is even worth it; after mixing, my emulator applies filtering and decimation, both of which use a ton of 64-bit float multiplications, additions and table look-ups. In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: APU mixer function efficiency

Post by tepples »

zeroone wrote:In the fancy mixer formula, it's really the divisions that make me cringe. We're all taught to stay away from them for speed reasons, but everyone owns a personal super computer these days.
Perhaps it's a matter of whether you want to target PCs or smartphones. On the one hand, people tend to want to play simpler games on the go rather than at home, and mobile devices with ARM CPUs have tended to perform far worse than desktop PCs at floating-point division in the past. On the other hand, the mobile devices that people are carrying nowadays have a touch screen and tilting the whole device as their only input methods, and these don't map quite so well to the twitchy actions that NES games tend to require on the Control Pad and A and B Buttons. Clip-on gamepads for phones exist, but I've never seen one in use in person.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: APU mixer function efficiency

Post by lidnariq »

I'm not certain exactly how the reciprocal-of-reciprocals model came about, but given how MOSFETs work it really "ought" to involve a square root instead. (wikipedia)
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

The LUT approximation deviates from the float version by more than 10% when DMC is not being used. For instance, plug-in the max values for pulse1, pulse2, triangle, and noise, and set dmc to 0. Maybe there is an audible difference.
User avatar
rainwarrior
Posts: 8734
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: APU mixer function efficiency

Post by rainwarrior »

zeroone wrote:The LUT approximation deviates from the float version by more than 10% when DMC is not being used. For instance, plug-in the max values for pulse1, pulse2, triangle, and noise, and set dmc to 0. Maybe there is an audible difference.
Yeah, it's unclear why the lookup table implementation on the wiki is using different numbers than blargg's formula, maybe mostly because it's using 3 magic values instead of summing triangle + noise + dmc. (Is the actual mix an analog combination of 3 DACs or a digital sum output to 1 DAC?)

I hinted at this earlier, but you should know that this formula is an old approximation that blargg made up years ago to fit the data available at that time. It's not really that accurate to begin with, but it's better than the linear approximation that was used before it. Between Visual2A03, and the wealth of knowledge and available expertise this community has accumulated since then I'm sure someone could do a lot better if they were to take a crack at it now. (I intend to try once I'm finished my game project, if nobody else gets to it before then.)

As for whether you can hear a difference, probably you won't. The analog mix between the two APU outputs is highly variable from system to system anyway; every release of NSFPlay I get complaints on both sides of "this channel is too loud, it's quieter on my NES" and the other way around. Every NES is a little different in that way, and it's a much stronger difference than the kind of thing you'll notice between these approximations of the same formula.
User avatar
zeroone
Posts: 939
Joined: Mon Dec 29, 2014 1:46 pm
Location: New York, NY
Contact:

Re: APU mixer function efficiency

Post by zeroone »

According to the wiki, "When the values for one of the groups are all zero, the result for that group should be treated as zero rather than undefined due to the division by 0 that otherwise results."

This is not the case. In the limit, as the pulse channels approach 0, pulse_out smoothly approaches 0. Similarly, tnd_out smoothly approaches a constant. Consequentially, the checks are not necessary if the formulas are reduced to the expression below:

Code: Select all

  public static float mix(final int pulse1, final int pulse2, 
      final int triangle, final int noise, final int dmc) {    

    return (0.9588f * (pulse1 + pulse2)) / (pulse1 + pulse2 + 81.28f) 
        - 361.733f / (2.75167f * triangle + 1.84936f * noise + dmc + 226.38f) 
            + 1.5979f;
  }
Also, with fewer divisions, it should run a lot faster.
Post Reply