Looking more closely at it:
1. Super Mario Bros. 3 writes $4007 and $4003 on
every note, not just every loop.
2. The timing between this pair of writes varies by about 100 cycles (though with the same variation every loop). This might account for about ~2.5% difference in phase on the highest frequency notes used in the track. (Probably not the culprit, not strong enough.)
3.
tepples suggested here that $4003/$4007 does not reset the state of the clock divider. This could account for much greater difference. If the pitch was the same, it could be up to 12.5% (1/8) off, but if the previous pitch was lower it could cause even wider variation (every octave down doubles the width here). This could potentially vary between loops.
4. Said before, but the triangle is free running and will have more or less random phase always, but I don't think it accounts for what you're hearing. (It's an effect, but not terribly strong compared to what the squares are doing.)
So, really if you're interested, I'd suggest digging into #3. The question I'd have for that is whether the clock divider eventually halts if it runs out while the channel is halted, or if it's always running, and thus every single note is subject to rather larger potential phase varation? The Visual 2A03 project might illuminate this. (Also, once determined, the information should be mentioned on the wiki, if it isn't already.)
This SMB3 track is an interesting edge case because it's using both channels identically; which is rather strange. Usually doublings have a a small difference in pitch for intentional chorus effect, or maybe a change of octave, but SMB3 is just doing the exact same thing on both.
You keep asking which emulators are accurate, which is not a fair question. Most likely nobody has tried to refine this very specific and subtle aspect of its behaviour against this very specific edge case. If you want to know the answer, make a test ROM that can expose the difference and test it on real hardware and emulators you'd like to know more about.
SMB3 itself is
not a sufficient test. As mentioned, in-game there are a lot of variables, but even the NSF, which should be deterministic, can't have consistent timing between different hardware NSF players, or software NSF players. The NSF specification is not strong enough to specify exactly when the play routine should start like that. (You mentioned a test case by Blargg, but I'm not certain if it's supposed to be a test of this specific thing either.) You can't just record SMB3 from hardware and expect "accurate" emulators to sound identical, that in itself would probably be an unfair and incorrect test (similar to that
strange pattern of memory initialization people used to use because it was measured that way on one particular NES). Even the idea that it should "change over time" is not necessarily correct for all reasonable timings-- things like this can often end up falling on some coincidental integer division of the timing loop, the change might be accidentally due to something else, etc.