FCEUX 2.2.2 provides better trace logging that includes instruction CPU cycles. I started recording when the text boxes appear at the start of the first stage of Marble Madness and I compared that log against a log from my emulator.
When I delay the NMI handler by 40 CPU cycles, not only does it fix the rendering, but every instruction matches up between the logs cycle-per-cycles. Each frame ends with a spin lock waiting for the next NMI:
Code:
$FC5A:4C 5A FC JMP $FC5A
Consequentially, delaying the NMI handler by 40 CPU cycles ultimately results in that spin lock spinning 13 fewer times.
If I remove the NMI handler hack and let NMI take place on dot 1 of scanline 241, the text box rendering gets screwed up, but still every instruction matches up between the logs cycle-by-cycle until it reaches the sprite 0 hit test at the bottom of the frame. Marble Madness appears to use a sprite 0 hit test to hide the last few scanlines, presumably to conceal graphical artifacts that would result from vertical scrolling.
In this case, the loop that is waiting for the sprite 0 hit test absorbs the difference. In fact, if I add a hack to delay setting of the sprite 0 hit flag by 40 CPU cycles, then the logs once again fully match up. This suggests that the CPU instruction timings are correct, including things like OAM DMA stalls.
How could the rendering be out of sync with the processor by 40 CPU cycles if the timings are valid?