Workaround for OAM corruption - rendering a blank frame?

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

Post Reply
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Workaround for OAM corruption - rendering a blank frame?

Post by Dwedit »

If you render a blank frame (set a black palette, DMA some sprites, enable rendering for an entire frame until NMI), will that compensate for the OAM corruption bug caused by disabling rendering early?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by rainwarrior »

Just for context, this seems to be directly consequent of the recent discussion in this thread: Nine sprites overflow doesn't work in the beginning

To answer your question, though: yes, as I understand it.

You could even render the frame with background on but sprites off, since having either visible causes both to go through their internal process.

If you upload OAM on every frame, really the "problem" is going to result in 2 missing sprites for a single frame, so in a lot of cases that might be nothing much to worry about? It's a big problem if you're using sprite overflow, or if you're doing the corruption every frame, or don't update OAM often, but otherwise it seems like something that could be ignored if the minor visual error for one frame isn't too objectionable? (seems much less worse than regularly occurring sprite flicker, which we already generally accept)

I am not sure if this problem has the potential to affect sprite-0 hit. At the end of that linked thread I had determined in the example that it was always making duplicates of sprites 0 and 1, so those two sprites seemed to be safe in every case? (I don't know if it's different if you set $2003 to some value besides $00 before OAM DMA, or fail to set it, etc.)

I also don't know what part of the process corrupts the sprites. Does it happen during sprite evaluation? Does the copy overwrite happen on every scanline or just the first one? Would it still happen if you didn't OAM DMA?
Drag
Posts: 1615
Joined: Mon Sep 27, 2004 2:57 pm
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by Drag »

Regarding what this bug is and why it happens:

The internal sprite OAM is DRAM instead of SRAM, and one characteristic of DRAM is the necessity to refresh it periodically. There's several strategies to do it, but the main effect is that the memory is split up into sectors of N bytes. When you read a byte from DRAM, you're fetching an entire N-byte sector and grabbing the desired byte from it, but fetching from DRAM is destructive, so that sector needs to be written back to where it was fetched from; the process of which "refreshes" that sector. Most DRAMs contain a third operation aside from read and write, which basically just fetches a sector and then rewrites it, so when DRAM sits idle and doesn't get accessed (for example, in a computer where an entire section of memory can be untouched for a while), there's supposed to be circuitry that automatically performs the "refresh" operation on successive sectors one after the other, over and over. The PPU doesn't need this though, because it touches every single byte of the OAM each scanline, which effectively does the same thing.

The effect of the bug seems consistent with a sector fetch happening (which puts it into a latch, so the appropriate byte can be copied out and then the sector can be rewritten), but then the rewrite portion puts it in the wrong spot for whatever reason. It's worth noting that dots 64-255 are the only time the primary OAM is accessed, and outside of that, only secondary OAM is being accessed. In addition, the secondary OAM is accessed when there's an in-range sprite on the scanline in question, and that lines up with the condition that the bug happens when rendering is disabled outside of dots 64-255, or when there's an in-range sprite on the scanline; these are when the secondary OAM is being accessed.

So, it seems logical to conclude that the DRAM's refresh gets goofed up when you disable rendering while the secondary OAM is being accessed. Why? Beats me! Seems like some part of the circuit is being shared between the two OAMs, and interrupting an access to secondary OAM bungles something up with the primary OAM.

I don't know how to read a decapped IC, so I can't trace this out myself and figure out exactly what's going on or if the above is true at all, but this seems like a reasonable place to start for someone who can.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Workaround for OAM corruption - rendering a blank frame?

Post by lidnariq »

Well, for one ... there's only one block of DRAM, and the memory addressing as shown in Visual2C02 shows "where" secondary OAM is relative to primary OAM (one address line apart).

I don't think reads are destructive in the specific DRAM used in the 2C02. The NMOS 4T-DRAM cells used store a copy of both Bit and NotBit (e.g. nodes 3028 and 3066). In this 4-transistor DRAM, refresh "should" just be : pclk0 charges the columns; and then one of row0 through row31 are asserted. The charge on the columns reloads the capacitor of one gate or the other, depending on the current value of both gates. Reading the node should be almost the same: pclk0 charges the column; then one of row0..31 is asserted; then column0..7 is asserted, capacitively connecting the column lines to the common bit in/out.

(four transistors per bit: t599, t562, t522, t651)

This topology isn't like historical 3-transistor DRAM, which has a separate read and write strobe (and, in fact, separate input and output data lines), nor like modern 1-transistor DRAM (where reads are unequivocally destructive).

My best guess is that most of the corruption comes from situations where the row drivers are asserted (connecting the feedback loop to the big common nodes) without first precharging the columns. This should usually end up copying the value from the higher capacitance common node (e.g. 253,273) back into the cell, rather than the desired other way.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by Dwedit »

The thing is though, you still have corrupt sprites next frame even if you rewrite the OAM. Why would that happen?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Workaround for OAM corruption - rendering a blank frame?

Post by lidnariq »

Maybe stopping the OAM evaluation at the wrong time gets the refresh/read/write sequencer into a bad phase? Just a random guess.
User avatar
thefox
Posts: 3134
Joined: Mon Jan 03, 2005 10:36 am
Location: 🇫🇮
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by thefox »

lidnariq wrote:Maybe stopping the OAM evaluation at the wrong time gets the refresh/read/write sequencer into a bad phase? Just a random guess.
This is how I've always understood this, but I have nothing to back up the claim (apart from not being able to think of any other explanation). More verbosely: When rendering is disabled mid-frame, the "sequencer" gets left in whatever state, which is not the state it's usually in at the end of a frame. (Also, we assume that the "sequencer" state is not completely reset at the beginning of a new frame.) Thus, when the next frame starts with freshly uploaded sprites, it ends up corrupting OAM as one of the first things it does when rendering begins, and eventually settles back into normal operation.
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
Drag
Posts: 1615
Joined: Mon Sep 27, 2004 2:57 pm
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by Drag »

It doesn't seem to be secondary OAM accesses that cause the glitch, since the secondary OAM is accessed all throughout the scanline, even during the safe zone and even when no sprites are in range, so it may be a problem with sprite evaluation itself instead of the DRAM.
Checking our favorite post again, enabling rendering late into the frame instead of in vblank also avoids the glitch. So, disabling rendering during the unsafe portion of the scanline, and then reenabling rendering during the same frame (like if you were to update the palette for a status bar split) shouldn't cause the glitch.

The physics of DRAM refreshing may have been a red herring. It's entirely possible that a flaw in the sprite evaluation logic is causing a delibarate garbage-write. The glitch could simply be something goofy happening when sprite evaluation activates for the first time after vblank. Otherwise, if it depended on what sprite evaluation was doing at the time, we'd be seeing a chunk of secondary OAM being copied into primary OAM, or possibly a string of FFs being copied since that's what happens during dots 0-63. But then wouldn't all that be overwritten by a DMA? The glitch has to be occuring when rendering is being enabled.

The mention of only 6 sprites triggering the glitch really bothers me though, it's such an arbitrary number that doesn't line up to anything. It might only be referring to Blargg's specific test where only one sprite is in range.
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by rainwarrior »

Drag wrote:The physics of DRAM refreshing may have been a red herring. It's entirely possible that a flaw in the sprite evaluation logic is causing a delibarate garbage-write. The glitch could simply be something goofy happening when sprite evaluation activates for the first time after vblank. Otherwise, if it depended on what sprite evaluation was doing at the time, we'd be seeing a chunk of secondary OAM being copied into primary OAM, or possibly a string of FFs being copied since that's what happens during dots 0-63. But then wouldn't all that be overwritten by a DMA? The glitch has to be occuring when rendering is being enabled.
When I tested it in the previous thread, it doesn't appear to be copying $FF, but rather it's copying 8 bytes from sprites 0-1 over two of the other sprites. It might be actually copying from the current $2003/OAMADDR location to the affected address? (Haven't tried testing it with other values written to $2003 yet.)
Drag wrote:The mention of only 6 sprites triggering the glitch really bothers me though, it's such an arbitrary number that doesn't line up to anything. It might only be referring to Blargg's specific test where only one sprite is in range.
As I said in that thread, I don't think it's restricted to 6 sprites at all. As far as I can tell it happens whenever sprite evaluation is interrupted, not really to do with specific sprite indices. The copied-over sprites can be in any index, so far as I can tell (and sprites 0-1 were immune, since they'd be copied over themselves).
Drag wrote:Checking our favorite post again, enabling rendering late into the frame instead of in vblank also avoids the glitch. So, disabling rendering during the unsafe portion of the scanline, and then reenabling rendering during the same frame (like if you were to update the palette for a status bar split) shouldn't cause the glitch.
I have a strong suspicion that the OAM corruption happens specifically on the first scanline where rendering is enabled again, so it makes sense to me that re-enabling it for at least one scanline would flush the error if done prior to the next frame's DMA. (Have not tested this yet.)
Drag
Posts: 1615
Joined: Mon Sep 27, 2004 2:57 pm
Contact:

Re: Workaround for OAM corruption - rendering a blank frame?

Post by Drag »

Writing a different value to $2003 does indeed change which sector overwrites the corrupted sector, as the linked post mentions.

The 6 sprites thing bothers me because I can't figure out how to make it plausible. Blargg was testing when only sprite 0 is in range of the scanline when rendering is disabled, so only one sprite is in range, and it's sprite 0. It corrupts predictably, but only the first 6 sprites have an effect. My thoughts were that, after rendering is disabled, the dots 64-255 portion of sprite evalution (if active at the time) might still continue until dot 256, and that the last sprite to be checked is what determines which sector gets corrupted. That would explain why it seems that the result is the same regardless of where inside dots 64-255 you disable rendering. However, I'm only assuming that the result is the same, so that might be incorrect.

The wiki doesn't state explicitly, but it seems that out of range sprites take 2 cycles to process, and in-range sprites take 8 cycles, so that means the sprite evaluated as part of dot 255 depends on how many sprites were found to be in-range. If there's only one in-range sprite, it takes 134 cycles to check the entire OAM table. There's 192 cycles total for checking the OAM table, so 58 cycles are left over. That means, sprite evaluation proceeds past the end of the OAM table, secondary OAM is placed in read-only mode, and evaluation continues at sprite 0 again. In this case, if the in-range sprite is within the first couple of sprites, it'll trigger the 8-cycle in-range logic again, while the rest of the sprites will not because this duplicate evaluation ends before it can reach them. The problem is that 58 cycles is much more than 6 sprites worth of leftover cycles.

My idea was that the corruption always happens, but because zero sprites were in range, the Y-coordinate evaluation always ends in such a way that sprite 0 is the last accessed sprite, so when the glitch happens, the first sector overwrites itself with its own data = no effect. However, when no sprites are in range, it only takes 128 cycles to evaluate the entire OAM table, and when it wraps around, there's enough time to check half the OAM table a second time, so it would leave off in a way that the halfway point of the OAM would get corrupted, but that clearly doesn't happen, so I don't know. Even if it did work out, adding one sprite causes the fifth pair of sprites to get corrupted, but that's not consistent with where the y-coordinate evaluation leaves off during the wraparound, so something else must be happening, or I made a wrong assumption somewhere.

Blargg mentioned that enabling rendering late avoids the glitch, but not how late, leading me to believe the glitch occurs somewhere in the prerender scanline, though we don't know where specifically. My guesses would be at the start, or during vertical scroll initialization, or during the skipped cycle logic (which would make the glitch happen every other frame, at worst, so maybe it's not this.).
Post Reply