PPU synchronization from NMI

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

PPU synchronization from NMI

Post by blargg »

Finally, the code to synchronize to the PPU in an NMI handler, without having to cycle-time all code:

nmi_sync.zip

Has demo code, the routines to call, and documentation on how to calculate exactly when a write will occur. With this, the random latency of NMI is eliminated, so that you are always a known number of CPU cycles into the current frame. Code outside NMI doesn't need to be cycle-timed. This will allow more accurate raster effects in demos and games.

I also wrote a detailed wiki page about how it achieves this synchronization.
User avatar
Memblers
Site Admin
Posts: 4044
Joined: Mon Sep 20, 2004 6:04 am
Location: Indianapolis
Contact:

Post by Memblers »

That's really amazing, I didn't know that was possible.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

This is a very interesting topic, and the fact that you can have a main thread that is not constant-timed makes all the difference.

Now, I'm still trying to understand exactly how the technique works, but how does this behave on the real console (I'm having some problems that prevent me from using my PowerPak right now, so I can't test it myself)? In Nintendulator it jitters quite a bit, and in Nestopia it jitters a little less (looks like a single pixel to me). Can you really make something always happen at the exact same time every frame (i.e. the demos in the archive don't jitter at all)?
User avatar
neilbaldwin
Posts: 481
Joined: Tue Apr 28, 2009 4:12 am
Contact:

Post by neilbaldwin »

We are not worthy.

:)
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

On NTSC, there is a one-pixel jitter. On PAL, there is a one or two-pixel jitter, and the left-most pixel can differ by one or two pixels after you press reset (but it doesn't change while running). This is all hardware limitation; there's no way in software to overcome this, even with fully cycle-timed code. In both cases, the frame length in CPU cycles includes a half-cycle, so every other frame every CPU cycle in the frame is shifted a pixel or two (but in NTSC, by synchronizing with the PPU at the beginning, this is always reduced to one pixel, never two).

What we need now is for people to try this out with their favorite raster technique, including ones they could never get to work due to timing, and see what is possible.

In a nutshell, we want NMI to occur at the beginning of a frame, rather than delayed until the end of the current instruction. This technique is able to determine how long the interrupted instruction was, and make a compensating delay so that it is as if NMI occurred at the beginning of the frame, without any delay. Note that the posted version can only handle interrupted instructions of two and three cycles, but that I have been able to make it work even when interrupting longer instructions.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

So Nestopia does a better job than Nintendulator when emulating this? Interesting.

IMO, the jitter makes it hard to find a good use for this. The first thing that came to my mind was vertical screen splits, but that would not look good with the jitter.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

Well, it works just as supposed/doccumented on the real hardware, both NTSC and PAL. The amazing thing is that, on NTSC, you get the exact same result on all resets and power on (or I've been extremely lucky). Nestopia isn't too far from the truth, but doesn't emulate different reset behavior on PAL.

You're a hero Blarg, congratulations ! I was just going to ask you where was the devlopment of the NTSC version because i haven't had news for a while, and I was afraid something would have turned up not doable.

Tokumaru, you're probably missing how much of an improvement what did blargg over the common synchronization method is. Under normal circunstances, the "best" possible margin for a PPU write was about a dozen of pixels (by having an all-cycle timed code from a NMI which interrupts an endless jmp here loop). Now the best possible margin is 1 pixel NTSC and 3 pixel PAL (so I'll say this is 3 pixels - because it's the worst case that counts if you want chances to do things that are doable on both NTSC and PAL systems).

So 3 pixels fits perfectly in a tile if you're lucky. This allows you to do :
- Mid-scanline CHR-bankswitching on the BG *WITHOUT* any tiles that are shared between all banks
- Scroll writes mid-scanline *WITHOUT* any blank area. Vertical or horizontal screen split becomes perfectly possible (as long as the bottom isn't used - to have free CPU time before the next VBLank)
- Nametable switching mid-scanline *WITHOUT* any tiles that are shared between both tables
- Do funky clipping effects with $2001 WITHOUT having any shaking horizontally.
- Palette rewrites during HBlank could become much less of a nightmare to get glitch free
- The accuracy of emulators can be tested to a higer point : Because it's possible to write to registers to a very predictable pixel, if effects takes too early or too late the result can be visible and inacuracies that would have otherwise never noticed will be corrected.
What we need now is for people to try this out with their favorite raster technique, including ones they could never get to work due to timing, and see what is possible.
True, I'll dig my old "Window" demo that did palette writes during HBlank but with glitches and see if I can improve that.
Also, some research about $2007 reads during the frame should be done. Nobody has any clue of what effects those have, and to be honest it's about time some serious info about that would be published.
I'd also say some research about $2003 and $2004 should be done, but Blarg was about to become mad last time so we could just forget about those two and keep to the good old "just sta $4014 and don't ask any other questions".
Useless, lumbering half-wits don't scare us.
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

Yes, for NTSC, the previous best you could get from an NMI that did sprite DMA was a jitter of 14 pixels. NMI itself has up to three (assuming it's a JMP loop). It's not just two because sometimes NMI is delayed an extra cycle on top of that. Then you have an extra possible cycle due to sprite DMA. And then there are up to two pixels depending on where in the CPU cycle the beginnings of the frames fall. That totals 4*3+2 = 14 pixels. With this, the jitter is a single pixel, and it's consistent across resets. The PAL situation is a little worse, due to the hardware timing.

I plan on trying this in Nestopia tomorrow to be sure it emulates it correctly. Until then, don't assume it's handling it correctly.

And as Bregalad mentioned, this might be possible using $2004 reads. I've done some tests, and made some progress, but still have some issues. If I could get it working, it'd be really awesome. It'd allow synchronizing without having to cycle-time the NMI handler, and it'd be able to handle NMI during any instruction. It's just that $2004 has lots of obscure things, and I'm sure it'll be somewhat different on PAL, unlike this technique, which is exactly the same, other than timing.

I've had techniques for synchronizing this well in the past for emulator test ROMs, but it required cycle-timed code. The breakthrough here is getting the same accuracy from a NMI, without the code outside NMI needing to be cycle-timed, and with only ten instructions of synchronization code in NMI (no loops either).
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

Blargg, do you have a generic method for easily delaying N cycles ?
Because it really is a headache to me. If the number of cycles is low enough, I go on with a M = (N-2)/5 and K = (N-2)%5 then do something like that (by hand, not using macros, although it might work with a macro):

Code: Select all

.if (K=1) M = M-1
.endif
  ldx #M
-  dex
  bne -
.if (K=1)
  nop
  nop
  nop     ;6 more cycles
.elseif (K=2)
  nop    ;2 more cycles
.elseif (K=3)
  lda $ff ;3 more cycles
.elseif (K=4)
  nop
  nop   ;4 more cycles
.endif

However, this don't work for N > 5*256 (it also don't work for smaller N but this isn't as much a problem).
And as Bregalad mentioned, this might be possible using $2004 reads. I've done some tests, and made some progress, but still have some issues. If I could get it working, it'd be really awesome. It'd allow synchronizing without having to cycle-time the NMI handler, and it'd be able to handle NMI during any instruction. It's just that $2004 has lots of obscure things, and I'm sure it'll be somewhat different on PAL, unlike this technique, which is exactly the same, other than timing.
That's not what I meant. I meant we could, by synchronize with your method, read $2004 and $2007 at very predictable times (instead of being approximate), and reverse engineer more precisely how read from those registers works.
Useless, lumbering half-wits don't scare us.
Bananmos
Posts: 552
Joined: Wed Mar 09, 2005 9:08 am
Contact:

Post by Bananmos »

Yeah, the list of things you could do with this sure gets your imagination going again at a time you thought we'd fully explored the NES from a "what cool effects are actually possible?"-point of view. I dare call it the second coolest discovery in the NES homebrew scene, ever since I learned about Loopy-scrolling more than 10 years ago :)

And speaking of that, Loopy's firefly demo does come to mind as one of the use cases for this: http://home.comcast.net/~olimar/NES/firefly.zip

The demo is MMC5 and uses scanline interrupts for the fly effect, so I don't know how much jitter there is on the a real NES... anyone who could tell me? I imagine it could be quite improved by this sync code, and it would be cool to see a more compatible NROM version rather than MMC5.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Just to be clear, I'm not underestimating what blargg has done. I just want to see it being put to good use, rather than being just another discovery that looks cool at first but is never used for anything meaningful.
User avatar
Banshaku
Posts: 2417
Joined: Tue Jun 24, 2008 8:38 pm
Location: Japan
Contact:

Post by Banshaku »

I'm not sure if the issue I had was related to this but I will ask my the question.

Blargg solution seems to help regarding raster effects. In my project, while scrolling up or down, I have to skip a few scanlines to "connect" the content together. This is because of the map format used in the original game is not nes compatible.

I had to use the the MMC3 irq counter to find the right line and skip a few of them. The only problem is that there was jitters. Because of that, I had to add some nop to fix the scanline at the right position. But, I never was able to get the right scanline after because of this operation somehow.

Would this new finding help in that case?
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Banshaku wrote:Would this new finding help in that case?
I don't think so, because resetting the scroll is a fairly quick operation and the time window provided by an HBlank should be more than enough to do it properly. An error of 7 or 8 cycles shouldn't matter in your case.

If you can't properly reset the scroll to the desired location there must be something else wrong. Maybe the IRQs are not behaving as expected and firing at different times in the scanline or something.

I have recently made my game reset the scroll during HBlank and it works flawlessly. You just have to write to $2006 and $2005 (if you want to change the fine Y scroll, if you don't only $2006 will do the trick) as described in loopy's document and make sure that the second $2006 write falls withing HBlank (and also the $2005 write that changes the X scroll, because changes to the fine X scroll appear immediately). HBlank lasts nearly 28 CPU cycles, so making sure that the 4 or 8 cycles of 1 or 2 PPU writes fall within that time shouldn't be such a precise operation.

Now, I don't remember when exactly in the scanline MMC3 IRQs fire, but if it's too close to the start of Hblank there might not be enough time to reset the scroll for the next scanline. If that's the case, it would be better to have the IRQ fire 1 scanline early and waste some time before changing the scroll on the following scanline, a change that will only take effect on the one after that.
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

This doesn't help with IRQ, only timed code running from NMI. The main benefit should be more time to do timed writes, since you don't need near as big an error window. In the litewall update I just posted in another thread, I make good use of this to do three palette updates per hblank, without any glitches, even in overscan. I also use the low jitter to do a $2001 tint update within a five-pixel window. Given that there is the natural three-pixel jagged effect you get with NTSC on successive scanlines, this only leaves two pixels of extra room.

I'm hoping it'll allow new mid-frame scroll effects that weren't possible before. At least with NTSC, the jitter is very low.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

As I understand it, the MMC3 IRQ fires at x=260, once the PPU has set up the address lines for the first sprite pattern fetch. This is already within hblank and after both the X and Y parts of Loopy_V have been updated. So I agree that "wasting" a line is probably the best choice, and it's not even really wasting if you defer computation of some of the written values until that line.
Post Reply