Optimizing scroll changes after MMC3 IRQs

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Wed May 06, 2020 8:04 am

My current project requires me to change the vertical scroll every 4 scanlines using the $2006/5/5/6 technique (because the fine Y scroll is > 3). I've decided to use the MMC3 for this project because of its popularity and availability, but since MMC3 IRQs fire so late in the scanline (PPU cycle 260 if I'm not mistaken), there's not enough hblank time left to make the scroll change, even with the following code (which runs after the previous instance of the IRQ handler has taken care of the first 3 PPU writes - $2006, $2005, $2005):

Code: Select all

IRQ:
  pha ;3 cycles
  lda Variable ;3 cycles
  sta $2006 ;4 cycles
  ;(set stuff up for the next IRQ)
  rti
That's 10 cycles, plus the 7 cycles it takes to call the IRQ handler and the up to 7 cycles of latency introduced by the instruction running when the IRQ fires, that's a total of 24 CPU cycles, or 72 PPU cycles, causing the final $2006 write to finish on PPU cycle 260 + 72 = 332 at the latest.

This actually wouldn't be a problem in my case if the write *consistently* finished on that cycle every time,
since the first 2 tiles of the whole screen are blank, and the horizontal scroll never changes, but because of the IRQ latency, the scroll change could take place between any of the automatic X scroll increments that happen at the end of the scanline, which would result in inconsistent X scrolling every time I changed the scroll.

I really don't want to have to waste an entire scanline waiting for the next hblank so I can safely update the scroll (this would mean wasting about 12% of my CPU budget), so I figured I'd ask here to see if anyone can think of a way to make that scroll change work right away. There are 2 important points that I mentioned above that may make a difference, but I will list them again for emphasis:

1- Only the vertical scroll has to change. The horizontal scroll is constant throughout the entire frame.
2- The leftmost 2 tiles of the entire screen are blank, so it doesn't matter if they're fetched using the old scroll values.

Thanks in advance for any insight you might have, even though my expectations for an ideal solution are fairly low! :lol:

User avatar
aa-dav
Posts: 92
Joined: Tue Apr 14, 2020 9:45 pm
Location: Russia

Re: Optimizing scroll changes after MMC3 IRQs

Post by aa-dav » Wed May 06, 2020 9:35 am

What if IRQ handler will be placed in RAM and lda variable will be converted to lda # value which will be updated inplace?

Also, extreme variant: do not use Y (or X) register outside IRQ while IRQ can be triggered. So, first instruction of IRQ is stx $2006 and after that X can be used but in the end must be set to value for next IRQ. But it's very extreme variant, I suppose. :)

calima
Posts: 1160
Joined: Tue Oct 06, 2015 10:16 am

Re: Optimizing scroll changes after MMC3 IRQs

Post by calima » Wed May 06, 2020 10:09 am

The various modes affect when the MMC3 irqs trigger. 8x16 sprites, using the "wrong way" halves for bg and sprites, etc etc. Maybe you can find a config that works for both the game-side and earlier irqs.

lidnariq
Posts: 9510
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: Optimizing scroll changes after MMC3 IRQs

Post by lidnariq » Wed May 06, 2020 10:59 am

tokumaru wrote:
Wed May 06, 2020 8:04 am
My current project requires me to change the vertical scroll every 4 scanlines using the $2006/5/5/6 technique (because the fine Y scroll is > 3).
You're storing different things in the bottom half of each tile?
but since MMC3 IRQs fire so late in the scanline (PPU cycle 260 if I'm not mistaken), there's not enough hblank time left to make the scroll change
If you flip around the sprite/background tables, you can move it 64px later, which would reduce the amount of delay in the other case...

VRC4's prescaler means that subsequent IRQs can always be at the same X position, so it'd be an easy way to see if just being able to schedule the IRQ earlier in the scanline helps enough.

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Wed May 06, 2020 12:36 pm

Thanks for the replies.
aa-dav wrote:
Wed May 06, 2020 9:35 am
What if IRQ handler will be placed in RAM and lda variable will be converted to lda # value which will be updated inplace?
Yeah, I though of that while posting, but that would save only 1 cycle...
Also, extreme variant: do not use Y (or X) register outside IRQ while IRQ can be triggered. So, first instruction of IRQ is stx $2006 and after that X can be used but in the end must be set to value for next IRQ. But it's very extreme variant, I suppose. :)
Also thought of this a while back, but giving up an index register might result in a bigger loss of performance than losing 1 scanline every 4, due to all the index swapping necessary... At the very least, it will be a major annoyance to write code for this "6502" variant with a single index register.
calima wrote:
Wed May 06, 2020 10:09 am
The various modes affect when the MMC3 irqs trigger. 8x16 sprites, using the "wrong way" halves for bg and sprites, etc etc. Maybe you can find a config that works for both the game-side and earlier irqs.
Since sprites are fetched during hblank, I'd only be able to clock the counter then, so still too late.
lidnariq wrote:
Wed May 06, 2020 10:59 am
You're storing different things in the bottom half of each tile?
I have soft-pixels in the first 2 and last 2 rows of each tile, so I can set the scroll to line 6, displaying rows 6, 7, 0 and 1, at which point I skip 12 lines to row 6 of the tile below. Rows 2, 3, 4 and 5 of each tile are unused.
If you flip around the sprite/background tables, you can move it 64px later, which would reduce the amount of delay in the other case...
Yeah, that's the best I was able to come up with... slightly reducing the delay by having the IRQ fire slightly later.
VRC4's prescaler means that subsequent IRQs can always be at the same X position, so it'd be an easy way to see if just being able to schedule the IRQ earlier in the scanline helps enough.
Switching to a much less popular mapper is a tough decision...

lidnariq
Posts: 9510
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: Optimizing scroll changes after MMC3 IRQs

Post by lidnariq » Wed May 06, 2020 1:02 pm

tokumaru wrote:
Wed May 06, 2020 12:36 pm
I have soft-pixels in the first 2 and last 2 rows of each tile, so I can set the scroll to line 6, displaying rows 6, 7, 0 and 1, at which point I skip 12 lines to row 6 of the tile below. Rows 2, 3, 4 and 5 of each tile are unused.
You're using the attribute bytes? If you can just use the nametable bytes, you could change to only having to skip 4 lines ... which you could do with three reads from $2007 at the right edge of the visible scanline. This also shifts coarse X for one scanline, so does require some finesse.
Switching to [VRC4, ] a much less popular mapper is a tough decision...
VRC4 clones are (were) pretty ubiquitous during the massive not-sanctioned-by-Nintendo market in the late 90s and 2000s. But really I was suggesting it only for establishing whether that's good enough of a chnge.

User avatar
Bregalad
Posts: 7892
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Re: Optimizing scroll changes after MMC3 IRQs

Post by Bregalad » Wed May 06, 2020 1:06 pm

because the fine Y scroll is > 3
I suppose this is for your raycaster metapixels, right ? Why does the Y scroll have to be >3 ? I fail to see why that'd be needed.

You say you'll be changing the scroll every 4 scanlines. For me this means only 4 of the 8 scanlines stored in the nametables are ever used. Just use two $2006 writes with no $2005 wrtie at all, to crush the 2 vertical name tables to half of their size, the lower half of each tile being unused, and the upper 4 pixels being used for 2 metatile pixels.

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Wed May 06, 2020 1:54 pm

lidnariq wrote:
Wed May 06, 2020 1:02 pm
You're using the attribute bytes? If you can just use the nametable bytes, you could change to only having to skip 4 lines ... which you could do with three reads from $2007 at the right edge of the visible scanline. This also shifts coarse X for one scanline, so does require some finesse.
I'm not sure I follow. Here's how each tile my BG pattern table is set up:

Code: Select all

Row 0: AAAABBBB
Row 1: AAAABBBB
Row 2: ********
Row 3: ********
Row 4: ********
Row 5: ********
Row 6: AAAABBBB
Row 7: AAAABBBB
The same 2 soft-pixels at the top (pseudo-colors A and B) are repeated at the bottom. The reason for this is that the same combination of colors may be needed at either the top or the bottom. Here's the beginning of one column of name table data, and which of their rows are visible:

Code: Select all

Tile 0: show rows 6 and 7;
Tile 1: show rows 0 and 1;
(change scroll)
Tile 2: show rows 6 and 7;
Tile 3: show rows 0 and 1;
(change scroll)
Tile 4: show rows 6 and 7;
Tile 5: show rows 0 and 1;
(change scroll)
Tile 6: show rows 6 and 7;
Tile 7: show rows 0 and 1;
(...)
Each instance of a tile has either the top 2 or the bottom 2 lines displayed, never both.

Anyway, the whole point of this setup is to create a display of 56x60 soft-pixels, each one being 4x2 hardware pixels in size, containing 1 of 16 dithered patterns from a single background palette. If you can think of a better way to achieve this (without doing pattern table updates), I'm open to other ideas.

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Wed May 06, 2020 2:02 pm

Bregalad wrote:
Wed May 06, 2020 1:06 pm
Why does the Y scroll have to be >3 ? I fail to see why that'd be needed.
I explained it in the post above.
You say you'll be changing the scroll every 4 scanlines. For me this means only 4 of the 8 scanlines stored in the nametables are ever used.
That's correct.
Just use two $2006 writes with no $2005 wrtie at all, to crush the 2 vertical name tables to half of their size, the lower half of each tile being unused, and the upper 4 pixels being used for 2 metatile pixels.
This would work if my soft pixels were 4x4 hardware pixels (which could in fact be done by changing the X name table bit via $2000 every 8 scanlines with very loose timing, no $2005 or $2006 needed at all), but I'm trying to make them 4x2 hardware pixels.

User avatar
Bregalad
Posts: 7892
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Re: Optimizing scroll changes after MMC3 IRQs

Post by Bregalad » Wed May 06, 2020 2:15 pm

Oh, so you use one tile for a 1x2 metapixels, I thought it was 2x2, as this would have allowed 4 colours per metapixels in 256 tiles. But you also use dithered colours I guess that's what complicate things.

It seems you'd need more than 4 nametables for this to fit with 2 buffers anyway. If you crush by a factor of 2, and have 4 nametables it's good, but if you crush with a factor of 4 then you'd need 4 other nametables to achieve double-buffering.

What I'll say might sound stupid but what's the worst CPU-performance wise : Change scroll every 4 lines but need to use $2005 writes, or change scroll every 2 lines but do just some very quick $2006 writes ? You're a free man, but my intuition tells me you should avoid $2005 if you're not doing any horizontal scrolling.

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Wed May 06, 2020 2:45 pm

Bregalad wrote:
Wed May 06, 2020 2:15 pm
But you also use dithered colours I guess that's what complicate things.
Yeah, even my old raycaster demo used dithering for shading... The chunky pixels are already ugly as they are, I can't even imagine how bad the 3D scene would look if all I had were flat colors.
It seems you'd need more than 4 nametables for this to fit with 2 buffers anyway.
I can do it with 4, because the 3D scene only occupies half of the vertical screen space. As for the HUD, a small portion of the scene isn't double buffered (this is written to VRAM last), leaving room in one of the name tables for it. The remaining parts of the screen, which are blank, are displayed by disabling background rendering or by switching blank patterns into the pattern tables.
What I'll say might sound stupid but what's the worst CPU-performance wise : Change scroll every 4 lines but need to use $2005 writes, or change scroll every 2 lines but do just some very quick $2006 writes ?
If the MMC3 forces me have to wait a whole scanline until the next hblank (seems likely), definitely 4 scanlines, because I have a whole scanline to compute and write the $2005/6 values. However, if there's a way to change the scroll with little to no waiting, changing every 2 scanlines might be better.

Oziphantom
Posts: 861
Joined: Tue Feb 07, 2017 2:03 am

Re: Optimizing scroll changes after MMC3 IRQs

Post by Oziphantom » Thu May 07, 2020 1:09 am

Can you use audio DMA to trigger at the right time? So it will always make it eat a constant amount of clocks? This way the IRQ fires during the DMA, then it will ack the irq as soon as the DMA lifts.

User avatar
Bregalad
Posts: 7892
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Re: Optimizing scroll changes after MMC3 IRQs

Post by Bregalad » Thu May 07, 2020 1:21 am

Oziphantom wrote:
Thu May 07, 2020 1:09 am
Can you use audio DMA to trigger at the right time? So it will always make it eat a constant amount of clocks? This way the IRQ fires during the DMA, then it will ack the irq as soon as the DMA lifts.
Nope, it will align poorly with scanlines and requires a significant amount of CPU time lost to "iddle wait" the remaining cycles. Not to mention the very significant extra headaches and lost CPU time to synchronize properly.
If the MMC3 forces me have to wait a whole scanline until the next hblank (seems likely), definitely 4 scanlines, because I have a whole scanline to compute and write the $2005/6 values. However, if there's a way to change the scroll with little to no waiting, changing every 2 scanlines might be better.
Well the $2006 only approach has the advantage of allowing very short irqs, something like

Code: Select all

IRQ:
  sta zp_save_a   ; 3
  lda scroll_l    ; 6
  sta $2006       ; 10
  clc             ; 12
  adc #$20        ; 14
  sta scroll_l    ; 17
  lda scroll_h    ; 20
  adc #$00        ; 22
  sta $2006       ; 26
  sta scroll_h
  
  ; retrigger MMC3 IRQ 2 lines latter here (I'm not familiar with the mapper)
  
  lda zp_save_a
  rti
becomes possible. I don't know whether 26 cycles delay between IRQ start and the $2006.2 store is acceptable, but it's definitely faster than f***ing with $2005. The main issue is that you'd have to deal with the overhead of IRQ and setting the new MMC3 IRQs twice as often which might or might not counterbalance this.

You might also consider 3-pixel tall metapixels, it seems you had only considered 2 or 4 so far. I agree 4 would be too ugly anyway. And I agree dithering is necessary for things not looking too horrible.

If this is still not enough the only suggestion I have is changing the mapper to something capable of triggering an IRQ at a given CPU cycle time, such as Konami VRC mappers or the Famicom Disk System.

You could interleave your code computing the raycast with code dealing with scrolling writes like I did for the second version of my rotation demo, but I highly disrecommand this as this makes code that is extremely unmaintenable and inflexible.

Fiskbit
Posts: 125
Joined: Sat Nov 18, 2017 9:15 pm

Re: Optimizing scroll changes after MMC3 IRQs

Post by Fiskbit » Thu May 07, 2020 1:31 am

I was considering the APU IRQ, myself, but I think the variance in interrupt timing will cause it to drift relative to the screen position. This needs to fire every 4 scanlines, so that error can build up over time. Maybe you could alternate between APU IRQs and scanline IRQs to keep it synced while halving the number of scanlines you have to burn, but I don't think just APU IRQs will work.

If you do have to burn full scanlines, I'd suggest swapping the BG and sprite tables (sprites at $0000, BG at $1000) to make the scanline IRQ trigger later, reducing the amount of time you have to waste. I'd also see if maybe there's a way you can do fixed-time work during the waiting period so it's not all wasted, even if there's some overhead to it.

I know you really want to use MMC3, but if there isn't a good solution here, the CPU gains in choosing a more suitable mapper might be very worth it.

User avatar
tokumaru
Posts: 11772
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Optimizing scroll changes after MMC3 IRQs

Post by tokumaru » Thu May 07, 2020 6:59 am

Using DMC IRQs for raster effects requires a significant amount of voodoo that I could never fully grasp, and it comes with plenty of downsides even when pulled off correctly.
Well the $2006 only approach has the advantage of allowing very short irqs, something like

Code: Select all

IRQ:
  sta zp_save_a   ; 3
  lda scroll_l    ; 6
  sta $2006       ; 10
  clc             ; 12
  adc #$20        ; 14
  sta scroll_l    ; 17
  lda scroll_h    ; 20
  adc #$00        ; 22
  sta $2006       ; 26
  sta scroll_h
  
  ; retrigger MMC3 IRQ 2 lines latter here (I'm not familiar with the mapper)
  
  lda zp_save_a
  rti
becomes possible. I don't know whether 26 cycles delay between IRQ start and the $2006.2 store is acceptable, but it's definitely faster than f***ing with $2005.
You can't do that because the IRQ latency will cause the scroll change to take place between the automatic X scroll increments that happen at the end of hblank. This is the very problem I described in the first post, which makes the waiting for the next hblank necessary.

And unlike DMC IRQs, I have a lot of practice with $2006/5/5/6 scroll changes and nothing but success when doing it. It's a very simple and reliable trick, and really easy to align with hblank, considering that only the final 2 writes must take place during hblank (if the X scroll doesn't change, only the last cycle of the last write needs to happen during hblank!).
You might also consider 3-pixel tall metapixels, it seems you had only considered 2 or 4 so far.
I considered 3, but that would align poorly with the sprites, interfering with sprite clipping. And in my tests, honestly, 3 didn't look much better than 4.
If this is still not enough the only suggestion I have is changing the mapper to something capable of triggering an IRQ at a given CPU cycle time, such as Konami VRC mappers or the Famicom Disk System.
But then that introduces the problem of IRQ latency buildup, unless there's a mapper out there that automatically reloads the previous cycle count and keeps counting without interference from the programmer, but I'm not aware of any.
You could interleave your code computing the raycast with code dealing with scrolling writes like I did for the second version of my rotation demo, but I highly disrecommand this as this makes code that is extremely unmaintenable and inflexible.
You mean using timed code for the raycasting logic? Yeah, I considered that, but I'm afraid it would cause even more cycles to be wasted than if waiting for the next hblank after an MMC3 IRQ.
Fiskbit wrote:
Thu May 07, 2020 1:31 am
If you do have to burn full scanlines, I'd suggest swapping the BG and sprite tables (sprites at $0000, BG at $1000) to make the scanline IRQ trigger later, reducing the amount of time you have to waste. I'd also see if maybe there's a way you can do fixed-time work during the waiting period so it's not all wasted, even if there's some overhead to it.
Yeah, that's probably what I'll end up doing. I'll definitely be doing all scrolling-related tasks during that time (updating counters, computing the values to be written to the PPU registers, and so on. I'll see if I can fit anything else in there.
I know you really want to use MMC3, but if there isn't a good solution here, the CPU gains in choosing a more suitable mapper might be very worth it.
Due to the large amount of IRQs necessary (30 or so), IRQ latency buildup becomes a serious problem with mappers using cycle-based counters. The MMC3 may have it's issues, but makes it makes it easy to trigger IRQs every N scanlines, and you almost don't have to care whether the console is NTSC or PAL.

Post Reply