It is currently Sat Sep 21, 2019 7:09 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue May 14, 2019 4:21 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
Most of the approaches I've seen described for using DMC interrupts for raster timing require compensation for some rather large timing uncertainties. From what I can tell, at least on MESEN and my own console, there's a much easier and more efficient technique. I haven't figured out the best way to quickly get the DMC synchronized with the frame, but once a lock is required it will remain very stable. I've posted a quick test that was sub-optimal, but I've since done some calculations for timings which lock things down much better, and am working on a better demo.

A key observation is that changes to the DMC rate take effect after the current bit is transmitted. If an interrupt occurs and one sets the DMC rate immediately to a slow rate, waits for the bit in progress to get sent, and then sets the DMC to a faster rate, then the first bit will use whatever DMC rate was set previously, the second will use the slow rate, and the remaining six (as well as the first of the next byte) will use the faster rate. If one uses $8E as the fast rate and alternates between $80 and $81 for the slow rate, then every pair of interrupts will take 1816 cycles--only 2.3 cycles short of the 1818.3 cycles required for 16 scan lines.

Using a few other DMC values at the start and end of the frame, it's possible to arrange for a combination of times that will be either 0.5 cycles less than a frame or 1.5 cycles more. Using the former on three out of four frames will result in jitter being within 1.5 cycles of what could be achieved with a mapper's IRQ. CPU overhead will be slightly greater because of the need to perform the second rate-setting write after the first bit gets clocked out, but should only be about 10% even with raster splits every eight lines.


Top
 Profile  
 
PostPosted: Tue May 14, 2019 6:59 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11412
Location: Rio de Janeiro - Brazil
Are you saying that by varying the playback rate a certain way it may be possible to achieve proper CPU-APU synchronization, so that we can start playback at a moment of our choice (this is the key part!) and have the IRQ fire a constant amount of time later? That'd be amazing! Abusing DMC IRQs for raster effects is a cool concept and all, but not very practical with all the timing compensation that's currently needed.

This interests me quite a lot, because I'm working on something that requires splits every 8 scanlines, and I'd like to avoid the MMC3 (or other "complex" mappers), if at all possible. Very precise timing is not even a requirement for me, because I only need to switch back and forth between 2 name tables, and $2000.0 doesn't affect the scroll immediately, only when hblank starts, so I have literally a whole scanline to do the switch, so even if the timing drifts with each IRQ, I have over 100 cycles of tolerance each frame.


Top
 Profile  
 
PostPosted: Tue May 14, 2019 8:05 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8568
Location: Seattle
If I understand correctly, it's nowhere near that magical, but it's still pretty powerful. I think the important part is:
- IRQ
- DPCM timer reloaded
- set new period ("p1"), busy wait "p0" cycles until...
- DPCM timer reloaded
- set new period ("p2"), set "p0" to "p2", return from IRQ handler

So you should be able to schedule an IRQ for any number of CPU cycles that's p0 + p1 + p2*(128*w4013+6) in the future, at the cost of having to busy-wait one "p0" period in the IRQ handler.

You still have to generate the synchronization i.e. there's still no magic to restart the FIFO. The important improvement over tepples's previous work is that you don't have to compensate for up to 4 scanlines of slack after the IRQ, and that there's enough precision here to be able to schedule the IRQ to roughly synchronize to the NMI handler.

This "only" lets you move the IRQ that would have been generated without using this technique over a range of 3.3 scanlines ((428-54)·3÷341), depending on "p1" and "p2". So there's still a few huge holes in the resulting period table. But this is still pretty useful!


Top
 Profile  
 
PostPosted: Tue May 14, 2019 11:54 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7741
Location: Chexbres, VD, Switzerland
tokumaru wrote:
This interests me quite a lot, because I'm working on something that requires splits every 8 scanlines, and I'd like to avoid the MMC3 (or other "complex" mappers), if at all possible. Very precise timing is not even a requirement for me, because I only need to switch back and forth between 2 name tables, and $2000.0 doesn't affect the scroll immediately, only when hblank starts, so I have literally a whole scanline to do the switch, so even if the timing drifts with each IRQ, I have over 100 cycles of tolerance each frame.

Also this opens the possibility of 8x16 attributes (rather than 16x16) without any special mapper, which can be interesting for quite a lot of cases.


Top
 Profile  
 
PostPosted: Wed May 15, 2019 1:14 am 
Offline
User avatar

Joined: Fri Jan 24, 2014 9:05 am
Posts: 193
Location: Hungary
This is some great news! If this technique is demonstrated to produce those results, I might just have to go back and re-write my IRQ handler. For now, I could get away with the 2-3 scanlines of jitter (somehow I could only get this work with rate $C) by designing my graphics such that solid colors or horizontal color bars are used around the split points, but I've always wanted to create that particular demoscene-esque sine wave distortion on things like lava or underwater segments.


Top
 Profile  
 
PostPosted: Wed May 15, 2019 7:25 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
lidnariq wrote:
You still have to generate the synchronization i.e. there's still no magic to restart the FIFO.


Synchronization only needs to happen once--ever--if one never lets the DMC run dry. Individual IRQ responses may have up to 7 cycles of uncertainty, depending upon the longest opcode that may be running when the IRQ fires, but provided that the DMC reloads happen within fairly broad timing windows this will not cause any cumulative timing uncertainty.

My present Ruby Runner plan is to use this to show each line in the name table twice, with different tile sets, so that pairs of rows of metatiles will require 66 stores rather than 128 [put tile data in rows 1-2, 5-6, 9-10, 13-14, 17-18, 21-22, and 25-26]. One would still need to write all 64 bytes of attribute data, but that would still represent a huge savings in time to write the name tables. If one extended vlbank and exploited the "SAX" upcode ["store A and X"] the time to update a 16x12 grid of metatiles would be 2,388 cycles--which could perhaps be done in a single frame if code didn't need to update attributes.

BTW, I'd suggest not using NMI except for purposes of establishing initial sync. If a cart starts out by waiting for a couple of vblank events, sets the DMA rate to 54, synchronizes itself with an OAC DMA and is executing a bunch of NOPs when the NMI hits, that would establish CPU synchronization to within two CPU cycles. If the IRQ vector is a JMP in RAM, and if the NMI loads X with 0, enables IRQ, and starts executing INX instructions, the IRQ handler could then use the value in X to determine how many pairs of cycles it needs to let the OAC "slip". Once that is done, there would be no further need for the NMI unless it is for whatever reason necessary to let the DMC lose synchronization [e.g. because one needs to write to flash and doesn't have enough RAM for a flash-write routine that can keep track of elapsed time].

Note also that NTSC timings will be affected by whether rendering is enabled at the end of vblank. Either condition can be accommodated, but code which disables rendering will either need to ensure that rendering is enabled, at least briefly, at the end of vblank (it could set all palette to the same color if desired so nothing would be visible) or else keep track of how many frames didn't have rendering enabled and adjust display timings accordingly.


Last edited by supercat on Wed May 15, 2019 10:53 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Wed May 15, 2019 9:06 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
tokumaru wrote:
Are you saying that by varying the playback rate a certain way it may be possible to achieve proper CPU-APU synchronization, so that we can start playback at a moment of our choice (this is the key part!) and have the IRQ fire a constant amount of time later? That'd be amazing! Abusing DMC IRQs for raster effects is a cool concept and all, but not very practical with all the timing compensation that's currently needed.

This interests me quite a lot, because I'm working on something that requires splits every 8 scanlines, and I'd like to avoid the MMC3 (or other "complex" mappers), if at all possible. Very precise timing is not even a requirement for me, because I only need to switch back and forth between 2 name tables, and $2000.0 doesn't affect the scroll immediately, only when hblank starts, so I have literally a whole scanline to do the switch, so even if the timing drifts with each IRQ, I have over 100 cycles of tolerance each frame.


Only "start" playback once. Never let it stop after that. If you ever know precisely where the beam was at some point in time when a DMC started sending its last byte with a known reload value, and can account for every DMC reload that occurs after that as well as the state of rendering the end of each vblank (to know whether it drops a half-pixel), you can know precisely where the beam will be every time a DMC reload is triggered.

Like you, I'm after an 8-line IRQ. Sixteen scan lines take 1818.667 cycles. IRQ events of 428+7*72 and 380+7*72 will together yield 1816. Add an extra NOP before your video stores on 1/3 of succeeding pairs, but "bit $FF" on the other 2/3, and all pairs should (after compensation) start within a cycle of the same spot on each line.


Top
 Profile  
 
PostPosted: Thu May 16, 2019 7:33 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
lidnariq wrote:
If I understand correctly, it's nowhere near that magical, but it's still pretty powerful.


I'd say it's pretty magical. Here's a demo which shows blips precisely every eight scan lines, vertically stacked except for the top one, and movable by joystick, without ever looking for video synchronization (via NMI or polling $2002) after it starts [it starts a somewhat arbitrary spot on the frame, with no attempt at initial synchronization]. The code uses four sets of DMC event lengths (1/2 cycle less than a frame, 1 1/2 cycles greater than a frame, about a scan line less, and about a scan line more) and simply cycles among those as appropriate. Because the main loop includes 7-cycle "inc abs,x" instructions, IRQ response time adds six cycles of jitter, but that uncertainty is not cumulative.

This kind of code should probably not try to synchronize with the beam while the main-line code is running a mix of 2-7 cycle instructions, but limiting the mainline to shorter instructions until beam synchronization is achieved shouldn't be a real loss. Many television sets will take a noticeable fraction of a second to achieve sync after the NES powers on, so it shouldn't matter if a game cart takes a half second on startup to establish sync.


Attachments:
File comment: NTSC joystick raster demo
rasterJoystick.nes [40.02 KiB]
Downloaded 216 times
Top
 Profile  
 
PostPosted: Thu May 16, 2019 8:17 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11412
Location: Rio de Janeiro - Brazil
I'm still trying to wrap my head around this, but this demo does indeed look very impressive! This definitely looks stable enough for what I need, and the amount of jitter suggests this would work well for other kinds of raster effects too. Great job!


Top
 Profile  
 
PostPosted: Thu May 16, 2019 8:55 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
tokumaru wrote:
I'm still trying to wrap my head around this, but this demo does indeed look very impressive! This definitely looks stable enough for what I need, and the amount of jitter suggests this would work well for other kinds of raster effects too. Great job!


Think of DMC as two counters, the first of which counts cycles and is loaded with a configurable value from 54 to 406 every time it counts to zero, and the second of which counts once each time the first counter hits zero, and is reloaded with eight each time it hits zero. The DMC will issue an IRQ each time the second timer hits zero, provided at least one value with bit 4 set (and no values with bit 4 clear) has written to $4015 at least once since the previous time it hits zero.

Whenever you get an IRQ, the first counter will have just been reloaded with the configured value and the second counter will be loaded with eight. If you change the configuration by writing a value in the range $80 to $8F to $4010 between when the IRQ arrived and the next time the first counter reaches zero, that change will be applied when it hits zero. If you change the configuration again, it will affect subsequent reloads until the next time the configuration is changed. The easiest way to exploit this is to write the value twice during most interrupts.

For purposes of computing total durations, you can simply think the time for each IRQ cycle as being Time1+7*Time2. Things are slightly complicated by the fact that the minimum execution time for an IRQ handler when Time1 and Time2 aren't equal will be the previous event's Time2, and by the fact that the time between one IRQ event and the next will actually be prevTime2 + Time1 + 6*Time2, but this approach will offer fine enough control over IRQ lengths that by selectively using either
Code:
    irqHandler23:
      save CPU regs
      insert delay here
      update PPU regs
      set rate 1st time
      prepare for irqHandler24
      insert delay here
      set rate 2nd time
      restore CPU regs and return

and
Code:
    irqHandler24:
      save CPU regs
      set rate 1st time
      insert delay here
      prepare for irqHandler25
      update PPU regs
      insert delay here
      set rate 2nd time
      restore CPU regs and return

one can avoid having to extend the IRQ past the first timer reload (which would in most cases probably be 72 cycles).


Top
 Profile  
 
PostPosted: Thu May 16, 2019 11:12 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8568
Location: Seattle
supercat wrote:
I'd say it's pretty magical.
Given that tokumaru asked
tokumaru wrote:
so that we can start playback at a moment of our choice (this is the key part!)
I stand by "nowhere near that magical". Specifically in that context that I put it.

In order to maintain sync, you still have to be able to divide your screen into a finite number of IRQs that together sum up to the number of cycles in some integer number of vblanks. You've obviously found at least one such division, but just how generalizable is it?


Top
 Profile  
 
PostPosted: Thu May 16, 2019 11:31 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
lidnariq wrote:
supercat wrote:
I'd say it's pretty magical.
Given that tokumaru asked
tokumaru wrote:
so that we can start playback at a moment of our choice (this is the key part!)
I stand by "nowhere near that magical". Specifically in that context that I put it.

In order to maintain sync, you still have to be able to divide your screen into a finite number of IRQs that together sum up to the number of cycles in some integer number of vblanks. You've obviously found at least one such division, but just how generalizable is it?


It generalizes very well, actually. Every even number of cycles that's greater than 1554 is achievable.


Top
 Profile  
 
PostPosted: Thu May 16, 2019 2:15 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8568
Location: Seattle
supercat wrote:
Every even number of cycles that's greater than 1554 is achievable.
Huh, it's even better than that. Although 1552 only has one combination (1·54 + 7·214), the longest unachievable periods I found were at 1426, followed by 1342, 1338, and 1300.

The 2A07's table works a little better: the longest unachievable period is 1118 cycles. And every even period between 1120 and 2240 is achievable, so every longer even period is also achievable.


Top
 Profile  
 
PostPosted: Thu May 16, 2019 3:19 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
lidnariq wrote:
supercat wrote:
Every even number of cycles that's greater than 1554 is achievable.
Huh, it's even better than that. Although 1552 only has one combination (1·54 + 7·214), the longest unachievable periods I found were at 1426, followed by 1342, 1338, and 1300.

The 2A07's table works a little better: the longest unachievable period is 1118 cycles. And every even period between 1120 and 2240 is achievable, so every longer even period is also achievable.


Given that the only periods which need to be matched exactly are 29780 and 29782 for NTSC 2C02, 2C03, and 2C04; 33246 and 33248 for PAL 2C07; and 35464 for Dendy, the only arrangements that can't be resolved easily are those that would have less than 1552 scan lines at the end after everything else is accounted for, or those which need splits less than four lines apart. The latter would simply require using busy-waiting within the IRQ between nearby splits (e.g. if one wanted splits every two scan lines, one would need to set up an IRQ for every four, and handle two scan lines with each interrupt, burning about 60% CPU), and the former should simply be a non-issue given that even NTSC VBLANK is 2273 cycles).


Top
 Profile  
 
PostPosted: Thu May 16, 2019 3:59 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8568
Location: Seattle
supercat wrote:
the only arrangements that can't be resolved easily are those that would have less than 1552 [cycles] at the end after everything else is accounted for
Er, that's my point. The threshold is 1428 cycles, not 1554. Because 1552 is achievable. Admittedly it's comparatively expensive, because that bit period of 214 cycles means busy-waiting for almost two scanlines in the subsequent IRQ, but it's still achievable.

I suppose, given that you have this level of precision, sometimes one might prefer to use IRQs where p1=p2 to skip busy-waiting to save on CPU time, and only use the more precise ones near the end of the frame to achieve PPU synchronization.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group