Writing to VRAM during HBlank

Discussion of hardware and software development for Super NES and Super Famicom.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Post Reply
93143
Posts: 1190
Joined: Fri Jul 04, 2014 9:31 pm

Writing to VRAM during HBlank

Post by 93143 » Tue Mar 31, 2020 7:19 pm

93143 wrote:
Sat Nov 02, 2019 10:50 pm
turboxray wrote:TmEE had a trick where he turned off the screen during hblank that allowed him to transfer more per scanline. He turned the display back on right before the screen went active. But yeah otherwise it just stalls the cpu. Didn't cause any artifacting from what I remember. But I don't remember the exact amount he could transfer before artifacting.
That may also be possible on SNES. I don't know if it's been tested; it's one of the things I haven't gotten around to yet. As with the MD, this would almost certainly prevent sprite compositing.
It works.

I've used all 8 HDMA channels to force blank, write the VRAM address, transfer 20 bytes of data, and turn the screen back on. It appears to work as well as could reasonably be expected.

I have not stress-tested it with regular DMA to work out what the ultimate limits are, what with map and tile prefetch needing to happen before line drawing starts. I should test this.

As we've seen in other experiments, the SNES may become violent when confused. On my SNES, the sprite layer on the line following the data burst not only doesn't work properly (which was expected), it shows white flickering segments despite white being nowhere in the palette. However, turning off sprites with TM/TS ($212C/$212D) removes the artifacting.

Unfortunately, writing the VRAM address during active display (so as to save time during HBlank) seems to have weird results (black lines across the tile, implying a nonzero index corresponding to a colour I didn't specify...?). I haven't gone to great lengths to figure out why this happens. It's possible I screwed something up... However, it does appear that once properly established, the VRAM address carries over between HBlanks, and that VRAM is open for data pretty much immediately after force blank is set.

The attachment shows a BG layer repeating a single tile composed of data with only one nonzero bitplane (the highlights indicate >16 bytes transferred), with a blue background composed of 64x64 sprites. The change in the tile pattern halfway down the screen is due to the tile being overwritten by HDMA.

EDIT: in a later post in this thread, I tested what happens when you use an IRQ and try for the maximum possible bandwidth: viewtopic.php?f=12&t=19896&p=250353#p250178
Attachments
hvdma.sfc
(32 KiB) Downloaded 77 times
Last edited by 93143 on Sat May 30, 2020 8:41 pm, edited 1 time in total.

User avatar
IGH
Posts: 3
Joined: Wed Dec 05, 2018 8:06 pm

Re: Writing to VRAM during HBlank

Post by IGH » Wed Apr 01, 2020 9:15 am

works on my 1/1/1 :D , awesome new development!

psycopathicteen
Posts: 2935
Joined: Wed May 19, 2010 6:12 pm

Re: Writing to VRAM during HBlank

Post by psycopathicteen » Sun Apr 19, 2020 8:11 am

This will allow full screen video at 20fps. I wonder how much bandwidth per line you can get away with if you use IRQ instead of HDMA.

Has there been any progress with trying to understand SNES OAM internal timing as well?

User avatar
dougeff
Posts: 2707
Joined: Fri May 08, 2015 7:17 pm
Location: DIGDUG
Contact:

Re: Writing to VRAM during HBlank

Post by dougeff » Sun Apr 19, 2020 12:58 pm

psycopathicteen wrote:
Sun Apr 19, 2020 8:11 am
This will allow full screen video at 20fps. I wonder how much bandwidth per line you can get away with if you use IRQ instead of HDMA.

Has there been any progress with trying to understand SNES OAM internal timing as well?
FMV? Audio and video, or just video?

How many seconds of FMV would that allow? Would it need a MSU-1 expansion?
nesdoug.com -- blog/tutorial on programming for the NES

psycopathicteen
Posts: 2935
Joined: Wed May 19, 2010 6:12 pm

Re: Writing to VRAM during HBlank

Post by psycopathicteen » Sun Apr 19, 2020 2:34 pm

A 12MB (or 96 megabits) SNES cartridge could contain 21 seconds of 256x224 4bpp video at 20fps uncompressed.

93143
Posts: 1190
Joined: Fri Jul 04, 2014 9:31 pm

Re: Writing to VRAM during HBlank

Post by 93143 » Tue Apr 21, 2020 5:55 am

psycopathicteen wrote:
Sun Apr 19, 2020 8:11 am
I wonder how much bandwidth per line you can get away with if you use IRQ instead of HDMA.
So do I. It depends on the S-PPU's BG data access pattern, but not in a complicated way. It shouldn't be especially hard to test, so I figure I'll take a look once I get another chance.
Has there been any progress with trying to understand SNES OAM internal timing as well?
Not by me. That could get complicated...
dougeff wrote:
Sun Apr 19, 2020 12:58 pm
FMV? Audio and video, or just video?
There should be room for audio, as long as the audio transfer loop is robust enough that HDMA interrupting it doesn't cause problems. The APU doesn't care about blanking.

In fact, if the VRAM address for the HDMA transfers carries over from VBlank (and I see no reason why it shouldn't), you only need 7 channels to transfer 20 bytes per line. This leaves a free channel that could be used for HDMA audio streaming, leaving the rest of the active frame for code (I don't know what you'd need that much compute time for during an FMV, but it's there).

In the case where you're using regular DMA in an IRQ to try to pack as much data into one HBlank as possible, you can't use HDMA because it interrupts the DMA. But you could probably slap on an extra DMA channel for audio streaming, because it doesn't have to happen during HBlank, and once you've aligned the DMA start to maximize bandwidth, the timing is plenty good enough.

...of course, if you're using MSU1 for video, you don't really need to stream a BRR audio track, do you?

tepples
Posts: 22013
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Writing to VRAM during HBlank

Post by tepples » Tue Apr 21, 2020 6:28 am

Unless it's MSU1 for video and background music and BRR for streamed sound effects.

User avatar
Nikku4211
Posts: 63
Joined: Sun Dec 15, 2019 1:28 pm
Location: Bronx, New York
Contact:

Re: Writing to VRAM during HBlank

Post by Nikku4211 » Sun May 24, 2020 12:17 pm

Awesome. I tested it on my SD2SNES on my RetroDuo, and it seems to work fine. I'll even post a photo.
Image
What could this be used for, though?
psycopathicteen wrote:
Sun Apr 19, 2020 8:11 am
This will allow full screen video at 20fps. I wonder how much bandwidth per line you can get away with if you use IRQ instead of HDMA.
Woah, really? That's awesome. Yeah, I do wonder that, though.
psycopathicteen wrote:
Sun Apr 19, 2020 2:34 pm
A 12MB (or 96 megabits) SNES cartridge could contain 21 seconds of 256x224 4bpp video at 20fps uncompressed.
So a 4MB SNES cartridge can contain a 4BPP Vine video uncompressed? That's cool, though pretty big. Then again, since a lot of Vine videos are shot in vertical aspect ratios, the video might even be smaller than 256 horizontal pixels, and possibly even smaller than that to account for the 8:7 -> 4:3 stretch TVs do.
93143 wrote:
Tue Apr 21, 2020 5:55 am
(I don't know what you'd need that much compute time for during an FMV, but it's there)
[...]
...of course, if you're using MSU1 for video, you don't really need to stream a BRR audio track, do you?
It would be really good if you did, though, because BRR is compressed, which means that it takes up less space than the normal uncompressed MSU-1 audio, which takes up wayyy too much space on my SD card.

Now how compressed is BRR? Well, thanks to SNESBRR, I actually have evidence.
  • An uncompressed .WAV of Malmen's Devotion Shuttle at 16-bit 32 kHz mono is 14,156,076 bytes.
  • The same exact sample at the same exact quality except converted to BRR is only 3,981,393 bytes.
  • The same song at 16-bit 16 kHz mono is 7,062,572 bytes.
  • The 16-bit 16 kHz mono BRR version is only 1,986,345 bytes.
  • The same .WAV at 16-bit 8 kHz mono is 3,531,308 bytes, which can basically fit in a 4 MB ROM.
  • The 16-bit 8 kHz mono BRR version is only 993,177 bytes, literally under a megabyte.
So is BRR effective enough to matter? I'll let you judge, I'm just here with the numbers.

Also, that compute time during an FMV could be used for some decompression, which combined with BRR, would be the bee's knees.
I have an ASD, so empathy is not natural for me. If I hurt you, I apologise.

93143
Posts: 1190
Joined: Fri Jul 04, 2014 9:31 pm

Re: Writing to VRAM during HBlank

Post by 93143 » Mon May 25, 2020 1:49 am

Nikku4211 wrote:
Sun May 24, 2020 12:17 pm
Awesome. I tested it on my SD2SNES on my RetroDuo, and it seems to work fine. I'll even post a photo.
Thanks! More verification is always good.
Now how compressed is BRR?
Not to rain on your parade, but we know the format and it's fixed-rate. It's 9 bytes for 16 samples. So 16-bit audio compresses to 9/32 of its original size. I believe you'll find that this ratio holds for your examples with a high degree of precision.

Also, BRR is very difficult to further compress. Video decompression is a possibility, but could be tough in the general case...
psycopathicteen wrote:
Sun Apr 19, 2020 8:11 am
I wonder how much bandwidth per line you can get away with if you use IRQ instead of HDMA.
Yeah, I do wonder that, though.
So apparently either preloading starts really early, or the S-PPU takes a while to figure out what's going on when you turn on rendering. On my SNES, if I set the interrupt late enough to reliably not force blank before the end of the visible scanline, I can only fit in 29 bytes per HBlank (or perhaps 29.5 if I felt like getting cute with DMA sync detection). This does suggest that HDMA should be able to do 24 if you rely on address carryover and don't bother with an audio streaming channel.

Disappointing. I was hoping for 39 bytes per line, which looked to be just maybe possible assuming essentially zero wakeup time and zero prep time for the fetched graphics, but apparently the S-PPU doesn't work that way.

hvdma_max.sfc
(64 KiB) Downloaded 39 times

This test triggers a DMA (consisting of a write to INIDISP, the data shot to VMDATAH*, and another write to INIDISP) in an H-IRQ (using a cycle-counted jump table to line up the DMA trigger write to a particular dot regardless of main code max instruction length) in order to determine whether the active layer wakes up in time given a particular data length. I tried all the 4bpp layers in Mode 1, 2, and 3, and apparently BG1 in Mode 1 loads the latest and thus allows the longest blanking.

If you want to mess with it, the length of the data shot ($1D) is at $80F9 in the file. Note that it is impossible to get DMA aligned to within a single dot even if the code that triggers it is, so the result will not necessarily be the same on every line once the glitching starts.

It doesn't look as interesting this time; I didn't use complex graphics. Just green. The point was not to show that the data goes into VRAM; we already know that from last time. It was to figure out how long the data burst can take before force blank is released too late.

* Odd-sized transfers are not compatible with word writes to VMDATAL/H. One could, however, upload even and odd bytes in separate transfers to VMDATAL and VMDATAH, in which case the HBlank DMA might only need to target one of them, with the other handled entirely in VBlank. I tried writing to VMDATAL instead of VMDATAH, but it didn't improve the maximum transfer size.

User avatar
Nikku4211
Posts: 63
Joined: Sun Dec 15, 2019 1:28 pm
Location: Bronx, New York
Contact:

Re: Writing to VRAM during HBlank

Post by Nikku4211 » Tue May 26, 2020 10:49 am

93143 wrote:
Mon May 25, 2020 1:49 am
Nikku4211 wrote:
Sun May 24, 2020 12:17 pm
Now how compressed is BRR?
Not to rain on your parade, but we know the format and it's fixed-rate. It's 9 bytes for 16 samples. So 16-bit audio compresses to 9/32 of its original size. I believe you'll find that this ratio holds for your examples with a high degree of precision.

Also, BRR is very difficult to further compress. Video decompression is a possibility, but could be tough in the general case...
Cool, I just wanted to remind you how it can even apply to the bigger picture. Yeah, BRR is difficult to further compress, but I think BRR compression by itself is enough of an improvement from raw uncompressed PCM in my eyes.
And video decompression can be tough to do on the fly. You're going to need to find the fastest 65816 compatible compression algorithm and make sure it can fit in the limited time you have.
93143 wrote:
Mon May 25, 2020 1:49 am
This test triggers a DMA (consisting of a write to INIDISP, the data shot to VMDATAH, and another write to INIDISP) in an H-IRQ (using a cycle-counted jump table to line up the DMA trigger write to a particular dot regardless of main code max instruction length) in order to determine whether the active layer wakes up in time given a particular data length. I tried all the 4bpp layers in Mode 1, 2, and 3, and apparently BG1 in Mode 1 loads the latest and thus allows the longest blanking.

If you want to mess with it, the length of the data shot ($1D) is at $80F9 in the file. Note that it is impossible to get DMA aligned to within a single dot even if the code that triggers it is, so the result will not necessarily be the same on every line once the glitching starts.

It doesn't look as interesting this time; I didn't use complex graphics. Just green. The point was not to show that the data goes into VRAM; we already know that from last time. It was to figure out how long the data burst can take before force blank is released too late.
Yeah, just tested it on my SD2SNES and it was all green. Success.
I have an ASD, so empathy is not natural for me. If I hurt you, I apologise.

93143
Posts: 1190
Joined: Fri Jul 04, 2014 9:31 pm

Re: Writing to VRAM during HBlank

Post by 93143 » Thu May 28, 2020 3:38 pm

Nikku4211 wrote:
Tue May 26, 2020 10:49 am
Yeah, BRR is difficult to further compress, but I think BRR compression by itself is enough of an improvement from raw uncompressed PCM in my eyes.
It absolutely is. And I'm not arguing against its use. I'm just clarifying that the compression ratio is known.

However, it does entail a potentially significant quality loss. You can prefilter the sample to remove the muffling from the gaussian interpolator, but you can't remove the compression noise. Prefiltering might even make the noise worse.

It's certainly more technically impressive to use BRR streaming - I'd lean towards it in a demo for that reason alone. And considering we're discussing multipalette 4bpp video, focusing on the difference between 32 kHz stereo BRR* and 44.1 kHz stereo PCM might be seen as lacking perspective...

On the other hand, even multipalette 256x224 at 20 fps is about 585 KB per second, and Red Book audio is only 172 KB per second. Using BRR instead would reduce the data rate by about 18%. Going to 256x208 at 30 fps makes the difference less than 14%.

(Note that the use of an H-IRQ rather than HDMA significantly reduces the amount of compute time available during the frame.)

*I really must get around to testing my high-bandwidth HDMA streaming scheme. By my count, the d4s stack method is asymptotically 121 SPC700 cycles for every 4 bytes, so it can't handle 32 kHz stereo. Doing it manually would take a large chunk of frame time on the S-CPU side, too much to fit in between H-IRQs and almost certainly too much to permit any sort of video compression even with HDMA used for the VRAM writes.
Yeah, just tested it on my SD2SNES and it was all green. Success.
Excellent.

I just realized that it would probably have been better to use something that shows clearly where the edge of the screen is, since it's very hard to count pixels this way.

On the other hand, the force blank point shows a moving jagged edge if it's not off the RHS, and it's easy to tell when the rendering starts too late on the LHS because it happens in 8-pixel chunks. So I'm reasonably confident that if you can see the edges and they don't look jagged or chopped off, it's working.

User avatar
Nikku4211
Posts: 63
Joined: Sun Dec 15, 2019 1:28 pm
Location: Bronx, New York
Contact:

Re: Writing to VRAM during HBlank

Post by Nikku4211 » Thu May 28, 2020 4:14 pm

93143 wrote:
Thu May 28, 2020 3:38 pm
It absolutely is. And I'm not arguing against its use. I'm just clarifying that the compression ratio is known.

However, it does entail a potentially significant quality loss. You can prefilter the sample to remove the muffling from the gaussian interpolator, but you can't remove the compression noise. Prefiltering might even make the noise worse.
Yeah, there's some quality loss, but to be fair, nobody's expecting high fidelity audio from the SNES. To be honest, anything that's understandable like 16kHz or 11kHz is probably good enough in my opinion, and I'd be fine with mono, since it doesn't really make that much of a difference on my TV where both speakers are close together.
93143 wrote:
Thu May 28, 2020 3:38 pm
I really must get around to testing my high-bandwidth HDMA streaming scheme. By my count, the d4s stack method is asymptotically 121 SPC700 cycles for every 4 bytes, so it can't handle 32 kHz stereo.
So what can it handle, then?
I have an ASD, so empathy is not natural for me. If I hurt you, I apologise.

93143
Posts: 1190
Joined: Fri Jul 04, 2014 9:31 pm

Re: Writing to VRAM during HBlank

Post by 93143 » Thu May 28, 2020 5:32 pm

Well, we know for certain that it can handle the voiceover in N-Warp Daisakusen... I think it's probably good for 32 kHz mono, and I suspect if you pushed it you could do 22 kHz stereo on NTSC and maybe PAL, but I haven't delved into the details of how his scheme is handled by the rest of the code (ie: how much setup time it requires between data bursts), or how long his data bursts can be before desync starts to be a risk.

For reference, a scanline is about 65 SPC700 cycles on an NTSC SNES, or maybe half a cycle longer on PAL, so at 121 cycles per 4 bytes, plus overhead, his method can probably push an average of around two bytes per scanline using four-byte HDMA shots in multi-line bursts with multi-line gaps in between the bursts.

...

The issue is that the APU is on a separate clock from the S-CPU, and it's a ceramic oscillator so its clock rate is not reliable.

In light of this, the d4s scheme tries to do the I/O pickup as quickly as possible by dumping the data to the SPC700's stack, leaving the bulk of the scanline free for timing margin. After the multi-line data burst has completed, he then has to pull the data back off the stack and put it where it's supposed to go, adding another loop that takes 55 cycles per 4 bytes of data transferred, on top of the 66 cycles of the actual I/O pickup loop.

My scheme uses self-modifying code to put the data where it's supposed to go during the pickup loop, so as to avoid needing a cleanup loop. Unfortunately this means I have less timing margin than the d4s method. Combined with the fact that I'm targeting much higher data rates and thus potentially longer bursts, this means I need much tighter timing control, and I had to get fancy with the delay section of the pickup loop.

I also haven't tested my method (been busy with other stuff, can't spare a lot of hobby time), so it's entirely possible I've missed something important... but if it works, 32 kHz stereo or even three simultaneous 22 kHz streams should be no problem. I originally devised it as a workaround for the audio memory constraints that single-handedly ruined the Street Fighter Alpha 2 port...

...

EDIT: My post with the IRQ test upthread has gotten a little buried, so here's a link: viewtopic.php?f=12&t=19896&p=250353#p250178
This test is an attempt to get the maximum amount of data into VRAM during HBlank without disrupting the rendering of the BG layer. It turns out that you can push modestly more data with DMA in an IRQ than you can with HDMA.

Post Reply