Some basic questions...

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Some basic questions...

Post by Señor Ventura »

Hello.

I always taken for granted some facts about the snes hardware, but, analyzing, i've found some errors in my perceptions... so, i have a handful of quick questions to solve it.


1º) Why the 65818 delays 8 cycles to transfer 1 Byte?... i think that the logic is writting a lenght word of 8 bits in a single cycle, but i don't know if it is fault of the WRAM, SRAM, VRAM...


2º) The next thing is correct?:
NTSC: 1364 cycles per scanline
PAL: 1455 cycles per scanline

NTSC Bytes per scanline: 1364/8= 170.5 Bytes
PAL Bytes per scanline: 1455/8= 181.8 Bytes

NTSC Bytes per frame: (1364*38)/8= 6479 Bytes or 6.32KB.
PAL(224 lines) Bytes per frame: (1455*88)/8= 16005 Bytes or 15.6KB.
PAL(239 lines) Bytes per frame: (1455*73)/8= 13276 Bytes or 12.9 KB.


3º) During the drawing of an scanline, there could be breaks in the middle during... 4 cycles?... 40 cycles?... It is due to memory latencys?


4º) The 65816 has exactly 325.996 cycles in an entire frame... Is that correct?.


5º) Abut the sprites... There are sprites of 32x16, 16x32, 16x8, and 8x16 pixels?... if the answer is "yes", Can i mix them if i choose the configuration of 32x32 and 16x16?.


6º) In order to update tiles, There are some differences between "tiles of sprites" and "tiles of layers" in terms of agility of drawing? (all the things about the OAM table may can influence something, while the tile layers could write in VRAM directly).


7º) Is theoretically possible update all the tiles you receive in a frame?. For example: I transfer 180 tiles during a frame, and at the end of that, when the frame buffer shall conform the image, all that tiles can be changed in the picture.


8º) The 4 players mode of the Street Racer... Really that mode 7 working in four "screens" doesn't have any impact to the perfomance of the PPU's?, or it can work due to an surplus of power.
I mean... There is 4 rotating layers at 60fps processed separately four times every frame?, or it doesn't works like that.



I think this is all... thanks in advance! :)
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

1. Some (3 clocks) is how long it takes for the 65816 to put an address on the address bus when requesting a read or write. The rest (3 or 5 clocks) accounts for the fact that ROM and DRAM take time for the signals to propagate through their control circuitry. The CPU has to wait for this to happen.

2. Before I can answer, I'd like to know where you got the 1455 figure.

3.The 40 clock delay compensates for the fact that WRAM (work RAM) is DRAM (dynamic RAM), which is made of capacitors. If a word of DRAM isn't read often, it's value will decay and eventually become unreadable. So the CPU pauses for some time out of each scanline to refresh (read and write back) words of WRAM.

4. Do you mean during active picture or during the whole frame (active picture plus vertical blanking)?

5. The S-PPU allows only square sprites: 8x8, 16x16, 32x32, or 64x64, and sprites of two such square sizes can be used at once.

6. Sprites are always 4 bits per pixel, and a single scene can use only 16 KiB (512 distinct tiles) of video memory for sprite tiles. Backgrounds can be 2bpp, 4bpp, or 8bpp, depending on the mode.

7. There's a limit on how much data your program can upload to VRAM during one vblank. For NTSC, this is about 5 KiB, or about 160 tiles at 4bpp. But yes, if you upload several tiles, the next frame's display will immediately reflect the upload.


Other users can add detail or answers to questions that I chose not to answer for the moment.

Based on various things I saw while reading your post, I guess that your native language is something other than English. Am I correct? Which language is it? Based on your username and nothing else, I guessed Spanish. Am I right?
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

tepples wrote:2. Before I can answer, I'd like to know where you got the 1455 figure.
Sorry, it were wrong.

Assuming that you have 325996 cycles, you obtain 1455 cycles per line in PAL dividing 325996/224... but you have 312 lines, so it doesn't happens like that in this way, right?.

If you divide all the cycles, obtaining 1455, in 224 lines of resolution then you really would get the extra 88 lines multiplying by 0... something would be wrong here ^^u

And all of this without having in mind that the concept is wrong too.
tepples wrote:4. Do you mean during active picture or during the whole frame (active picture plus vertical blanking)?
Could be almost the same thing?.
tepples wrote:7. There's a limit on how much data your program can upload to VRAM during one vblank. For NTSC, this is about 5 KiB, or about 160 tiles at 4bpp. But yes, if you upload several tiles, the next frame's display will immediately reflect the upload.
Based in a bandwidth of 5,72KB, around 183 tiles at 4bpp, 366 tiles at 2bpp, and 91 tiles at 8bpp.

The thing then is that all the tiles transferred are "active" inmediately, ok. It's logic, being a system inaccessible during the active display... in megadrive instead, it puts problems, and i've heard sometimes that transfer tiles don't guarantee to be able to update with security from a certain number of tiles (when you are close to the limit of the bandwidth).
tepples wrote:Based on various things I saw while reading your post, I guess that your native language is something other than English. Am I correct? Which language is it? Based on your username and nothing else, I guessed Spanish. Am I right?
My grammar betryes me xD

Yes, i'm spanish :)

Thank you indeed!
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

Señor Ventura wrote:Assuming that you have 325996 cycles, you obtain 1455 cycles per line in PAL dividing 325996/224... but you have 312 lines, so it doesn't happens like that in this way, right?.
Correct. There are 1364 master clocks per line in both TV systems. Multiply this by 262 or 312 to get master clocks per frame. (You lose two master clocks per frame on NTSC because of chroma realignment.)
Señor Ventura wrote:
tepples wrote:4. Do you mean during active picture or during the whole frame (active picture plus vertical blanking)?
Could be almost the same thing?.
Not quite. Active picture is usually 224 out of 262 lines on NTSC or 240 out of 312 lines on PAL. There's a big difference.
Señor Ventura wrote:The thing then is that all the tiles transferred are "active" inmediately, ok. It's logic, being a system inaccessible during the active display... in megadrive instead, it puts problems, and i've heard sometimes that transfer tiles don't guarantee to be able to update with security from a certain number of tiles (when you are close to the limit of the bandwidth).
DMA bandwidth during vertical blanking for both systems is similar. Their behavior differs during active picture: the Mega Drive VDP has a FIFO (first in, first out) queue that stores multiple writes and executes them during downtime in the scanline, whereas the Super NES S-PPU does not. Both can read 32 bytes of VRAM per 16 horizontal pixels, but what these reads correspond to differs because the S-PPU stores column scroll offsets in VRAM or has a third layer if they're not used, while the VDP stores half of the sprite display list in VRAM.

For each 16 pixels of the active portion of each line, both read 4 bytes of layer 1 nametables, 8 bytes of two layer 1 tiles, 4 bytes of layer 2 nametables, and 8 bytes of two layer 2 tiles. This leaves eight bytes, which the systems split up differently:
  • Mega Drive: Sprite tiles, VRAM refresh, and the write FIFO
  • Super NES mode 1 (no column scrolling): 4 bytes of layer 3 nametables, and 4 bytes of layer 3 tiles
  • Super NES mode 2 (column scrolling): 8 bytes of VSRAM offsets
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

tepples wrote:Correct. There are 1364 master clocks per line in both TV systems. Multiply this by 262 or 312 to get master clocks per frame. (You lose two master clocks per frame on NTSC because of chroma realignment.)
Then, you have 38 lines free, and 224 lines inside of the active resolution loosing 2 cycles each.

So:
38*1364= 51832 cycles.
224*1362= 305088 cycles.
In total= 356920 CPU cycles.

Is right?

Then, from the 38 lines of non active resolution it can obtain 170.5 Bytes by each one, and from the 224 of the active resolution it can obtain 170.25 Bytes by each one (but only when those scanlines are removed).

But still happens that there are 38 lines that multiplied by 170.5 Bytes gives 6,32KB (6479 Bytes), and it should be wrong, cause the total bandwidth is known that is 5,72KB... Why this occurs?.

tepples wrote:Not quite. Active picture is usually 224 out of 262 lines on NTSC or 240 out of 312 lines on PAL. There's a big difference.
Right, sorry, i was thinking in other way xD

I meaned the whole frame, the cycles that the cpu has to do things... but that question should be answered now ^^
tepples wrote:DMA bandwidth during vertical blanking for both systems is similar. Their behavior differs during active picture: the Mega Drive VDP has a FIFO (first in, first out) queue that stores multiple writes and executes them during downtime in the scanline, whereas the Super NES S-PPU does not. Both can read 32 bytes of VRAM per 16 horizontal pixels, but what these reads correspond to differs because the S-PPU stores column scroll offsets in VRAM or has a third layer if they're not used, while the VDP stores half of the sprite display list in VRAM.

For each 16 pixels of the active portion of each line, both read 4 bytes of layer 1 nametables, 8 bytes of two layer 1 tiles, 4 bytes of layer 2 nametables, and 8 bytes of two layer 2 tiles. This leaves eight bytes, which the systems split up differently:
  • Mega Drive: Sprite tiles, VRAM refresh, and the write FIFO
  • Super NES mode 1 (no column scrolling): 4 bytes of layer 3 nametables, and 4 bytes of layer 3 tiles
  • Super NES mode 2 (column scrolling): 8 bytes of VSRAM offsets
I'm still processing... xD

So, during the active display the two both read nametables and tiles in the same way, but with the surplus of memory both do different things:
-Megadrive check its VRAM for frame buffering, and the pertinent writtings.
-SNES read aditional data for modes 1 and 2.
lidnariq
Posts: 11429
Joined: Sun Apr 13, 2008 11:12 am

Re: Some basic questions...

Post by lidnariq »

Señor Ventura wrote:38*1364= 51832 cycles.
224*1362= 305088 cycles.
Not per scanline; per frame.

Every second vertical retrace, one scanline (at the end of drawing) is missing one pixel, or 4 master clock cycles. On average, this means that NTSC has 262*1364-2 master cycles per frame
niconii
Posts: 219
Joined: Sun Mar 27, 2016 7:56 pm

Re: Some basic questions...

Post by niconii »

An important distinction here, which I'm not sure if you're aware of, is the difference between master cycles and CPU cycles.

The numbers you've been talking about are master cycles, not CPU cycles. For NTSC, these cycles run on a ~21.477 MHz clock.

However, CPU cycles are different. Depending on the area of memory accessed, one CPU cycle = 6 master cycles (~3.58 MHz), 8 master cycles (~2.68 MHz), or for joypad ports, 12 master cycles (~1.79 MHz). CPU cycles that don't involve memory are always 6 master cycles.

So, when you ask why the 65816 takes 8 cycles to read or write a byte, this isn't quite accurate. The 65816 takes one CPU cycle to read or write a byte, and this is sometimes equivalent to 8 master cycles.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

Señor Ventura wrote:Abut the sprites... There are sprites of 32x16, 16x32, 16x8, and 8x16 pixels?... if the answer is "yes", Can i mix them if i choose the configuration of 32x32 and 16x16?.
No, but tepples wasn't entirely correct either. The SNES does have a couple of undocumented sprite size settings that allow for non-square sprites. Setting the top three bits of OBSEL to 110 gets you 16x32 and 32x64 sprites. Setting them to 111 gets you 16x32 and 32x32 sprites.

Unfortunately, according to superfamicom.org, vertical flipping doesn't work properly for non-square sprites, as each half flips separately. So you have to be careful when using these settings.
In order to update tiles, There are some differences between "tiles of sprites" and "tiles of layers" in terms of agility of drawing? (all the things about the OAM table may can influence something, while the tile layers could write in VRAM directly).
Is theoretically possible update all the tiles you receive in a frame?. For example: I transfer 180 tiles during a frame, and at the end of that, when the frame buffer shall conform the image, all that tiles can be changed in the picture.
The 4 players mode of the Street Racer... Really that mode 7 working in four "screens" doesn't have any impact to the perfomance of the PPU's?, or it can work due to an surplus of power.
I mean... There is 4 rotating layers at 60fps processed separately four times every frame?, or it doesn't works like that.
It sounds like you're under the impression that the SNES has a framebuffer that the PPU draws into. That's wrong. The PPU reads data from VRAM (based on register settings and OAM) and writes directly to the TV screen. It never writes to VRAM.

The closest thing to a framebuffer is the line buffer used for sprites, which are read from VRAM and composited during HBlank so they can be combined with the other layers during the active line in which they appear. (OAM is read during active display, and the results are used during the subsequent HBlank and displayed on the next line down. This is why the screen starts on line 1 - line 0 is used to read OAM for the first actual active display line.) The 34-tile limit per scanline is simply due to the fact that the PPU can't load more than that during HBlank, which is why the relevant flag in STAT77 is called Time Over.

When you send a tile to VRAM, you are altering the data the PPU sends to the TV when it reads that memory area. So naturally it updates right away.

Mode 7 is the same - the PPU just takes the current scroll position, origin and transform matrix (all of which are set by the CPU) and uses them to look up tiles and pixels in VRAM and output them to the TV. Since the transform is affine, you can't do perspective this way, so developers used HDMA to automatically change the Mode 7 parameters after every scanline. Four layers of Mode 7 in perspective might be a somewhat greater load on the CPU (though not necessarily all that much because the number of scanlines is the same or lower, and the bulk of the work is computing transform matrices for every line), but the PPU doesn't care because Mode 7, like any other BG mode, is a constant-load-per-pixel operation regardless of what the CPU is doing to the input parameters.
Señor Ventura wrote:Then, from the 38 lines of non active resolution it can obtain 170.5 Bytes by each one
Nope. The DMA unit pauses with the rest of the CPU-side system for DRAM refresh, which eats 40 master clocks every line without fail. You only get 165.5 bytes per line (and it's the same in PAL).
the total bandwidth is known that is 5,72KB...
No. Who told you that? Theoretically, you should be able to fit about 6289 bytes if you completely fill VBlank with a single massive DMA, carefully timed to start right at the beginning of VBlank. The catch is that you usually can't do that, so the available bandwidth is generally a bit lower.

If you fully update CGRAM and OAM every frame, then subtracting that from the total gets you around 5 KB per frame, or less if you have a bunch of maneuvering and a lot of little transfers to do.

You can extend VBlank in one or both directions using forced blank if you need more time, but this will result in black bars at the top and/or bottom of the screen - you can see this in games like Star Fox (which also has black bars at the sides to reduce fillrate requirements) and Yoshi's Island (which doesn't), not to mention Super Mario Kart. In fact, as far as I recall every Super FX game and every Capcom fighter or beat-em-up does this.
tepples wrote:Active picture is usually 224 out of 262 lines on NTSC or 240 out of 312 lines on PAL. There's a big difference.
Technically it's 224 lines if the overscan bit in SETINI is clear, and 239 lines if it's set. I note the use of the term "usually", which presumably refers to the fact that 224 is preferable in NTSC (more DMA bandwidth and you can't see the extra lines anyway) and 239 is preferable in PAL (you can see the extra lines, and DMA bandwidth isn't as restrictive because of the longer VBlank).
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

Nicole wrote:However, CPU cycles are different. Depending on the area of memory accessed, one CPU cycle = 6 master cycles (~3.58 MHz), 8 master cycles (~2.68 MHz), or for joypad ports, 12 master cycles (~1.79 MHz). CPU cycles that don't involve memory are always 6 master cycles.
Compare to the Mega Drive's 68000 CPU, which nominally runs at 315÷88×15÷7 = 7.67 million T-states per second, but in practice runs at one-fourth that because of how long it takes to read or write memory. The Z80 and LR35902 (Game Boy CPU) have a similar T-state structure.
93143 wrote:I note the use of the term "usually", which presumably refers to the fact that 224 is preferable in NTSC
That's what I meant.

If you want even more video memory bandwidth on a Super NES, you can use forced blank to letterbox the visible area down to 168 (NTSC) or 200 (PAL) lines and then claim that your game is "optimized for widescreen TVs". This gives (262-168)*161.5 = 15181 bytes per frame, enough to overwrite OAM, CGRAM, and almost the entirety of sprite VRAM. Then you can regain apparent resolution by using interlaced backgrounds.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

Thank you all for the answers! :)

I have one question more, and i can't find how to solve it using google...

Can i divide the screen vertically in three sections, to get active scanlines with 144 pixels of width resolution? (the red one in the image):

Image


That is to say, to keep two portions of the screen totally deactivate, and so, It could be possible to gain some bandwidth if i proceed like that?.
I mean... if you deactivate horizontal scanlines you gain bandwidth, but, What if the scanline is shorter like in the image?.
lidnariq wrote:Not per scanline; per frame.

Every second vertical retrace, one scanline (at the end of drawing) is missing one pixel, or 4 master clock cycles. On average, this means that NTSC has 262*1364-2 master cycles per frame
Thank you, it's good for me to know too.

P.D: Yes, per frame :)
Nicole wrote:An important distinction here, which I'm not sure if you're aware of, is the difference between master cycles and CPU cycles.

The numbers you've been talking about are master cycles, not CPU cycles. For NTSC, these cycles run on a ~21.477 MHz clock.

However, CPU cycles are different. Depending on the area of memory accessed, one CPU cycle = 6 master cycles (~3.58 MHz), 8 master cycles (~2.68 MHz), or for joypad ports, 12 master cycles (~1.79 MHz). CPU cycles that don't involve memory are always 6 master cycles.

So, when you ask why the 65816 takes 8 cycles to read or write a byte, this isn't quite accurate. The 65816 takes one CPU cycle to read or write a byte, and this is sometimes equivalent to 8 master cycles.
This is so important, thank you... I was already saying to myself, something wasn't having all the sense.

The thing is that effectively, with every master cycle of the cpu, the PPU's are doing their cycles meantinme. When the CPU take a cycle internally, the PPU's are doing 6 cycles of an scnaline drawing... when the CPU writes in the WRAM, the PPU's are doing 8 cycles of an scanline drawing... and when the CPU communicates with the joypad, the PPU's are doing 12 cycles of an scanline drawing.

That is correct?.
93143 wrote:No, but tepples wasn't entirely correct either. The SNES does have a couple of undocumented sprite size settings that allow for non-square sprites. Setting the top three bits of OBSEL to 110 gets you 16x32 and 32x64 sprites. Setting them to 111 gets you 16x32 and 32x32 sprites.

Unfortunately, according to superfamicom.org, vertical flipping doesn't work properly for non-square sprites, as each half flips separately. So you have to be careful when using these settings.
Sounds complicated. I keep it, but for later, better ^^
93143 wrote:It sounds like you're under the impression that the SNES has a framebuffer that the PPU draws into. That's wrong. The PPU reads data from VRAM (based on register settings and OAM) and writes directly to the TV screen. It never writes to VRAM.

The closest thing to a framebuffer is the line buffer used for sprites, which are read from VRAM and composited during HBlank so they can be combined with the other layers during the active line in which they appear. (OAM is read during active display, and the results are used during the subsequent HBlank and displayed on the next line down. This is why the screen starts on line 1 - line 0 is used to read OAM for the first actual active display line.) The 34-tile limit per scanline is simply due to the fact that the PPU can't load more than that during HBlank, which is why the relevant flag in STAT77 is called Time Over.

When you send a tile to VRAM, you are altering the data the PPU sends to the TV when it reads that memory area. So naturally it updates right away.
So, at the beginning of the active display the VRAM send the scanlines to the TV, but this is what an frame buffer does, but without areas to write, all the pool data you send to it, it will be behave like an frame buffer... may be by that the VRAM is closed to receive data during the active display.

I mean, it is not a frame buffer, but is mistakable.
93143 wrote:Mode 7 is the same - the PPU just takes the current scroll position, origin and transform matrix (all of which are set by the CPU) and uses them to look up tiles and pixels in VRAM and output them to the TV. Since the transform is affine, you can't do perspective this way, so developers used HDMA to automatically change the Mode 7 parameters after every scanline. Four layers of Mode 7 in perspective might be a somewhat greater load on the CPU (though not necessarily all that much because the number of scanlines is the same or lower, and the bulk of the work is computing transform matrices for every line), but the PPU doesn't care because Mode 7, like any other BG mode, is a constant-load-per-pixel operation regardless of what the CPU is doing to the input parameters.
Why the cpu has extra load if the HDMA is drive with its own bus?.

Then, the key is that only the number of scanline layers influences in the data volume that the cpu has to send through the HDMA for the transformation, right?

So, at 4 players, the number of scanlines corresponding to an "mode 7" layer is bigger than during a normal play...

Image

Image


But, what if i do this per software... At 4 players the only thing that matters is the traffic data?.
https://www.youtube.com/watch?v=Tl3gKAobaTE

93143 wrote:No. Who told you that? Theoretically, you should be able to fit about 6289 bytes if you completely fill VBlank with a single massive DMA, carefully timed to start right at the beginning of VBlank. The catch is that you usually can't do that, so the available bandwidth is generally a bit lower.
Got it... better will be don't to think in anything more. 165.5 per scanline.

At least, was a total surprise to see that the bandwidth is finally bigger that the expected... 6,14 KB.
93143 wrote:If you fully update CGRAM and OAM every frame, then subtracting that from the total gets you around 5 KB per frame, or less if you have a bunch of maneuvering and a lot of little transfers to do.

You can extend VBlank in one or both directions using forced blank if you need more time, but this will result in black bars at the top and/or bottom of the screen - you can see this in games like Star Fox (which also has black bars at the sides to reduce fillrate requirements) and Yoshi's Island (which doesn't), not to mention Super Mario Kart. In fact, as far as I recall every Super FX game and every Capcom fighter or beat-em-up does this.
Or do 30 frames per second... is a solution too...

Films runs at 24, so... xD
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Some basic questions...

Post by tepples »

You can use a window to hide pixels outside a 144-pixel-wide strip. Kirby Super Star uses something similar to hide scrolling artifacts at the sides of the screen. But unlike hiding an entire scanline, hiding stretches of an individual scanline will not gain you any video memory bandwidth.
Why the cpu has extra load if the HDMA is drive with its own bus?.
DMA uses the CPU's address bus, the I/O address bus, and the data bus. HDMA pauses the CPU for a few cycles in order to borrow the use of the bus.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

tepples wrote:You can use a window to hide pixels outside a 144-pixel-wide strip. Kirby Super Star uses something similar to hide scrolling artifacts at the sides of the screen. But unlike hiding an entire scanline, hiding stretches of an individual scanline will not gain you any video memory bandwidth.
It was logicall. The only thing you gain is to fill more area every frame.
tepples wrote:DMA uses the CPU's address bus, the I/O address bus, and the data bus. HDMA pauses the CPU for a few cycles in order to borrow the use of the bus.
I understood... is not due to cpu load but the more you use the HDMA, the more the cpu loses time to later work in another things remaining less time to the end of the frame.

It must be something like that, i supose...

Thank you!
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Some basic questions...

Post by 93143 »

Señor Ventura wrote:So, at the beginning of the active display the VRAM send the scanlines to the TV
The sending happens continuously during active display, because it literally is active display. The PPU is taking the raw data (register settings, OAM, tiles and maps in VRAM, and colours in CGRAM) and using it to generate a video signal. That is, the PPU is in essence controlling the TV's electron beam in real time. This is why the sprites and backgrounds work the way they do - the PPU has to generate each pixel just in time for the TV to illuminate the corresponding phosphors, so everything it does has to be constant-load.

This is also why VRAM is locked during active display - the PPU has to read it continuously in order to generate pixels on time. Any delays would show up as visual garbage or black areas on the screen. You can in fact turn off rendering (force blank) so as to write to VRAM/OAM/CGRAM during what would normally be active display; this produces black output in the area of the screen that the electron beam is passing over during the forced blanking time, and it also prevents sprites from preloading and can thus glitch the OBJ layer for up to a scanline after rendering is turned back on.

The SNES (along with the NES, Mega Drive, etc.) is very different from a framebuffer-based system.
I mean... if you deactivate horizontal scanlines you gain bandwidth, but, What if the scanline is shorter like in the image?
I would think you could use forced blank in an H-position-timed IRQ to do that, if forced blank frees VRAM quickly enough (I think it does, but I haven't tried it). But since interrupts on the SNES can't happen with pixel-perfect positioning, mostly because individual CPU instructions take several pixels to execute and an IRQ won't start until the current instruction is finished, you'd probably want to also use windowing as tepples describes to straighten the edges, or else just map black tiles outside the desired display area. And, obviously, you wouldn't be able to count on having the entire black area for DMA, because the timing inaccuracy of the interrupt would consume up to 16 pixels or so (depending on the code being interrupted and on whether mitigation measures were present in the IRQ code).

And as I just mentioned, you would probably end up killing sprites entirely by doing this, since they don't load during forced blanking.

Not to mention that H-IRQs eat CPU time for breakfast, especially if they contain any position stabilization code, because they happen 200+ times per frame...

In short, it's probably possible, but it doesn't work nearly as well as trimming scanlines off the top and bottom.
93143 wrote:Why the cpu has extra load if the HDMA is drive with its own bus?.
Because unless the ROM contains ready-made coefficient lists for all possible viewing angles, the CPU has to calculate the transform matrix coefficients for every scanline in order to compile the HDMA tables for the next frame.

Also, as tepples points out, the HDMA is not "drive with its own bus"; it's part of the CPU and hogs the main system bus entirely when operating. It's much quicker than manual writes with the CPU, but just writing the transform matrix with HDMA still takes about 7% of a scanline, and if you need to write scroll or origin every line as well that goes up to about 10%.
Then, the key is that only the number of scanline layers influences in the data volume that the cpu has to send through the HDMA for the transformation, right?
If by "scanline layers" you mean scanlines during which the PPU is set to Mode 7 and displaying the perspective playfield, yes.
So, at 4 players, the number of scanlines corresponding to an "mode 7" layer is bigger than during a normal play...
Okay, perhaps I should have checked what single-player looked like in that game. I was thinking of F-Zero, where the Mode 7 layer is most of the screen even in single-player mode...

Yes, 4-player in Street Racer would take more CPU time to handle, even with pre-baked HDMA (which would have eaten quite a lot of ROM, so I doubt they did that).
But, what if i do this per software... At 4 players the only thing that matters is the traffic data?.
https://www.youtube.com/watch?v=Tl3gKAobaTE
I have no idea what you mean by "traffic data", but...

The bulk of the CPU load in that case is rendering the playfield. As you can see, it's only ~30 fps with big blocky pixels, and it doesn't cover nearly as much of the screen as F-Zero's playfield does. I don't think a 4-player version of that would look good.

Though ultimately it's pretty much the same situation as with real Mode 7, in that the number of players as such is largely irrelevant to the question of rendering load. A flat, texture-mapped perspective layer is mathematically simple enough that the computational load is mostly proportional to the area of the screen that has to be rendered, rather than whether that area is divided into one, two, or four such layers - as long as there's only one layer per line, since computing the transform for a line is nontrivial and could be significant (I haven't done the math). (Obviously running the rest of the game engine for four players is going to be more expensive than for just one.)

Doing that in software on SNES seems like a dubious proposition. The CPU is somewhat weaker than the Mega Drive (though not as much as the clock speed difference would seem to suggest), and the PPU can do it in hardware anyway. The advantages, I suppose, would be the ability to do corner 4-player rather than pancake 4-player (you can't change Mode 7 parameters arbitrarily in the middle of a scanline without glitching, but if you're rendering to a framebuffer in software you can do whatever you want) and the ability to use maps larger than 1024x1024 (Mode 7 only allows one map, and you can't change where it is in VRAM; this is why Super Mario Kart was a go-kart game instead of a multiplayer F-Zero sequel). I'm not sure it'd be worth it given the resolution and framerate you'd have to put up with; pancake 4-player doesn't look that horrible in comparison, and if you need a bigger map it might be better to try to pull off something like my quarter-map scheme (though for 4-player it needs 8 KB updates, which implies a reduced active display height)... Plus, in accordance with what I said above, corner 4-player might take noticeably more CPU than pancake 4-player at the same resolution because it needs twice as many transform matrices...
Or do 30 frames per second... is a solution too...
I think Star Fox tops out at 20. Mind you, a lot of the frame rate issues in that game were due to software rendering on the Super FX taking a long time, but the frame data was also too big to transfer in one VBlank, even though they extended VBlank with forced blank. The catch here is that you need to double buffer some of the data in VRAM so you don't get tearing or glitching.

...

Please excuse my huge posts. I like to be precise, but I'm not very good at explaining stuff, and I get drawn off onto tangents very easily.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Some basic questions...

Post by Oziphantom »

93143 wrote:
I mean... if you deactivate horizontal scanlines you gain bandwidth, but, What if the scanline is shorter like in the image?
I would think you could use forced blank in an H-position-timed IRQ to do that, if forced blank frees VRAM quickly enough (I think it does, but I haven't tried it). But since interrupts on the SNES can't happen with pixel-perfect positioning, mostly because individual CPU instructions take several pixels to execute and an IRQ won't start until the current instruction is finished, you'd probably want to also use windowing as tepples describes to straighten the edges, or else just map black tiles outside the desired display area. And, obviously, you wouldn't be able to count on having the entire black area for DMA, because the timing inaccuracy of the interrupt would consume up to 16 pixels or so (depending on the code being interrupted and on whether mitigation measures were present in the IRQ code).

And as I just mentioned, you would probably end up killing sprites entirely by doing this, since they don't load during forced blanking.

Not to mention that H-IRQs eat CPU time for breakfast, especially if they contain any position stabilization code, because they happen 200+ times per frame...

In short, it's probably possible, but it doesn't work nearly as well as trimming scanlines off the top and bottom.
Not quite true.. the 65816's super power is it can give you perfect IRQ/NMIs with a single clock delay. So if you are doing horizontal splits, you can pepper your normal code and as long as you hit a WAI before the interrupt is due to happen you will get it with 1 clock fixed slide.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Some basic questions...

Post by Señor Ventura »

Oziphantom wrote:Not quite true.. the 65816's super power is it can give you perfect IRQ/NMIs with a single clock delay. So if you are doing horizontal splits, you can pepper your normal code and as long as you hit a WAI before the interrupt is due to happen you will get it with 1 clock fixed slide.
So, you can shorten the scanlines... but, Could it increase the bandwidth?.
Post Reply