Maybe floating point isn't that bad of an idea.

Discussion of hardware and software development for Super NES and Super Famicom.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
psycopathicteen
Posts: 2937
Joined: Wed May 19, 2010 6:12 pm

Re: Maybe floating point isn't that bad of an idea.

Post by psycopathicteen » Sat Jun 15, 2019 8:38 pm

I also want to try out logarithms as well.

User avatar
Señor Ventura
Posts: 113
Joined: Sat Aug 20, 2016 3:58 am

Re: Maybe floating point isn't that bad of an idea.

Post by Señor Ventura » Fri Jul 17, 2020 3:50 am

I have a doubt about that multipliers... It could be possible to send a compressed packet of tiles to vram and use the PPU1 to decompress it?.

It could be a good solution to (virtually) increase the bandwidth.

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Fri Jul 17, 2020 3:04 pm

S-PPU can't write to VRAM and isn't programmable. You'd need to do something clever with the graphics format so that the process of displaying it could be considered decompression.

For instance, with Mode 7, if you want an image composed of unique tiles you're normally limited to 128x128, because there are only 256 tiles. But you can do 256x128 if you use tiles consisting of just two colours each in 4x8 blocks and draw the image to the tilemap. And since each tilemap byte represents two "pixels", you save 50% on DMA for a given image size. This method limits you to 4bpp if you want a fully general capability, but if you can accept lossy compression (not all possible colour combinations present in the tileset) you can go higher. More colour blocks per tile makes the image bigger, but makes it more difficult to get good results with a decent number of colours.

Also, you can set up the VRAM port so that you can write 2bpp graphics into a 4bpp format, or 1bpp into 2bpp. Of course, you only get as many bits as you wrote, and there's no real way to trade off bit depth with compression fidelity like there is with the Mode 7 trick above, so it's a very lossy form of "compression"...

User avatar
Señor Ventura
Posts: 113
Joined: Sat Aug 20, 2016 3:58 am

Re: Maybe floating point isn't that bad of an idea.

Post by Señor Ventura » Fri Jul 17, 2020 6:06 pm

93143 wrote:
Fri Jul 17, 2020 3:04 pm
S-PPU can't write to VRAM and isn't programmable. You'd need to do something clever with the graphics format so that the process of displaying it could be considered decompression.
I was referring to use the cpu to put in the vram that compressed data and using the multipliers to copy again the new result, not using the ppu1, but if the vram can't be touched after the copy is a waste of time then.

It would have been a beautiful dream, achieving about 10KB's of tile data in every cycle, or something like that.

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Fri Jul 17, 2020 7:23 pm

I'm not entirely sure what you're even proposing, but the way the PPU multiplier works is quite limited, and I'm pretty sure it can't do anything like what you're thinking of.

The PPU multiplier has exactly two modes of use:

1) the S-PPU uses it to figure out what to draw to the TV when in BG Mode 7. This does not result in anything being written to VRAM.

2) the S-CPU can use it to obtain the result of a 16x8 signed multiplication between two numbers supplied via MMIO, if the S-PPU's BG layer handling is not in Mode 7 (so they don't step on each other trying to use the same multiplication hardware for two different things). This result shows up in dedicated MMIO registers on the CPU side; it never gets anywhere near VRAM.

It's not even that the S-PPU can't write to VRAM - it could, but it doesn't. There's nothing you can do about this because it's a fixed-function chip.

The CPU can absolutely write things to VRAM, but it has to use the VRAM port on the B bus during VBlank or forced blank. You could in principle have the CPU use the PPU multiplier to help decompress graphics, but the result would still be on the CPU side, and would still have to pass through the VRAM port at some point to get to VRAM, so you wouldn't save any bandwidth.

User avatar
Señor Ventura
Posts: 113
Joined: Sat Aug 20, 2016 3:58 am

Re: Maybe floating point isn't that bad of an idea.

Post by Señor Ventura » Sat Jul 18, 2020 2:38 am

93143 wrote:
Fri Jul 17, 2020 7:23 pm
I'm not entirely sure what you're even proposing, but the way the PPU multiplier works is quite limited, and I'm pretty sure it can't do anything like what you're thinking of.

The PPU multiplier has exactly two modes of use:

1) the S-PPU uses it to figure out what to draw to the TV when in BG Mode 7. This does not result in anything being written to VRAM.

2) the S-CPU can use it to obtain the result of a 16x8 signed multiplication between two numbers supplied via MMIO, if the S-PPU's BG layer handling is not in Mode 7 (so they don't step on each other trying to use the same multiplication hardware for two different things). This result shows up in dedicated MMIO registers on the CPU side; it never gets anywhere near VRAM.

It's not even that the S-PPU can't write to VRAM - it could, but it doesn't. There's nothing you can do about this because it's a fixed-function chip.

The CPU can absolutely write things to VRAM, but it has to use the VRAM port on the B bus during VBlank or forced blank. You could in principle have the CPU use the PPU multiplier to help decompress graphics, but the result would still be on the CPU side, and would still have to pass through the VRAM port at some point to get to VRAM, so you wouldn't save any bandwidth.
Yes, i was afraid of that, maybe you can decompress something into the WRAM, and then pass it to VRAM via DMA, but then the effective method don't save bandwidth, and may be is more effective use an SDD-1, or a giant mapper.

At least you can serve of that multipliers to pool some kind of process helping to the cpu, something is something.

How much processing power we are talking for?, Do it can help calculanting AI's of objects, physics of sprites, etc?. What could be the expectative processing that kind of things?, Could a game like gradius III benefit of this to avoid all of those slowdowns?, Do it could work like that? (we already know is patched for SA-1 and 3,58mhz).

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Sat Jul 18, 2020 2:57 pm

Señor Ventura wrote:
Sat Jul 18, 2020 2:38 am
How much processing power we are talking for?
It's a multiplier. It multiplies two numbers together.

If you want to do a signed multiplication between a 16-bit number and an 8-bit number, you poke the numbers into the appropriate Mode 7 matrix registers 8 bits at a time, and then read the result. The result is 24-bit, but you may only need the top 8 or 16 bits; fortunately it's easy to only read the part you want.

Unlike the CPU multiplier, which is unsigned 8-bit x 8-bit and takes 8 CPU cycles before you can safely read the result, I believe the PPU multiplier is fast enough that the CPU physically can't read the result before it's ready.

You can use the PPU multiplier for any application in which a reasonably quick signed 16x8 multiplication capability would be useful.

Note that this was not some deep secret during the commercial lifetime of the SNES, so it isn't safe to expect that any particular game could be sped up by using it - it very well might be using it already.

User avatar
Señor Ventura
Posts: 113
Joined: Sat Aug 20, 2016 3:58 am

Re: Maybe floating point isn't that bad of an idea.

Post by Señor Ventura » Sat Jul 18, 2020 4:49 pm

93143 wrote:
Sat Jul 18, 2020 2:57 pm
Señor Ventura wrote:
Sat Jul 18, 2020 2:38 am
How much processing power we are talking for?
It's a multiplier. It multiplies two numbers together.

If you want to do a signed multiplication between a 16-bit number and an 8-bit number, you poke the numbers into the appropriate Mode 7 matrix registers 8 bits at a time, and then read the result. The result is 24-bit, but you may only need the top 8 or 16 bits; fortunately it's easy to only read the part you want.

Unlike the CPU multiplier, which is unsigned 8-bit x 8-bit and takes 8 CPU cycles before you can safely read the result, I believe the PPU multiplier is fast enough that the CPU physically can't read the result before it's ready.

You can use the PPU multiplier for any application in which a reasonably quick signed 16x8 multiplication capability would be useful.

Note that this was not some deep secret during the commercial lifetime of the SNES, so it isn't safe to expect that any particular game could be sped up by using it - it very well might be using it already.
I didn't think about the fact of that "desynchronization" between the cpu and the ppu. basically, the ppu1 can do a lot of multiplications, but most of them these disappears before the cpu reach that point and can read some of these results, Am i right?.

Lately i'm feel a little bit frustrated... May be it will be faster if using 16 bits multipliers than 24 bits multiplier, but i'm lost with it.

I have a question appart from this thread, if you let me do it. Now that games like gradius III are benefit of chips like SA-1, Could games at 30fps like top gear 2 to take advantage of this kind of improvements? (forcing 60fps by hard coding, of course).

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Sat Jul 18, 2020 6:55 pm

Señor Ventura wrote:
Sat Jul 18, 2020 4:49 pm
I didn't think about the fact of that "desynchronization" between the cpu and the ppu. basically, the ppu1 can do a lot of multiplications, but most of them these disappears before the cpu reach that point and can read some of these results, Am i right?.
Something kinda like that might happen if the PPU is in BG Mode 7. Because it needs to use the multiplier regularly to figure out how to render the image, the result may not be what the CPU expects based on what it last wrote into $211B-211C. (Especially if those registers have been clobbered by HDMA; lots of Mode 7 games use HDMA to fiddle with the matrix between scanlines...) And of course, writing to those registers as part of a CPU calculation (successful or not) could end up glitching out the graphics, because they are part of the Mode 7 transform matrix.

If the PPU is not in Mode 7, the CPU can write the inputs and the answer will appear in the PPU multiplication result registers and stay there as long as you need it to. There's no rush to read it. What I meant by "the CPU physically can't read the result before it's ready" is that the result is ready so fast that you don't need to wait.

You can't give the PPU a list of multiplications to do for the CPU. The CPU has to poke in the values a byte at a time and read the answer to each multiplication before doing the next one.
I have a question appart from this thread, if you let me do it. Now that games like gradius III are benefit of chips like SA-1, Could games at 30fps like top gear 2 to take advantage of this kind of improvements? (forcing 60fps by hard coding, of course).
You mean by adding an SA-1? Probably. It's much faster than the base SNES CPU even in FastROM (Top Gear 2 is FastROM), so one assumes that the game could be 60 fps - unless there's a lot of DMA to VRAM going on that serves as a bottleneck. The SA-1 won't help you with that.

creaothceann
Posts: 230
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: Maybe floating point isn't that bad of an idea.

Post by creaothceann » Sun Jul 19, 2020 2:08 am

93143 wrote:
Sat Jul 18, 2020 6:55 pm
You can't give the PPU a list of multiplications to do for the CPU.
Maybe via HDMA? When 43x0.7 is 1 the 5A22 reads from the PPU instead of writing to it.
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Sun Jul 19, 2020 2:56 am

That should work, yeah.

But it still technically qualifies as the CPU providing a single pair of input values and then reading the result before providing new values. It's not the PPU you're giving the list to; it's the DMA unit. I was trying to counter the idea that the CPU could end up reading too slowly and miss results - I'm not sure where he got that idea, so I was trying to cover all the bases.

User avatar
Señor Ventura
Posts: 113
Joined: Sat Aug 20, 2016 3:58 am

Re: Maybe floating point isn't that bad of an idea.

Post by Señor Ventura » Sun Jul 19, 2020 4:27 am

93143 wrote:
Sat Jul 18, 2020 6:55 pm
If the PPU is not in Mode 7, the CPU can write the inputs and the answer will appear in the PPU multiplication result registers and stay there as long as you need it to. There's no rush to read it. What I meant by "the CPU physically can't read the result before it's ready" is that the result is ready so fast that you don't need to wait.
So, in terms of efficiency, Do it is like the cpu could do multiplications in only 8 cycles or so? (plus another 8-12 cycles to copy it into the wram)... other cpu's delays near a hundred of cycles i think, so, the advantage is clear, but still there is a bad exploitation of that function due to the slow frequency of the cpu.

How much frequency should have the 65816 to read every multiplication without obligate to the ppu1 to wait to do the next multiplication?, 40 mhz?
93143 wrote:
Sat Jul 18, 2020 6:55 pm
You can't give the PPU a list of multiplications to do for the CPU. The CPU has to poke in the values a byte at a time and read the answer to each multiplication before doing the next one.
So, the trick is in programming it manually, right?.
93143 wrote:
Sat Jul 18, 2020 6:55 pm
You mean by adding an SA-1? Probably. It's much faster than the base SNES CPU even in FastROM (Top Gear 2 is FastROM), so one assumes that the game could be 60 fps - unless there's a lot of DMA to VRAM going on that serves as a bottleneck. The SA-1 won't help you with that.
Games like nigell mansell F1 runs at 60fps, so, all the trees and objects in the top gear 2 doubtly needs more than 5KB of tile updatings. Probably the reason is the need to manage 20 vehicles at the same time.

With the speed racer happens the same thing. It is fast rom (in snes9x shows "30/fast rom"), but the frame rate comes to be poor in pal systems (in ntsc not so much). The 65816 seems not to reach more performance.

Image

93143
Posts: 1193
Joined: Fri Jul 04, 2014 9:31 pm

Re: Maybe floating point isn't that bad of an idea.

Post by 93143 » Sun Jul 19, 2020 11:57 pm

So, in terms of efficiency, Do it is like the cpu could do multiplications in only 8 cycles or so?
That might be a bit optimistic. Remember, just writing a single byte to a known address is already 4 cycles, unless the address is direct page, in which case it's only 3. You need to write three bytes, and then read at least one and probably two, maybe three.

Let's see... Starting with the 16-bit value in the accumulator and the 8-bit value in X, with direct page set to $2100, we have:

Code: Select all

sep #$20     ; set A to 8-bit
sta $1b      ; write low byte of 16-bit value
xba          ; switch bytes in AB
sta $1b      ; write high byte of 16-bit value
stx $1c      ; write 8-bit value
rep #$20     ; set A to 16-bit
lda $35      ; read upper 16 bits of result
Every instruction there is 3 cycles except for the last one, which is 4. The total comes to 22 cycles. And I didn't count loading the values into A and X. It's also only 16-bit x 8-bit, whereas something like the 68000 (or the SA-1) can do full 16x16.

Please don't take this estimate of the cycle count as a definitive number. My code may not be optimal, and as I said I made some starting assumptions.

If I'm not mistaken, you can write just one of the input values rather than both, and it will still do the multiplication of the new value with the old value in the other register. It's possible that it is actually doing the multiplication constantly, and the result persists only because the inputs don't change...
speed racer
Speed Racer looks like it's using Mode 7 to draw hills. That could get expensive, because you have to do a fair bit of math to figure out all the coefficients for the HDMA tables. It's possible that's the reason, and with most of the screen taken up by Mode 7 there's not much room to use the PPU multiplier. I'd guess the SA-1 could get it to 60. (It's even possible that optimization of the S-CPU code could get it to 60, but since I haven't looked at the code or coded anything similar, that's not a prediction, just an acknowledgement of a possibility.)

The SA-1 has a better multiplier than either the SNES CPU or the PPU. It's not as fast as the PPU multiplier at doing the actual calculation (you have to wait 5 cycles to read the result) but it's a full signed 16x16 multiplier (32-bit result) and you can use 16-bit writes for the inputs so it's quicker to use overall. (It's also got an accumulation mode where you can do a series of multiplications and it will add them; this produces a 40-bit result.) It's paired with a 16/16 signed divider too, which still only takes 5 cycles between the final write and the result being ready, as compared with the SNES CPU's divider which is only 16/8 unsigned and takes 16 cycles.

...note also that when I say "cycles", I mean cycles of the CPU that's doing the task. In the case of the SA-1, a "cycle" is a lot shorter than it is in the case of the S-CPU...

Post Reply