Avoiding using signed multiplication while in Mode 7

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Avoiding using signed multiplication while in Mode 7

Post by psycopathicteen »

If I'm planning on making a Mode 7 level that uses some of the same enemies as non-Mode 7 levels, is there any reason to use $211b, $211c and $2134 in the first place?
lidnariq
Posts: 11429
Joined: Sun Apr 13, 2008 11:12 am

Re: Avoiding using signed multiplication while in Mode 7

Post by lidnariq »

To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by HihiDanni »

You don't necessarily need to wait 8 and have the CPU do nothing. Why not spend that time doing additional processing to mask that latency?
lidnariq wrote:In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers
If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.
and D is unlikely to point to the multiplication registers.
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.
Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Avoiding using signed multiplication while in Mode 7

Post by tepples »

lidnariq wrote:Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
Holding one factor constant is the case for applications that (ab)use the multiplier as a faux barrel shifter. In VWF rendering, for example, each bitplane in a tile is "multiplied" by a particular power of two in order to shift it left by so many bits. This works with the CPU multiplier but not the PPU one because of the signedness constraint.
HihiDanni wrote:Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once.
But if you have set D to $4200 to use multiplication, you have no place to store pointers to the data that you're processing using (dd),Y or [dd],Y addressing, unless you take the cycle and bank flexibility hit of using (dd,S),Y.
lidnariq
Posts: 11429
Joined: Sun Apr 13, 2008 11:12 am

Re: Avoiding using signed multiplication while in Mode 7

Post by lidnariq »

HihiDanni wrote:If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.
Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by HihiDanni »

What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Avoiding using signed multiplication while in Mode 7

Post by tepples »

HihiDanni wrote:What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.
For one thing, a base address used with aaaaaa,X or [dd],Y is 24-bit, whereas an offset or base plus offset is only 16-bit. Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank. For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by HihiDanni »

tepples wrote:Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.
If the destination is the multiplication register then you can just use D, with no need to bank switch. Should you need to bank switch though, as I had mentioned before, you can do it while waiting for the multiplication result.
tepples wrote:For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.
If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.

Edit: Oops, I missed a reply.
lidnariq wrote:Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.
You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Avoiding using signed multiplication while in Mode 7

Post by tepples »

HihiDanni wrote:
tepples wrote:Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.
If the destination is the multiplication register then you can just use D, with no need to bank switch.
The loop I had envisioned was read source, write to multiplier, read multiplier, write to destination.
HihiDanni wrote:If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.
Good for reading characters out of "<Arnold> I still love Vista, baby\0", not so much for shifting the glyphs that represent each letter.
HihiDanni wrote:You can do a single 16-bit load into C.
With a 6-cycle (36mc) penalty for REP plus SEP, if needed.

I guess I might need to give an example of what this sort of compositing code might look like.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by HihiDanni »

tepples wrote:With a 6-cycle (36mc) penalty for REP plus SEP, if needed.
As it turns out, REP and SEP indeed take three cycles each (for some reason I was imagining two cycles) so that might not be the most optimal way to do it after all.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
lidnariq
Posts: 11429
Joined: Sun Apr 13, 2008 11:12 am

Re: Avoiding using signed multiplication while in Mode 7

Post by lidnariq »

HihiDanni wrote:You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
But that requires that your in-memory structure already be compatible with that (multiplier and multiplicand in adjacent bytes). There could well be reasons that's not feasible.
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
That was not the argument I was trying to make.

I was attempting to state that the performance benefit for using direct page access to $42xx is small in comparison to all the other overhead that it doesn't matter much. And especially in the case of $4200, you get literally no other benefit to doing so; there's nothing else there that is useful to have faster access to (Not even two bytes of RAM!). Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
What I was really trying to aim at was something to the effect of:

The way to cram the fastest speed out of either multiplier is to minimize as much time spent on I/O, because I/O is the bottleneck. Setting D to point at the multiplication registers can help ... but only if there isn't somewhere else it would be more useful to have it stay at. Setting the multiplier once, and just updating the multiplicand can help, but only if you're doing a bunch of multiplications in a row all by the same multiplier.


Specifically regarding the topic starter, the question was "Should I write two versions of the code, one that uses the faster PPU multiplier and one that uses the slower CPU multiplier? Or just use the slower CPU multiplier for everything?". Everything in my reply was my reasoning to the conclusion: "IF you can do your math in the u7·u8→u15 least common denominator of both multipliers, there is no significant benefit to using one over the other (and go ahead and use the CPU multiplier exclusively). The overwhelmingly biggest benefit from the PPU multiplier comes from if you need fewer total multiplications (and fewer cycles spent on I/O)"
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by HihiDanni »

Your original post makes sense now; thanks for the clarification.
lidnariq wrote:Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.
Yeah, I had envisioned the multiplications being done in a tight loop. Probably useful for Mode 7 (though it'd depend on how exactly Mode 7 is being used here, whether there will be any perspective effects or not). For object thinkers I'd leave D alone.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by psycopathicteen »

I think I should test everything out with the $42xx registers to see if I need to make 2 different multiplication routines.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Avoiding using signed multiplication while in Mode 7

Post by psycopathicteen »

It appears that the performance loss is negligible. :D

Now I need to make something with mode 7.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: Avoiding using signed multiplication while in Mode 7

Post by Señor Ventura »

lidnariq wrote:To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.
So, load the operation opcodes is only 1/3 faster using the PPU, but its power doing the multiplications is much more noticeable once time loaded... is like that?.
Post Reply