Avoiding using signed multiplication while in Mode 7
Moderator: Moderators
Forum rules
- For making cartridges of your Super NES games, see Reproduction.
-
- Posts: 3140
- Joined: Wed May 19, 2010 6:12 pm
Avoiding using signed multiplication while in Mode 7
If I'm planning on making a Mode 7 level that uses some of the same enemies as non-Mode 7 levels, is there any reason to use $211b, $211c and $2134 in the first place?
Re: Avoiding using signed multiplication while in Mode 7
To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...
Comparing apples to apples restricts you to u7·u7→u14 anyway.
The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy
In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
Where the PPU multiplier wins big is just in requiring fewer total multiplications.
Comparing apples to apples restricts you to u7·u7→u14 anyway.
The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy
In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
Where the PPU multiplier wins big is just in requiring fewer total multiplications.
Re: Avoiding using signed multiplication while in Mode 7
You don't necessarily need to wait 8 and have the CPU do nothing. Why not spend that time doing additional processing to mask that latency?
If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.lidnariq wrote:In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.and D is unlikely to point to the multiplication registers.
Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
Re: Avoiding using signed multiplication while in Mode 7
Holding one factor constant is the case for applications that (ab)use the multiplier as a faux barrel shifter. In VWF rendering, for example, each bitplane in a tile is "multiplied" by a particular power of two in order to shift it left by so many bits. This works with the CPU multiplier but not the PPU one because of the signedness constraint.lidnariq wrote:Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
But if you have set D to $4200 to use multiplication, you have no place to store pointers to the data that you're processing using (dd),Y or [dd],Y addressing, unless you take the cycle and bank flexibility hit of using (dd,S),Y.HihiDanni wrote:Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once.
Re: Avoiding using signed multiplication while in Mode 7
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.HihiDanni wrote:If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
Re: Avoiding using signed multiplication while in Mode 7
What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
Re: Avoiding using signed multiplication while in Mode 7
For one thing, a base address used with aaaaaa,X or [dd],Y is 24-bit, whereas an offset or base plus offset is only 16-bit. Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank. For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.HihiDanni wrote:What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.
Re: Avoiding using signed multiplication while in Mode 7
If the destination is the multiplication register then you can just use D, with no need to bank switch. Should you need to bank switch though, as I had mentioned before, you can do it while waiting for the multiplication result.tepples wrote:Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.
If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.tepples wrote:For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.
Edit: Oops, I missed a reply.
You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).lidnariq wrote:Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
Re: Avoiding using signed multiplication while in Mode 7
The loop I had envisioned was read source, write to multiplier, read multiplier, write to destination.HihiDanni wrote:If the destination is the multiplication register then you can just use D, with no need to bank switch.tepples wrote:Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.
Good for reading characters out of "<Arnold> I still love Vista, baby\0", not so much for shifting the glyphs that represent each letter.HihiDanni wrote:If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.
With a 6-cycle (36mc) penalty for REP plus SEP, if needed.HihiDanni wrote:You can do a single 16-bit load into C.
I guess I might need to give an example of what this sort of compositing code might look like.
Re: Avoiding using signed multiplication while in Mode 7
As it turns out, REP and SEP indeed take three cycles each (for some reason I was imagining two cycles) so that might not be the most optimal way to do it after all.tepples wrote:With a 6-cycle (36mc) penalty for REP plus SEP, if needed.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
Re: Avoiding using signed multiplication while in Mode 7
But that requires that your in-memory structure already be compatible with that (multiplier and multiplicand in adjacent bytes). There could well be reasons that's not feasible.HihiDanni wrote:You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
That was not the argument I was trying to make.You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
I was attempting to state that the performance benefit for using direct page access to $42xx is small in comparison to all the other overhead that it doesn't matter much. And especially in the case of $4200, you get literally no other benefit to doing so; there's nothing else there that is useful to have faster access to (Not even two bytes of RAM!). Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.
What I was really trying to aim at was something to the effect of:The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
The way to cram the fastest speed out of either multiplier is to minimize as much time spent on I/O, because I/O is the bottleneck. Setting D to point at the multiplication registers can help ... but only if there isn't somewhere else it would be more useful to have it stay at. Setting the multiplier once, and just updating the multiplicand can help, but only if you're doing a bunch of multiplications in a row all by the same multiplier.
Specifically regarding the topic starter, the question was "Should I write two versions of the code, one that uses the faster PPU multiplier and one that uses the slower CPU multiplier? Or just use the slower CPU multiplier for everything?". Everything in my reply was my reasoning to the conclusion: "IF you can do your math in the u7·u8→u15 least common denominator of both multipliers, there is no significant benefit to using one over the other (and go ahead and use the CPU multiplier exclusively). The overwhelmingly biggest benefit from the PPU multiplier comes from if you need fewer total multiplications (and fewer cycles spent on I/O)"
Re: Avoiding using signed multiplication while in Mode 7
Your original post makes sense now; thanks for the clarification.
Yeah, I had envisioned the multiplications being done in a tight loop. Probably useful for Mode 7 (though it'd depend on how exactly Mode 7 is being used here, whether there will be any perspective effects or not). For object thinkers I'd leave D alone.lidnariq wrote:Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
-
- Posts: 3140
- Joined: Wed May 19, 2010 6:12 pm
Re: Avoiding using signed multiplication while in Mode 7
I think I should test everything out with the $42xx registers to see if I need to make 2 different multiplication routines.
-
- Posts: 3140
- Joined: Wed May 19, 2010 6:12 pm
Re: Avoiding using signed multiplication while in Mode 7
It appears that the performance loss is negligible.
Now I need to make something with mode 7.
Now I need to make something with mode 7.
- Señor Ventura
- Posts: 233
- Joined: Sat Aug 20, 2016 3:58 am
Re: Avoiding using signed multiplication while in Mode 7
So, load the operation opcodes is only 1/3 faster using the PPU, but its power doing the multiplications is much more noticeable once time loaded... is like that?.lidnariq wrote:To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...
Comparing apples to apples restricts you to u7·u7→u14 anyway.
The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy
In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.
Where the PPU multiplier wins big is just in requiring fewer total multiplications.