It is currently Fri Aug 17, 2018 10:01 am

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Wed Jan 17, 2018 2:48 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2731
If I'm planning on making a Mode 7 level that uses some of the same enemies as non-Mode 7 levels, is there any reason to use $211b, $211c and $2134 in the first place?


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 3:11 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7391
Location: Seattle
To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 3:40 pm 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 186
You don't necessarily need to wait 8 and have the CPU do nothing. Why not spend that time doing additional processing to mask that latency?

lidnariq wrote:
In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers

If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.

Quote:
and D is unlikely to point to the multiplication registers.

Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.

Quote:
Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 3:43 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20412
Location: NE Indiana, USA (NTSC)
lidnariq wrote:
Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Holding one factor constant is the case for applications that (ab)use the multiplier as a faux barrel shifter. In VWF rendering, for example, each bitplane in a tile is "multiplied" by a particular power of two in order to shift it left by so many bits. This works with the CPU multiplier but not the PPU one because of the signedness constraint.

HihiDanni wrote:
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once.

But if you have set D to $4200 to use multiplication, you have no place to store pointers to the data that you're processing using (dd),Y or [dd],Y addressing, unless you take the cycle and bank flexibility hit of using (dd,S),Y.


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 3:52 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7391
Location: Seattle
HihiDanni wrote:
If you're doing 8x8 mult and you have A set to be 8-bit size, you effectively get two 8-bit registers out of it.
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.

Quote:
Unless, you know, you set D to point to the multiplication registers. Which you're probably going to do anyway if you're processing a bunch of stuff at once. Then I'd say it's extremely likely to point to the multiplication registers.
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.

Quote:
Not sure what point you're trying to make here, or what it's even in reference to. Both methods require loads and stores.
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 3:54 pm 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 186
What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 4:22 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20412
Location: NE Indiana, USA (NTSC)
HihiDanni wrote:
What's the use case of using indirect access here? You can encode a base address and an offset within the same Y register, as long as you don't need to differentiate between the two.

For one thing, a base address used with aaaaaa,X or [dd],Y is 24-bit, whereas an offset or base plus offset is only 16-bit. Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank. For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 5:03 pm 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 186
tepples wrote:
Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.

If the destination is the multiplication register then you can just use D, with no need to bank switch. Should you need to bank switch though, as I had mentioned before, you can do it while waiting for the multiplication result.

tepples wrote:
For another, often you do need to differentiate between the base address and offset because your loop finishes once an index register has decreased to zero (DEX/DEY BNE) or past zero (DEX/DEY BPL). Otherwise, you would have to store the loop counter (for DEC) or end base plus offset (for CPX or CPY) somewhere, and $4200-$42FF doesn't have a good place to do so.

If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.

Edit: Oops, I missed a reply.

lidnariq wrote:
Loading things into B and using XBA is as I/O intensive as just loading things separately. It is quite literally only useful if you happen to have a block of RAM where the multiplicands and multipliers are already stored interleaved.

You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).

Quote:
Tautologies are tautologies? You have to take the values you're loading from somewhere, and it could be (I'd argue even likely) the case that setting D to RAM (or PPU registers) is more useful. There's nothing else useful in the $4200 block to make it obvious that saving 12-18MtCy on the store/load from the multiplier part is better than instead saving 12-18MtCy on e.g. RAM I/O for the multiplication routine itself.

You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)

Quote:
If you use one multiplier you don't have to reload it. This is true regardless of whether you're using the CPU or PPU multiplier.

The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 6:11 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20412
Location: NE Indiana, USA (NTSC)
HihiDanni wrote:
tepples wrote:
Enjoy thrashing DBR when alternating between reading the source bank and writing the destination bank.

If the destination is the multiplication register then you can just use D, with no need to bank switch.

The loop I had envisioned was read source, write to multiplier, read multiplier, write to destination.

HihiDanni wrote:
If your data set allows for terminators, you can do the loop-breaking branch statement during the data load.

Good for reading characters out of "<Arnold> I still love Vista, baby\0", not so much for shifting the glyphs that represent each letter.

HihiDanni wrote:
You can do a single 16-bit load into C.

With a 6-cycle (36mc) penalty for REP plus SEP, if needed.

I guess I might need to give an example of what this sort of compositing code might look like.


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 6:24 pm 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 186
tepples wrote:
With a 6-cycle (36mc) penalty for REP plus SEP, if needed.

As it turns out, REP and SEP indeed take three cycles each (for some reason I was imagining two cycles) so that might not be the most optimal way to do it after all.

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 6:33 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7391
Location: Seattle
HihiDanni wrote:
You can do a single 16-bit load into C. XBA to my knowledge does not require any slow cycles for FastROM (unless you're executing the code out of RAM).
But that requires that your in-memory structure already be compatible with that (multiplier and multiplicand in adjacent bytes). There could well be reasons that's not feasible.

Quote:
You were originally referring to using D to write to the multiplication registers, as though not having D for it would be the bottleneck. The answer, simply, is to set D to write to the multiplication registers. (It is true that you could be using D for something else. And thinking about this some more, much of my own programming style involves the state of D being localized within the function, so changing D is a non-issue in this case)
That was not the argument I was trying to make.

I was attempting to state that the performance benefit for using direct page access to $42xx is small in comparison to all the other overhead that it doesn't matter much. And especially in the case of $4200, you get literally no other benefit to doing so; there's nothing else there that is useful to have faster access to (Not even two bytes of RAM!). Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.

Quote:
The question was what is "worse" referring to in your original post. Worse in comparison to what? (I guess looking at it now you meant not using two multipliers would exacerbate the losses from not using D)
What I was really trying to aim at was something to the effect of:

The way to cram the fastest speed out of either multiplier is to minimize as much time spent on I/O, because I/O is the bottleneck. Setting D to point at the multiplication registers can help ... but only if there isn't somewhere else it would be more useful to have it stay at. Setting the multiplier once, and just updating the multiplicand can help, but only if you're doing a bunch of multiplications in a row all by the same multiplier.


Specifically regarding the topic starter, the question was "Should I write two versions of the code, one that uses the faster PPU multiplier and one that uses the slower CPU multiplier? Or just use the slower CPU multiplier for everything?". Everything in my reply was my reasoning to the conclusion: "IF you can do your math in the u7·u8→u15 least common denominator of both multipliers, there is no significant benefit to using one over the other (and go ahead and use the CPU multiplier exclusively). The overwhelmingly biggest benefit from the PPU multiplier comes from if you need fewer total multiplications (and fewer cycles spent on I/O)"


Top
 Profile  
 
PostPosted: Wed Jan 17, 2018 7:12 pm 
Offline
User avatar

Joined: Tue Apr 05, 2016 5:25 pm
Posts: 186
Your original post makes sense now; thanks for the clarification.

lidnariq wrote:
Additionally, changing D involves a small but noticeable number of cycles lost (PEA / PLD totals 6·6+4·8=68MtCy each time) or significant register pressure (LDA #imm / TAD is 30MtCy each time but you've destroyed A) so you have to do a bunch of multiplications in a row in order for it to be worthwhile.

Yeah, I had envisioned the multiplications being done in a tight loop. Probably useful for Mode 7 (though it'd depend on how exactly Mode 7 is being used here, whether there will be any perspective effects or not). For object thinkers I'd leave D alone.

_________________
SNES NTSC 2/1/3 1CHIP | serial number UN318588627


Top
 Profile  
 
PostPosted: Thu Jan 18, 2018 12:19 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2731
I think I should test everything out with the $42xx registers to see if I need to make 2 different multiplication routines.


Top
 Profile  
 
PostPosted: Sat Jan 20, 2018 12:39 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2731
It appears that the performance loss is negligible. :D

Now I need to make something with mode 7.


Top
 Profile  
 
PostPosted: Sun Jan 28, 2018 8:07 pm 
Offline
User avatar

Joined: Sat Aug 20, 2016 3:58 am
Posts: 52
lidnariq wrote:
To do a multiplication using either set of registers, the 65816 spends most of its time in I/O...

Comparing apples to apples restricts you to u7·u7→u14 anyway.

The fastest you could run a multiplication using the CPU registers would be STn.16 dpg ; wait 8 ; LDn.16 dpg, or (4+8+4)·6 → 96MtCy
The fastest via the PPU registers would be STZ.8 dpg ; STn.16 dpg ; LDn.16 dpg, or (3+4+4)·6 → 66MtCy

In practice, the multiplier and multiplicand are very unlikely to be packed in a useful way in your registers, and D is unlikely to point to the multiplication registers. Unless you're specifically doing some kind of SIMD-like block processing (using one multiplier on a bunch of multiplicands) you're probably going to make it worse by adding even more loads and stores.

Where the PPU multiplier wins big is just in requiring fewer total multiplications.


So, load the operation opcodes is only 1/3 faster using the PPU, but its power doing the multiplications is much more noticeable once time loaded... is like that?.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group