It is currently Tue Aug 14, 2018 6:55 pm

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 436 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 30  Next
Author Message
PostPosted: Sun Jan 03, 2016 12:41 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3359
Location: Nacogdoches, Texas
Stef wrote:
The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)

Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...

Stef wrote:
(you have to count at least 15/16 cycles for load / read result time)

That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 1:30 am 
Offline
User avatar

Joined: Mon Jan 03, 2005 10:36 am
Posts: 3110
Location: Tampere, Finland
Espozo wrote:
Stef wrote:
The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)

Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...

Not sure if this was what you were asking, but off the top of my head: (a and b are 16-bit numbers)
Code:
a*b = (a_hi*256 + a_lo) * (b_hi*256 + b_lo)
    = a_hi*b_hi*256*256 + a_hi*256*b_lo + a_lo*b_hi*256 + a_lo*b_lo
    = ((a_hi*b_hi)<<16) + ((a_hi*b_lo + a_lo*b_hi)<<8) + a_lo*b_lo

_________________
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 7:56 am 
Offline

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul
Code:
; Description: Unsigned 16-bit multiplication with unsigned 32-bit result.
;                                                                         
; Input: 16-bit unsigned value in T1                                     
;        16-bit unsigned value in T2                                     
;        Carry=0: Re-use T1 from previous multiplication (faster)         
;        Carry=1: Set T1 (slower)                                         
;                                                                         
; Output: 32-bit unsigned value in PRODUCT                               
;                                                                         
; Clobbered: PRODUCT, X, A, C                                             
;                                                                         
; Allocation setup: T1,T2 and PRODUCT preferably on Zero-page.           
;                   square1_lo, square1_hi, square2_lo, square2_hi must be
;                   page aligned. Each table are 512 bytes. Total 2kb.   
;                                                                         
; Table generation: I:0..511                                             
;                   square1_lo = <((I*I)/4)                               
;                   square1_hi = >((I*I)/4)                               
;                   square2_lo = <(((I-255)*(I-255))/4)                   
;                   square2_hi = >(((I-255)*(I-255))/4)                   
.proc multiply_16bit_unsigned                                             
                ; <T1 * <T2 = AAaa                                       
                ; <T1 * >T2 = BBbb                                       
                ; >T1 * <T2 = CCcc                                       
                ; >T1 * >T2 = DDdd                                       
                ;                                                         
                ;       AAaa                                             
                ;     BBbb                                               
                ;     CCcc                                               
                ; + DDdd                                                 
                ; ----------                                             
                ;   PRODUCT!                                             

                ; Setup T1 if changed
                bcc :+               
                    lda T1+0         
                    sta sm1a+1       
                    sta sm3a+1       
                    sta sm5a+1       
                    sta sm7a+1       
                    eor #$ff         
                    sta sm2a+1       
                    sta sm4a+1       
                    sta sm6a+1       
                    sta sm8a+1       
                    lda T1+1         
                    sta sm1b+1       
                    sta sm3b+1       
                    sta sm5b+1       
                    sta sm7b+1       
                    eor #$ff         
                    sta sm2b+1       
                    sta sm4b+1       
                    sta sm6b+1       
                    sta sm8b+1       
                :                   

                ; Perform <T1 * <T2 = AAaa
                ldx T2+0                 
                sec                       
sm1a:           lda square1_lo,x         
sm2a:           sbc square2_lo,x         
                sta PRODUCT+0             
sm3a:           lda square1_hi,x         
sm4a:           sbc square2_hi,x         
                sta _AA+1                 

                ; Perform >T1_hi * <T2 = CCcc
                sec                         
sm1b:           lda square1_lo,x             
sm2b:           sbc square2_lo,x             
                sta _cc+1                   
sm3b:           lda square1_hi,x             
sm4b:           sbc square2_hi,x             
                sta _CC+1                   

                ; Perform <T1 * >T2 = BBbb
                ldx T2+1                 
                sec                       
sm5a:           lda square1_lo,x         
sm6a:           sbc square2_lo,x         
                sta _bb+1                 
sm7a:           lda square1_hi,x         
sm8a:           sbc square2_hi,x         
                sta _BB+1                 

                ; Perform >T1 * >T2 = DDdd
                sec                       
sm5b:           lda square1_lo,x         
sm6b:           sbc square2_lo,x         
                sta _dd+1                 
sm7b:           lda square1_hi,x         
sm8b:           sbc square2_hi,x         
                sta PRODUCT+3             

                ; Add the separate multiplications together
                clc                                       
_AA:            lda #0                                     
_bb:            adc #0                                     
                sta PRODUCT+1                             
_BB:            lda #0                                     
_CC:            adc #0                                     
                sta PRODUCT+2                             
                bcc :+                                     
                    inc PRODUCT+3                         
                    clc                                   
                :                                         
_cc:            lda #0                                     
                adc PRODUCT+1                             
                sta PRODUCT+1                             
_dd:            lda #0                                     
                adc PRODUCT+2                             
                sta PRODUCT+2                             
                bcc :+                                     
                    inc PRODUCT+3                         
                :                                         

                rts
.endproc           

_________________
__________________________
http://pcedev.wordpress.com


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 7:58 am 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 248
Espozo wrote:
Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...


You can do 8x16 --> 24 multiplication with mode 7 registers but no division as far i know.
And of course you can chain them to obtain a 16x16 --> 32 multiplication :
(8 low) x16 = x
(8 high) x16 = y
then just do a 32 bits addition : x + (y << 8) to get the final 32 bits results.
Still even using the mode 7 registers i bet you spent a large amount of cycle just to do 16x16=32


Quote:
That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.


Actually on 68000 the multiplication takes up to 72 cycles (for signed version) and up to 140 cycles for the signed division but given some benchmarks i did, you can assume a mean of 50 cycles for multiplication and about 90/100 cycles for division which is not that bad.
With the genesis 68000 cpu (7.67 Mhz) i can transform (3D transformation + 2D projection) about 10000 vertices per second which is not that bad (i expected 6000 max).
A single 3D transformation consist of :
- 9 16x16=32 multiplication
- 6 32bit additions
- 3 16bit additions
A single 2D projection consist of :
- 3 16bit additions
- 1 32:16=16 division
- 2 16x16=32 multiplication

The projection could be different but i handled it that way for convenience.
Still that is a big amount of complexes operations and i don't count the load / store / shift operations here.
If we just count maximum cycles of mul and div, we already obtain (11*70) + 140 which is close to 1000 cycles per vertex ! So we shouldn't be able to transform more than 8000 vertices per second just because of these operations... but hopefully that is not the case. I wonder how much we could transform with the SNES hardware using smart interlacing of operations with the different available multiplier / diviser units :)


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 8:00 am 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 248
tomaitheous wrote:
This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul


Interesting :) is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.

Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU :-)


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 2:16 pm 
Offline

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
Stef wrote:
Interesting :) is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.
Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU :-)


Code:
; Description: Signed 16-bit multiplication with signed 32-bit result.
;                                                                     
; Input: 16-bit signed value in T1                                   
;        16-bit signed value in T2                                   
;        Carry=0: Re-use T1 from previous multiplication (faster)     
;        Carry=1: Set T1 (slower)                                     
;                                                                     
; Output: 32-bit signed value in PRODUCT                             
;
; Clobbered: PRODUCT, X, A, C
.proc multiply_16bit_signed
                jsr multiply_16bit_unsigned

                ; Apply sign (See C=Hacking16 for details).
                lda T1+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T2+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T2+1
                    sta PRODUCT+3
                :
                lda T2+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T1+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T1+1
                    sta PRODUCT+3
                :

                rts
.endproc


Yeah, those two groups of 8bit SBCs could be optimized down to one each for '816. So worse case is two 16bit subtractions for signed overhead. Though I think a slightly more optimal routine could be written for signed input/output.

_________________
__________________________
http://pcedev.wordpress.com


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 3:14 pm 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 248
Indeed, someone should rewrite an optimized version for the 65816, would be pretty useful :)


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 8:47 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2727
I think I figured out how Tomaitheous's code works. This is basically the math equation:

LUT1(a+b) - LUT2(255-a+b)

LUT1(x) = (x^2)/4
LUT2(x) = ((x-255)^2)/4

((a+b)^2)/4 - ((255-a+b-255)^2)/4=
((a+b)^2)/4 - ((-a+b)^2)/4 =
((a+b)^2 - (-a+b)^2)/4 =
(a^2 + 2ab + b^2 - (a^2 - 2ab + b^2))/4 =
(a^2 + 2ab + b^2 - a^2 + 2ab - b^2)/4 =
4ab/4 = ab

I don't know how the rounding error somehow gets cancelled out.


Top
 Profile  
 
PostPosted: Sun Jan 03, 2016 9:10 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20402
Location: NE Indiana, USA (NTSC)
See quarter square multiplication on Wikipedia.


Top
 Profile  
 
PostPosted: Mon Jan 04, 2016 12:51 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2727
I use multiplication for calculating the rotation of the bosses' joints. The sines and cosines are 16-bit signed values from -256 to 256, and the radius is a signed 8-bit values. The result is a signed 16-bit value.

I thought of a fast routine that uses non-mode-7 multiplication registers.

Code:
sine_multiplication:
lda #$0000
sep #$20
cpy #$80
bcc +
sbc sine_lo,x //clears carry for upcomming adc
+;
sty $4202
ldy sine_hi,x
sty $4203
ldy sine_lo,x
adc $004216 //wastes one cycle
sty $4203
xba
rep #$21
adc $4216
rts


Top
 Profile  
 
PostPosted: Mon Jan 25, 2016 5:29 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2727
What really boggles my mind about slowdown is when games lag with 4 sprites, while other games can have more than 40. You mean to tell me they programmed every routine 10 times slower than necessary?

What's even more odd is that the games with heavy slowdown seem to have a higher threshold on the second lag frame as if running the game itself takes a lot of overhead. Like, it takes 4 sprites to make it run at 30fps, but with 10 sprites onscreen, it STILL runs at 30fps? WTF?


Top
 Profile  
 
PostPosted: Mon Jan 25, 2016 5:50 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20402
Location: NE Indiana, USA (NTSC)
The data

Game A
1-3 objects: 60 fps
4-10 objects: 30 fps

Game B
40 objects: 60 fps

Speculation as to cause

For purpose of argument, I'll assume good faith, imputing no malice where incompetence is sufficient (Hanlon's razor), nor incompetence when bona fide intractability is sufficient.

As for the difference between the games: Some games have more complicated collision detection and path finding algorithms and may thus slow down with fewer active objects. I doubt that individual spread-gun bullet sprites in Contra or even ships in Recca have very complicated movement.

As for why 10 is no slower than 4: Perhaps there is a constant overhead equivalent to seven objects, such as decoding the map into background update packets. Or perhaps the player character is as complex as two or three objects. I know the walking characters in Haunted: Halloween '85 (the player and the zombies) are the most algorithmically complex because they have to read the collision map four times (bottom center, left, and right, and head-height at leading edge), compared to everything else that reads it once (bottom center) or not at all. Incidentally, I had to do a shload of optimization to the code that parses collision slabs to get it to work well with six walkers (the player and five zombies) on the most platform-filled levels (such as the barn) without slowing down.


Top
 Profile  
 
PostPosted: Mon Jan 25, 2016 6:35 pm 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3359
Location: Nacogdoches, Texas
What's funny is that I just got 1943 from a local video game store, Game X Change, (don't know why I felt like sharing that) and I played it and it's a ton better about not slowing down than Gradius III which I also got, except that this one is on the SNES... That's what really boggles my mind.

Isn't the SNES's CPU supposed to be something like 4x faster than the one in the NES?


Top
 Profile  
 
PostPosted: Mon Jan 25, 2016 7:11 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2727
Now that I think about it, I think most people mistakenly identify object collision as speed-critical code, when it could actually be the BG collision that bogs the cpu down. I wonder if slopes have anything to do with it, because a lot of NES games didn't use slopes.

Edit: I changed the wording, because I'm not exactly sure if this is totally accurate.


Last edited by psycopathicteen on Wed Jan 27, 2016 8:15 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon Jan 25, 2016 7:22 pm 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3359
Location: Nacogdoches, Texas
psycopathicteen wrote:
BG collision that bogs the cpu down.

Oh yeah... There's no BG to bump into in that game. :lol:


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 436 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 30  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group