Why no SNES homebrew scene?

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Why no SNES homebrew scene?

Post by Drew Sebastino »

Stef wrote:The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)
Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...
Stef wrote:(you have to count at least 15/16 cycles for load / read result time)
That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.
User avatar
thefox
Posts: 3134
Joined: Mon Jan 03, 2005 10:36 am
Location: 🇫🇮
Contact:

Re: Why no SNES homebrew scene?

Post by thefox »

Espozo wrote:
Stef wrote:The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)
Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...
Not sure if this was what you were asking, but off the top of my head: (a and b are 16-bit numbers)

Code: Select all

a*b = (a_hi*256 + a_lo) * (b_hi*256 + b_lo)
    = a_hi*b_hi*256*256 + a_hi*256*b_lo + a_lo*b_hi*256 + a_lo*b_lo
    = ((a_hi*b_hi)<<16) + ((a_hi*b_lo + a_lo*b_hi)<<8) + a_lo*b_lo
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
tomaitheous
Posts: 592
Joined: Thu Aug 28, 2008 1:17 am
Contact:

Re: Why no SNES homebrew scene?

Post by tomaitheous »

This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul

Code: Select all

; Description: Unsigned 16-bit multiplication with unsigned 32-bit result.
;                                                                         
; Input: 16-bit unsigned value in T1                                      
;        16-bit unsigned value in T2                                      
;        Carry=0: Re-use T1 from previous multiplication (faster)         
;        Carry=1: Set T1 (slower)                                         
;                                                                         
; Output: 32-bit unsigned value in PRODUCT                                
;                                                                         
; Clobbered: PRODUCT, X, A, C                                             
;                                                                         
; Allocation setup: T1,T2 and PRODUCT preferably on Zero-page.            
;                   square1_lo, square1_hi, square2_lo, square2_hi must be
;                   page aligned. Each table are 512 bytes. Total 2kb.    
;                                                                         
; Table generation: I:0..511                                              
;                   square1_lo = <((I*I)/4)                               
;                   square1_hi = >((I*I)/4)                               
;                   square2_lo = <(((I-255)*(I-255))/4)                   
;                   square2_hi = >(((I-255)*(I-255))/4)                   
.proc multiply_16bit_unsigned                                             
                ; <T1 * <T2 = AAaa                                        
                ; <T1 * >T2 = BBbb                                        
                ; >T1 * <T2 = CCcc                                        
                ; >T1 * >T2 = DDdd                                        
                ;                                                         
                ;       AAaa                                              
                ;     BBbb                                                
                ;     CCcc                                                
                ; + DDdd                                                  
                ; ----------                                              
                ;   PRODUCT!                                              

                ; Setup T1 if changed
                bcc :+               
                    lda T1+0         
                    sta sm1a+1       
                    sta sm3a+1       
                    sta sm5a+1       
                    sta sm7a+1       
                    eor #$ff         
                    sta sm2a+1       
                    sta sm4a+1       
                    sta sm6a+1       
                    sta sm8a+1       
                    lda T1+1         
                    sta sm1b+1       
                    sta sm3b+1       
                    sta sm5b+1       
                    sta sm7b+1       
                    eor #$ff         
                    sta sm2b+1       
                    sta sm4b+1       
                    sta sm6b+1       
                    sta sm8b+1       
                :                    

                ; Perform <T1 * <T2 = AAaa
                ldx T2+0                  
                sec                       
sm1a:           lda square1_lo,x          
sm2a:           sbc square2_lo,x          
                sta PRODUCT+0             
sm3a:           lda square1_hi,x          
sm4a:           sbc square2_hi,x          
                sta _AA+1                 

                ; Perform >T1_hi * <T2 = CCcc
                sec                          
sm1b:           lda square1_lo,x             
sm2b:           sbc square2_lo,x             
                sta _cc+1                    
sm3b:           lda square1_hi,x             
sm4b:           sbc square2_hi,x             
                sta _CC+1                    

                ; Perform <T1 * >T2 = BBbb
                ldx T2+1                  
                sec                       
sm5a:           lda square1_lo,x          
sm6a:           sbc square2_lo,x          
                sta _bb+1                 
sm7a:           lda square1_hi,x          
sm8a:           sbc square2_hi,x          
                sta _BB+1                 

                ; Perform >T1 * >T2 = DDdd
                sec                       
sm5b:           lda square1_lo,x          
sm6b:           sbc square2_lo,x          
                sta _dd+1                 
sm7b:           lda square1_hi,x          
sm8b:           sbc square2_hi,x          
                sta PRODUCT+3             

                ; Add the separate multiplications together
                clc                                        
_AA:            lda #0                                     
_bb:            adc #0                                     
                sta PRODUCT+1                              
_BB:            lda #0                                     
_CC:            adc #0                                     
                sta PRODUCT+2                              
                bcc :+                                     
                    inc PRODUCT+3                          
                    clc                                    
                :                                          
_cc:            lda #0                                     
                adc PRODUCT+1                              
                sta PRODUCT+1                              
_dd:            lda #0                                     
                adc PRODUCT+2                              
                sta PRODUCT+2                              
                bcc :+                                     
                    inc PRODUCT+3                          
                :                                          

                rts
.endproc           
__________________________
http://pcedev.wordpress.com
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: Why no SNES homebrew scene?

Post by Stef »

Espozo wrote: Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...
You can do 8x16 --> 24 multiplication with mode 7 registers but no division as far i know.
And of course you can chain them to obtain a 16x16 --> 32 multiplication :
(8 low) x16 = x
(8 high) x16 = y
then just do a 32 bits addition : x + (y << 8) to get the final 32 bits results.
Still even using the mode 7 registers i bet you spent a large amount of cycle just to do 16x16=32

That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.
Actually on 68000 the multiplication takes up to 72 cycles (for signed version) and up to 140 cycles for the signed division but given some benchmarks i did, you can assume a mean of 50 cycles for multiplication and about 90/100 cycles for division which is not that bad.
With the genesis 68000 cpu (7.67 Mhz) i can transform (3D transformation + 2D projection) about 10000 vertices per second which is not that bad (i expected 6000 max).
A single 3D transformation consist of :
- 9 16x16=32 multiplication
- 6 32bit additions
- 3 16bit additions
A single 2D projection consist of :
- 3 16bit additions
- 1 32:16=16 division
- 2 16x16=32 multiplication

The projection could be different but i handled it that way for convenience.
Still that is a big amount of complexes operations and i don't count the load / store / shift operations here.
If we just count maximum cycles of mul and div, we already obtain (11*70) + 140 which is close to 1000 cycles per vertex ! So we shouldn't be able to transform more than 8000 vertices per second just because of these operations... but hopefully that is not the case. I wonder how much we could transform with the SNES hardware using smart interlacing of operations with the different available multiplier / diviser units :)
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: Why no SNES homebrew scene?

Post by Stef »

tomaitheous wrote:This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul
Interesting :) is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.

Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU :-)
tomaitheous
Posts: 592
Joined: Thu Aug 28, 2008 1:17 am
Contact:

Re: Why no SNES homebrew scene?

Post by tomaitheous »

Stef wrote: Interesting :) is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.
Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU :-)

Code: Select all

; Description: Signed 16-bit multiplication with signed 32-bit result.
;                                                                     
; Input: 16-bit signed value in T1                                    
;        16-bit signed value in T2                                    
;        Carry=0: Re-use T1 from previous multiplication (faster)     
;        Carry=1: Set T1 (slower)                                     
;                                                                     
; Output: 32-bit signed value in PRODUCT                              
;
; Clobbered: PRODUCT, X, A, C
.proc multiply_16bit_signed
                jsr multiply_16bit_unsigned

                ; Apply sign (See C=Hacking16 for details).
                lda T1+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T2+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T2+1
                    sta PRODUCT+3
                :
                lda T2+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T1+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T1+1
                    sta PRODUCT+3
                :

                rts
.endproc
Yeah, those two groups of 8bit SBCs could be optimized down to one each for '816. So worse case is two 16bit subtractions for signed overhead. Though I think a slightly more optimal routine could be written for signed input/output.
__________________________
http://pcedev.wordpress.com
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: Why no SNES homebrew scene?

Post by Stef »

Indeed, someone should rewrite an optimized version for the 65816, would be pretty useful :)
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Why no SNES homebrew scene?

Post by psycopathicteen »

I think I figured out how Tomaitheous's code works. This is basically the math equation:

LUT1(a+b) - LUT2(255-a+b)

LUT1(x) = (x^2)/4
LUT2(x) = ((x-255)^2)/4

((a+b)^2)/4 - ((255-a+b-255)^2)/4=
((a+b)^2)/4 - ((-a+b)^2)/4 =
((a+b)^2 - (-a+b)^2)/4 =
(a^2 + 2ab + b^2 - (a^2 - 2ab + b^2))/4 =
(a^2 + 2ab + b^2 - a^2 + 2ab - b^2)/4 =
4ab/4 = ab

I don't know how the rounding error somehow gets cancelled out.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Why no SNES homebrew scene?

Post by tepples »

psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Why no SNES homebrew scene?

Post by psycopathicteen »

I use multiplication for calculating the rotation of the bosses' joints. The sines and cosines are 16-bit signed values from -256 to 256, and the radius is a signed 8-bit values. The result is a signed 16-bit value.

I thought of a fast routine that uses non-mode-7 multiplication registers.

Code: Select all

sine_multiplication:
lda #$0000
sep #$20
cpy #$80
bcc +
sbc sine_lo,x //clears carry for upcomming adc
+;
sty $4202
ldy sine_hi,x
sty $4203
ldy sine_lo,x
adc $004216 //wastes one cycle
sty $4203
xba
rep #$21
adc $4216
rts
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Why no SNES homebrew scene?

Post by psycopathicteen »

What really boggles my mind about slowdown is when games lag with 4 sprites, while other games can have more than 40. You mean to tell me they programmed every routine 10 times slower than necessary?

What's even more odd is that the games with heavy slowdown seem to have a higher threshold on the second lag frame as if running the game itself takes a lot of overhead. Like, it takes 4 sprites to make it run at 30fps, but with 10 sprites onscreen, it STILL runs at 30fps? WTF?
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Why no SNES homebrew scene?

Post by tepples »

The data

Game A
1-3 objects: 60 fps
4-10 objects: 30 fps

Game B
40 objects: 60 fps

Speculation as to cause

For purpose of argument, I'll assume good faith, imputing no malice where incompetence is sufficient (Hanlon's razor), nor incompetence when bona fide intractability is sufficient.

As for the difference between the games: Some games have more complicated collision detection and path finding algorithms and may thus slow down with fewer active objects. I doubt that individual spread-gun bullet sprites in Contra or even ships in Recca have very complicated movement.

As for why 10 is no slower than 4: Perhaps there is a constant overhead equivalent to seven objects, such as decoding the map into background update packets. Or perhaps the player character is as complex as two or three objects. I know the walking characters in Haunted: Halloween '85 (the player and the zombies) are the most algorithmically complex because they have to read the collision map four times (bottom center, left, and right, and head-height at leading edge), compared to everything else that reads it once (bottom center) or not at all. Incidentally, I had to do a shload of optimization to the code that parses collision slabs to get it to work well with six walkers (the player and five zombies) on the most platform-filled levels (such as the barn) without slowing down.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Why no SNES homebrew scene?

Post by Drew Sebastino »

What's funny is that I just got 1943 from a local video game store, Game X Change, (don't know why I felt like sharing that) and I played it and it's a ton better about not slowing down than Gradius III which I also got, except that this one is on the SNES... That's what really boggles my mind.

Isn't the SNES's CPU supposed to be something like 4x faster than the one in the NES?
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Why no SNES homebrew scene?

Post by psycopathicteen »

Now that I think about it, I think most people mistakenly identify object collision as speed-critical code, when it could actually be the BG collision that bogs the cpu down. I wonder if slopes have anything to do with it, because a lot of NES games didn't use slopes.

Edit: I changed the wording, because I'm not exactly sure if this is totally accurate.
Last edited by psycopathicteen on Wed Jan 27, 2016 8:15 pm, edited 1 time in total.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Why no SNES homebrew scene?

Post by Drew Sebastino »

psycopathicteen wrote:BG collision that bogs the cpu down.
Oh yeah... There's no BG to bump into in that game. :lol:
Post Reply