It is currently Thu Apr 25, 2019 9:01 am

 All times are UTC - 7 hours

### Forum rules

• For making cartridges of your Super NES games, see Reproduction.

 Page 9 of 30 [ 436 posts ] Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 30  Next
 Print view Previous topic | Next topic
Author Message
 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 12:41 am
 Formerly Espozo

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3450
Location: Richmond, Virginia
Stef wrote:
The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)

Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...

Stef wrote:
(you have to count at least 15/16 cycles for load / read result time)

That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 1:30 am

Joined: Mon Jan 03, 2005 10:36 am
Posts: 3141
Location: Tampere, Finland
Espozo wrote:
Stef wrote:
The SNES only has 8x8 multiplication (the other unit being used by the mode 7, that is not safe for a regular alternative)

Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...

Not sure if this was what you were asking, but off the top of my head: (a and b are 16-bit numbers)
Code:
a*b = (a_hi*256 + a_lo) * (b_hi*256 + b_lo)
= a_hi*b_hi*256*256 + a_hi*256*b_lo + a_lo*b_hi*256 + a_lo*b_lo
= ((a_hi*b_hi)<<16) + ((a_hi*b_lo + a_lo*b_hi)<<8) + a_lo*b_lo

_________________
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 7:56 am

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul
Code:
; Description: Unsigned 16-bit multiplication with unsigned 32-bit result.
;
; Input: 16-bit unsigned value in T1
;        16-bit unsigned value in T2
;        Carry=0: Re-use T1 from previous multiplication (faster)
;        Carry=1: Set T1 (slower)
;
; Output: 32-bit unsigned value in PRODUCT
;
; Clobbered: PRODUCT, X, A, C
;
; Allocation setup: T1,T2 and PRODUCT preferably on Zero-page.
;                   square1_lo, square1_hi, square2_lo, square2_hi must be
;                   page aligned. Each table are 512 bytes. Total 2kb.
;
; Table generation: I:0..511
;                   square1_lo = <((I*I)/4)
;                   square1_hi = >((I*I)/4)
;                   square2_lo = <(((I-255)*(I-255))/4)
;                   square2_hi = >(((I-255)*(I-255))/4)
.proc multiply_16bit_unsigned
; <T1 * <T2 = AAaa
; <T1 * >T2 = BBbb
; >T1 * <T2 = CCcc
; >T1 * >T2 = DDdd
;
;       AAaa
;     BBbb
;     CCcc
; + DDdd
; ----------
;   PRODUCT!

; Setup T1 if changed
bcc :+
lda T1+0
sta sm1a+1
sta sm3a+1
sta sm5a+1
sta sm7a+1
eor #\$ff
sta sm2a+1
sta sm4a+1
sta sm6a+1
sta sm8a+1
lda T1+1
sta sm1b+1
sta sm3b+1
sta sm5b+1
sta sm7b+1
eor #\$ff
sta sm2b+1
sta sm4b+1
sta sm6b+1
sta sm8b+1
:

; Perform <T1 * <T2 = AAaa
ldx T2+0
sec
sm1a:           lda square1_lo,x
sm2a:           sbc square2_lo,x
sta PRODUCT+0
sm3a:           lda square1_hi,x
sm4a:           sbc square2_hi,x
sta _AA+1

; Perform >T1_hi * <T2 = CCcc
sec
sm1b:           lda square1_lo,x
sm2b:           sbc square2_lo,x
sta _cc+1
sm3b:           lda square1_hi,x
sm4b:           sbc square2_hi,x
sta _CC+1

; Perform <T1 * >T2 = BBbb
ldx T2+1
sec
sm5a:           lda square1_lo,x
sm6a:           sbc square2_lo,x
sta _bb+1
sm7a:           lda square1_hi,x
sm8a:           sbc square2_hi,x
sta _BB+1

; Perform >T1 * >T2 = DDdd
sec
sm5b:           lda square1_lo,x
sm6b:           sbc square2_lo,x
sta _dd+1
sm7b:           lda square1_hi,x
sm8b:           sbc square2_hi,x
sta PRODUCT+3

; Add the separate multiplications together
clc
_AA:            lda #0
sta PRODUCT+1
_BB:            lda #0
sta PRODUCT+2
bcc :+
inc PRODUCT+3
clc
:
_cc:            lda #0
sta PRODUCT+1
_dd:            lda #0
sta PRODUCT+2
bcc :+
inc PRODUCT+3
:

rts
.endproc

_________________
__________________________
http://pcedev.wordpress.com

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 7:58 am

Joined: Mon Jul 01, 2013 11:25 am
Posts: 249
Espozo wrote:
Wait, are you saying that you can do 16x8 or 16x16 multiplication with mode 7's multiplication and division registers? Is there any realistic way to "chain" two 8 bit multiplications together like you can do with addition or subtraction? Probably not...

You can do 8x16 --> 24 multiplication with mode 7 registers but no division as far i know.
And of course you can chain them to obtain a 16x16 --> 32 multiplication :
(8 low) x16 = x
(8 high) x16 = y
then just do a 32 bits addition : x + (y << 8) to get the final 32 bits results.
Still even using the mode 7 registers i bet you spent a large amount of cycle just to do 16x16=32

Quote:
That's why I was thinking it wouldn't be that fast. 70 to 140 cycles sounds astronomically high to me, but on the SNES, it's more like half which is still ridiculously large, but it's at least somewhat reasonable.

Actually on 68000 the multiplication takes up to 72 cycles (for signed version) and up to 140 cycles for the signed division but given some benchmarks i did, you can assume a mean of 50 cycles for multiplication and about 90/100 cycles for division which is not that bad.
With the genesis 68000 cpu (7.67 Mhz) i can transform (3D transformation + 2D projection) about 10000 vertices per second which is not that bad (i expected 6000 max).
A single 3D transformation consist of :
- 9 16x16=32 multiplication
- 6 32bit additions
- 3 16bit additions
A single 2D projection consist of :
- 3 16bit additions
- 1 32:16=16 division
- 2 16x16=32 multiplication

The projection could be different but i handled it that way for convenience.
Still that is a big amount of complexes operations and i don't count the load / store / shift operations here.
If we just count maximum cycles of mul and div, we already obtain (11*70) + 140 which is close to 1000 cycles per vertex ! So we shouldn't be able to transform more than 8000 vertices per second just because of these operations... but hopefully that is not the case. I wonder how much we could transform with the SNES hardware using smart interlacing of operations with the different available multiplier / diviser units

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 8:00 am

Joined: Mon Jul 01, 2013 11:25 am
Posts: 249
tomaitheous wrote:
This could be repurposed for '816. The routine has a large overhead of T1 loading because it's setup for sequential calls were T1 doesn't change, but you could easily switch that to indirect, removing the self-modifying code, and seriously cut that loading over head down by a lot. Also, the tables are split 8bit as well as the math, so that could be optimized into single tables with single sbc's; word wide tables and 16bit operations. A re-write/re-structure for '816 could probably get it close to 70 cycle range.

http://codebase64.org/doku.php?id=base: ... iplication 16x16->32bit mul

Interesting is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.

Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 2:16 pm

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
Stef wrote:
Interesting is there the same for signed multiplication (which is more useful) ? I guess it only requires different lut.
Edit: I found the signed version which add some cycles at the end of the multiplication operation, signed operands process a bit slower because of that, still that is a fast implentation for the 6502 CPU

Code:
; Description: Signed 16-bit multiplication with signed 32-bit result.
;
; Input: 16-bit signed value in T1
;        16-bit signed value in T2
;        Carry=0: Re-use T1 from previous multiplication (faster)
;        Carry=1: Set T1 (slower)
;
; Output: 32-bit signed value in PRODUCT
;
; Clobbered: PRODUCT, X, A, C
.proc multiply_16bit_signed
jsr multiply_16bit_unsigned

; Apply sign (See C=Hacking16 for details).
lda T1+1
bpl :+
sec
lda PRODUCT+2
sbc T2+0
sta PRODUCT+2
lda PRODUCT+3
sbc T2+1
sta PRODUCT+3
:
lda T2+1
bpl :+
sec
lda PRODUCT+2
sbc T1+0
sta PRODUCT+2
lda PRODUCT+3
sbc T1+1
sta PRODUCT+3
:

rts
.endproc

Yeah, those two groups of 8bit SBCs could be optimized down to one each for '816. So worse case is two 16bit subtractions for signed overhead. Though I think a slightly more optimal routine could be written for signed input/output.

_________________
__________________________
http://pcedev.wordpress.com

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 3:14 pm

Joined: Mon Jul 01, 2013 11:25 am
Posts: 249
Indeed, someone should rewrite an optimized version for the 65816, would be pretty useful

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 8:47 pm

Joined: Wed May 19, 2010 6:12 pm
Posts: 2843
I think I figured out how Tomaitheous's code works. This is basically the math equation:

LUT1(a+b) - LUT2(255-a+b)

LUT1(x) = (x^2)/4
LUT2(x) = ((x-255)^2)/4

((a+b)^2)/4 - ((255-a+b-255)^2)/4=
((a+b)^2)/4 - ((-a+b)^2)/4 =
((a+b)^2 - (-a+b)^2)/4 =
(a^2 + 2ab + b^2 - (a^2 - 2ab + b^2))/4 =
(a^2 + 2ab + b^2 - a^2 + 2ab - b^2)/4 =
4ab/4 = ab

I don't know how the rounding error somehow gets cancelled out.

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Sun Jan 03, 2016 9:10 pm

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21324
Location: NE Indiana, USA (NTSC)
See quarter square multiplication on Wikipedia.

_________________
Pin Eight | Twitter | GitHub | Patreon

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 04, 2016 12:51 pm

Joined: Wed May 19, 2010 6:12 pm
Posts: 2843
I use multiplication for calculating the rotation of the bosses' joints. The sines and cosines are 16-bit signed values from -256 to 256, and the radius is a signed 8-bit values. The result is a signed 16-bit value.

I thought of a fast routine that uses non-mode-7 multiplication registers.

Code:
sine_multiplication:
lda #\$0000
sep #\$20
cpy #\$80
bcc +
sbc sine_lo,x //clears carry for upcomming adc
+;
sty \$4202
ldy sine_hi,x
sty \$4203
ldy sine_lo,x
adc \$004216 //wastes one cycle
sty \$4203
xba
rep #\$21
rts

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 25, 2016 5:29 pm

Joined: Wed May 19, 2010 6:12 pm
Posts: 2843
What really boggles my mind about slowdown is when games lag with 4 sprites, while other games can have more than 40. You mean to tell me they programmed every routine 10 times slower than necessary?

What's even more odd is that the games with heavy slowdown seem to have a higher threshold on the second lag frame as if running the game itself takes a lot of overhead. Like, it takes 4 sprites to make it run at 30fps, but with 10 sprites onscreen, it STILL runs at 30fps? WTF?

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 25, 2016 5:50 pm

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21324
Location: NE Indiana, USA (NTSC)
The data

Game A
1-3 objects: 60 fps
4-10 objects: 30 fps

Game B
40 objects: 60 fps

Speculation as to cause

For purpose of argument, I'll assume good faith, imputing no malice where incompetence is sufficient (Hanlon's razor), nor incompetence when bona fide intractability is sufficient.

As for the difference between the games: Some games have more complicated collision detection and path finding algorithms and may thus slow down with fewer active objects. I doubt that individual spread-gun bullet sprites in Contra or even ships in Recca have very complicated movement.

As for why 10 is no slower than 4: Perhaps there is a constant overhead equivalent to seven objects, such as decoding the map into background update packets. Or perhaps the player character is as complex as two or three objects. I know the walking characters in Haunted: Halloween '85 (the player and the zombies) are the most algorithmically complex because they have to read the collision map four times (bottom center, left, and right, and head-height at leading edge), compared to everything else that reads it once (bottom center) or not at all. Incidentally, I had to do a shload of optimization to the code that parses collision slabs to get it to work well with six walkers (the player and five zombies) on the most platform-filled levels (such as the barn) without slowing down.

_________________
Pin Eight | Twitter | GitHub | Patreon

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 25, 2016 6:35 pm
 Formerly Espozo

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3450
Location: Richmond, Virginia
What's funny is that I just got 1943 from a local video game store, Game X Change, (don't know why I felt like sharing that) and I played it and it's a ton better about not slowing down than Gradius III which I also got, except that this one is on the SNES... That's what really boggles my mind.

Isn't the SNES's CPU supposed to be something like 4x faster than the one in the NES?

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 25, 2016 7:11 pm

Joined: Wed May 19, 2010 6:12 pm
Posts: 2843
Now that I think about it, I think most people mistakenly identify object collision as speed-critical code, when it could actually be the BG collision that bogs the cpu down. I wonder if slopes have anything to do with it, because a lot of NES games didn't use slopes.

Edit: I changed the wording, because I'm not exactly sure if this is totally accurate.

Last edited by psycopathicteen on Wed Jan 27, 2016 8:15 pm, edited 1 time in total.

Top

 Post subject: Re: Why no SNES homebrew scene?Posted: Mon Jan 25, 2016 7:22 pm
 Formerly Espozo

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3450
Location: Richmond, Virginia
psycopathicteen wrote:
BG collision that bogs the cpu down.

Oh yeah... There's no BG to bump into in that game.

Top

 Display posts from previous: All posts1 day7 days2 weeks1 month3 months6 months1 year Sort by AuthorPost timeSubject AscendingDescending
 Page 9 of 30 [ 436 posts ] Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 30  Next

 All times are UTC - 7 hours

#### Who is online

Users browsing this forum: creaothceann, nocash and 3 guests

 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum

Search for:
 Jump to:  Select a forum ------------------ NES / Famicom    NESdev    NESemdev    NES Graphics    NES Music    Homebrew Projects       2018 NESdev Competition       2017 NESdev Competition       2016 NESdev Competition       2014 NESdev Competition       2011 NESdev Competition    Newbie Help Center    NES Hardware and Flash Equipment       Reproduction    NESdev International       FCdev       NESdev China       NESdev Middle East Other    General Stuff    Membler Industries    Other Retro Dev       SNESdev       GBDev    Test Forum Site Issues    phpBB Issues    Web Issues    nesdevWiki