Fast 2D blitting on Super FX

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Fast 2D blitting on Super FX

Post by 93143 »

I've been messing with some conceptual codes for high-speed drawing of untransformed 2D graphics. They require specialized graphics formats which are not interoperable; I'm thinking I'd test for speed beforehand and use whichever one was fastest for a given object - as long as all the code can fit in the instruction cache at the same time...

Of course, these codes are probably not final. I'm still not very good at Super FX.

I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching. But I haven't written anything like that yet.

I've got to say, I find the 8-bit busing to be far more aggravating in the case of the Super FX than in the case of the S-CPU. What they were trying to do with the Super FX really needed more bits per word than they had...

Code: Select all

; SINGLE-PIXEL BLITTING (slowest and most general):

	to R12		 ; pixel count goes in the LOOP index register
	getb			; get pixel count for first line
	inc R14		; increment ROM address, triggering a buffer load

Start:
	getc			; get pixel data (one byte per pixel) from ROM buffer
	inc R14		; and increment ROM address
	loop			; decrement pixel count; if not zero, go to address in R13, ie: "Start"
	plot			; plot pixel and increment X-counter in R1 (since the GSU is pipelined, this byte gets executed regardless)

	getb			; get carriage return X-component (goes in R0)
	inc R14		; increment ROM address
	with R1		; update X-coordinate
	sub R0		 ; by subtracting carriage return X-component
	inc R2		 ; increment Y-coordinate
	to R12		 ; update LOOP index register
	getb			; with pixel count for next line
	inc R14		; increment ROM address
	loop			; decrement pixel count and branch to Start if not zero
	nop			 ; dummy fill pipeline (nothing else to do before GETC, and the ROM buffer isn't ready anyway)

; The main loop has only two bytes between INC R14 and GETC, so in high-speed mode it's probably 6 cycles rather than 4.
; Blitting a sliver in 4bpp is probably at least 40 cycles, but that's still only 5 cycles per pixel, so this method is
; bottlenecked by code unless you're drawing in 8bpp.

Code: Select all

; DUAL-PIXEL BLITTING (faster for long solid runs, slower for short runs, doesn't support gaps):

	to R12		 ; pixel count goes in the LOOP index register
	getb			; get pixel count for first line, plus two if odd
	inc R14		; increment ROM address, triggering a buffer load
	with R12	  ; operate on pixel count
SStart:
	lsr			 ; turn pixel count into pixel pair count
	bcc DStart	; if the pixel count was even, go to dual-pixel blitting
	nop			 ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
	getc			; fetch the first pixel from the ROM buffer
	inc R14		; increment the ROM address
	loop			; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
	plot			; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

	bra EndL	  ; go to end of line (at this point it's been determined that the line was only one pixel long)
	getb			; get carriage return X-component in R0 (happens after branch)
DStart:
	getc			; get pixel pair
	inc R14		; increment ROM address
	plot			; plot pixel to buffer and increment X-coordinate
	loop			; decrement pixel pair count and go to DStart if nonzero
	plot			; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

	getb			; get carriage return X-component in R0
EndL:
	inc R14		; increment ROM address
	with R1		; update X-coordinate
	sub R0		 ; with carriage return value
	inc R2		 ; increment Y-coordinate
	to R12		 ; refresh pixel counter
	getb			; with next line's pixel count, plus three if odd and one if even
	inc R14		; increment ROM address
	dec R12		; decrement pixel count (hence the +1 for lines other than the first)
	bne SStart	; branch to SStart if pixel count is nonzero
	with R12	  ; set up for right shift of pixel count

; This one uses the dither functionality to plot two pixels per byte fetched from ROM.  Naturally this means all the
; graphics have to be duplicated in ROM so there's a version for each value of the dither bit (XOR of the X and Y
; bottom bits).  Also, since dither can't plot transparent with non-transparent (it always checks the bottom of the
; colour register for colour #0, because it's checking the dither bit at the same time and doesn't yet know which half
; to use), this method does not support gaps in a line.

Code: Select all

; DUAL-PIXEL WITH GAPS (a bit slower than basic dual-pixel blitting, but more flexible):

	to R12		 ; pixel count goes in the LOOP index register
	getb			; get pixel count for first line, plus two if odd
	inc R14		; increment ROM address, triggering a buffer load
	with R12	  ; operate on pixel count
SStart:
	lsr			 ; turn pixel count into pixel pair count
	bcc DStart	; if the pixel count was even, go to dual-pixel blitting
	nop			 ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
	getc			; fetch the first pixel from the ROM buffer
	inc R14		; increment the ROM address
	loop			; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
	plot			; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

	bra EndL	  ; go to end of line (at this point it's been determined that the line was only one pixel long)
	getb			; get X increment in R0, shifted left and added to the Y increment bit
DStart:
	getc			; get pixel pair
	inc R14		; increment ROM address
	plot			; plot pixel to buffer and increment X-coordinate
	loop			; decrement pixel pair count and go to DStart if nonzero
	plot			; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

	getb			; get X increment in R0, shifted left and added to the Y increment bit
EndL:
	inc R14		; increment ROM address
	sex			 ; ensure that negative X increments remain negative when shifted
	lsr			 ; shift X increment into position, pushing the Y increment out into the carry flag
	bcs NewLine  ; if the Y increment was one, go to NewLine (duplicated code for speed)
	with R1		; update X-coordinate
	sub R0		 ; with X increment
	to R12		 ; refresh pixel counter
	getb			; with next run's pixel count, plus three if odd and one if even
	inc R14		; increment ROM address
	dec R12		; decrement pixel count (hence the +1 for runs other than the first)
	bne SStart	; branch to SStart if pixel count is nonzero
	with R12	  ; set up for right shift of pixel count
	bra EndBlit  ; branch past duplicated code
NewLine:
	sub R0		 ; update X-coordinate with X increment
	to R12		 ; refresh pixel counter
	getb			; with next line's pixel count, plus three if odd and one if even
	inc R14		; increment ROM address
	inc R2		 ; increment Y-coordinate
	dec R12		; decrement pixel count
	bne SStart	; branch to SStart if pixel count is nonzero
	with R12	  ; set up for right shift of pixel count
EndBlit:

; This one encodes the X-coordinate carriage return value shifted left with a Y-increment bit shoved in on the right, so
; as to allow the algorithm to jump across gaps in a line without jumping down.  This limits the size of the object
; somewhat, since there are now only 7 bits for the X-increment value, but I'm not too worried about that. I could
; encode TWO Y-increment bits this way, so as to allow vertical gaps in the object, but with what most of the graphics
; in my game look like, I doubt plotting a transparent pixel now and then is less efficient than doing a bunch of extra
; maneuvering at the end of every single run of solid pixels...
Thoughts? Have I made any obvious mistakes like misunderstanding how to use an instruction?

I suppose dumps of untested code aren't especially useful or interesting, since there's no indication of what might or might not be wrong...



EDIT: Just had an idea:

Code: Select all

	getb
	inc R14
	color
	plot
	mult R3	; where R3 contains 0010h
	swap
	color
	loop
	plot
Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.

On the other hand, my single-pixel blit routine is even slower, and the extra pixel this method tacks onto odd-sized lines is transparent and can't cause a sliver overflow, so it might actually be better...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Fast 2D blitting on Super FX

Post by psycopathicteen »

Here's a question. Does the Super FX chip really read the instruction "to" before it reads "get" or does it get assembled into a single instructions?
User avatar
ikari_01
Posts: 141
Joined: Sat Jul 04, 2009 2:28 pm
Location: Wunstorf, Germany

Re: Fast 2D blitting on Super FX

Post by ikari_01 »

It's two instructions. The RISC is strong with this one.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Fast 2D blitting on Super FX

Post by tepples »

Z80 and HuC6280 also have prefix instructions that modify the following instruction.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

The problem is the 8-bit ROM bus. If you want single-cycle execution, you need single-cycle load, and if you want single-cycle load you need single-word instructions. With 16 general-purpose registers and 8 bits per instruction, there's only so much you can do without prefixes, and even some unary operations aren't necessarily important enough to burn 1/16 of the opcode matrix on.

At least it defaults to using R0 if you don't specify. It's a bit like having an accumulator. (But then, I had never programmed anything in assembly that wasn't accumulator-based before this, so for all I know this sort of 'preferred' register thing is common...)

...

It's an interesting chip. It seems to have specific support for texture mapping, via the MERGE opcode - the idea is apparently that R7 and R8 are 8.8 subtexel representations of the texture indices, which can be updated easily by the rasterization code, and MERGE takes the high byte of each one and sticks them together into a single 16-bit index, which can be then used to pick a texel out of the ROM buffer.

Yes, this means that the addressable texture memory on the Super FX is 16 times as large as on the Nintendo 64...
ARM9
Posts: 57
Joined: Sun Aug 11, 2013 6:07 am

Re: Fast 2D blitting on Super FX

Post by ARM9 »

93143 wrote: EDIT: Just had an idea:

Code: Select all

	getb
	inc R14
	color
	plot
	mult R3	; where R3 contains 0010h
	swap
	color
	loop
	plot
Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.
I'm not sure what your idea was but it looks like you're trying to do what the dither flag already does in hardware (doesn't work in 8bpp screen mode).

Code: Select all

ibt r0, #%00010 ; \
cmode           ; / set flag 1 (enable dither) in color mode register
ibt r0, #$21    ; \
color           ; / alternate between color 1 and 2
ibt r12, #16    ; draw 16 pixels
move r13, r15   ; set loop point to next instruction
loop
plot            ; color plotted = (r1^r2)&1 ? high 4 bits : low 4 bits
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.

I want to be able to blit objects with gaps and/or odd numbers of pixels in a run without accidentally erasing part of what's underneath them. At the same time, I want to keep the RAM buffer more or less saturated, which should be easier if I can pull two pixels with a single ROM buffer cycle. Two of the three methods I posted (not counting the snippet you quoted) use dither for this, but with safeguards to prevent zero overwrites and missed pixels.

(There's admittedly not a lot of context for those methods; just assume that dither has been turned on for the "dual-pixel" ones...)

...

I've just noticed something, and tried it out. Bit 2 of the plot mode register is supposed to switch between lower and upper nibbles for COLOR or GETC, but...

Code: Select all

	getb
	inc R14
	from R3
	cmode	; set bit 2 to 0
	color
	plot
	from R4
	cmode	; set bit 2 to 1
	color
	loop
	plot
That's way slower than the one where I used multiplication to do the bit shift. Heck, it's slower than just using LSR four times. What have I missed?

...I guess it was intended for accessing compressed source graphics for transforms or texture mapping, not speeding up 1:1 pixel copying. In that context, you can just use GETC instead of GETB followed by COLOR. But it still seems hardly worth it compared to MULT+SWAP, considering it needs to be done every time instead of just on even-numbered texels...

It would have been nice to have an instruction to directly flip the nibbles in the color register.

(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)
ARM9
Posts: 57
Joined: Sun Aug 11, 2013 6:07 am

Re: Fast 2D blitting on Super FX

Post by ARM9 »

93143 wrote:Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.
That's right, reminds me of a test rom that seems to indicate emulators get this wrong, which is understandable assuming no licensed game uses this incorrectly.
http://imgur.com/a/FZCgX (higan 101 has the old behaviour)
The white blocks behind the background are sprites, black is cgram entry 0.
I don't have anything to test on so I can't confirm that the patched version is the correct behaviour.
(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)
Due to the lack of barrel shifter asl/lsl is basically just add, except it affects the overflow flag.

Code: Select all

add r0; add r0; add r0 // r0 << 3
with r2; add r2 // r2 << 1
to r1; from r3; add r3 // r1 = r3 << 1
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Okay, now I'm annoyed.

I should have paid more attention to the calculation method for plot addresses. Apparently you can't plot off the edges of the screen, even though the screen is less than 256 pixels high. Nor does it wrap intelligently - if you plot a pixel at Y=192 in 192-line mode, it ends up on line 0 in the next column over. If you plot a pixel at Y = -1 (ie: 255), it ends up on line 63.

Now I have to choose between drawing to sprite tables all the time (and having to rearrange the data before downloading to VRAM) and putting checks before and/or in the main drawing loop to handle partially offscreen bullets. At least the X coordinate works fine, since I'm not using the full width...
ARM9 wrote:emulators get this wrong
Well, it's a good thing I checked, then...
Due to the lack of barrel shifter asl/lsl is basically just add
That is an excellent point. I'm not used to being able to add a register to itself. (At least, I wasn't before I marathoned a nontrivial Super FX program this past weekend - writing 65816 code felt weird after that...)

Speaking of shifting, I found that using a table of random numbers was (in my application, which required simultaneous access to a sine table in a different bank) not dramatically faster than just running xorshift16:

Code: Select all

	move R0, R1	; copy random number into accumulator
	add R0		  ; shift left 4 bits
	add R0
	add R0
	add R0
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
	lsr			  ; shift right 3 bits
	lsr
	lsr
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
	add R0		  ; shift left 7 bits
	add R0
	add R0
	add R0
	add R0
	add R0
	add R0
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
Actually, my initial attempt at a table of rands was slower than the above algorithm, but I'm getting better at this quickly...

Fun fact: in high-speed mode, this PRNG executes in the same amount of time it takes an S-CPU in FastROM to load a 16-bit number from direct page...
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Okay, yeah, I think I can do better than that:

Code: Select all

	move R0, R1	; copy random number into accumulator
	add R0		  ; shift left 5 bits
	add R0
	add R0
	add R0
	add R0
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
	hib			  ; shift right 9 bits
	lsr
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
	lob			  ; shift left 8 bits
	swap
	xor R1		  ; exclusive-OR with old value
	move R1, R0	; copy result to R1
23 cycles instead of 28. Nipping at the heels of a reasonable table load algorithm (repeated bankswitching is surprisingly expensive if you don't have registers to burn, and of course memory access itself is painful at five cycles per byte)...

...

If my math is correct, which it may very well not be, my bullet drawing loop seems to be maximally inefficient. I added up the cycles it would take to run if it wasn't held up by RAM buffer wait states, and the difference between that and the number of cycles it's actually taking seems to be roughly equal to the average number of cycles it should take to flush the pixel caches for all of the necessary sliver blit operations. In other words, there seems to be no parallel processing advantage showing up at all.

I'm hoping I did something stupid somewhere that's eating a ton of cycles to no purpose...
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Dodge this.
FXtest1.sfc
(128 KiB) Downloaded 536 times
640 bullets at 60 fps in 224x192. The screen flashes red if it drops a frame, but you shouldn't ever see that happen. With 656 bullets I get the occasional flash, but 640 runs perfectly in higan for over two minutes, which is longer than the period of the PRNG.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Fast 2D blitting on Super FX

Post by psycopathicteen »

Is this using 2bpp 8x8 bullets?, but damn 640 is a lot.
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: Fast 2D blitting on Super FX

Post by 93143 »

Yes, it's 2bpp. The bullets are actually 6x7, which is closer to the size (and shape, with the SNES PAR) of this type of bullet in the original game, and is moreover slightly quicker to draw.

I couldn't do 640 bullets under these constraints with a dither-based general-purpose drawing routine; there was too much overhead, and it topped out near 500. To get to 640 I had to forget about loading from the ROM buffer and just unroll the whole bullet in code. All the parts that need to be fast still fit in the cache, and my bullet list format allows multiple lists with dedicated handling loops to exist under a common bullet cap, so I think it's a legitimate approach for a number of pattern types.

I also didn't bother checking for collision with the player, but I imagine it wouldn't be all that onerous as I can simply leave the player's position in a pair of registers without adding more than a couple of cycles to the drawing code. Actually checking for collision shouldn't take as long as pulling the position from RAM would...

The actual game will have a 144-pixel-wide playfield, which gains me about 50,000 cycles, or roughly 25% extra compute time per frame, since I won't need to spend so much time waiting for DMA and clearing the framebuffer. And a lot of the bullet patterns need to be 4bpp and hence 30 fps, which should allow me to get much closer to the theoretical pixel buffer flush time, particularly with larger bullets. For really big ones, if the background isn't Mode 7 I can reserve part of OAM for the GSU and just use real sprites...

...

Just noting here that I was wrong about "maximally inefficient". I just didn't count the cycles carefully enough. There is a significant parallel processing bonus; it's just not as large as I was hoping, probably because the lines I'm blitting are so short...
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: Fast 2D blitting on Super FX

Post by Stef »

Wow, that is really impressive :)
I think it's kind of bullet benchmarking... I guess you are using the SFX to draw the bullet, i wonder how much you can do without the SFX :) Something i wonder, do you have any kind of double buffering with the SFX memory ? So you can use one bank to work with while you are DMAing the other bank to VRAM ? If that is the case you can really maximum usage of the SFX chip.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Fast 2D blitting on Super FX

Post by psycopathicteen »

I did a demo like this a while ago and I got 256 bullets at 30fps, though I had to make the bullets really small like 5x5.
Post Reply