It is currently Mon May 22, 2017 8:13 pm

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Dec 05, 2016 8:51 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
I've been messing with some conceptual codes for high-speed drawing of untransformed 2D graphics. They require specialized graphics formats which are not interoperable; I'm thinking I'd test for speed beforehand and use whichever one was fastest for a given object - as long as all the code can fit in the instruction cache at the same time...

Of course, these codes are probably not final. I'm still not very good at Super FX.

I had also considered using a sort of data-as-code format for large quantities of small identical objects. The entire graphic would be hardcoded and require no ROM access, metadata handling, or branching. But I haven't written anything like that yet.

I've got to say, I find the 8-bit busing to be far more aggravating in the case of the Super FX than in the case of the S-CPU. What they were trying to do with the Super FX really needed more bits per word than they had...

Code:
; SINGLE-PIXEL BLITTING (slowest and most general):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line
   inc R14      ; increment ROM address, triggering a buffer load

Start:
   getc         ; get pixel data (one byte per pixel) from ROM buffer
   inc R14      ; and increment ROM address
   loop         ; decrement pixel count; if not zero, go to address in R13, ie: "Start"
   plot         ; plot pixel and increment X-counter in R1 (since the GSU is pipelined, this byte gets executed regardless)

   getb         ; get carriage return X-component (goes in R0)
   inc R14      ; increment ROM address
   with R1      ; update X-coordinate
   sub R0       ; by subtracting carriage return X-component
   inc R2       ; increment Y-coordinate
   to R12       ; update LOOP index register
   getb         ; with pixel count for next line
   inc R14      ; increment ROM address
   loop         ; decrement pixel count and branch to Start if not zero
   nop          ; dummy fill pipeline (nothing else to do before GETC, and the ROM buffer isn't ready anyway)

; The main loop has only two bytes between INC R14 and GETC, so in high-speed mode it's probably 6 cycles rather than 4.
; Blitting a sliver in 4bpp is probably at least 40 cycles, but that's still only 5 cycles per pixel, so this method is
; bottlenecked by code unless you're drawing in 8bpp.

Code:
; DUAL-PIXEL BLITTING (faster for long solid runs, slower for short runs, doesn't support gaps):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line, plus two if odd
   inc R14      ; increment ROM address, triggering a buffer load
   with R12     ; operate on pixel count
SStart:
   lsr          ; turn pixel count into pixel pair count
   bcc DStart   ; if the pixel count was even, go to dual-pixel blitting
   nop          ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
   getc         ; fetch the first pixel from the ROM buffer
   inc R14      ; increment the ROM address
   loop         ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
   plot         ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

   bra EndL     ; go to end of line (at this point it's been determined that the line was only one pixel long)
   getb         ; get carriage return X-component in R0 (happens after branch)
DStart:
   getc         ; get pixel pair
   inc R14      ; increment ROM address
   plot         ; plot pixel to buffer and increment X-coordinate
   loop         ; decrement pixel pair count and go to DStart if nonzero
   plot         ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

   getb         ; get carriage return X-component in R0
EndL:
   inc R14      ; increment ROM address
   with R1      ; update X-coordinate
   sub R0       ; with carriage return value
   inc R2       ; increment Y-coordinate
   to R12       ; refresh pixel counter
   getb         ; with next line's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   dec R12      ; decrement pixel count (hence the +1 for lines other than the first)
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count

; This one uses the dither functionality to plot two pixels per byte fetched from ROM.  Naturally this means all the
; graphics have to be duplicated in ROM so there's a version for each value of the dither bit (XOR of the X and Y
; bottom bits).  Also, since dither can't plot transparent with non-transparent (it always checks the bottom of the
; colour register for colour #0, because it's checking the dither bit at the same time and doesn't yet know which half
; to use), this method does not support gaps in a line.

Code:
; DUAL-PIXEL WITH GAPS (a bit slower than basic dual-pixel blitting, but more flexible):

   to R12       ; pixel count goes in the LOOP index register
   getb         ; get pixel count for first line, plus two if odd
   inc R14      ; increment ROM address, triggering a buffer load
   with R12     ; operate on pixel count
SStart:
   lsr          ; turn pixel count into pixel pair count
   bcc DStart   ; if the pixel count was even, go to dual-pixel blitting
   nop          ; waste a cycle, because it's better than wasting 5 cycles at the end of the loop
   getc         ; fetch the first pixel from the ROM buffer
   inc R14      ; increment the ROM address
   loop         ; decrement pixel pair count (hence the +2 for odd pixel counts) and go to DStart if nonzero
   plot         ; plot first pixel to buffer and increment X-coordinate (happens regardless of LOOP result)

   bra EndL     ; go to end of line (at this point it's been determined that the line was only one pixel long)
   getb         ; get X increment in R0, shifted left and added to the Y increment bit
DStart:
   getc         ; get pixel pair
   inc R14      ; increment ROM address
   plot         ; plot pixel to buffer and increment X-coordinate
   loop         ; decrement pixel pair count and go to DStart if nonzero
   plot         ; plot pixel to buffer (relying on dither flag to switch colours) and increment X-coordinate

   getb         ; get X increment in R0, shifted left and added to the Y increment bit
EndL:
   inc R14      ; increment ROM address
   sex          ; ensure that negative X increments remain negative when shifted
   lsr          ; shift X increment into position, pushing the Y increment out into the carry flag
   bcs NewLine  ; if the Y increment was one, go to NewLine (duplicated code for speed)
   with R1      ; update X-coordinate
   sub R0       ; with X increment
   to R12       ; refresh pixel counter
   getb         ; with next run's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   dec R12      ; decrement pixel count (hence the +1 for runs other than the first)
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count
   bra EndBlit  ; branch past duplicated code
NewLine:
   sub R0       ; update X-coordinate with X increment
   to R12       ; refresh pixel counter
   getb         ; with next line's pixel count, plus three if odd and one if even
   inc R14      ; increment ROM address
   inc R2       ; increment Y-coordinate
   dec R12      ; decrement pixel count
   bne SStart   ; branch to SStart if pixel count is nonzero
   with R12     ; set up for right shift of pixel count
EndBlit:

; This one encodes the X-coordinate carriage return value shifted left with a Y-increment bit shoved in on the right, so
; as to allow the algorithm to jump across gaps in a line without jumping down.  This limits the size of the object
; somewhat, since there are now only 7 bits for the X-increment value, but I'm not too worried about that. I could
; encode TWO Y-increment bits this way, so as to allow vertical gaps in the object, but with what most of the graphics
; in my game look like, I doubt plotting a transparent pixel now and then is less efficient than doing a bunch of extra
; maneuvering at the end of every single run of solid pixels...

Thoughts? Have I made any obvious mistakes like misunderstanding how to use an instruction?

I suppose dumps of untested code aren't especially useful or interesting, since there's no indication of what might or might not be wrong...



EDIT: Just had an idea:

Code:
   getb
   inc R14
   color
   plot
   mult R3   ; where R3 contains 0010h
   swap
   color
   loop
   plot


Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.

On the other hand, my single-pixel blit routine is even slower, and the extra pixel this method tacks onto odd-sized lines is transparent and can't cause a sliver overflow, so it might actually be better...


Top
 Profile  
 
PostPosted: Thu Dec 08, 2016 3:18 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2131
Here's a question. Does the Super FX chip really read the instruction "to" before it reads "get" or does it get assembled into a single instructions?


Top
 Profile  
 
PostPosted: Thu Dec 08, 2016 4:33 pm 
Offline
User avatar

Joined: Sat Jul 04, 2009 2:28 pm
Posts: 139
Location: Wunstorf, Germany
It's two instructions. The RISC is strong with this one.


Top
 Profile  
 
PostPosted: Thu Dec 08, 2016 4:39 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 18332
Location: NE Indiana, USA (NTSC)
Z80 and HuC6280 also have prefix instructions that modify the following instruction.


Top
 Profile  
 
PostPosted: Thu Dec 08, 2016 5:29 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
The problem is the 8-bit ROM bus. If you want single-cycle execution, you need single-cycle load, and if you want single-cycle load you need single-word instructions. With 16 general-purpose registers and 8 bits per instruction, there's only so much you can do without prefixes, and even some unary operations aren't necessarily important enough to burn 1/16 of the opcode matrix on.

At least it defaults to using R0 if you don't specify. It's a bit like having an accumulator. (But then, I had never programmed anything in assembly that wasn't accumulator-based before this, so for all I know this sort of 'preferred' register thing is common...)

...

It's an interesting chip. It seems to have specific support for texture mapping, via the MERGE opcode - the idea is apparently that R7 and R8 are 8.8 subtexel representations of the texture indices, which can be updated easily by the rasterization code, and MERGE takes the high byte of each one and sticks them together into a single 16-bit index, which can be then used to pick a texel out of the ROM buffer.

Yes, this means that the addressable texture memory on the Super FX is 16 times as large as on the Nintendo 64...


Top
 Profile  
 
PostPosted: Wed Jan 11, 2017 8:52 am 
Offline

Joined: Sun Aug 11, 2013 6:07 am
Posts: 57
93143 wrote:
EDIT: Just had an idea:

Code:
   getb
   inc R14
   color
   plot
   mult R3   ; where R3 contains 0010h
   swap
   color
   loop
   plot


Okay, never mind; that's a bit slow. It handles gaps fine, but it can just barely keep up with 4bpp blitting, which means that with metadata handling between lines, this method is probably bottlenecked by code. For some reason I was thinking SWAP was like XCN on the SPC700; it's actually more like XBA on the 65C816, which means you can't use it to flip the colours in a byte.


I'm not sure what your idea was but it looks like you're trying to do what the dither flag already does in hardware (doesn't work in 8bpp screen mode).
Code:
ibt r0, #%00010 ; \
cmode           ; / set flag 1 (enable dither) in color mode register
ibt r0, #$21    ; \
color           ; / alternate between color 1 and 2
ibt r12, #16    ; draw 16 pixels
move r13, r15   ; set loop point to next instruction
loop
plot            ; color plotted = (r1^r2)&1 ? high 4 bits : low 4 bits


Top
 Profile  
 
PostPosted: Wed Jan 11, 2017 7:31 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.

I want to be able to blit objects with gaps and/or odd numbers of pixels in a run without accidentally erasing part of what's underneath them. At the same time, I want to keep the RAM buffer more or less saturated, which should be easier if I can pull two pixels with a single ROM buffer cycle. Two of the three methods I posted (not counting the snippet you quoted) use dither for this, but with safeguards to prevent zero overwrites and missed pixels.

(There's admittedly not a lot of context for those methods; just assume that dither has been turned on for the "dual-pixel" ones...)

...

I've just noticed something, and tried it out. Bit 2 of the plot mode register is supposed to switch between lower and upper nibbles for COLOR or GETC, but...

Code:
   getb
   inc R14
   from R3
   cmode   ; set bit 2 to 0
   color
   plot
   from R4
   cmode   ; set bit 2 to 1
   color
   loop
   plot

That's way slower than the one where I used multiplication to do the bit shift. Heck, it's slower than just using LSR four times. What have I missed?

...I guess it was intended for accessing compressed source graphics for transforms or texture mapping, not speeding up 1:1 pixel copying. In that context, you can just use GETC instead of GETB followed by COLOR. But it still seems hardly worth it compared to MULT+SWAP, considering it needs to be done every time instead of just on even-numbered texels...

It would have been nice to have an instruction to directly flip the nibbles in the color register.

(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)


Top
 Profile  
 
PostPosted: Sun Jan 15, 2017 2:52 pm 
Offline

Joined: Sun Aug 11, 2013 6:07 am
Posts: 57
93143 wrote:
Yeah, but according to the manual, dither doesn't handle transparent pixels properly, because it determines which nibble to use in parallel with checking whether the bottom nibble is zero.

That's right, reminds me of a test rom that seems to indicate emulators get this wrong, which is understandable assuming no licensed game uses this incorrectly.
http://imgur.com/a/FZCgX (higan 101 has the old behaviour)
The white blocks behind the background are sprites, black is cgram entry 0.
I don't have anything to test on so I can't confirm that the patched version is the correct behaviour.

Quote:
(Also, it turns out there is no ASL instruction. They decided to burn an opcode on a sign-preserving ASR instead...)

Due to the lack of barrel shifter asl/lsl is basically just add, except it affects the overflow flag.
Code:
add r0; add r0; add r0 // r0 << 3
with r2; add r2 // r2 << 1
to r1; from r3; add r3 // r1 = r3 << 1


Top
 Profile  
 
PostPosted: Thu Feb 23, 2017 3:49 am 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
Okay, now I'm annoyed.

I should have paid more attention to the calculation method for plot addresses. Apparently you can't plot off the edges of the screen, even though the screen is less than 256 pixels high. Nor does it wrap intelligently - if you plot a pixel at Y=192 in 192-line mode, it ends up on line 0 in the next column over. If you plot a pixel at Y = -1 (ie: 255), it ends up on line 63.

Now I have to choose between drawing to sprite tables all the time (and having to rearrange the data before downloading to VRAM) and putting checks before and/or in the main drawing loop to handle partially offscreen bullets. At least the X coordinate works fine, since I'm not using the full width...

ARM9 wrote:
emulators get this wrong

Well, it's a good thing I checked, then...

Quote:
Due to the lack of barrel shifter asl/lsl is basically just add

That is an excellent point. I'm not used to being able to add a register to itself. (At least, I wasn't before I marathoned a nontrivial Super FX program this past weekend - writing 65816 code felt weird after that...)

Speaking of shifting, I found that using a table of random numbers was (in my application, which required simultaneous access to a sine table in a different bank) not dramatically faster than just running xorshift16:

Code:
   move R0, R1   ; copy random number into accumulator
   add R0        ; shift left 4 bits
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   lsr           ; shift right 3 bits
   lsr
   lsr
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   add R0        ; shift left 7 bits
   add R0
   add R0
   add R0
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1

Actually, my initial attempt at a table of rands was slower than the above algorithm, but I'm getting better at this quickly...

Fun fact: in high-speed mode, this PRNG executes in the same amount of time it takes an S-CPU in FastROM to load a 16-bit number from direct page...


Top
 Profile  
 
PostPosted: Thu Feb 23, 2017 11:43 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
Okay, yeah, I think I can do better than that:

Code:
   move R0, R1   ; copy random number into accumulator
   add R0        ; shift left 5 bits
   add R0
   add R0
   add R0
   add R0
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   hib           ; shift right 9 bits
   lsr
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1
   lob           ; shift left 8 bits
   swap
   xor R1        ; exclusive-OR with old value
   move R1, R0   ; copy result to R1

23 cycles instead of 28. Nipping at the heels of a reasonable table load algorithm (repeated bankswitching is surprisingly expensive if you don't have registers to burn, and of course memory access itself is painful at five cycles per byte)...

...

If my math is correct, which it may very well not be, my bullet drawing loop seems to be maximally inefficient. I added up the cycles it would take to run if it wasn't held up by RAM buffer wait states, and the difference between that and the number of cycles it's actually taking seems to be roughly equal to the average number of cycles it should take to flush the pixel caches for all of the necessary sliver blit operations. In other words, there seems to be no parallel processing advantage showing up at all.

I'm hoping I did something stupid somewhere that's eating a ton of cycles to no purpose...


Top
 Profile  
 
PostPosted: Sat Mar 11, 2017 12:06 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
Dodge this.

Attachment:
FXtest1.sfc [128 KiB]
Downloaded 55 times

640 bullets at 60 fps in 224x192. The screen flashes red if it drops a frame, but you shouldn't ever see that happen. With 656 bullets I get the occasional flash, but 640 runs perfectly in higan for over two minutes, which is longer than the period of the PRNG.


Top
 Profile  
 
PostPosted: Sat Mar 11, 2017 2:32 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2131
Is this using 2bpp 8x8 bullets?, but damn 640 is a lot.


Top
 Profile  
 
PostPosted: Sun Mar 12, 2017 12:18 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 703
Yes, it's 2bpp. The bullets are actually 6x7, which is closer to the size (and shape, with the SNES PAR) of this type of bullet in the original game, and is moreover slightly quicker to draw.

I couldn't do 640 bullets under these constraints with a dither-based general-purpose drawing routine; there was too much overhead, and it topped out near 500. To get to 640 I had to forget about loading from the ROM buffer and just unroll the whole bullet in code. All the parts that need to be fast still fit in the cache, and my bullet list format allows multiple lists with dedicated handling loops to exist under a common bullet cap, so I think it's a legitimate approach for a number of pattern types.

I also didn't bother checking for collision with the player, but I imagine it wouldn't be all that onerous as I can simply leave the player's position in a pair of registers without adding more than a couple of cycles to the drawing code. Actually checking for collision shouldn't take as long as pulling the position from RAM would...

The actual game will have a 144-pixel-wide playfield, which gains me about 50,000 cycles, or roughly 25% extra compute time per frame, since I won't need to spend so much time waiting for DMA and clearing the framebuffer. And a lot of the bullet patterns need to be 4bpp and hence 30 fps, which should allow me to get much closer to the theoretical pixel buffer flush time, particularly with larger bullets. For really big ones, if the background isn't Mode 7 I can reserve part of OAM for the GSU and just use real sprites...

...

Just noting here that I was wrong about "maximally inefficient". I just didn't count the cycles carefully enough. There is a significant parallel processing bonus; it's just not as large as I was hoping, probably because the lines I'm blitting are so short...


Top
 Profile  
 
PostPosted: Mon Mar 13, 2017 3:14 am 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 225
Wow, that is really impressive :)
I think it's kind of bullet benchmarking... I guess you are using the SFX to draw the bullet, i wonder how much you can do without the SFX :) Something i wonder, do you have any kind of double buffering with the SFX memory ? So you can use one bank to work with while you are DMAing the other bank to VRAM ? If that is the case you can really maximum usage of the SFX chip.


Top
 Profile  
 
PostPosted: Mon Mar 13, 2017 2:26 pm 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2131
I did a demo like this a while ago and I got 256 bullets at 30fps, though I had to make the bullets really small like 5x5.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Revenant, tokumaru and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group