It is currently Fri Apr 20, 2018 6:00 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Thu Apr 05, 2018 2:51 pm 
Offline

Joined: Sat Aug 28, 2010 9:01 am
Posts: 218
This was intended to be a reply to tepples' thread about OAM allocation but I figured it would make a good thread on its own.

For inspiration, I've written a stack copy routine which can copy 16 consecutive bytes to VRAM in one HBlank+trailing mode 2, if the line is free of sprites. When there are only a "few" sprites on the line, it's still able to safely copy 14 bytes. If there are "many" sprites, the timings are even stricter. In my particular case, I made it copy only 14 bytes, and implemented logic to skip lines with "many" sprites in, which in my case was easier than varying the number of bytes being copies. In my case I used it in my Flappy Bird clone to produce a parallax scrolling background for the scenery behind the pipes.

The setup is as follows:
Code:
   ld   A,$08         ; HBlank as LCD interrupt source
   ldh   [STAT],A

   ld   A,2         ; LCD interrupt
   ldh   [IE],A
Nothing too weird there. The code is using the HALT opcode to synchronize the copy, so IME is assumed to be 0 through-out. (Ie: interrupt execution is disabled using DI.)

Here's a slightly redacted version of the routine with some game specific logic removed:
Code:
; Copy 16 bytes in one HBlank (mode 0+mode 2)
STACKCOPY_LCD::
   ld   [RAMCODE-RAMCODE_S+ldspopcode16+1],SP   ; Save SP at the load SP opcode at the end.
   ld   SP,HL               ; Load source address from HL into SP.
   ld   H,D               ; \ Load destination address into HL.
   ld   L,E               ; /
.fastcopyloop
   ; 0
   pop   DE               ; Prefetch.
   xor   A               ; \ Clear pending registers.
   ldh   [IFLAG],A            ; /

   ld   A,E               ; Prefetch.

   halt                  ; Wait for HBlank to happen.
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A

   
rept   6
   pop   DE               ; Main unrolled loop body.
   ld   A,E
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A
endr

   ; 7
   pop   DE
   ld   A,E
   ld   [HL+],A
   ld   [HL],D               ; Save some time on the last byte for good measure.
   inc   HL
   
   ldh   A,[skipline]

   ld   E,A

   ldh   A,[LY]
   cp   E
   jr   z,.skiplines
   
.afterskiplines
   ldh   A,[linesctr]
   dec   A
   ldh   [linesctr],A
   jr   nz,.fastcopyloop

   ld   E,L
   ld   D,H
   ld   HL,[SP+0]            ; Restore source pointer for later use.

   jp   RAMCODE-RAMCODE_S+ldspopcode16
Explanation:

First, SP is saved so it can be restored later. This code may need some explanation. I have copied code to RAM and I'm using a bit of pointer arithmetic to point to argument part of an LD SP, $xxxx opcode. This is done so that when done, the code can jump to the restoration routine which would execute ld SP, $xxxx; ret.

Code:
; The RAM code source. Somewhere in ROM...
RAMCODE_S::
   ; Maybe some other code here...
ldspopcode16::
   ld   SP,0000               ; This is overwritten at the start of the code.
   ret
RAMCODE_S_End::

; The RAM code destination. Somewhere in RAM...
SECTION "RAMCODE",BSS
RAMCODE::
   ds   RAMCODE_S_End-RAMCODE_S         ; Buffer for the RAM code.


Next SP and HL are prepared from the input parameters.

The main routine consists of an unrolled loop of 8 copies of the following code, which copies two bytes:
Code:
   pop   DE               ; Main unrolled loop body.
   ld   A,E
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A
However, the first and last iterations are slightly different so only 6 of the iterations look exactly like that.

The first iteration prepares as much data as possible before the accessible period starts to prevent wasting precious cycles. It clears IF and runs HALT in order to synchronize to HBlank. When the CPU wakes up, it writes the first byte.

The last iteration also has a small difference. It writes D to [HL] instead of going through A, which would consume one extra instruction cycle. (Ie 4 machine cycles.) It means HL will have to be incremented afterwards, but this is ok since the incrementation is not timing sensitive, unlike the write.

After that, it checks whether we need to skip any lines because they have too many sprites. This logic is omitted from this example. Then it counts down linesctr and returns when all requested ata has been copied. Lastly it restores the HL and DE to the source and target address as they would be after the copy is done.

The example code copies 16 bytes per HBlank which requires that no sprites are shown on any line where the routine is executed. You could change rept 6 to a lower value if needed because sprites were used. In my Flappy Bird clone I use rept 5 which copies 14 bytes, as mentioned.

As per tepples' requirements, the routine could be adapted for use with 1 bpp tiles or OAM at a lower data rate.

Here's the clock calculation for the routine:
Code:
   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles
   
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 20 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 36 cycles (*6)

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   [HL],D   ;  8
   ; = 32 cycles

This gives a total of 268 cycles, 16 cycles less than the 284 cycles a HBlank+mode 2 would last without sprites. 4-8 of those cycles are used by the nop that's needed after the halt, I'm pretty sure. So this code can copy one tile per line.

For the case of OAM, we should go by the most pessimistic value of HBlank, 201 cycles. 201-32-36-8=141 cycles left for the inner loop part. 141/36=3 (remainder 32) so this routine could run 5 cycles, and thus copy 10 bytes, or 2.5 whole entries into OAM.

For the case of 1 BPP graphics, the routine would look a bit different. Here we make a few assumptions:
  • The palette is set such that you only need to update one of the bytes per pixel row.
  • Additionally that this byte is the odd address. What this does is that we can safely use inc L to increment the destination address because the inc L instruction will only ever be used to increment an even value, which cannot possibly cross carry over to the high byte. Such addresses are instead handled by the ld [HL+],A instruction, which does a full 16 bit increment internally.
Code:
   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles

   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 28 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 44 cycles

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   [HL],D   ;  8
   ; = 36 cycles
   inc  L        ;  4 (outside the cycle count)

Doing the cycle calculation for 284 available cycles we get: 284-28-36-8=212 left for inner part. 212/44=4 (remainder 36 cycles). So this code could run 6 iterations, and copy 12 bytes, which corresponds to 1.5 tiles since tiles are 8 bytes big in 1 bpp format.

All these figures could be nudged ever so slightly upward, maybe 1 extra byte per loop cycle, with more controlled timings. But at that point you get diminishing returns.

So in summary:
VRAM (full copy): 1 tile/line
VRAM (1bpp expand): 1.5 tiles/line
OAM: 2.5 entries/line

_________________
Gameboy Genius (Blog) - Gameboy development forum (+wiki and file area)


Last edited by nitro2k01 on Fri Apr 06, 2018 12:17 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Fri Apr 06, 2018 12:51 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 352
On the prefetch why not pop into BC as well, then on the first hit you can
Code:
halt                  ; Wait for HBlank to happen.
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A
   ld   A,B
   ld   [HL+],A
   ld   A,C
   ld   [HL+],A

to get a few more for less clocks ?


Top
 Profile  
 
PostPosted: Fri Apr 06, 2018 2:36 am 
Offline

Joined: Sat Aug 28, 2010 9:01 am
Posts: 218
Good idea. That might tip the balance up to integer counts for the OAM and VRAM cases, so 2 tiles or 3 entries. Although I'd have to confirm this on hardware to be sure. If nothing else, it would make the routine more likely to work despite the presence of sprites.

The extra write is 24 clocks long for the OAM case which should fit into the 32 clock remainder.

For the 1 bpp case, add 8 cycles for the incs, so 32 clocks which should fit into the 36 clock remainder although that's from a 284 click base case, ie no overlapping sprites.

For the general case, the 16 remaining cycles would not be sufficient. However, the reduction may make 16 bytes viable regardless of the presence of sprites, now with 24 clocks to spare from the 284 base case.

_________________
Gameboy Genius (Blog) - Gameboy development forum (+wiki and file area)


Top
 Profile  
 
PostPosted: Fri Apr 06, 2018 10:10 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19919
Location: NE Indiana, USA (NTSC)
Let me get this straight: If hblank is the only source enabled in STAT, and STAT is the only source enabled in IE, and IME is off, then HALT waits for the next hblank like STA WSYNC on Atari 2600, correct? If so, you'd save old STAT and IE before a transfer begins, run the transfer, and restore STAT and IE, right? Because the application I was considering when I mentioned hblank tile copying in the other topic has to watch out for lines just above LYC in order not to get tripped up by the LYC STAT IRQ that changes which VRAM bank is used for tiles $00-$7F.

How many cycles or T-states does each sprite take away from hblank?


Top
 Profile  
 
PostPosted: Fri Apr 06, 2018 10:17 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7001
Location: Seattle
Varies, unfortunately, depending on the specfic three LSBits of the sprite X, window X position, and background X scroll. See 33c3's The Ultimate Game Boy Talk, starting somewhere around 40 minutes in.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group