16 byte per line hblank copy routine

Discussion of programming and development for the original Game Boy and Game Boy Color.
Post Reply
nitro2k01
Posts: 252
Joined: Sat Aug 28, 2010 9:01 am

16 byte per line hblank copy routine

Post by nitro2k01 »

This was intended to be a reply to tepples' thread about OAM allocation but I figured it would make a good thread on its own.

For inspiration, I've written a stack copy routine which can copy 16 consecutive bytes to VRAM in one HBlank+trailing mode 2, if the line is free of sprites. When there are only a "few" sprites on the line, it's still able to safely copy 14 bytes. If there are "many" sprites, the timings are even stricter. In my particular case, I made it copy only 14 bytes, and implemented logic to skip lines with "many" sprites in, which in my case was easier than varying the number of bytes being copies. In my case I used it in my Flappy Bird clone to produce a parallax scrolling background for the scenery behind the pipes.

The setup is as follows:

Code: Select all

	ld	A,$08			; HBlank as LCD interrupt source
	ldh	[STAT],A

	ld	A,2			; LCD interrupt
	ldh	[IE],A
Nothing too weird there. The code is using the HALT opcode to synchronize the copy, so IME is assumed to be 0 through-out. (Ie: interrupt execution is disabled using DI.)

Here's a slightly redacted version of the routine with some game specific logic removed:

Code: Select all

; Copy 16 bytes in one HBlank (mode 0+mode 2)
STACKCOPY_LCD::
	ld	[RAMCODE-RAMCODE_S+ldspopcode16+1],SP	; Save SP at the load SP opcode at the end.
	ld	SP,HL					; Load source address from HL into SP.
	ld	H,D					; \ Load destination address into HL.
	ld	L,E					; /
.fastcopyloop
	; 0
	pop	DE					; Prefetch.
	xor	A					; \ Clear pending registers.
	ldh	[IFLAG],A				; /

	ld	A,E					; Prefetch.

	halt						; Wait for HBlank to happen.
	ld	[HL+],A
	ld	A,D
	ld	[HL+],A

	
rept	6
	pop	DE					; Main unrolled loop body. 
	ld	A,E
	ld	[HL+],A
	ld	A,D
	ld	[HL+],A
endr

	; 7
	pop	DE
	ld	A,E
	ld	[HL+],A
	ld	[HL],D					; Save some time on the last byte for good measure.
	inc	HL
	
	ldh	A,[skipline]

	ld	E,A

	ldh	A,[LY]
	cp	E
	jr	z,.skiplines
	
.afterskiplines
	ldh	A,[linesctr]
	dec	A
	ldh	[linesctr],A
	jr	nz,.fastcopyloop

	ld	E,L
	ld	D,H
	ld	HL,[SP+0]				; Restore source pointer for later use.

	jp	RAMCODE-RAMCODE_S+ldspopcode16
Explanation:

First, SP is saved so it can be restored later. This code may need some explanation. I have copied code to RAM and I'm using a bit of pointer arithmetic to point to argument part of an LD SP, $xxxx opcode. This is done so that when done, the code can jump to the restoration routine which would execute ld SP, $xxxx; ret.

Code: Select all

; The RAM code source. Somewhere in ROM...
RAMCODE_S::
	; Maybe some other code here...
ldspopcode16::
	ld	SP,0000					; This is overwritten at the start of the code.
	ret
RAMCODE_S_End::

; The RAM code destination. Somewhere in RAM...
SECTION "RAMCODE",BSS
RAMCODE::
	ds	RAMCODE_S_End-RAMCODE_S			; Buffer for the RAM code.
Next SP and HL are prepared from the input parameters.

The main routine consists of an unrolled loop of 8 copies of the following code, which copies two bytes:

Code: Select all

	pop	DE					; Main unrolled loop body. 
	ld	A,E
	ld	[HL+],A
	ld	A,D
	ld	[HL+],A
However, the first and last iterations are slightly different so only 6 of the iterations look exactly like that.

The first iteration prepares as much data as possible before the accessible period starts to prevent wasting precious cycles. It clears IF and runs HALT in order to synchronize to HBlank. When the CPU wakes up, it writes the first byte.

The last iteration also has a small difference. It writes D to [HL] instead of going through A, which would consume one extra instruction cycle. (Ie 4 machine cycles.) It means HL will have to be incremented afterwards, but this is ok since the incrementation is not timing sensitive, unlike the write.

After that, it checks whether we need to skip any lines because they have too many sprites. This logic is omitted from this example. Then it counts down linesctr and returns when all requested ata has been copied. Lastly it restores the HL and DE to the source and target address as they would be after the copy is done.

The example code copies 16 bytes per HBlank which requires that no sprites are shown on any line where the routine is executed. You could change rept 6 to a lower value if needed because sprites were used. In my Flappy Bird clone I use rept 5 which copies 14 bytes, as mentioned.

As per tepples' requirements, the routine could be adapted for use with 1 bpp tiles or OAM at a lower data rate.

Here's the clock calculation for the routine:

Code: Select all

   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles
   
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 20 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   ; = 36 cycles (*6)

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   ld   [HL],D   ;  8
   ; = 32 cycles
This gives a total of 268 cycles, 16 cycles less than the 284 cycles a HBlank+mode 2 would last without sprites. 4-8 of those cycles are used by the nop that's needed after the halt, I'm pretty sure. So this code can copy one tile per line.

For the case of OAM, we should go by the most pessimistic value of HBlank, 201 cycles. 201-32-36-8=141 cycles left for the inner loop part. 141/36=3 (remainder 32) so this routine could run 5 cycles, and thus copy 10 bytes, or 2.5 whole entries into OAM.

For the case of 1 BPP graphics, the routine would look a bit different. Here we make a few assumptions:
  • The palette is set such that you only need to update one of the bytes per pixel row.
  • Additionally that this byte is the odd address. What this does is that we can safely use inc L to increment the destination address because the inc L instruction will only ever be used to increment an even value, which cannot possibly cross carry over to the high byte. Such addresses are instead handled by the ld [HL+],A instruction, which does a full 16 bit increment internally.

Code: Select all

   halt   ; (including nop repeated due to double execution glitch.)
   ; = 8 cycles

   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 28 cycles

   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   A,D      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ; = 44 cycles

   ; Last
   pop  DE       ; 12
   ld   A,E      ;  4
   ld   [HL+],A  ;  8
   inc  L        ;  4
   ld   [HL],D   ;  8
   ; = 36 cycles
   inc  L        ;  4 (outside the cycle count)
Doing the cycle calculation for 284 available cycles we get: 284-28-36-8=212 left for inner part. 212/44=4 (remainder 36 cycles). So this code could run 6 iterations, and copy 12 bytes, which corresponds to 1.5 tiles since tiles are 8 bytes big in 1 bpp format.

All these figures could be nudged ever so slightly upward, maybe 1 extra byte per loop cycle, with more controlled timings. But at that point you get diminishing returns.

So in summary:
VRAM (full copy): 1 tile/line
VRAM (1bpp expand): 1.5 tiles/line
OAM: 2.5 entries/line
Last edited by nitro2k01 on Fri Apr 06, 2018 12:17 pm, edited 1 time in total.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: 16 byte per line hblank copy routine

Post by Oziphantom »

On the prefetch why not pop into BC as well, then on the first hit you can

Code: Select all

halt                  ; Wait for HBlank to happen.
   ld   [HL+],A
   ld   A,D
   ld   [HL+],A
   ld   A,B
   ld   [HL+],A
   ld   A,C
   ld   [HL+],A
to get a few more for less clocks ?
nitro2k01
Posts: 252
Joined: Sat Aug 28, 2010 9:01 am

Re: 16 byte per line hblank copy routine

Post by nitro2k01 »

Good idea. That might tip the balance up to integer counts for the OAM and VRAM cases, so 2 tiles or 3 entries. Although I'd have to confirm this on hardware to be sure. If nothing else, it would make the routine more likely to work despite the presence of sprites.

The extra write is 24 clocks long for the OAM case which should fit into the 32 clock remainder.

For the 1 bpp case, add 8 cycles for the incs, so 32 clocks which should fit into the 36 clock remainder although that's from a 284 click base case, ie no overlapping sprites.

For the general case, the 16 remaining cycles would not be sufficient. However, the reduction may make 16 bytes viable regardless of the presence of sprites, now with 24 clocks to spare from the 284 base case.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 16 byte per line hblank copy routine

Post by tepples »

Let me get this straight: If hblank is the only source enabled in STAT, and STAT is the only source enabled in IE, and IME is off, then HALT waits for the next hblank like STA WSYNC on Atari 2600, correct? If so, you'd save old STAT and IE before a transfer begins, run the transfer, and restore STAT and IE, right? Because the application I was considering when I mentioned hblank tile copying in the other topic has to watch out for lines just above LYC in order not to get tripped up by the LYC STAT IRQ that changes which VRAM bank is used for tiles $00-$7F.

How many cycles or T-states does each sprite take away from hblank?
lidnariq
Posts: 11430
Joined: Sun Apr 13, 2008 11:12 am

Re: 16 byte per line hblank copy routine

Post by lidnariq »

Varies, unfortunately, depending on the specfic three LSBits of the sprite X, window X position, and background X scroll. See 33c3's The Ultimate Game Boy Talk, starting somewhere around 40 minutes in.
Post Reply