It is currently Thu Nov 23, 2017 1:38 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Aug 14, 2016 1:54 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
I've never programmed for any 8080-family CPU before, unless you count very brief inline assembly for 8086 back in 1998 or so. But I'm thinking of learning to program the Game Boy. I gather that it'll be a huge change from 6502 assembly or C. I found ASMSchool and was startled by the dearth of addressing modes.

Say you have x attributes for each of y actors. There are two ways to organize this in memory:

  1. Some CPUs prefer an array of structures, where each actor's properties are contiguous in memory. These include machines with an addressing mode that uses a small offset to a full pointer held in a register, such as the Zilog Z80 (IX-relative modes), Motorola 68000, and ARMv4.
  2. Other CPUs prefer a structure of arrays, where (for example) all X positions, all Y positions, all frame numbers, etc. are contiguous. These include machines with an addressing mode that uses a constant full pointer plus an offset smaller than a pointer, such as the MOS/WDC 6502 family.

In another topic, adam_smasher mentioned that the LR35902 (aka GBZ80) is the oddball of the bunch. The closest thing it has to an indexed addressing mode is (HL), where the array is 256-byte aligned, H points to the table's start, and L points to an entry in the table.

So if a program uses the first few bytes of a 256-byte page for a lookup table, what does it then do with the rest of the page? If it's ROM, does it stick some subroutine or some sequentially accessed data there? Or do Game Boy games just eat the cost of calculating addresses in software, making the work-per-clock ratio between the 6502 and LR35902 even lower than I had guessed?


Top
 Profile  
 
PostPosted: Sun Aug 14, 2016 4:24 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10118
Location: Rio de Janeiro - Brazil
I too always wondered what the options for addressing data on the GB were. A full Z80 looks very versatile, but the Game Boy CPU doesn't look particularly suited to the manipulation of large blocks of memory.


Top
 Profile  
 
PostPosted: Sun Aug 14, 2016 6:53 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
I'm not a very proficient Z80 programmer but I know that the Z80 indexed instructions are incredibly slow and should be avoided anywhere performance is important (i.e. in a game). The performant way to access memory on a Z80 is the same as the only way to do it on an 8080 or GB: calculate the complete effective address (base + array offset + structure member offset) and store it in a register pair (HL, DE or BC). To walk an array, use the 16-bit INC instruction, the GB's postincrement instructions (which are only available for HL), or ADD the stride to the register pair (the fastest way to do this is to store the stride in another register pair and use a 16-bit ADD).

Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".

Register pressure makes arrays-of-structures painful on the 8080. You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.

The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.


Top
 Profile  
 
PostPosted: Sun Aug 14, 2016 7:37 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
AWJ wrote:
Basically, the 8080 family is a bit more like a RISC CPU in that calculating effective addresses is a distinct operation rather than something that gets folded into an "addressing mode". You could even say that the fundamental insight of RISC was "hey, instead of having all these addressing modes, let's just go back to the 8080 but add more and wider registers, because as long as you don't register spill doing address calculations explicitly ends up just as fast as doing them implicitly".

Yet practical RISC architectures ended up having a 68000-style pointer + short displacement as an available addressing mode for practical struct field access, rather than requiring an explicit addition every time to seek to the element of an array holding the value for a particular actor (in structure-of-arrays) or to the field of an actor's struct (in array-of-structures). In architectures where the load/store stage of the pipeline sits after the ALU stage, such as classical MIPS, address generation like this is essentially free.

In MIPS, the most "by-the-book" RISC design, lw $rt, 4($rs) reads from address rs + 4. ARM is even more flexible:
  • ldr r0, [r1, #4] reads from address r1 + 4
  • ldr r0, [r1, #4]! (pre-increment) adds 4 to r1 and reads from the new address
  • ldr r0, [r1], #4 (post-increment) reads from address r1, then adds 4 to r1
  • ldr r0, [r1, r2, lsl #2] (register indexed scaled) reads from address r1 + (r2 << 2)

Quote:
Register pressure makes arrays-of-structures painful on the 8080.

And register pressure is one of the first things that the RISCs discarded.

Quote:
You do okay if you're only processing one array at a time, but if you've got two (even if only one is an array-of-structures) you need a register pair for one pointer, a register pair for the other pointer, a register pair to hold the strides, and a loop coun... uh oh, you're already out of registers.

The addition of the postincrement instructions on the GB only makes structures-of-arrays even more of a winner.

Say I store 16 bytes' worth of properties for each actor (player or active enemy) in a side-scrolling game across 16 parallel arrays. Position and velocity are already 10 bytes: X (screen, pixel, subpixel), X velocity (pixel/frame, subpixel/frame), Y (pixel, subpixel), Y velocity (pixel/frame, subpixel/frame), and facing direction. On top of that are actor class, state/animation frame, state transition time, and a couple more bytes related to loading the frame's tiles into VRAM. Would I then need to add the actor ID (in one register) to the base of each array to form HL for every single access?


Top
 Profile  
 
PostPosted: Mon Aug 15, 2016 1:48 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
All right, I guess "structures-of-arrays are faster" was an oversimplification. The case I was thinking of was where you're only accessing one member of each structure in your loop. What matters is that data that you access sequentially is sequential in memory, so that you can use postincrement as much as possible and minimize the instructions spent recalculating addresses. So you'd definitely put the bytes of an array of 16/24/32-bit integers together, not interleave them like you might on a 6502. But if you've got one loop that does collision detection and another loop that does animation and metasprite assembly, and the data sets used by those loops are disjoint, it would make sense to split your "actors" into two structs, one with the collision detection data and the other with the animation data.


Top
 Profile  
 
PostPosted: Mon Aug 15, 2016 8:07 am 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 206
Location: NYC
The absence of indexed addressing modes seems rough, but you can get a ton of mileage from arranging your data so that the CPU can get at it easily; the 4MHz GBZ80 can probably go pretty close to toe-to-toe with a 1.79MHz 6502 for most 2D game logic...as long as the GBZ80 programmer took care to properly layout their data.

Even if you can't keep everything page-aligned, if you can be sure your table never crosses page boundaries, you can strip out the carry logic from the most general sample code I wrote; that'll save you 12 cycles.

As AWJ suggests, if you're going to be operating on an entire structure at once, you can use the array-of-structures format, ideally writing your code to walk through the members you want to access.

You can also line up related data across pages, so that once you've indexed your first table, you can get related data just by loading a new value into H (4 cycles, if you can just use INC H).

Or, if possible, you could just try to rethink or restructure your engine to not operate on the entire structure at once.


Top
 Profile  
 
PostPosted: Mon Aug 15, 2016 9:02 am 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
But if I rely on the fact that one field follows another, and I or someone else on my team reorders the fields so that a different subroutine can walk through them with INC, such a reorganization may end up breaking the assumption. As far as I can tell, this means that for every INC or DEC in a subroutine that accesses more than one such field, I'll have to put an assembly-time assertion that the header file didn't change the order of the fields since I wrote that particular subroutine. And when such an assertion breaks, that's a lot of code I'll have to touch and re-test.

I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.

I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?


Top
 Profile  
 
PostPosted: Mon Aug 15, 2016 11:03 am 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 206
Location: NYC
Quote:
But if I rely on the fact that one field follows another, and I or someone else on my team reorders the fields so that a different subroutine can walk through them with INC, such a reorganization may end up breaking the assumption. As far as I can tell, this means that for every INC or DEC in a subroutine that accesses more than one such field, I'll have to put an assembly-time assertion that the header file didn't change the order of the fields since I wrote that particular subroutine. And when such an assertion breaks, that's a lot of code I'll have to touch and re-test.

Yes, unfortunately, this is the case. Or, more realistically, one doesn't bother with any sort of compile-time assertions and just keeps everything in their head and prays :)
Quote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs greatly from that for the striped-arrays for 6502 (C64, Atari, NES, Super NES) or the pointer+offset of a 68000 (Genesis, Amiga) or full Z80 (Game Gear), and it might not have been trivial to find programmers experienced with its idiosyncrasies.

Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.
Quote:
I don't see how not to operate on the entire structure at once. An actor has to move horizontally and vertically, eject itself from any obstacles, and then update its animation frame. Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?

Something like that, sure. My game engine works much that way.


Top
 Profile  
 
PostPosted: Mon Aug 15, 2016 11:43 am 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
adam_smasher wrote:
Quote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower, even apart from the slow green LCD prior to the Game Boy Pocket: the mindset for an 8080 differs

Possibly; I've never disassembled a troublesome GB game to say. Are there any particularly slow ones you can think of? I wouldn't mind at least trying to take a peek and evaluate.

When I played Balloon Kid and Super Mario Land 2, their slow frame rate was something I noticed.

Quote:
Quote:
Or are you suggesting separate loops: one to choose all actors' movement direction based on its animation frame, a second loop to move all actors horizontally, a third loop to move all actors vertically, a fourth loop to eject all actors from obstacles, and a fifth loop to update all actors' animation frames?

Something like that, sure. My game engine works much that way.

Even that can get troublesome, as ejection tends to touch the entire position, especially on slopes. And then you need five methods for each actor type instead of just move().


Top
 Profile  
 
PostPosted: Tue Aug 16, 2016 11:42 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
A lot of commercial GB games contain really terrible and inefficient code. I've seen a lot of GB code that's effectively literally-translated 6502 (the first two Final Fantasy Legend games are bad for this--the third is significantly better) I've seen games that put the stack in high RAM--I guess they were told "accessing high RAM is faster" and thoroughly misunderstood what was meant. That's the problem with using a custom CPU architecture. I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...


Top
 Profile  
 
PostPosted: Fri Aug 19, 2016 11:09 am 
Offline

Joined: Wed May 19, 2010 6:12 pm
Posts: 2366
I'm guessing most people probably would just copy the slots to a static region of memory.


Top
 Profile  
 
PostPosted: Fri Aug 19, 2016 11:21 am 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
AWJ wrote:
I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...

Probably not. The Konami-1 variant of the 6809 reportedly just XORs each opcode (not its operand) with a formula based on A3 and A1. That could be compensated at assembly time. Anyone who had coded for a CoCo, Dragon, FM-7, or Williams arcade could probably jump right in.

But then arcade code didn't have to be quite as efficient as console code anyway, as the manufacturer could just throw more hardware at it.


Top
 Profile  
 
PostPosted: Fri Aug 19, 2016 1:26 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
tepples wrote:
AWJ wrote:
I wonder if Konami had trouble getting people to write efficient code for the custom 6809 derivative most of their late-80s arcade games ran on...

Probably not. The Konami-1 variant of the 6809 reportedly just XORs each opcode (not its operand) with a formula based on A3 and A1. That could be compensated at assembly time. Anyone who had coded for a CoCo, Dragon, FM-7, or Williams arcade could probably jump right in.

But then arcade code didn't have to be quite as efficient as console code anyway, as the manufacturer could just throw more hardware at it.


That's not the CPU I'm talking about. The Konami-1 was just a 6809 with scrambled opcodes, but the Konami-2 had 8 general-purpose output pins that were usually used for ROM banking (indeed they're listed as A16-A23 on schematics) and ISA changes that go beyond simple scrambling. The Konami-1 was used in early-1980s games like Gyruss; the Konami-2 was used until the early 1990s (The Simpsons was one of the last games that ran on it)


Top
 Profile  
 
PostPosted: Tue Nov 29, 2016 11:40 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19252
Location: NE Indiana, USA (NTSC)
In this topic, Axelay recommends 6502-style parallel tables with start addresses aligned to a power of two, which lets instructions like set 5,L and res 5,L point HL at different tables in a page.


Top
 Profile  
 
PostPosted: Wed Nov 30, 2016 10:53 pm 
Offline

Joined: Mon Sep 27, 2004 2:57 pm
Posts: 1248
It's been a while since I last touched any z80, but every time an indexed array lookup was needed, the pointer always had to be calculated by hand.

The gameboy has a treasure trove of RAM compared to the NES, so it shouldn't be a problem to align your actor memory so that an individual actor's memory doesn't cross a page boundary. Once you do, all you have to do is store your actor's base pointer in a register pair, and the only calculations you'd need to do are on the low byte. If your actors' memory sizes are a power of two, it's a simple matter of ORing the desired index to the low byte, and then later using an AND mask to recover the base pointer so you can OR another index.

That's all for array-of-structures. Structure-of-arrays is the exact same thing, except instead of attribute offsets being 0, 1, 2, 3, etc, they're now $00, $10, $20, $30, etc. You also trade limitations; instead of actor size being a power of two and actor count being free, you now have actor count being a power of two and actor size being free. Use whatever's more efficient for the memory usage, your code will be exactly the same regardless.

If you need more than 256 bytes worth of actors, you should use array-of-structures, because you'll likely be calculating pointers for attributes (8-bit operation if no actors cross a page boundary) more often than you'll be calculating base pointers for actors (16-bit operation always).

As an interesting aside, the ZX Spectrum optimized its framebuffer specifically for Z80 HL addressing; Increasing L moves to the next 8x8 character on the screen, increasing H moves to the next scanline within that character. It might not be the same as DMG-Z80, but it might be worth checking out what ZX Spectrum programmers have to say. That's how I came across this helpful bit-twiddling reference after all.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group