SM83 vs. 6502
The Nintendo Entertainment System and the Super Game Boy accessory have the same 945/44 = 21.47 MHz master clock. The NTSC NES divides the master clock by 12 to make the 1.79 MHz 6502 clock. SGB divides the clock by 5 (4.30 MHz) before passing it to the Game Boy's LR35902 system on chip, whose multicycle implementation in turn divides it by 4 (1.07 MHz) for its Sharp SM83 CPU core. Thus the effective clock rate of the Game Boy is 3/5 (60%) of the NES clock rate, which opens the debate about whether SM83 makes it up in work per clock.
Stack instructions: A push and pop on SM83 take 7 cycles total, same as 6502, but they handle 2 bytes at a time. SM83's RET is faster than 6502's RTS by 2 cycles, reducing the penalty for subroutine calls. The indirect call instruction JP (HL) takes 1 cycle, which is faster than the load high PHA load low PHA RTS on 6502.
ALU instructions: SM83's lack of a penalty cycle for "implied"-mode instructions helps. The Intel-style carry (as opposed to MOS/ARM-style carry) allows a fast idiom for sign-extending A: RLCA SUB A copies bit 7 to all bits of A. There's an 8-bit rotate in addition to the 6502's 9-bit one, an arithmetic right shift that copies old bit 7 to new bit 7 (no need for CMP #$80 ROR A), and a nibble swap instruction. But there's no sign flag after ALU operations, and testing bit 7 of an ALU result needs another cycle or two for a compare or bit-test instruction.
Memory instructions: With its 2-cycle 16-bit increments and autoincrement for the pointer register HL, SM83 is arguably faster than 6502 for sequential access to arrays, especially those larger than 256 bytes. But for random access, I've mentioned elsewhere how like Intel's 8080, SM83 lacks the 6502's rich indexed addressing modes. Thus random access to a field of a structure, such as the fields of an actor in a game, requires radical reorganization of structures in memory and more preparation in advance based on in which order the fields will be accessed. Later I'll post the workarounds that I discovered.
RAM: Game Boy has more. This tilts some space-time tradeoffs; I'd be interested to read how this plays out in practice.
C language: Making a game that runs with minor changes between PC and either NES or Game Boy often involves writing the game logic in C and only the I/O (input, audio, graphics) and systems parts of the engine in assembly. ISSOtm has written thoughts about C on Game Boy. Instructions that involve HL and SP allow for a larger hardware stack, reducing some of the soft-stack penalty that cc65 has to pay.
VRAM bandwidth: GB and NTSC NES have almost the same count of cycles per scanline (114 vs. 113.667). NTSC NES has 20.5 lines of vertical blanking (assuming half of prerender is "borrowed") while GB has 10. GB, however, doesn't support extending blanking to add more VRAM update time. An unrolled copy to VRAM without the popslide technique is 6 cycles/byte on GB compared to 8 on NES. But because the GB PPU is faster relative to the CPU (4 dots on GB, 3 on NTSC NES) and narrower (160 dots vs. 256), GB also has the majority of its scanline (at least 64 cycles) open for VRAM reading and writing during horizontal blanking. This makes it practical for a loop to copy 8 bytes to VRAM after each of the 144 scanlines even without the GBC's CHR HDMA feature, or 1152 bytes per screen, so long as you take care about tearing. But nothing beats the bandwidth of having all your tiles in CHR ROM at a slight cost in flexibility, though GBC has banked CHR RAM.
OAM DMA on Game Boy takes 160 cycles, running at 1 cycle per byte like Super NES DMA, as opposed to 514 cycles (2 per byte) on the NES. Because it doesn't pause the CPU during DMA execution, only HRAM (its counterpart to NES zero page or GBA IWRAM) is accessible, and DMA is normally done by a 10-byte subroutine in HRAM. OAM DMA is also possible mid-frame provided sprite rendering is turned off, allowing it to be moved out of vblank and into the status bar.
Scrolling: Like the Super NES, the Game Boy lacks the oddball 30-row nametable height. This simplifies some designs for nametable update packet updates. Monochrome doesn't have attributes at all; GBC is like MMC5 EXRAM or Super NES nametables in that it has a second byte plane of attributes whose addresses parallel those of the nametable. But without special support for +32 increment, a nametable column copy loop is slightly slower than 6 cycles/byte.
Frame rate: Both the NES and Game Boy run at close to 60 frames per second. The original green screen Game Boy (DMG) takes several refreshes to change a pixel from light to dark or vice versa. The Game Boy Pocket and Game Boy Color take fewer, but still not fast enough to make 30 Hz flicker as noticeable as it would be on SGB. This lets developers get away with engines that run on twos, on threes, or even on fours (Balloon Kid), for 30, 20, or 15 fps.
I'm interested in fleshing out the arguments both ways to make them more quantitative as opposed to hand-wavey.
The Z80 has higher code density making unrolling things more practical in tight memory situations, but the 6502 lets you throw down tables for little penalty.
Another consideration is resolution, a Gameboy is 160x144 vs a Nes 256x240 so 23040 vs 61440 which means the GB has 37.5% the number of pixels to update, so in the bang per pixel I would think a GB slaughters a NES. There are no palletes which means the GB screen takes even less bytes, and faster DMA on the OAM mean you can probably get more done.
More RAM always helps, although the Z80s lack of setting flags on load instructions makes storing 128/0 for a NegativeFlag variable mostly moot as you need to do a ld and/or to get the flags set.
C on a Z80 makes alot more sense than on a 6502, still not a great idea, but a lot more sane.
No palettes on monochrome mean no color cycling on waterfalls and the like, meaning tile rotation through CHR RAM updates needs to make up for it.
GB OAM DMA takes about 1.5 vblank scanlines out of 10, leaving 8.5. NES OAM DMA takes just shy of 5 vblank scanlines out of 20, leaving 15, still having more scanlines to do things. But GB has another edge in that OAM access is less buggy, meaning something with only a few sprites might get away with writing them directly to OAM rather than making a display list and DMAing it in. But filling that display list is another issue, and the speed of that can depend on how the address of the source data is calculated.
 "Less buggy" in that only 16-bit increment instructions write garbage to GB OAM, not just the mere act of writing beyond the first 7 bytes like on NES.
Another feature is the GB has mem mapped VRAM, none of this port rubbish, pure direct access. The Z80 has one glorious trick up its sleeve, the one thing it will smash a 6502 in, and that is rapidly moving blocks of data around. To the point that I'm currently investigating how to do it effectively because even on a 2Mhz Z80 is about 25% faster than a 1Mhz 6502. Granted the GB80 doesn't have the other register set, which cuts the data you can move at once in half, I would think it still beats the NES CPU + Port. Its not as fast as pure immediate speed code, but its a lot more generic and eats a lot less space which makes it more practical. You could fit the routine into the 120? bytes of HRAM, along with the initial data you want. This would allow you to get the GB80 to setup the registers and the dest SP while the OAM DMA happens, then as soon as it drops, hit with the first PUSH set, then keep going until you hit the end of Mode 2. If only there was the 2nd set still, as you could keep it loaded for a H-BLANK update, but alas.
Code: Select all
; 8 cycles, no extra register ld a,$E0 and a,l xor a,member_offset ld l,a ld a,[hl] ; or B, C, D, or E trashing A ; 8 cycles, extra HRAM variable to hold L of struct start ; ergo no real benefit ldh a,[Lcur_actor_base] ; 3 xor a,member_offset ; 2 ld l,a ld a,[hl] ; or B, C, D, or E trashing A ; 6 cycles, extra register B to hold L of struct start ld a,b xor a,member_offset ld l,a ld a,[hl] ; or C, D, or E trashing A ; The next three require either the programmer or macros to keep ; track of which member was accessed most recently. This makes ; them less practical across calls or branches. ; 6 cycles, no extra register, depends on last offset ld a,l xor a,member_offset^last_member_offset ld l,a ld a,[hl] ; or B, C, D, or E trashing A ; The next two modify L in place and leave A unchanged if B, C, ; D, or E is accessed. But this requires predicting in advance ; in which order the members of a structure will be accessed ; and having the discipline to rewrite code every time the ; structure layout is revised to reflect new access patterns. ; If loading into A, replace [hl] with [hl+] or [hl-] to make ; last_member_offset for the next access one greater or less ; than member_offset for this access. ; 4 cycles, differs by 1 bit from last offset ; no extra register, A preserved set 2,l ld a,[hl] ; or B, C, D, or E ; 3 cycles, differs by +/-1 from last offset ; no extra register, A preserved inc l ld a,[hl] ; or B, C, D, or E ; The next one assumes 256-byte slots, as if memory is laid ; out in a 2D array. If this struct uses, say, xx00-xx17 of ; pages C0-C7, other things can use xx18-xxFF of the same page. ; This method works on machines with 8K or more of RAM, such as ; Spectrum, MSX, Game Boy, and Master System/Game Gear, but not ; so well on 1K RAM machines like ColecoVision and SG-1000. ; 4 cycles ld l,member_offset ld a,[hl] ; or B, C, D, or E ; The last one assumes copying a struct into HRAM beforehand. ; Copying an n-byte struct into HRAM beforehand and out afterward ; adds roughly 12*n cycles to calling methods on an instance. ; For a struct using (say) the first 20 bytes of its 32-byte slot, ; and eight instances in a pool, copying in and out could add up to ; 1920 bytes or 11% of a frame. ; 3 cycles, HL not needed, register A only ldh a,[hThis+member_offset]
arrays, one for each byte, with X identifying which member's fields
will be accessed.
Code: Select all
; 4 (read) or 5 (write) cycles, no extra register, A preserved ; Can also load into Y, or store from Y if array in direct page lda member_array,x ; standard syntax ; or ld a,[member_array+X] ; in nocash-style pseudo-Z80 syntax
But for (say) enemy movement in a platformer, the behaviors will differ greatly from one type of object to the next. Say an array contains data for 8 actors that can be of different subtypes, such as an enemy that walks toward the player, an enemy that flies near the top of the screen and dive bombs the player directly downward, an enemy that paces back and forth, an enemy that flies near the top and periodically swoops down toward the player, an enemy that walks in a straight line and can be turned over and picked up, or a powerup that sits still and waits for the player to collect it. All of these actor types coexist in the same actor pool. Then each loop will need an associated flag as to whether the step of computation applies to each particular object, and now you need three pointers, adding BC to point to this flag. That pretty much exhausts your registers. I guess you could interleave "which steps should be skipped" fields with data fields in these arrays.
So if accessing fields of an element of an array of structs is slow, and adam_smasher's SIMD style paradigm doesn't apply to somewhat heterogeneous arrays, the LR35902 will have to make up for the inflexibility of its sequential-access-oriented address generator in other ways in order to keep up with a 6502.
Which of the above methods to access fields of elements of actor pools are used most often in action games for Game Boy? Are there any I missed?
Code: Select all
; copy the ent data in an area in the stack, leaving enough room for a few calls etc LD HL,(SP+(-Offset-ValueIndex)) ; this then gets 2 bytes or 1 word param in to HL
With a copy set up cost though
OO code is still going to be a pain though, which is why my first GB projects probably aren't going to involve situations that need OO code (such as enemy movement).
The instruction set reference I've been using says LD HL,SP+rr is 12 master clocks, or 3 cycles. This isn't enough to read the instruction, the offset, and 16 bits of data from stack memory. So I'm guessing it just adds the signed 8-bit value to SP and stores the result in HL. Then another LD A|B|C|D|E,[HL] (2 cycles) is needed, but at least it works with BCDE and thus preserves A if that's needed. The method can also be be mixed with the HRAM copy method by LD HL,hThis.Oziphantom wrote:
Code: Select all
The Z80 is such a 95% cpu... it almost does things in a nice way, but it just lacks that 1 opcode....
I've been asking in Z80 circles and basically I just get silence. There is no good way. The best I've be given is "align to 256 boundaries" and Use Tables..
I've also been pointing out to others how to use their CPU faster ..sigh...
Seems there are no silver bullets nor horizons. But the Stack move trick Really does slay a 6502 Just the main computer I'm doing it on only has a 2mhz Z80 which makes it only slightly faster, on a 3.5 or 4mhz Z80 it would slay...
What exactly is this "stack move trick", and does it work even when interrupts are enabled? Or would I have to first test whether LY is close to LYC before doing a move in order to avoid graphical glitches due to missing a STAT IRQ? Because if you try using a stack move on a 6809, you get corruption from the return address and things your ISR pushes.
EDIT: I've been informed that Retrocomputing Stack Exchange deems C about as bad of a fit for Z80 as for 6502.
Load SP from SRC
Save SP to SRC
Set SP to DEST
SET SP from SRC
And yes an IRQ could stuff it up. However typically you would be doing this during VBlank and hence be in the interrupt already, or do it at a point where you know you are safe from interrupts. Sadly somebody at SEGA decided that the PAUSE button should be on NMI... obviously from the Commodore School of Design ala the RESTORE key ...
Now the GB only has half, so not quite the bang per buck but I would expect it to still be faster. Person at the SMS forum did a badly optimised broken pop HL out (c) out (c) test and got 25% over outi to which full pop/pull would be faster again, but the SMS has "port" access to memory
I think by 256 align they meant
and then you only have to mod l in a hl pair to index into an entity. This person was using PacMan hardware and probably only had 5 entities. Somebody else pointed out that the Filmation games ( ZX spectrum ) used IX/Y. I would point out that Filmation games are slow
- Posts: 763
- Joined: Wed Feb 13, 2008 9:10 am
- Location: Estonia, Rapla city (50 and 60Hz compatible :P)
NMI was originally reset on SC-3000, and then got repurposed as Pause on SG-1000 and later hardware.
In a 2-player game, is the Game Boy ever safe from Game Link port byte completion interrupts that can happen every 1024 cycles (9 scanlines)? Are games supposed to instead poll the transfer busy flag (SC bit 7) periodically in the stack transfer loop?Oziphantom wrote:And yes an IRQ could stuff it up. However typically you would be doing this during VBlank and hence be in the interrupt already, or do it at a point where you know you are safe from interrupts.
That sounds sort of like how the Popslide library on NES works. It handles the (remote) possibility of being interrupted by 1. treating the input buffer on the stack as consumable and 2. allocating 8 unused bytes of headroom before the buffer. (An IRQ wouldn't happen in a vblank handler unless obscure music engine tricks are in use.)Oziphantom wrote:Person at the SMS forum did a badly optimised broken pop HL out (c) out (c) test and got 25% over outi
As you are probably going to "pull" from ROM as it will have the prebuilt data you need for frames and the such, any interrupt at that point is going to write to no mans land ( maybe triggering a BANK change on GB, that would be evil... ) to which you can then set a flag in RAM to trap that you have entered the IRQ while in ROM stack, and then restore back into the routine at some safe point.
If it was while in writing to dest, save for the last couple of bytes, it will be safe, as it will then wind back the stack value and you just continue putting the right data back over it. Unless you are in VBlank to which point you may not still be in VBlank and you need to handle said case, wait for next VBlank, move back to start of handler and go again. If it was caused by data transfer then the glitch in the frames is probably not the best, but if its the odd glitch probably better in the long run.