It is currently Thu Aug 16, 2018 4:56 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue Mar 27, 2018 11:15 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
The relative speed of the NES and Game Boy can be calculated in several ways.

The Nintendo Entertainment System and the Super Game Boy accessory have the same 945/44 = 21.47 MHz master clock. The NTSC NES divides the master clock by 12 to make the 1.79 MHz 6502 clock. SGB divides the clock by 5 (4.30 MHz) before passing it to the Game Boy's LR35902 CPU, whose multicycle implementation in turn divides it by 4 (1.07 MHz). Thus GB effective clock rate is 3/5 (60%) of the NES clock rate, which opens the debate about whether LR35902 makes it up in work per clock.

Stack instructions: A push and pop on LR35902 take 7 cycles total, same as 6502, but they handle 2 bytes at a time. LR35902's RET is faster than 6502's RTS by 2 cycles, reducing the penalty for subroutine calls. The indirect call instruction JP (HL) takes 1 cycle, which is faster than the load high PHA load low PHA RTS on 6502.

ALU instructions: LR35902's lack of a penalty cycle for "implied"-mode instructions helps. The Intel-style carry (as opposed to MOS/ARM-style carry) allows a fast idiom for sign-extending A: RLCA SUB A copies bit 7 to all bits of A. There's an 8-bit rotate in addition to the 6502's 9-bit one, an arithmetic right shift that copies old bit 7 to new bit 7 (no need for CMP #$80 ROR A), and a nibble swap instruction. But there's no sign flag after ALU operations, and testing bit 7 of an ALU result needs another cycle or two for a compare or bit-test instruction.

Memory instructions: With its 2-cycle 16-bit increments and autoincrement for the pointer register HL, LR35902 is arguably faster than 6502 for sequential access to arrays, especially those larger than 256 bytes. But for random access, I've mentioned elsewhere how LR35902 lacks the 6502's rich indexed addressing modes. Thus random access to a field of a structure, such as the fields of an actor in a game, requires radical reorganization of structures in memory and more preparation in advance based on in which order the fields will be accessed. Later I'll post the seven workarounds that I discovered.

RAM: Game Boy has more. This tilts some space-time tradeoffs; I'd be interested to read how this plays out in practice.

C language: Making a game that runs with minor changes between PC and either NES or Game Boy often involves writing the game logic in C and only the I/O (input, audio, graphics) and systems parts of the engine in assembly. ISSOtm has written thoughts about C on Game Boy. Instructions that involve HL and SP allow for a larger hardware stack, reducing some of the soft-stack penalty that cc65 has to pay.

VRAM bandwidth: GB and NTSC NES have almost the same count of cycles per scanline (114 vs. 113.667). NTSC NES has 20.5 lines of vertical blanking (assuming half of prerender is "borrowed") while GB has 10. GB, however, doesn't support extending blanking to add more VRAM update time. An unrolled copy to VRAM is 6 cycles/byte on GB compared to 8 on NES. But because the GB PPU is faster relative to the CPU (4 dots on GB, 3 on NTSC NES) and narrower (160 dots vs. 256), GB also has the majority of its scanline (at least 64 cycles) open for VRAM reading and writing during horizontal blanking. This makes it practical for a loop to copy 8 bytes to VRAM after each of the 144 scanlines even without the GBC's CHR HDMA feature, or 1152 bytes per screen, so long as you take care about tearing. But nothing beats the bandwidth of having all your tiles in CHR ROM at a slight cost in flexibility, though GBC has banked CHR RAM.

OAM DMA on Game Boy takes 160 cycles, running at 1 cycle per byte like Super NES DMA, as opposed to 514 cycles (2 per byte) on the NES. Because it doesn't pause the CPU during DMA execution, only HRAM (its counterpart to NES zero page or GBA IWRAM) is accessible, and DMA is normally done by a 10-byte subroutine in HRAM.

Scrolling: Like the Super NES, the Game Boy lacks the oddball 30-row nametable height. This simplifies some designs for nametable update packet updates. Monochrome doesn't have attributes at all; GBC is like MMC5 EXRAM or Super NES nametables in that it has a second byte plane of attributes whose addresses parallel those of the nametable. But without special support for +32 increment, a nametable column copy loop is slightly slower than 6 cycles/byte.

Frame rate: Both the NES and Game Boy run at close to 60 frames per second. The original green screen Game Boy (DMG) takes several refreshes to change a pixel from light to dark or vice versa. The Game Boy Pocket and Game Boy Color take fewer, but still not fast enough to make 30 Hz flicker as noticeable as it would be on SGB. This lets developers get away with engines that run on twos, on threes, or even on fours (Balloon Kid), for 30, 20, or 15 fps.

I'm interested in fleshing out the arguments both ways to make them more quantitative as opposed to hand-wavey.


Top
 Profile  
 
PostPosted: Tue Mar 27, 2018 9:59 pm 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
From the Spectrum wars, a 3.58Mhz Z80 ~ 0.98Mhz 6502. To which the Gameboy is more Z80-/8080+ and the NES is 6502- but 4.2 vs 1.89 my money is on the NES's cpu for raw power.
The Z80 has higher code density making unrolling things more practical in tight memory situations, but the 6502 lets you throw down tables for little penalty.

Another consideration is resolution, a Gameboy is 160x144 vs a Nes 256x240 so 23040 vs 61440 which means the GB has 37.5% the number of pixels to update, so in the bang per pixel I would think a GB slaughters a NES. There are no palletes which means the GB screen takes even less bytes, and faster DMA on the OAM mean you can probably get more done.

More RAM always helps, although the Z80s lack of setting flags on load instructions makes storing 128/0 for a NegativeFlag variable mostly moot as you need to do a ld and/or to get the flags set.

C on a Z80 makes alot more sense than on a 6502, still not a great idea, but a lot more sane.


Top
 Profile  
 
PostPosted: Wed Mar 28, 2018 8:09 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
I can understand "bang per pixel" as a metric for fillrate in software 3D, as the viewport is a larger fraction of the screen in Qix for Game Boy than NES and in Faceball 2000 for Game Boy than Super NES. But in a 2D game on a machine with hardware scrolling and sprites, a lot of things scale with number of objects more than their size. So if there are 8 moving things on the screen, there still need to be 8 move calls whether they're 32x32 pixels each on a Genesis (10% of screen width) or 16x16 pixels each on a Game Boy (also 10% of screen width).

No palettes on monochrome mean no color cycling on waterfalls and the like, meaning tile rotation through CHR RAM updates needs to make up for it.

GB OAM DMA takes about 1.5 vblank scanlines out of 10, leaving 8.5. NES OAM DMA takes just shy of 5 vblank scanlines out of 20, leaving 15, still having more scanlines to do things. But GB has another edge in that OAM access is less buggy,[1] meaning something with only a few sprites might get away with writing them directly to OAM rather than making a display list and DMAing it in. But filling that display list is another issue, and the speed of that can depend on how the address of the source data is calculated.


[1] "Less buggy" in that only 16-bit increment instructions write garbage to GB OAM, not just the mere act of writing beyond the first 7 bytes like on NES.


Top
 Profile  
 
PostPosted: Wed Mar 28, 2018 9:01 pm 
Offline

Joined: Sun Mar 19, 2006 9:44 pm
Posts: 957
Location: Japan
This is really interesting. I thought the GB's CPU just crawled compared to the NES', when just simply considering the GB using a Z80/8080 and looking at its clock speed.

_________________
http://www.chrismcovell.com


Top
 Profile  
 
PostPosted: Wed Mar 28, 2018 9:23 pm 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
Sure but a smaller screen does normally mean less on the screen. For example Mario vs Warioland. Having more things "off screen" lets you offload their update rate a bit. The poor display with no backlight and ghosting means you want less things moving slower anyway right?

Another feature is the GB has mem mapped VRAM, none of this port rubbish, pure direct access. The Z80 has one glorious trick up its sleeve, the one thing it will smash a 6502 in, and that is rapidly moving blocks of data around. To the point that I'm currently investigating how to do it effectively because even on a 2Mhz Z80 is about 25% faster than a 1Mhz 6502. Granted the GB80 doesn't have the other register set, which cuts the data you can move at once in half, I would think it still beats the NES CPU + Port. Its not as fast as pure immediate speed code, but its a lot more generic and eats a lot less space which makes it more practical. You could fit the routine into the 120? bytes of HRAM, along with the initial data you want. This would allow you to get the GB80 to setup the registers and the dest SP while the OAM DMA happens, then as soon as it drops, hit with the first PUSH set, then keep going until you hit the end of Mode 2. If only there was the 2nd set still, as you could keep it loaded for a H-BLANK update, but alas.


Top
 Profile  
 
PostPosted: Wed Mar 28, 2018 11:56 pm 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
Looking at the GB80 limits... it might work out slower than the 6502 to do the stack move vs LDA XXXX,x STA XXXX,x


Top
 Profile  
 
PostPosted: Thu Mar 29, 2018 7:12 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
Here are ways to read or write a member of a power of two aligned data structure in RAM, as a way of compensating for the slowness (on Z80) or nonexistence (on LR35902) of indexed addressing. They assume that a pointer to some member of the struct is already in HL.
Code:
  ; 8 cycles, no extra register
  ld a,$E0
  and a,l
  xor a,member_offset
  ld l,a
  ld a,[hl]  ; or B, C, D, or E trashing A

  ; 8 cycles, extra HRAM variable to hold L of struct start
  ; ergo no real benefit
  ldh a,[Lcur_actor_base]  ; 3
  xor a,member_offset  ; 2
  ld l,a
  ld a,[hl]  ; or B, C, D, or E trashing A

  ; 6 cycles, extra register B to hold L of struct start
  ld a,b
  xor a,member_offset
  ld l,a
  ld a,[hl]  ; or C, D, or E trashing A

  ; The next three require either the programmer or macros to keep
  ; track of which member was accessed most recently.  This makes
  ; them less practical across calls or branches.

  ; 6 cycles, no extra register, depends on last offset
  ld a,l
  xor a,member_offset^last_member_offset
  ld l,a
  ld a,[hl]  ; or B, C, D, or E trashing A

  ; The next two modify L in place and leave A unchanged if B, C,
  ; D, or E is accessed.  But this requires predicting in advance
  ; in which order the members of a structure will be accessed
  ; and having the discipline to rewrite code every time the
  ; structure layout is revised to reflect new access patterns.
  ; If loading into A, replace [hl] with [hl+] or [hl-] to make
  ; last_member_offset for the next access one greater or less
  ; than member_offset for this access.

  ; 4 cycles, differs by 1 bit from last offset
  ; no extra register, A preserved
  set 2,l
  ld a,[hl]  ; or B, C, D, or E

  ; 3 cycles, differs by +/-1 from last offset
  ; no extra register, A preserved
  inc l
  ld a,[hl]  ; or B, C, D, or E

  ; The next one assumes 256-byte slots, as if memory is laid
  ; out in a 2D array.  If this struct uses, say, xx00-xx17 of
  ; pages C0-C7, other things can use xx18-xxFF of the same page.
  ; This method works on machines with 8K or more of RAM, such as
  ; Spectrum, MSX, Game Boy, and Master System/Game Gear, but not
  ; so well on 1K RAM machines like ColecoVision and SG-1000.

  ; 4 cycles
  ld l,member_offset
  ld a,[hl]  ; or B, C, D, or E

  ; The last one assumes copying a struct into HRAM beforehand.
  ; Copying an n-byte struct into HRAM beforehand and out afterward
  ; adds roughly 12*n cycles to calling methods on an instance.
  ; For a struct using (say) the first 20 bytes of its 32-byte slot,
  ; and eight instances in a pool, copying in and out could add up to
  ; 1920 bytes or 11% of a frame.

  ; 3 cycles, HL not needed, register A only
  ldh a,[hThis+member_offset]


Compare to 6502, where you'd stripe the struct into a set of separate
arrays, one for each byte, with X identifying which member's fields
will be accessed.
Code:
  ; 4 (read) or 5 (write) cycles, no extra register, A preserved
  ; Can also load into Y, or store from Y if array in direct page
  lda member_array,x  ; standard syntax
  ; or
  ld a,[member_array+X]  ; in nocash-style pseudo-Z80 syntax


The striping paradigm also applies to Z80 and LR35902 to a lesser extent if you (say) store all accelerations together, all vertical velocities together, all vertical displacements together. Then you can arrange updates to your objects in a SIMD (single instruction multiple data)-style paradigm where each step has its own loop. For example, a loop to add all vertical velocities to the corresponding vertical displacement might have the pointer to velocity in DE and the pointer to displacement in HL. This arrangement, preferred by adam_smasher, is fine if the array is homogeneous, with all elements having the same behavior and all computations applied to all elements.

But for (say) enemy movement in a platformer, the behaviors will differ greatly from one type of object to the next. Say an array contains data for 8 actors that can be of different subtypes, such as an enemy that walks toward the player, an enemy that flies near the top of the screen and dive bombs the player directly downward, an enemy that paces back and forth, an enemy that flies near the top and periodically swoops down toward the player, an enemy that walks in a straight line and can be turned over and picked up, or a powerup that sits still and waits for the player to collect it. All of these actor types coexist in the same actor pool. Then each loop will need an associated flag as to whether the step of computation applies to each particular object, and now you need three pointers, adding BC to point to this flag. That pretty much exhausts your registers. I guess you could interleave "which steps should be skipped" fields with data fields in these arrays.

So if accessing fields of an element of an array of structs is slow, and adam_smasher's SIMD style paradigm doesn't apply to somewhat heterogeneous arrays, the LR35902 will have to make up for the inflexibility of its sequential-access-oriented address generator in other ways in order to keep up with a 6502.

Which of the above methods to access fields of elements of actor pools are used most often in action games for Game Boy? Are there any I missed?


Top
 Profile  
 
PostPosted: Thu Mar 29, 2018 10:31 pm 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
Code:
; copy the ent data in an area in the stack, leaving enough room for a few calls etc
LD HL,(SP+(-Offset-ValueIndex))
; this then gets 2 bytes or 1 word param in to HL
It uses 12 clocks per pair, which I guess you are converting to a 1Mhz 6502 base so 3 clocks? or NES speed 2 clocks?
With a copy set up cost though


Top
 Profile  
 
PostPosted: Fri Mar 30, 2018 9:59 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
After implementing a proportional font drawing routine in LR35902 assembly yesterday, I've realized that keeping local variables in registers as much as possible and carefully choosing what to spill out to HRAM does speed things up. To make this practical, I ended up writing a subroutine first as a rough pass, where locals are more eagerly spilled to HRAM as if I were coding for 6502, and then doing a second optimization pass to allocate registers. (Is that the typical work flow among Z80 or LR35902 programmers?) One advantage over the 6502 is that sequential accesses through DE or HL need not spend cycles rereading the source and destination pointers from memory all the time. The same is true of locals kept in registers. This and the lack of an "internal operation" cycle for what 6502 calls "implied" mode instructions make up for the penalty of needing to constantly ld things in and out of A.

OO code is still going to be a pain though, which is why my first GB projects probably aren't going to involve situations that need OO code (such as enemy movement).

Oziphantom wrote:
Code:
LD HL,(SP+(-Offset-ValueIndex))

The instruction set reference I've been using says LD HL,SP+rr is 12 master clocks, or 3 cycles. This isn't enough to read the instruction, the offset, and 16 bits of data from stack memory. So I'm guessing it just adds the signed 8-bit value to SP and stores the result in HL. Then another LD A|B|C|D|E,[HL] (2 cycles) is needed, but at least it works with BCDE and thus preserves A if that's needed. The method can also be be mixed with the HRAM copy method by LD HL,hThis.


Top
 Profile  
 
PostPosted: Sun Apr 01, 2018 2:11 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
Just when you think you find something .. denied!

The Z80 is such a 95% cpu... it almost does things in a nice way, but it just lacks that 1 opcode....

I've been asking in Z80 circles and basically I just get silence. There is no good way. The best I've be given is "align to 256 boundaries" and Use Tables..
I've also been pointing out to others how to use their CPU faster ..sigh...

Seems there are no silver bullets nor horizons. But the Stack move trick Really does slay a 6502 ;) Just the main computer I'm doing it on only has a 2mhz Z80 which makes it only slightly faster, on a 3.5 or 4mhz Z80 it would slay...


Top
 Profile  
 
PostPosted: Sun Apr 01, 2018 9:58 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
Essentially the problem is that of accessing elements of a 2D array with one dimension (actor ID) constant over the short term and the other dimension (actor property) not sequential. Perhaps by "align to 256 boundaries", they're telling you to put the actors at $C0E0-$C01F, $C1E0-$C1FF, ..., $C7E0-$C7FF. This way actor ID can be held constant at $C0, $C1, ..., $C7, and actor property can be set to $E0, $E1, ..., $FF. But that isn't so convenient if you don't have other arrays (2D or otherwise) to fill the remaining 224 bytes of each page. (Nor would it work on a ColecoVision.)

What exactly is this "stack move trick", and does it work even when interrupts are enabled? Or would I have to first test whether LY is close to LYC before doing a move in order to avoid graphical glitches due to missing a STAT IRQ? Because if you try using a stack move on a 6809, you get corruption from the return address and things your ISR pushes.

EDIT: I've been informed that Retrocomputing Stack Exchange deems C about as bad of a fit for Z80 as for 6502.


Top
 Profile  
 
PostPosted: Mon Apr 02, 2018 12:02 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
Basically you

Save SP
Load SP from SRC
pop AF
pop BC
pop DE
pop HL
EXX
pop AF
pop BC
pop DE
pop HL
Save SP to SRC
Set SP to DEST
push HL
push DE
push BC
push AF
EXX
push HL
push DE
push BC
push AF
SET SP from SRC
....

And yes an IRQ could stuff it up. However typically you would be doing this during VBlank and hence be in the interrupt already, or do it at a point where you know you are safe from interrupts. Sadly somebody at SEGA decided that the PAUSE button should be on NMI... obviously from the Commodore School of Design ala the RESTORE key ...
Now the GB only has half, so not quite the bang per buck but I would expect it to still be faster. Person at the SMS forum did a badly optimised broken pop HL out (c) out (c) test and got 25% over outi to which full pop/pull would be faster again, but the SMS has "port" access to memory

I think by 256 align they meant
00XX
01XX
02XX
and then you only have to mod l in a hl pair to index into an entity. This person was using PacMan hardware and probably only had 5 entities. Somebody else pointed out that the Filmation games ( ZX spectrum ) used IX/Y. I would point out that Filmation games are slow ;)


Top
 Profile  
 
PostPosted: Mon Apr 02, 2018 12:21 am 
Offline
User avatar

Joined: Wed Feb 13, 2008 9:10 am
Posts: 655
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
On SMS you can inhibit NMI from cartslot by forcing the line high, not gonna work on JP SMS or Mark III though. Game Gear has no worries in GG mode but will barf in SMS mode when someone touches Start.
NMI was originally reset on SC-3000, and then got repurposed as Pause on SG-1000 and later hardware.

_________________
http://www.tmeeco.eu


Top
 Profile  
 
PostPosted: Mon Apr 02, 2018 7:00 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20409
Location: NE Indiana, USA (NTSC)
Oziphantom wrote:
And yes an IRQ could stuff it up. However typically you would be doing this during VBlank and hence be in the interrupt already, or do it at a point where you know you are safe from interrupts.

In a 2-player game, is the Game Boy ever safe from Game Link port byte completion interrupts that can happen every 1024 cycles (9 scanlines)? Are games supposed to instead poll the transfer busy flag (SC bit 7) periodically in the stack transfer loop?

Oziphantom wrote:
Person at the SMS forum did a badly optimised broken pop HL out (c) out (c) test and got 25% over outi

That sounds sort of like how the Popslide library on NES works. It handles the (remote) possibility of being interrupted by 1. treating the input buffer on the stack as consumable and 2. allocating 8 unused bytes of headroom before the buffer. (An IRQ wouldn't happen in a vblank handler unless obscure music engine tricks are in use.)


Top
 Profile  
 
PostPosted: Mon Apr 02, 2018 7:40 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 510
I would hope Nintendo wasn't as janky as to make pulse interrupts.. but looking at the nes.hmm... to which the interrupt is still there when you EI, as the line will still be held in the active state until you clear it. Although I'm assuming the Z80 behaves in this way, I will do a test on my 128 to confirm its behaviour.

As you are probably going to "pull" from ROM as it will have the prebuilt data you need for frames and the such, any interrupt at that point is going to write to no mans land ( maybe triggering a BANK change on GB, that would be evil... ) to which you can then set a flag in RAM to trap that you have entered the IRQ while in ROM stack, and then restore back into the routine at some safe point.
If it was while in writing to dest, save for the last couple of bytes, it will be safe, as it will then wind back the stack value and you just continue putting the right data back over it. Unless you are in VBlank to which point you may not still be in VBlank and you need to handle said case, wait for next VBlank, move back to start of handler and go again. If it was caused by data transfer then the glitch in the frames is probably not the best, but if its the odd glitch probably better in the long run.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group