It is currently Mon Dec 18, 2017 4:10 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Fri Dec 02, 2016 11:52 am 
Offline

Joined: Thu Aug 28, 2008 1:17 am
Posts: 591
tepples wrote:
I'm beginning to understand why some Game Boy games ran at 30 fps or slower

There were master system games that ventured into that 30fps territory too. Earlier titles.

A lot of z80 programmers tend to boast about the available regs to work with on the z80 compared to accumulate based processors (65x, 6809, etc). But I always found it to be the complete opposite. Data registers are kind of a moot point on the 65x simply because it has a lot of direct memory addressing modes (and fast mode; direct or zero page). A lot more operations actually have to go through the A reg on the z80, from another reg or from indirection (address regs). I always felt like a constraint of constantly juggling things - way more than what might be done with Acc on the 65x. And the having ZP as off processor address registers (address vectors) - feels soo free in comparison. Even the 68k felt a tiny bit cramped in this respect to the 65x (only 7 address regs; SP is the 8th address reg).

Of course, the context of 65x to me is not limited to the NES - so my view of optimization and use of quick LUTs for logic are probably expanded compared to the NES environment.

_________________
__________________________
http://pcedev.wordpress.com


Top
 Profile  
 
PostPosted: Sun Dec 18, 2016 9:12 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
C64 vs. Speccy wars concluded that a full Z80 (with IX and IY) has 1/3 of the IPC of a 6502.


Top
 Profile  
 
PostPosted: Thu Jul 27, 2017 3:03 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
In this post, TmEE wrote:
Z80 assembly came next as I needed means to play sound without tying up the main CPU. It was much more painful than 68K which pretty much had spoiled me. x86 still felt worse though... Nowdays I also do Master System and SC-3000 / SG-1000 stuff, whole game in Z80 isn't actually all that bad.

Is that based on IX/IY (SG1K/SMS/GG only) or some other way to step through fields of an actor structure?


Top
 Profile  
 
PostPosted: Fri Jul 28, 2017 1:16 am 
Offline
User avatar

Joined: Wed Feb 13, 2008 9:10 am
Posts: 597
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
I pretty much only use IX(IXL,IXH) and IY(IYL,IYH) as temporary variables and all else goes by structures aligned to 256 bytes and BC/DE/HL addressing with lot of incrementing the C/E/L rather than direct specifying of elements to access through them. Autoincrement comes for free on 68K and incrementing is faster on Z80 than directly specifying the element to access too, data is always laid out in the order of use to accomodate that approach.

_________________
http://www.tmeeco.eu


Top
 Profile  
 
PostPosted: Fri Jul 28, 2017 6:38 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
TmEE wrote:
data is always laid out in the order of use

That's what others were recommending but I'm somehow not fully grokking. Say there are 16 bytes of state for each actor in an action platformer:

X position (24 bits; 16.8)
X velocity (16 bits; 8.8 signed)
Y position (16 bits; 8.8)
Y velocity (16 bits; 8.8 signed)
Current frame
Timing state
Facing direction
Height of last hitbox to hit this actor relative to the actor's feet; used for collision response
Health
Actor type ID
VRAM location for actor's sprite cels

Are there some generic rules of thumb for field layout to ensure "data is always laid out in the order of use"? If not, how can I predict "the order of use" in all cases? Do I need to prototype all the routines used by a sample of the actor types in a high-level language, and then reorder the fields to be either after or one bit different from the previous field before translating the routines to Z80/LR35902?


Top
 Profile  
 
PostPosted: Fri Jul 28, 2017 8:24 am 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 219
Location: NYC
Well, I feel as though I've suggested this before, but the best rule of thumb in general is probably "structs of arrays" rather than "arrays of structs".

Otherwise you do your best, focus on the needs of the most speed-sensitive code, maybe iterate on your design a few times, and if worst comes to worst, you might need to suck it up and manually index - which, as long as you keep your data page-aligned and/or so that it never crosses page boundaries, really isn't too bad: we're talking about, in 6502 terms, no more than a handful of extra cycles. Outside of a vblank handler, it's extremely rare that you really need to worry about that in the average game.


Top
 Profile  
 
PostPosted: Fri Jul 28, 2017 8:40 am 
Offline
User avatar

Joined: Wed Feb 13, 2008 9:10 am
Posts: 597
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
Jaa, sometimse you just got to suck it up. My process is iterative, with refactors as new ways to improve something present themselves. Prototyping in a higher level language probably gets you somewhere sooner.

_________________
http://www.tmeeco.eu


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 12:42 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
Found this via Why "logic" is bullshit (RANT):
In this post, Stef wrote:
Both the 6502 and GB-CPU have a small instruction set, but the advantage is on the 6502 side is with the addressing modes for the instructions and less bottle necks for the processor architecture/design.
[...]
If you're specifically doing bitmap drawing effects (software blitting) at the given GB screen resolution, then relatively speaking the GB 1mhz CPU is might be faster for the task than NES at its native resolution. But in absolute terms, and a normal game engine, I would say that it's behind.

I'm continuing to read through Stef's posts in that topic to see if it addresses the problem I'm seeing, that of random access to the properties of a particular element of an array of objects.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 1:28 pm 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 219
Location: NYC
(all cycle counts below for the GBZ80 are /4, for easier comparison with the 6502):

Assuming the index you want is in A and the array is page-aligned and < 256 bytes,

Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
LD L, A
LD A, [HL]


That's 7 cycles.

Better yet, use arrays of one byte fields, rather than fields of arrays, and you can get rid of the ADD FieldOffset, saving 2 cycles:

Code:
LD H, ArrayBase >> 8
LD L, A
LD A, [HL]


That's 5 cycles.

If the array isn't page-aligned but doesn't wrap past page boundaries, add another two cycles to ADD ArrayBase & $00FF.

Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
LD L, A
LD A, [HL]


That's 9 cycles.

If the array isn't page-aligned and might wrap, add two cycles on no-wrap, or four on wrap:

Code:
LD H, ArrayBase >> 8
;; multiply A by the size of each object - the 6502 has to do the same, so this can be discarded for comparison
ADD FieldOffset
ADD ArrayBase & $00FF
JR NC, .nc
INC H
.nc:
LD L, A
LD A, [HL]


That's 11-13 cycles.

So: if you don't take care with your memory layout, you lose performance. But nothing about what you want to do is impossible or even difficult. It's usually somewhat slower than the 6502 but not terribly so, depending on a bunch of platform-specific factors on both sides (is your base address in zero page? your index in one of the index registers? do you cross page boundaries? what kind of math do you need to do to get the final index? etc etc). Usually this doesn't matter too much, unless you're in a tight loop - in which case you write/arrange your memory for speed, and the GBZ80 holds its own just fine against the 6502 in that context.

Far and away the biggest problem here is that code for a fully general operation is a bit unwieldly. Wrap it in a macro if you'd like.

This is a non-issue.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 1:31 pm 
Offline
User avatar

Joined: Thu Sep 15, 2016 6:29 am
Posts: 467
Location: Denmark (PAL)
This has probably been covered well in the thread already, but I think you can get pretty far without indexed adressing.

As I recently revealed, I've been perusing the entire source code for Donkey Kong, and it's actually surprisingly rare that it uses the IX and IY index registers that the Z80 has over the 8080. Instead it uses a lot of INC and DEC on the 16 bit HL register, and adds the DE register to cycle through object tables. It's in no way as elegant as what we are used to with the 6502, but having access to 16 bit additions on registers used for addressing opens up a whole new toolset - I'm assuming that's also possible on the Game Boy's CPU.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 1:35 pm 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 219
Location: NYC
You can, and you can also auto-increment/decrement HL on reads for free.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 1:45 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
It might be easier for readers to appreciate how this is a non-issue if we try a concrete example. A 6502 subroutine to simulate movement of an object of under influence of gravity might look like this:
Code:
GRAVITY = 48  ; /256 pixel per frame^2
NUM_ACTORS = 16

; The following arrays are not aligned to the start of a page,
; but none cross a page boundary.
.bss
.align NUM_ACTORS

; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
actor_ysub: .res NUM_ACTORS
actor_y: .res NUM_ACTORS
actor_yhi: .res NUM_ACTORS

; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
actor_dysub: .res NUM_ACTORS
actor_dy: .res NUM_ACTORS

.code

move_actor_x_vertically:
  ; Step 1: Apply acceleration due to gravity
  clc
  lda #GRAVITY
  adc actor_dysub,x
  sta actor_dysub,x
  lda #0
  adc actor_dy,x
  sta actor_dy,x

  ; Step 2: Add velocity to displacement
  clc
  lda actor_dysub,x
  adc actor_ysub,x
  sta actor_ysub,x
  lda actor_dy,x
  adc actor_y,x
  sta actor_y,x

  ; Sign extend the velocity
  lda actor_dy,x
  and #$80
  beq :+
    lda #$FF
  :
  adc actor_yhi,x
  sta actor_yhi,x
  rts


I'd like to see the idiomatic translation of this to Z80 or LR35902. Perhaps what I'm missing is some sort of insight on how "and adds the DE register to cycle through object tables" plays out in practice.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 3:28 pm 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 219
Location: NYC
Presumably you'd want to do this to each of your actors in a loop? Then I'd probably do it like this:

Code:
GRAVITY = 48  ; /256 pixel per frame^2
NUM_ACTORS = 16

.bss
; Displacement from top of map in 1/256 pixel units
; range: 0.000 to 32767.996 pixels
.align NUM_ACTORS
Actor_Y: .res NUM_ACTORS * 3

; Signed velocity in 1/256 pixel per frame units
; range: -16.000 to 16.000 pixels/frame
.align NUM_ACTORS
Actor_DY .res NUM_ACTORS * 2

.code

ApplyGravityToVelocities:
  LD HL, Actor_DY
  LD B, NUM_ACTORS
  LD C, GRAVITY
.loop:
  LD A, [HL]
  ADD C
  LD [HLI], A
  JR NC, .nc
  INC [HL]
.nc:
  INC L ; skip past high-byte
  DEC B
  JR NZ, .loop
  RET

ApplyVelocities:
  LD DE, Actor_Y
  LD HL, Actor_DY
  LD B, NUM_ACTORS
.loop
;; add low byte
  LD A, [DE] ; get ysub
  ADD [HL] ; add dysub
  LD [DE], A ; set ysub
  INC E
  INC L
;; add middle byte
  LD A, [DE] ; get y
  ADC [HL] ; add dy + carry(ysub + dysub)
  LD [DE], A ; set y
  INC E
;; adjust high byte
  LD A, [HLI] ; get dy(hi), move to next dy(lo)
  BIT 7, A
  JR Z, .pos
.neg:
  JR C, .next
  LD A, [DE]
  DEC A
  LD [DE], A
.pos:
  JR NC, .next
  LD A, [DE]
  INC A
  LD [DE], A
.next
  INC E
  DEC B
  JR NZ, .loop
  RET


I think that's right? It's not tested, and I'm not totally sure I understand the signed arithmetic bit.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 5:36 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19355
Location: NE Indiana, USA (NTSC)
I would want to apply gravity to actors with some types and not to actors with other types. Based on the code you presented, you appear to suggest structuring the actor update loops in a form that conceptually resembles the single-instruction, multiple-data (SIMD) approach used by shaders on modern GPUs:

  1. Apply step 1 to all actors.
  2. Apply step 2 to all actors.
  3. Apply step 3 to all actors.

But if the table contains both actors that fall and actors that do not fall, each step needs to include a determination of whether to apply or skip the step for a particular actor:

  1. Apply step 1 to those actors whose combination of type and state uses step 1.
  2. Apply step 2 to those actors whose combination of type and state uses step 2.
  3. Apply step 3 to those actors whose combination of type and state uses step 3.

In this sort of SIMD-like structure, I don't see how you'd enable or disable individual steps based on an actor's type and state without using additional register pairs for pointers to lookup tables from type and state to bitfields of which steps shall be performed on objects in that type and state.

Please forgive my naivete. It's just that arranging enemy AI using SIMD rather than straight-through code is an entirely new concept to me.


Top
 Profile  
 
PostPosted: Thu Nov 30, 2017 5:52 pm 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 219
Location: NYC
I think you can pretty straightforwardly adapt my code to only apply the operations to one actor - the only real difference would maybe be expanding the Y position to 4 bytes, to index into the array with two quick shifts.

I just wrote the code as I would have written it, without having that additional criteria in mind. I tend to work in the "SIMD" style, and it often does help things on the Game Boy because of the auto-incrementing. But in retrospect I should have kept it more similar to your original code - I probably muddled things, especially because this code didn't end up really needing to be very SIMD-ish.

Code:
;;; A - contains actor # to apply gravity to
ApplyGravityToVelocity:
  PUSH HL
  LD H, Actor_DY >> 8
  RLA ; multiply by two, since each velocity is two bytes
  ADD Actor_DY & $00FF ; add the array base, since we're not necessarily page aligned
  LD L, A
  LD A, [HL]
  ADD GRAVITY
  LD [HL], A
  POP HL
  RET


If it's only gravity that's selective, then you can still do the actual motions using the separate routine I gave, all at once.

Code:
.bss
.align PAGE
Actor_UsesGravity: .res NUM_ACTORS

.code

UpdateActors:
  CALL ApplyGravity
  CALL ApplyVelocities
  RET ; obviously this tail call could be optimized out

ApplyGravity:
  LD HL, ActorUsesGravity ;; page aligned, so L = 0
  LD B, NUM_ACTORS
.loop:
  LD A, [HL]
  AND A
  JR NZ, .next
  LD A, L
  CALL ApplyGravity ; note that I made the above subroutine preserve HL!
.next:
  INC L
  DEC B
  JR NZ, .loop
  RET


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group