Flight Minigames (canceled)

Moderator: Moderators

tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Flight Minigames

Post by tepples »

43110 wrote:I think should start looking into tepples sound engine. I believe it's pretty much music.s sound.s and musicseq.* in RHDE.

I do like the 64 note range of famitone2
Patterns in my music engine support a TRANSPOSE command that changes the base note for the remainder of a pattern. For example, TRANSPOSE,12 shifts the rest of the pattern up by an octave, and RHDE's combat theme uses this. It just has the two-octave range for a more compact encoding of the common case.
and my edit that allows different tempo and pitch setups.
Which "tempo and pitch" setups are you talking about?

(And is this tune happy or crappy?)
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

ahh, two-octave local range per pattern range makes sense. The 64 note range I'm talking about is the period table.

Tempo and pitch for adjusting between NTSC, PAL, and Dendy TV systems. I hacked that in famitone2 by using the upper two bits of a variable.
tepples wrote:(And is this tune happy or crappy?)
umm, both? It'll be good for the flappy mode featuring a butterfly. Unless you want to exclusively use it for that shoot-em-up game.
BruceMcF
Posts: 4
Joined: Wed May 28, 2014 7:57 pm

Re: Flight Minigames

Post by BruceMcF »

43110 wrote: Also this is probably the one piece of code that the cpu will spend a good bulk of it's time on.

Code: Select all

;;
; multi purpose multi byte operation
; [API sniped but will be described in full source]
bignum_op:
  ldx temp+4 ; edit, don't touch temp+4
-
  lda (temp+0), y
  eor temp+5    ; a value of $ff will make this a subtraction
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex ; edit, don't touch temp+4
  bne -
  rts
If you are using these for four byte values, then this lets you set the size once for a series of operations without having to reset it all the time. But count up how many times they appear ... if they appear too many times, the overhead of clear and carry and setting up the EOR byte will overwhelm the 11 bytes saved, and you are better off with:

Code: Select all

bignm_add:
  clc
  ldx temp+4
-
  lda (temp+0), y
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts

bignm_sub: ; above subtracts 0 from 2 and stores in 0 ~ is that the right way around?
  sec
  ldx temp+4
-
  lda (temp+2), y
  sbc (temp+0), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts 
Its only a narrow range of number of calls of bignum_op where it actually nets saving space, when extra calling overhead is taken into account. If there are like two routines that add and two that subtract, you should be ahead, by a few bytes, but even at four routines that do each, the extra call overhead may offset the squeezing of bytes from the called routines.

Similar for clear and copy:

Code: Select all

big_cpy:
  ldx temp+4
-
  lda (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts

big_clr:
  ldx temp+4
  lda #0
-
  sta (temp+0), y
  iny
  dex
  bne -
  rts 
... allow clearing and copying blocks at a time (set temp+4=y=0), but used during four byte FP integer operations, allows temp+4 to hold the same size index and allows avoiding set-up overhead. There's a double purpose version using "AND temp+5", but at a net saving of 9 bytes, its a close run thing whether extra set-up overhead swamps the crunching of the factored operation.

In the middle case, if the program flow allows the temp+5 set-up to be stable over multiple operations, so there is not much overhead setting temp+5, but the carry flag is set or cleared more than four times total in calling big_num, you save space just by setting up distinct entry points to do that:

Code: Select all

;;
bignm_sub:
  sec
  bcs +
bignm_add:
  clc
+
  ldx temp+4
-
  lda (temp+0), y
  eor temp+5    ; a value of $ff will make this a subtraction
  adc (temp+2), y
  sta (temp+0), y
  iny
  dex
  bne -
  rts
... though I still wonder whether its:

OP2 - OP1 =: OP1

... or ... OP1 - OP2 =: OP1 ... in which case it should be "lda (temp+2),y", etc.
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

I've updated the first posted the work I have done this month (v0.2). There's nothing apparently new in the compiled ROM, but the source has changed a bit. Additionally I'm now planing for 6 games instead of 8.

About what BruceMcF posted:
My motivation for even merging the two multi byte operations was to rid of sequences of jump subroutines, and the major optimization was actually having to do with loading all the prams from three byte tables (see line 1058 in the source of v0.2).

I was trying to preserve X through the whole chain of subroutine calls so that at the top I can do a loop like
ldx #0
-
jsr do_math_step
inx
cpx #32
bcc -
but looking again at what I have now, I really should change that to a counter in zeropage, since I had to make an unnecessary stack push in the multiplication subroutine.

Also if you want to see another thing I did with a hard to compute the trade off, I took an idea of having list of addresses in stack and made my game state selector push such a list instead of having a sequence of jsr in a subroutine.
BruceMcF
Posts: 4
Joined: Wed May 28, 2014 7:57 pm

Re: Flight Minigames

Post by BruceMcF »

43110 wrote:About what BruceMcF posted:

I was trying to preserve X through the whole chain of subroutine calls so that at the top I can do a loop like
ldx #0
-
jsr do_math_step
inx
cpx #32
bcc -
but looking again at what I have now, I really should change that to a counter in zeropage, since I had to make an unnecessary stack push in the multiplication subroutine.
Yes, its frequently better to use the X register in inner loops and an index register for outer loops.
Also if you want to see another thing I did with a hard to compute the trade off, I took an idea of having list of addresses in stack and made my game state selector push such a list instead of having a sequence of jsr in a subroutine.
Also, if you have a pool of precomputed "RTS addresses" in a known page, you can set up outer factor routines as jump targets:

Code: Select all

;;
routine1: ; routine cannot span a page, cannot start on 0 location in a page
   lda #>rtn1+beg-1
   ldx #>rtn1_end+1 ; rtn1_end generated from source so points to low byte
vec_call:
   sta temp+5
-
   lda vec_pool,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call set of vectors

routine2: ; see warnings above
   lda #>rtn2+beg-1
   ldx #>rtn2_end+1
   bne vec_call
... 
And further crunching in storage of the vector sequences is possible from vectors being allowed to overlap:

Code: Select all

;;
vec_pool:
rtn1_beg:
   doop1-1 ; first operation of routine1
   doop2-1
   doop3-1
rtn2_beg:
   doop4-1 ; first operation of routine2
   doop5-1
rtn1_end:
   doop6-1 ; last operation of routine1
rtn2_end:
   doop7-1 ; last operation of routine2
   ; ... 
The fixed space overhead of that particular stack stuffing is 12 bytes (there are of course post-indexed versions that allow the vectors to be anywhere in visible memory). The outer routines break even compared to identical subroutine factor code at six unique operations calls per outer routine, with one byte saved per inner operation call after that point. And of course any overlapping operations in the pool are basically free space to all but one of the calling routines.

So that rewards programming that relies on lots of little factors that are heavily re-used, lego-block style, and factor design that favors the use of overlapping "phrases" of the operation inner factors.
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

I had to read that several times before I got it. While I am aiming for subroutine reuse, I don't think things will logically interleave like that. But you did mention putting subroutines in the same page, and I then saw that the address really only have 10 bits for a 4KiB ROM. That means I'm now wasting 6 bits for every address in the list, and that's not too far from having a whole implied byte automatically inserted by the list loader.

Edit: Nevermind, I miscounted. 4KiB is 12 bits, causing only 4 bits of waste per address.

but before I start arranging assembly code like a falling block puzzle game, I need to program the pieces first.

On a related note, I have about 640 unused bits from the note_period table. What can I put there that uses only the upper 5 bits of 128 bytes?
BruceMcF
Posts: 4
Joined: Wed May 28, 2014 7:57 pm

Re: Flight Minigames

Post by BruceMcF »

43110 wrote:I had to read that several times before I got it. While I am aiming for subroutine reuse, I don't think I'll have things logically knitted like that. But you did mention putting subroutines in the same page, and I then saw that the address really only have 10 bits for a 4KiB ROM. That means I'm now wasting 6 bits for every address in the list, and that's not too far from having a whole implied byte automatically inserted by the list loader.
But you also want the list loader to be concise.

Without the "falling block puzzle game" side of it, you can have counted vectors anywhere in the ROM without a count byte by using the high bit to flag the beginning of the list, but after working that through, its a bit more intricate and brittle than I like.

Vectors with an index to their own end in front don't let you play the "sliding block puzzle" games, but they are easier to follow what is going on.

And there really is not any binding constraint with the vector pool approach in terms of how many vectors it supports. If you have so many lists of subroutine calls that you've used up one vector pool, you've saved more than enough space to cover the overhead for a second vector pool page.

Code: Select all

;;
;    NB. A "vector pool" is a page with a set of subroutine address vectors in it.
;    The leading byte is the position in the page of the last byte in the vector.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool1,x
   tax
-
   lda vpool1,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page

routine2:
   ldx #<vector2
   jmp go_pool1

routine3: ; NB. routine3 is only four ops, so there is no space saving from the vector
   jsr op3_1
   jsr op3_2
   jsr op3_3
   jmp op3_4

routine4:
   ldx #<vector4
   jmp go_pool1

; ... and so on, until & unless the vpool1 page is full up, then (supposing it was 30 routines that filled it up):

routine31: ; the first entry in vector pool 2.
   ldx #<vector31

go_pool2:
   lda vpool2,x
   tax
-
routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool2,x
   tax
-
   lda vpool2,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page

routine32:
   ldx #<vector2
   jmp go_pool2
...
... so the limit of distinct vectors to a single binary page is not an overall limit on the number of vectors altogether.

And if (as supposed above), 30 vectors filled up a vector page, then using the vector page saved 30 bytes, since each vector pool call is one byte shorter than passing a full address, which

But if you want to really crunch ... yeah the addresses can be packed. EDIT ~ yeah, 4K is 12 bits, so not just can't be bothered packing more than two, CAN'T pack more then two ~ 2 bits to 4 bits fixed put in place w/out more edit markets ~ the original was to crunch routines into the 4K "Golden RAM" in the C64 above the ROM in the memory map and below the I/O, but it was reconstructed from memory, and I got some of the details off ~ plus multiplied when I should have been dividing (oops). /EDIT

If there are an odd number of routines, its the address at the head of the list that is not unpacked, so in the vector pool, an odd list would like like (dummy, so op_# is the place of the op in the sequence, not the actual name, psuedo-assembler syntax):
.byte
.word op1
.byte <op_2, <op_3, ((>op_2 .AND. $0F)*16)+(>op_3 .AND $0F)
.byte <op_4, <op_5, ((>op_4 .AND. $0F)*16)+(>op_5 .AND $0F)
.byte <op_6, <op_7, ((>op_6 .AND. $0F)*16)+(>op_7 .AND $0F)
...

... with the packed address business best constructed as a macro in your favorite 6502 macro assembler, to get it right once and avoid lurking typos.

Note that the high bit of the address is stripped by the packing process and replaced by the routine.

Code: Select all

;;
;    NB. The leading byte is the size of the packed string of vectors in bytes.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   lda vpool1,x
   sta temp+5
   tax
-
   lda vpool1,x
   tay
   and #$0F
   ora #$80
   pha

   lda vpool1,x
   pha
   dex
   cmp temp+5
   beq +

   lda vpool1,x
   tya
   lsr a
   lsr a
   lsr a
   sec
   ror a
   pha

   lda vpool1,x
   pha
   dex
   cmp temp+5
   bne -
+
   rts ; call subroutine vectors in turn from stack page
44 vs 16 bytes, so 28 bytes extra space required by the processing. 56 pairs of subroutine vectors crunched together and that breaks even.

So that is worth considering if the vector pool page is more than half full.

And since each vector pool has its own engine, if you have own full vector pool page and one only quarter full, you can put packed lists in the full one, and unpacked lists in the only partly full one.

Personally, if I was using any, I'd use the simpler one, because its easier to just define the routines using ".word" assembler instructions with the label for the routine.

This is all a kind of building a Forth-like inner interpreter without a Forth assembler/compiler system running in the NES.
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

Well I tried, but Summer is on me and I have no time until the start of the next school year in August. I look forward to the multi-cart production.

Before I go, here's the list of games I had planned in this:
  • Game A: like the Helicopter game
  • Game B: An upside-down version of Balloon Fight physics. What should the character be?
  • Game C: A Flying Saucer that travels in a sine wave unless you press A, (I can place the sine table in the high byte of the note period table)
  • Game D: A paper airplane that always travels ±45°.
  • Game E: An electric spark that travels on one of two lanes. The A button switch between the two.
  • Game F: flappy butterfly A fluttering butterfly.
BruceMcF
Posts: 4
Joined: Wed May 28, 2014 7:57 pm

Re: Flight Minigames

Post by BruceMcF »

BruceMcF wrote:

Code: Select all

;;
;    NB. A "vector pool" is a page with a set of subroutine address vectors in it.
;    The leading byte is the position in the page of the last byte in the vector.

routine1: ; the first entry in vector pool 1.
   ldx #<vector1

go_pool1:
   stx temp+5
   lda vpool1,x
   tax
-
   lda vpool1,x
   pha
   dex
   cpx temp+5
   bne -
   rts ; call subroutine vectors in turn from stack page
... {etc.}
Note that a second way to use the redundant 4 bits per address is as data.

List: .byte <op1, >op1 .AND $0F, <op2, >op2 .AND $0F, ...

Call:
A-reg: index to first address in list
X-reg: index of first address of following routine

Code: Select all

routine1:
   lda #<lst_rtn1
   ldx #<lst_rtn2
   jmp call_vec
...

call_vec:
   sta temp+5
-
   dex
   lda vec_pool,x ; high byte address, data in high nybble
   and #$0F
   ora #$80
   pha
   dex
   lda vec_pool,x ; low byte address
   pha
   cpx temp+5
   bne -
   rts 
This packs operation scripts of up to 128 routines total into a page, since the length information is entirely embedded by the call. Having an all-address pool makes it reasonably straightforward to use the high nybble of the even address as 128 nybbles of data.

I'm thinking of butterfly as something that would fly with a bit of unpredictable imprecision, so it's something where a wrap-around sequence of +/- trim values could be laid out to make it "fluttery", but various processes that could be given character with a cycle of signed 4 bit integers (-8 through +7) could work as well.

nnyb2int: ; signed high nybble to signed integer
php ; remember sign
lsr a
lsr a
lsr a
lsr a
plp
bpl +
ora #$F0
+ rts

; flutvflo = $07 for 16, 8-step cycles; $0F for 8, 16 step cycles, $1F for 4, 32 step cycles
flutvflo:
.byte $0F

inc_flutter: ; increment wrap around based on mask in flutvflo:
inc flutter
lda flutter
bit flutvflo
bne +
clc
sbc flutvlfo ; this is a trick ~ "clc / sbc N" is the same as "sec / sbc {N+1}", and the mask+1 is the wrap-around
sta flutter
+ lsr a
tax
inx
lda vec+pool,x ; data is in high nybble of high byte of address
rts

Have a good summer!
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

I been studying a self made disassembly of a old Nintendo game, and it's been an interesting experience, so I guess I'll bump this thread for the question I want to ask instead of making another thread. I hope I can get this game started again.

Suppose I do all the logic of the entire game by lots of small subroutines. Some subroutines might be used only once and inlined or be called as the last thing in another subroutine and jmp'ed to. A conditional jump might be optimized into a direct branch by rearranging code, or a conditional return might branch to a rts above the routine if the one below is to far away.

Is there some tool that can generate a call graph and automatically make these optimizations? If not, what parser can I work off of to quickly make my own optimizer? (I'm ok with python stuff)
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Flight Minigames

Post by tepples »

Years ago, I wrote a program in Python to make a call graph of a ca65 program. This is what the graph looked like. Should I try posting the program?
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames

Post by JRoatch »

Sure, why not.
JRoatch
Formerly 43110
Posts: 422
Joined: Wed Feb 05, 2014 7:01 am
Contact:

Re: Flight Minigames (canceled)

Post by JRoatch »

I've given up on this project so that I can clear my mind for the planning of another nes game, and to give time to gear myself up for the next competition.
Post Reply