Truth be told, after cranking this out in a week and change, my brain doesn't want to jump right back into 65xx ASM just yet. It would much rather finish up Star Ocean 3 and possibly start roughing out a new C++ game.
The block preview mode in your TG codebase is a nice idea- I might rig something similar in a later version of mine. On the whole, though, I'm certain my code is slower. Case in point: my 16-bit multiply
Code: Select all
sta mul16Flag2 ; what? You want I should preserve the negative flag on a junk call?
; basic shift-and-add method
; keep halving mul*1 and popping bits off mul*2
; if the bit off mul*2 is a 1, add the remaining mul*1 to result
jsr rsh16 ; since the highest power place in mul*2 is 1/2
jsr lsh16 ; which pops the shifted-out bit into carry
bcc mul16_loop_no_add ; so we can act on it right away
cpx #15 ; after 15 rshs, we're guaranteed to have 0 in the rsh input
; visual break to bookend the loop
sta mul16Flag2 ; safe, since we know we can't have overflowed, so only the sign flags might be unequal, producing a negative
; for reference, the above math is done on non-2's-complement 16-bit values, highest place being 1/2, lowest being 1/64k, with a flag byte consisting of 6 unused bits followed by an overflow flag and a negative flag
Kinda important for mandelbrot, and yet every call probably spends more cycles shifting stuff between the inputs of my various other routines than accomplishing actual computation. That, and I use straight-up shift-and-add. I'd have to spend an hour parsing your innermost multiply code before it would make total sense, but it looks like you take some shortcuts at the higher levels. I almost used 32-bit precision, but by the time I wrapped my head back around the math to see how easy it was, I was too lazy to go and change all my zeropage allocation for more subroutine input bytes.
I also have a nice little restraining order in there called itersPerNMI which I've set quite low indeed for the sake of the music. Come to think of it, I should reset my counter in the NMI routine rather than the mandelbrot loop since that's not the only place I ever waitNMI... *changes code* ... great. Now it chugs even more
I could just dec the address rather than dey and reload y to catch what are probably NMIs that occur just before I'd wait for an NMI but that would cost 5 cycles in my inner loop as opposed to 2... bleh. Clearly more work is needed.
When I allow as much frameskip as is needed to crunch out an entire tile before actively waiting, iirc it runs a good deal faster. But the music hiccoughs something fierce.
Is that an iso I see with FractalEngine? Meaning I could run it on my actual TurboDuo? Crazy talk! 8) I'll have to get by on Nestopia and good sense for mine unless I can scrounge a dev cart and/or EEPROM burner.
edit: new version uploading as I type. My iterations-per-frame counting was way off, so my wait calls were eating a lot of time. I decided to nix the whole iteration-counting deal and instead just let frameskips happen and update the music as needed. The result cuts runtime to 75% what it used to be.