I know I'll have to do it in software, but I don't feel like reinventing the wheel. Anyone has good, fast code that would allow me to do both? Preferably something I can just jsr into.
Wiki: 8-bit Multiply
CC65 has some multiply/divide routines: (see umul/udiv/imul/idiv)
CC65 GitHub runtime module
Keldon recently shared a multiply using a lookup table:
Forum: Relatively fast multi, ...
In general, you can make it faster with lookup tables, but they take up space.
There's unfortunately a lot of different needs for multiplication (e.g. 8 bit x 8 bit = 16 bit? 16 x 16 = 32? signed/unsigned? remainder or fixed point?) so it might be tricky to find a "generic" one that does exactly what you want efficiently.
If you want to try writing your own, the simple method isn't really that tricky. If you know how to do long multiplication / long division by hand, it's basically just this technique in binary (i.e. for multiply it's a left shift x2 for each row instead of x10, but you still just add up the rows in the end).
Finally, there's an even simpler version where you can implement multiply as just a repeated add, or divide as counting a repeated subtract. This is slow and inefficient in the general case, but if your numbers are small or you're in a situation where the result is cumulatively maintained, they can be pretty effective anyway.
If you absolutely know you need multiplication of a factor of 2, bit shift is much faster.
I've used a table of answers before, for example, if I always need to multiply by 10.
LDA table, x
May I ask why do you need multiplication and division in a 6502 program? Most games can get away without them, even when more complex physics is involved. Sometimes you can make do with approximations, so you can use smaller look-up tables and faster code. The only reason I'm using multiplication is because I'm working on a raycaster, which involves trigonometry, but even then I try to use look-up tables as much as possible to avoid/simplify the real-time math.
If you add DEY to the start of the subroutine you can drop the CLC inside the loop. Saves <=14 cycles.rainwarrior wrote:I already linked Wiki: 8-bit Multiply but tepples just added an even nicer variation to it.
Also, unrolling that loop saves like ~40 cycles at the cost of ~50 bytes. Well worth it, IMO. 50 bytes is nothing.
I think that's valid, yeah.pubby wrote:If you add DEY to the start of the subroutine you can drop the CLC inside the loop. Saves <=14 cycles.
It would probably need a check for 0 before the DEY, which I guess would add 6 more cycles, so the difference is from -4 to 10 cycles? Though you also get a very quick Y=0 case in the bargain.
Edit: or wait, never mind, 255 + carry is going to be 0 anyway. I guess you don't actually need to check at all. So, 0 to 14 cycles, yeah.
I guess cc65's umul8x8r16 is actually the same as tepples' algorithm, just with slightly less parsimonious return value storage:
https://github.com/cc65/cc65/blob/maste ... ul8x8r16.s
Not a new method, just different code.
It's equivalent to tepples' method with the (non-unrolling) optimization pubby suggested. Also it doesn't bother to initialize A and treats it as an extra "add" input.Kasumi wrote:Haven't checked if it's better than the ones on the wiki, but here's another player from atariage: http://atariage.com/forums/topic/71120- ... /?p=896028
Not a new method, just different code.
prodlo will remain zero, but register A will end up non-zero. So checking for Y=0 is still necessary.rainwarrior wrote:Edit: or wait, never mind, 255 + carry is going to be 0 anyway. I guess you don't actually need to check at all. So, 0 to 14 cycles, yeah.
- Formerly WheelInventor
- Posts: 2032
- Joined: Thu Apr 14, 2016 2:55 am
- Location: Gothenburg, Sweden
Code: Select all
mult32: asl mult16: asl mult8: asl mult4: asl asl rts div32: lsr div16: lsr div8: lsr div4: lsr lsr rts
You can avoid the addtional 12 cycles that jsr and rts cost together by hardcoding it into your specific
function. If you can figure out a siginificant var or const to be a value so that the operation requires even fewer lsr/asl:s to get the desired value; even better.
Several integer and floating point math examples can be found at http://6502.org/source/ - i haven't looked them up but came across it when i was looking for a reference on how many cycles the above would take.
The "interesting" part is that It exits early whenever the right-shifting multiplier reaches zero, which is fundamentally faster than performing all the iterations for the bit width if one of your values is small. I also extended it to 8*8=16bit which can also function as 16*8=16bit, still performing the early exit. There's plenty of other multiplication & division algorithms under the math section in there as well, though I think all the other primary mult techniques have been covered here so far.