## Unsigned Integer Division Routines

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

tepples
Posts: 22017
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

### Re: Unsigned Integer Division Routines

Unless, of course, you use tokumaru's solution of storing world-relative and nametable-relative camera positions separately.

Omegamatrix
Posts: 35
Joined: Tue Jun 10, 2014 8:15 pm

### Re: Unsigned Integer Division Routines

I've updated the divide by 22 with a quicker routine. Same amount of bytes as before.

Old:

Code: Select all

``````;Divide by 22
;1/22 = 1/11 * 1/2
;21 bytes, 37 cycles
sta  temp
lsr
lsr
ror
ror
ror
lsr
ror
lsr
lsr
lsr
lsr``````
New:

Code: Select all

``````;Divide by 22
;21 bytes, 34 cycles
lsr
cmp  #33
sta  temp
lsr
ror
ror
lsr
ror
lsr
lsr
lsr``````

chitselb
Posts: 6
Joined: Sat Jun 14, 2014 2:14 pm

### Re: Unsigned Integer Division Routines

The idea behind Rad50 http://en.wikipedia.org/wiki/RADIX-50 is that 40x40x40 = 64000, so if you limit your character set to 40 characters (26 letters + 10 digits + 4 whatevers) you can squeeze three bytes of text into a two byte unsigned, every time. 33% compression.

In these routines, the divisor is 8-bit. Scaling up to 16-bit e.g. LSR becomes "LSR x+1; ROR x" introduces inaccuracies once the divisor is greater than 256. 28741 /10 = 2862

chitselb
Posts: 6
Joined: Sat Jun 14, 2014 2:14 pm

### Re: Unsigned Integer Division Routines

I played around with it in a spreadsheet, and I noticed this. It's 32-bit math, but it doesn't have to be an obscene amount of 6502 code

(x*13107+13100)/65536 = x/5

and 13107 = 3 + 48 + 192 + 12288 (triple it; add and shift 4; add and shift 4; add and shift 4; add (and shift 4))

Here it is in C

Code: Select all

``````#include <stdio.h>
int main(void)
{
int z=0;
int i;
for (i=0; i<64000; i++) {
int t=0;
int a=i*3;
int j;
for (j=0; j<4; j++) {
t += a;
a *= 16;
}
t += 13100;  /* a fudge factor */
int x=(t/65536)>>3;
int y=(i/40);
z += abs(x-y);
printf("%6d  %6d  %6d  %6d\n",i,x,y,z);
}
}``````

tepples
Posts: 22017
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

### Re: Unsigned Integer Division Routines

In practice, I wonder whether base 32 might be quicker to decode and nearly as effective at compression. See ITA2 and Z-character.

doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

### Re: Unsigned Integer Division Routines

Omegamatrix wrote:I had a simple idea for finding the divisions. I used a brute attack approach running every possible combination in an excel sheet. I only tested for 1/x and ignored truncation... I just looked for division combinations that were the most accurate under perfect circumstances. Those were the routines I focused on testing

From there I built a bunch of tools. I made another excel sheet to test division for all input values (0-255), some C programs to do the same thing as both excel sheets, a rom for the 2600 to test for verification and help solve correction factors for some routines. Whenever you see ADC #xx in one of the routines you are looking at a correction factor I had to introduce. Sometimes I found I could also use just CLC or SEC at certain points in the routine to do the correction.
Clever. I think I'll keep these routines handy for future needs.

I don't suppose you've considered rigging up a super-optimizer and letting it loose on the problem? It seems like a relatively limited number of permutations to search through, though I suppose the immediate fudge factor might cause trouble.

I know I've occasionally wished that someone would take the time to write a good one for the 6502. Ideally with a flexible scheme for evaluating the results, a flexible opcode/immediate search space and specifying other constraints.

It just seems like 6502 optimization in practice as often as not boils down to a puzzle of the form: shuffle bit X into register Y at precisely cycle Z while preserving carry. What remains is then a brain-dead task of working through the possible permutations until you find the best one.

chitselb
Posts: 6
Joined: Sat Jun 14, 2014 2:14 pm

### Re: Unsigned Integer Division Routines

Here's what I came up with. Probably not optimal, and at 179 bytes, way more code than I want. It's a Forth word in http://pettil.tumblr.com if you're wondering what is all that weird stuff on the fringe of the code

Code: Select all

``````;--------------------------------------------------------------
#if 0
name=40/MOD
stack=( u -- u%40 u/40 )
tags=math
Perform a divide by 40 and a modulo 40, for
#endif
slmod40
stx storex
lda #0
ldx #7
slmod40a
sta n,x
dex
bpl slmod40a                ; zero n+0..n+7
ldx #2
slmod40b
clc
lda n                       ; addend = n*3
sta n
lda n+1
sta n+1
bcc slmod40c
inc n+2
slmod40c
dex
bpl slmod40b
lda #4
sta n+8
slmod40d
clc
lda n+0
sta n+4
lda n+1
sta n+5
lda n+2
sta n+6
lda n+3
sta n+7                     ; sum += addend

ldy #4
slmod40e
asl n
rol n+1
rol n+2
rol n+3                     ; addend << 4
dey
bne slmod40e
dec n+8                     ; repeat 4x (3+48+768+12288 = 13107)
bne slmod40d

clc
lda n+4
sta n+4
lda n+5
sta n+5
bcc slmod40f
inc n+6
bne slmod40f
inc n+7                     ; sum += 13100 (fudge factor)
slmod40f
lda n+7                     ; (n+6) is now u/5
lsr
ror n+6
lsr
ror n+6
lsr
ror n+6
sta n+7                     ; (n+6) = u/40

lda n+6
sta n+0
lda n+7
sta n+1
sty n+2
sty n+3
asl n
rol n+1
asl n
rol n+1                     ; n = u/40*4
;clc
lda n
sta n
lda n+1
sta n+1                     ; n = u/40*5
asl n
rol n+1
asl n
rol n+1
asl n
rol n+1                     ; n = u/40*5*8
sec
lda tos
sbc n
sta tos
lda tos+1
sbc n+1
sta tos+1                   ; tos =  u % 40
lda n+6
ldy n+7                     ; push   u / 40
ldx storex
jmp pushya
``````

Omegamatrix
Posts: 35
Joined: Tue Jun 10, 2014 8:15 pm

### Re: Unsigned Integer Division Routines

doynax wrote:Clever. I think I'll keep these routines handy for future needs.

I don't suppose you've considered rigging up a super-optimizer and letting it loose on the problem? It seems like a relatively limited number of permutations to search through, though I suppose the immediate fudge factor might cause trouble.

I know I've occasionally wished that someone would take the time to write a good one for the 6502. Ideally with a flexible scheme for evaluating the results, a flexible opcode/immediate search space and specifying other constraints.

It just seems like 6502 optimization in practice as often as not boils down to a puzzle of the form: shuffle bit X into register Y at precisely cycle Z while preserving carry. What remains is then a brain-dead task of working through the possible permutations until you find the best one.
Thanks! It's always nice to hear people might make use of these routines. I wrote them because to me they are like solving little puzzles.

The idea came up before on AtariAge about letting a super computer mull over every usable opcode. Who knows what weird combination of opcodes would bring a faster, shorter solution? Nothing came of that idea though... I suppose though that just a small pool of opcodes might be workable on a home computer. Say just using STA, ASL, LSR, ROR, ROL, CLC, SEC, ADC zeropage, ADC immediate, SBC zeropage, sbc immediate, EOR zeropage, EOR immediate. I would propose constraining the routine to 36 bytes or less, and 36 cycles or less because most of the routines I wrote are already shorter than that. I would also constrain it to no more then two temp registers, and preferably just one. If the program could crunch all that then I would try adding in ROL, ROR, ASL, and LSR (all zeropage) next.

doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

### Re: Unsigned Integer Division Routines

Omegamatrix wrote:The idea came up before on AtariAge about letting a super computer mull over every usable opcode. Who knows what weird combination of opcodes would bring a faster, shorter solution? Nothing came of that idea though... I suppose though that just a small pool of opcodes might be workable on a home computer. Say just using STA, ASL, LSR, ROR, ROL, CLC, SEC, ADC zeropage, ADC immediate, SBC zeropage, sbc immediate, EOR zeropage, EOR immediate. I would propose constraining the routine to 36 bytes or less, and 36 cycles or less because most of the routines I wrote are already shorter than that. I would also constrain it to no more then two temp registers, and preferably just one. If the program could crunch all that then I would try adding in ROL, ROR, ASL, and LSR (all zeropage) next.
Such devices have been used with some degree of success in the past. Mostly to find short bit-hacks abusing the architecture in some novel way or another. Massalin's original paper is a good read, and there is a PIC16 implementation for instance.

There are other other strategies of course but the brute-force method of just generating all permutations from a limited set is workable. They key is to keep the number of opcodes low (counting each immediate and temporary as a distinct instruction) and while for a general-purpose optimize this may not be feasible I'd be more interested in a guided search automate the sort of thing you've been doing by hand here. For instance in the case of division you might first search for roughly the right answer then add in the fudge-factor separately.
Note that virtually all generated sequences are completely wrong, so validating them is easy with a couple of well-picked counterexamples. A full-blown theorem prover is only necessary for confirming the final result, if at all.

Realistically with, say, 25 opcodes, a bit of pruning and a fast search function you might be able to reach around 12 instructions on a small cluster running overnight.

Omegamatrix
Posts: 35
Joined: Tue Jun 10, 2014 8:15 pm

### Re: Unsigned Integer Division Routines

I glanced at Massalin's paper and that looks like an interesting solution they found for BCD. I will have to read it more later when I have time.

I thought more about the computation today. It would be nice to also include ORA and AND, both zeropage and immediate. It's all the immediates that explode the number of states. ADC, SBC, and EOR immediate take 256 states. ORA and AND only need 255 states as you can skip ORA #0 and AND #\$FF.

Code: Select all

``````;opcode     states
; rol         1
; lsr         1
; asl         1
; ror         1
; clc         1
; sec         1
; sbc temp    1
; eor temp    1
; and temp    1
; ora temp    1
; sbc #imm   256
; eor #imm   256
; and #imm   255
; ora #imm   255``````
That totals 1289 states. To do 12 instructions would be 1289^12 = 2.1 x 10^37 routines... so pretty heavy. If we skipped all of the immediates then we have a small pool of 11 different states. 11^12 = 2.85 x 10^11 routines, which is doable. I'd much rather run through all the routines and resolve all possible fudge factors at once, but that is some serious computation power needed.

For a good initial test value, 255, 238, and 239 are all very good to use. I did a quick excel sheet for calculating the number of unique results for each input (0-255) when dividing by 3 to 31 (skipping divide by 2, 4, 8, and 16). The rightmost column displays the number of unique values for that input.
238 and 239 give the same results for division. More importantly 238 and 255 give unique results for division 3 to 19. This makes it easy to test for initial correctness of the routine with just two checks.

tepples
Posts: 22017
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

### Re: Unsigned Integer Division Routines

I found another fudge if you're allowed to change the encoder. Have the encoder multiply each 16-bit word by 65536/64000 = 128/125 = 1.024, so that the decoder works by multiplying by 40 (shift, shift, add, shift, shift) instead of dividing by 40. The principle is that of arithmetic coding.

chitselb
Posts: 6
Joined: Sat Jun 14, 2014 2:14 pm

### Re: Unsigned Integer Division Routines

tepples wrote:I found another fudge if you're allowed to change the encoder. Have the encoder multiply each 16-bit word by 65536/64000 = 128/125 = 1.024, so that the decoder works by multiplying by 40 (shift, shift, add, shift, shift) instead of dividing by 40. The principle is that of arithmetic coding.
The encoder will run on a non-6502 computer, so a faster decoder would be wonderful. I stared at this for a bit and couldn't figure out how it works. Given "DOG" which encodes to 4, 15, 7, or 4*1600+15*40+7 = 7007, I multiply that *1.024 and wind up with 7175.168. I don't have the luxury of storing the fractional portion, so iteratively multiplying 7175*40 and dividing by 65536 results in 4, 15, 6.8359375

I see where you're going with it, but how does it translate into an algorithm that yields decodable output?

tepples
Posts: 22017
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

### Re: Unsigned Integer Division Routines

Try rounding the multiplication by 1.024 up using ceil().

chitselb
Posts: 6
Joined: Sat Jun 14, 2014 2:14 pm

### Re: Unsigned Integer Division Routines

tepples wrote:Try rounding the multiplication by 1.024 up using ceil().
I played around with it in a spreadsheet and there are inaccuracies. It would probaby work if more bits were available for the fractions, but in Rad50 they just aren't there.

lidnariq
Posts: 9506
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

### Re: Unsigned Integer Division Routines

Worked for me:

Code: Select all

``````#!/usr/bin/perl
use POSIX 'ceil';
for (\$x = 0; \$x < 40; \$x++) {
for (\$y = 0; \$y < 40; \$y++) {
for (\$z = 0; \$z < 40; \$z++) {
my \$sum = \$x*1600 + \$y*40 + \$z;
\$sum = ceil(\$sum * 1.024);

\$sum *= 40; \$a = \$sum >> 16; \$sum &= 0xFFFF;
\$sum *= 40; \$b = \$sum >> 16; \$sum &= 0xFFFF;
\$c = (\$sum * 40) >> 16;
if (\$x != \$a or \$y != \$b or \$z != \$c) {
printf "%2d=%2d %2d=%2d %2d=%2d\n",\$x,\$a,\$y,\$b,\$z,\$c;
}
}
}
}``````