It is currently Wed Oct 01, 2014 1:16 am

All times are UTC - 7 hours




Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next
Author Message
 Post subject:
PostPosted: Tue Sep 12, 2006 7:34 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
I'll probably attempt to implement yxa_to_5_digits on up to yxa_to_8_digits. The main problem is that these exceed the 6502's registers, so will require lots of swapping with memory. The comparisons will also be more involved with three bytes. It'd be nice to have a whole family of routines through 24 bits.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Sep 12, 2006 8:35 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 6708
Location: Rio de Janeiro - Brazil
Yeah, and I liked your idea of entry points, where you can use the same routine for a range of different precisions as needed. And a big routine could always be cut down depending on how large your numbers are.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Sep 12, 2006 11:36 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
Extending to 24-bit input and 8 decimal digits out didn't turn out as bad as I thought. Here are the new routines I added and their timings (and the previous routines, for completeness):
Code:
yxa_to_8_digits: Best: 186, Average: 344, Worst: 475
yxa_to_7_digits: Best: 170, Average: 322, Worst: 440
yxa_to_6_digits: Best: 139, Average: 254, Worst: 348
yxa_to_5_digits: Best: 104, Average: 179, Worst: 236
 xa_to_5_digits: Best:  99, Average: 170, Worst: 231
 xa_to_4_digits: Best:  79, Average: 128, Worst: 173
 xa_to_3_digits: Best:  54, Average:  75, Worst:  94
  a_to_3_digits: Best:  43, Average:  52, Worst:  60
  a_to_2_digits: Best:  28, Average:  34, Worst:  40

Support for 6 or more digits is in yxa_to_decimal.asm, and the original source has been modified as well:
xa_to_decimal.asm
yxa_to_decimal.asm
Unfortunately the test program had to become more extensive in order to try every possible 24-bit input value. I ended up adding my NES CPU emulator and having it run the routine for every value. This allowed the test program to conveniently generate the timing information automatically. For the 8 digit output, it took 2.5 minutes for the program to run, which translates to about 53 minutes of emulated NES time! (average of 344 clocks * $FFFFFF / 1789773 = 3224 seconds)
Quote:
I liked your idea of entry points, where you can use the same routine for a range of different precisions as needed. And a big routine could always be cut down depending on how large your numbers are.

Yeah, the multiple entry points was a lucky side-effect of the algorithm. As you say, any of the upper digit code can be removed if you don't need it in a particular program.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 13, 2006 7:23 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1190
I liked the method you used, but didn't like the high code size (613 bytes) and the forcing of using all of the available registers to hold the values (which can get messy and confusing at times), so I threw together a looped routine that worked on the same principle but took up less space. Code size came up at 280 bytes, and while it is significantly slower than your version, it's still way faster than the one I posted earlier. It currently doesn't support multiple entry points for variable output digits, but that's easy enough to add by just replacing the initial LDY in each routine with a different value corresponding to the proper offsets in the data tables.

This one assembles with CA65:

Code:
.define   HEX_BYTE2(val) .LOBYTE(.HIWORD(val))
.define   HEX_BYTE1(val) .HIBYTE(.LOWORD(val))
.define   HEX_BYTE0(val) .LOBYTE(.LOWORD(val))

hex2bcd_output:   .res 8
hex2bcd_input:   .res 3

.proc   hex2bcd_init
   LDY #7
:   STA hex2bcd_output,Y
   DEY
   BPL :-   
   RTS
.endproc

.proc   hex2bcd_24bit
   LDY #0
loop:   LDA hex2bcd_otable,Y
   BMI done
   TAX
   SEC
   LDA hex2bcd_input+0
   SBC hex2bcd_b0table,Y
   PHA
   LDA hex2bcd_input+1
   SBC hex2bcd_b1table,Y
   PHA
   LDA hex2bcd_input+2
   SBC hex2bcd_b2table,Y
   BCS okay
   PLA
   PLA
   BCC end
okay:   STA hex2bcd_input+2
   PLA
   STA hex2bcd_input+1
   PLA
   STA hex2bcd_input+0

   LDA hex2bcd_output,X
   CLC
   ADC hex2bcd_vtable,Y
   STA hex2bcd_output,X
end:   INY
   BNE loop

done:   LDA hex2bcd_input+0
   CLC
   ADC hex2bcd_output+7
   STA hex2bcd_output+7
   RTS
.endproc

.proc   hex2bcd_16bit
   LDY #9
loop:   LDA hex2bcd_otable,Y
   BMI done
   TAX
   SEC
   LDA hex2bcd_input+0
   SBC hex2bcd_b0table,Y
   PHA
   LDA hex2bcd_input+1
   SBC hex2bcd_b1table,Y
   BCS okay
   PLA
   BCC end
okay:   STA hex2bcd_input+1
   PLA
   STA hex2bcd_input+0

   LDA hex2bcd_output,X
   CLC
   ADC hex2bcd_vtable,Y
   STA hex2bcd_output,X
end:   INY
   BNE loop

done:   LDA hex2bcd_input+0
   CLC
   ADC hex2bcd_output+7
   STA hex2bcd_output+7
   RTS
.endproc

.proc   hex2bcd_8bit
   LDY #19
loop:   LDA hex2bcd_otable,Y
   BMI done
   TAX
   SEC
   LDA hex2bcd_input+0
   SBC hex2bcd_b0table,Y
   BCC end
okay:   STA hex2bcd_input+0

   LDA hex2bcd_output,X
   CLC
   ADC hex2bcd_vtable,Y
   STA hex2bcd_output,X
end:   INY
   BNE loop

done:   LDA hex2bcd_input+0
   CLC
   ADC hex2bcd_output+7
   STA hex2bcd_output+7
   RTS
.endproc

hex2bcd_b2table:
.byte                                                             HEX_BYTE2(10000000)
.byte HEX_BYTE2( 6000000),HEX_BYTE2( 3000000),HEX_BYTE2( 2000000),HEX_BYTE2( 1000000)
.byte HEX_BYTE2(  600000),HEX_BYTE2(  300000),HEX_BYTE2(  200000),HEX_BYTE2(  100000)
.byte HEX_BYTE2(   60000),HEX_BYTE2(   30000),HEX_BYTE2(   20000),HEX_BYTE2(   10000)
.byte HEX_BYTE2(    6000),HEX_BYTE2(    3000),HEX_BYTE2(    2000),HEX_BYTE2(    1000)
.byte HEX_BYTE2(     600),HEX_BYTE2(     300),HEX_BYTE2(     200),HEX_BYTE2(     100)
.byte HEX_BYTE2(      60),HEX_BYTE2(      30),HEX_BYTE2(      20),HEX_BYTE2(      10)

hex2bcd_b1table:
.byte                                                             HEX_BYTE1(10000000)
.byte HEX_BYTE1( 6000000),HEX_BYTE1( 3000000),HEX_BYTE1( 2000000),HEX_BYTE1( 1000000)
.byte HEX_BYTE1(  600000),HEX_BYTE1(  300000),HEX_BYTE1(  200000),HEX_BYTE1(  100000)
.byte HEX_BYTE1(   60000),HEX_BYTE1(   30000),HEX_BYTE1(   20000),HEX_BYTE1(   10000)
.byte HEX_BYTE1(    6000),HEX_BYTE1(    3000),HEX_BYTE1(    2000),HEX_BYTE1(    1000)
.byte HEX_BYTE1(     600),HEX_BYTE1(     300),HEX_BYTE1(     200),HEX_BYTE1(     100)
.byte HEX_BYTE1(      60),HEX_BYTE1(      30),HEX_BYTE1(      20),HEX_BYTE1(      10)

hex2bcd_b0table:
.byte                                                             HEX_BYTE0(10000000)
.byte HEX_BYTE0( 6000000),HEX_BYTE0( 3000000),HEX_BYTE0( 2000000),HEX_BYTE0( 1000000)
.byte HEX_BYTE0(  600000),HEX_BYTE0(  300000),HEX_BYTE0(  200000),HEX_BYTE0(  100000)
.byte HEX_BYTE0(   60000),HEX_BYTE0(   30000),HEX_BYTE0(   20000),HEX_BYTE0(   10000)
.byte HEX_BYTE0(    6000),HEX_BYTE0(    3000),HEX_BYTE0(    2000),HEX_BYTE0(    1000)
.byte HEX_BYTE0(     600),HEX_BYTE0(     300),HEX_BYTE0(     200),HEX_BYTE0(     100)
.byte HEX_BYTE0(      60),HEX_BYTE0(      30),HEX_BYTE0(      20),HEX_BYTE0(      10)

hex2bcd_otable:
   .byte 0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,$FF
hex2bcd_vtable:
   .byte 1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1,6,3,2,1


How hard do you think it would be to update your test program to test and profile these functions? When I timed the 24-bit version I got around 1500 cycles for a 5-digit number, but then I noticed that b2table was crossing a page boundary.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 13, 2006 8:35 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 6042
Location: Jongny, VD, Switzerland
Oh, I did follow nothing ! Are you guys still doing the method who substract constant numbers and compares them in order to find the main digit (unsure I understand this perfectly).
Using A, X and Y register as the input number isn't optimal in my opinion.
The best way is to have only a pointer in z-page that points where the actual binary number in memory to be converted is, and to call the routine to a different label in function of the number of digit you want.
Just compare the two folowing piece of code :
Code:
lda #<Number
sta PointerL
lda #>Number
sta PointerH
jsr ConvertBinNmr

or
Code:
lda Number+2
tay
lda Number+1
tax
lda Number
jsr ConvertBinNmr

Well, both aren't so much different, but I just found the first one look more professional. Total subjective behaviour, though.

_________________
Life is complex: it has both real and imaginary components.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 13, 2006 9:04 am 
Offline
Site Admin
User avatar

Joined: Mon Sep 20, 2004 6:04 am
Posts: 2875
Location: Indianapolis
Bregalad wrote:
Well, both aren't so much different, but I just found the first one look more professional. Total subjective behaviour, though.


I'd say it's a good opportunity for writing a macro for calling the routine, you can have it look as professional as you want while still being the fastest and smallest method (and only take up one line of source, which is great if you do it a lot, less chance for typos also).

Why does your 2nd example do LDA number+2 / TAY btw? Just do LDY number+2. :P

Nice work everyone, and thanks. Maybe some day I'll have a program now that displays actual decimal numbers from hex (I can think of plenty of uses, XMODEM transfer status for one).


Top
 Profile  
 
 Post subject:
PostPosted: Wed Sep 13, 2006 12:47 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
Quietust's table-based version passed the exhaustive test. I then optimized it to run about twice as fast and fit in one 256-byte page. Here is the original source and three progressively optimized versions, with comments at the top about all the changes:

hex2bcd_opt.zip

This code could also have multiple entry points added for the desired number of output digits. Maybe I'll do that tomorrow.

UPDATE ... or today.

I think I've pretty much beat this to death here. I added entry points for arbitrary numbers of digits, and an intermediate optimized version with just the 24-bit code, if size is your biggest concern (it comes in at 192 bytes). The other fast code from yesterday is also included.

to_decimal_asm.zip

Summary so far:
Code:
xa_to_decimal.asm:
yxa_to_8_digits: Best: 186, Average: 344, Worst: 475
yxa_to_7_digits: Best: 170, Average: 322, Worst: 440
yxa_to_6_digits: Best: 139, Average: 254, Worst: 348
yxa_to_5_digits: Best: 104, Average: 179, Worst: 236
 xa_to_5_digits: Best:  99, Average: 170, Worst: 231
 xa_to_4_digits: Best:  79, Average: 128, Worst: 173
 xa_to_3_digits: Best:  54, Average:  75, Worst:  94
  a_to_3_digits: Best:  43, Average:  52, Worst:  60
  a_to_2_digits: Best:  28, Average:  34, Worst:  40
610 bytes (279 if you leave out 6-8 digit support)

hex2bcd.asm
hex2bcd_24bit_8dig Best: 600, Average: 815, Worst: 940
hex2bcd_24bit_7dig Best: 572, Average: 777, Worst: 877
hex2bcd_24bit_6dig Best: 452, Average: 608, Worst: 687
hex2bcd_24bit_5dig Best: 332, Average: 439, Worst: 497
hex2bcd_16bit_5dig Best: 288, Average: 383, Worst: 441
hex2bcd_16bit_4dig Best: 222, Average: 289, Worst: 329
hex2bcd_16bit_3dig Best: 130, Average: 165, Worst: 191
 hex2bcd_8bit_3dig Best:  55, Average:  67, Worst:  78
 hex2bcd_8bit_2dig Best:  48, Average:  55, Worst:  60
277 bytes

hex2bcd_small.asm
hex2bcd_24bit_8dig: Best: 768, Average: 1073, Worst: 1226
hex2bcd_24bit_7dig: Best: 741, Average: 1036, Worst: 1164
hex2bcd_24bit_6dig: Best: 621, Average:  867, Worst:  974
hex2bcd_24bit_5dig: Best: 501, Average:  698, Worst:  784
hex2bcd_16bit_5dig: Best: 471, Average:  661, Worst:  754
hex2bcd_16bit_4dig: Best: 381, Average:  529, Worst:  594
hex2bcd_16bit_3dig: Best: 261, Average:  360, Worst:  404
 hex2bcd_8bit_3dig: Best: 201, Average:  270, Worst:  309
 hex2bcd_8bit_2dig: Best: 141, Average:  191, Worst:  214
192 bytes


Top
 Profile  
 
 Post subject:
PostPosted: Sat Sep 16, 2006 10:50 am 
Offline

Joined: Mon Mar 06, 2006 3:42 pm
Posts: 94
Location: Montreal, canada
See? Get a few smart people interested in solving a problem, and look what happens!

Blargg and Quietust, thank you for following through on this. Now everybody who actually writes their own NES games can benefit. This is exactly what I hoped would happen.

Edit: @Bregalad and Celius: the algorithm I proposed works on a similar idea to binary search (see wikipedia article).
For each digit, it essentially does a binary search for the value of the digit. With 10 possible values for the digit, you need to do at most ceil(log2(10)) == 4 tests to figure out which one it is.

One subtle part is this: Once you know what the value of the digit is, you have to subtract that number from N anyways so that you can get on with figuring out the next digit. So rather than handle the "high" case and the "low" case differently for each comparison, I keep things simple by just subtracting as I go! That effectively makes both cases the same.

So if you have 4 digits worth of number to figure out, and you start out comparing the number to 6000, then either it's less than 6000 and we do nothing (the low case), or its >= 6000 (the high case) in which case we subtract that 6000 right away and add 6 to temporary where we store the value of the digit we will output. Because of the subtraction, the remaining part of the number is now less than 4000, whether we subtracted 6000 from it or not, which makes it simple to do the next step (comparing it to 3000).

The other subtle part, which Disch elaborated on earlier, is how I chose the 4 points to compare the digit to such that at most two of the comparisons will succeed. 3 comparisons is unfortunately not enough in some cases (that could only distinguish 2^3 == 8 different values). If we had 16 possible digits to look for, we would need the four points to be 8,4,2,1 but since we only have 10 possible digits that the result might be, 6,3,2,1 are slightly faster when the digit happens to be a 7 (because 6+1 = 7, where the other case is 4+2+1 = 7)


Top
 Profile  
 
 Post subject:
PostPosted: Sat Sep 16, 2006 1:10 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
mozz wrote:
the algorithm I proposed works on a similar idea to binary search (see wikipedia article). For each digit, it essentially does a binary search for the value of the digit. With 10 possible values for the digit, you need to do at most ceil(log2(10)) == 4 tests to figure out which one it is.


I actually started out by trying to implement a binary search, but it was quite unwieldy.

Quote:
Once you know what the value of the digit is, you have to subtract that number from N anyways so that you can get on with figuring out the next digit. So rather than handle the "high" case and the "low" case differently for each comparison, I keep things simple by just subtracting as I go! That effectively makes both cases the same.

That didn't work in your code since the subtraction of the low byte is bypassed when the high byte is greater than the comparison value. Below, when X > 23, it jumps to gt60000. In your code, the SBC would only have been done in place of the CMP #112, if X = 23. Also the way I coded it, I didn't keep re-loading the lowest byte, so even if I found a way to use the SBC in place of a comparison, I'd have to undo the subtract if the number was found lower than the digit value.
Code:
    cpx #23
    bcc lt6000
    bne gt6000
    cmp #112
    bcc lt6000
gt6000:
    sbc #112
    tay
    txa
    sbc #23
    tax
    tya

On the other hand, Quietust's hex2bcd (and my further optimizations of it) do as you describe: subtract the digit value from a temporary copy of the value, and if the result isn't negative, writes this new value back. I enjoyed the simplicity of this when working with hex2bcd.
Quote:
since we only have 10 possible digits that the result might be, 6,3,2,1 are slightly faster when the digit happens to be a 7 (because 6+1 = 7, where the other case is 4+2+1 = 7)

I delved deeper into this in some ways. I first noted that only 3 out of 10 digits end in a 1, therefore 70% of the time the last comparison will not match, so the code can be optimized for the loop ending with a non-match. Also, supporting multiple entry points, I used the digits 7, 4, 2, 1 for the ten thousands place (in both versions), allowing the 16-bit case to check digits 4, 2, 1, and the 24-bit case to check for digits 7, 4, 2, 1. This allowed the 24-bit code to jump to the 16-bit code after checking for 7. The 7, 4, 2, 1 set is almost as good as 6, 3, 2, 1, and doesn't ever use more than two for a given digit.

The code still needs to be documented a bit better and perhaps hex2bcd have the multiple entry points moved to another file (or added as macros as I did to hex2bcd_small), since they bloat it a bit.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Sep 16, 2006 3:25 pm 
Offline

Joined: Mon Mar 06, 2006 3:42 pm
Posts: 94
Location: Montreal, canada
blargg wrote:
That didn't work in your code since the subtraction of the low byte is bypassed when the high byte is greater than the comparison value. Below, when X > 23, it jumps to gt60000. In your code, the SBC would only have been done in place of the CMP #112, if X = 23.

Whoops! That's what happens when I write code at three in the morning and never test it.

blargg wrote:
I delved deeper into this in some ways. I first noted that only 3 out of 10 digits end in a 1, therefore 70% of the time the last comparison will not match, so the code can be optimized for the loop ending with a non-match. Also, supporting multiple entry points, I used the digits 7, 4, 2, 1 for the ten thousands place (in both versions), allowing the 16-bit case to check digits 4, 2, 1, and the 24-bit case to check for digits 7, 4, 2, 1. This allowed the 24-bit code to jump to the 16-bit code after checking for 7. The 7, 4, 2, 1 set is almost as good as 6, 3, 2, 1, and doesn't ever use more than two for a given digit.

That is very clever. I can see how 24-bit support might be useful, e.g. for experience points in an RPG. Having entry points for known number of digits is a great idea too. Note that if you aren't sure how large the number is but its usually small, you could do one short-circuit test at the start of the routine (e.g. if the high byte is zero, jump over a bunch of code).


Top
 Profile  
 
 Post subject:
PostPosted: Sat Sep 16, 2006 7:19 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1190
blargg wrote:
Quietust's table-based version passed the exhaustive test. I then optimized it to run about twice as fast and fit in one 256-byte page. Here is the original source and three progressively optimized versions, with comments at the top about all the changes:

hex2bcd_opt.zip


Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit? Is there something gained by not handling it straight within the 16-bit case? And why do you not do the same thing when going from 16-bit to 8-bit (you end up doing the 200 case twice)?

For reference, the following code appears to work properly and is only 233 bytes long. When I ran it on the 24-bit number 0, it took only 581 cycles to complete, though I don't know if larger numbers were made slower in the process.

Code:
hex2bcd_output:   .res 8
hex2bcd_input:   .res 3



.proc   hex2bcd_init
   LDY #7
:   STA hex2bcd_output,Y
   DEY
   BPL :-   
   RTS
.endproc

.scope hex2bcd
conv_24bit:
   LDY #9
   CLC
loop1:   LDA hex2bcd_input+0
   SBC b0table+9,Y
   LDA hex2bcd_input+1
   SBC b1table+9,Y
   TAX
   LDA hex2bcd_input+2
   SBC b2table,Y
   BCS good1
   DEY
   BPL loop1
   BMI conv_16bit
   
good1:   STA hex2bcd_input+2
   STX hex2bcd_input+1
   
   LDX otable+9,Y
   LDA hex2bcd_output,X
   ADC vtable,Y
   STA hex2bcd_output,X
   
   LDA hex2bcd_input+0
   SBC b0table+9,Y
   STA hex2bcd_input+0
   CLC
   DEY
   BPL loop1

conv_16bit:
   LDY #8
   CLC
loop2:   LDA hex2bcd_input+0
   SBC b0table,Y
   TAX
   LDA hex2bcd_input+1
   SBC b1table,Y
   BCS good2
   DEY
   BPL loop2
   BMI conv_8bit
   
good2:   STX hex2bcd_input+0
   STA hex2bcd_input+1
   LDX otable,Y
   LDA hex2bcd_output,X
   ADC vtable+3,Y
   STA hex2bcd_output,X
   DEY
   BPL loop2

conv_8bit:
   LDA hex2bcd_input+0
   CMP #200
   BCC :+
   SBC #200
   INC hex2bcd_output+5
   INC hex2bcd_output+5
:   CMP #100
   BCC :+
   SBC #100
   INC hex2bcd_output+5
:   LDX #0
   CMP #60
   BCC :+
   SBC #60
   LDX #6
:   CMP #30
   BCC :+
   SBC #30
   INX
   INX
   INX
:   CMP #20
   BCC :+
   SBC #20
   INX
   INX
:   CMP #10
   BCC :+
   SBC #10
   INX
   CLC
:   ADC hex2bcd_output+7
   STA hex2bcd_output+7
   TXA
   ADC hex2bcd_output+6
   STA hex2bcd_output+6
   RTS

vtable:   .byte 6,0,1,2,5,0,1,2,5,0,1,3
otable:   .byte 5,5,4,4,4,4,3,3,3,3,2,2,2,2,1,1,1,1,0

.define    HEX_BYTE2(val) .LOBYTE(.HIWORD(val))
.define    HEX_BYTE1(val) .HIBYTE(.LOWORD(val))
.define    HEX_BYTE0(val) (.LOBYTE(.LOWORD(val))-1)

b0table:
.byte                                         HEX_BYTE0(     300),HEX_BYTE0(     600)
.byte HEX_BYTE0(    1000),HEX_BYTE0(    2000),HEX_BYTE0(    3000),HEX_BYTE0(    6000)
.byte HEX_BYTE0(   10000),HEX_BYTE0(   20000),HEX_BYTE0(   40000),HEX_BYTE0(   70000)
.byte HEX_BYTE0(  100000),HEX_BYTE0(  200000),HEX_BYTE0(  300000),HEX_BYTE0(  600000)
.byte HEX_BYTE0( 1000000),HEX_BYTE0( 2000000),HEX_BYTE0( 3000000),HEX_BYTE0( 6000000)
.byte HEX_BYTE0(10000000)

b1table:
.byte                                         HEX_BYTE1(     300),HEX_BYTE1(     600)
.byte HEX_BYTE1(    1000),HEX_BYTE1(    2000),HEX_BYTE1(    3000),HEX_BYTE1(    6000)
.byte HEX_BYTE1(   10000),HEX_BYTE1(   20000),HEX_BYTE1(   40000),HEX_BYTE1(   70000)
.byte HEX_BYTE1(  100000),HEX_BYTE1(  200000),HEX_BYTE1(  300000),HEX_BYTE1(  600000)
.byte HEX_BYTE1( 1000000),HEX_BYTE1( 2000000),HEX_BYTE1( 3000000),HEX_BYTE1( 6000000)
.byte HEX_BYTE1(10000000)

b2table:
.byte                                                             HEX_BYTE2(   70000)
.byte HEX_BYTE2(  100000),HEX_BYTE2(  200000),HEX_BYTE2(  300000),HEX_BYTE2(  600000)
.byte HEX_BYTE2( 1000000),HEX_BYTE2( 2000000),HEX_BYTE2( 3000000),HEX_BYTE2( 6000000)
.byte HEX_BYTE2(10000000)
.endscope

hex2bcd_24bit = hex2bcd::conv_24bit
hex2bcd_16bit = hex2bcd::conv_16bit
hex2bcd_8bit = hex2bcd::conv_8bit

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 12:14 am 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
Quote:
Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit?

The ten thousand values subtracted are 70000, 40000, 20000, and 10000. In the 24-bit case, once it's eliminated 70000, the number could still be 69999 ($01116F). If it jumped to the 16-bit code immediately, the still non-zero most significant byte would be ignored, making the value look like 4463. Of course I had the benefit of the validator, so I could just adjust the overlap until it worked, then figure out why. :)

Quote:
When I ran it on the 24-bit number 0, it took only 581 cycles to complete,

In the most recent posted code I list the worst-case values so you can easily invoke it when testing them in a program. I tested your changes and they broke the 24-bit and 16-bit routines for the reason described above. I went ahead spent a while (ugh, 3 hours now that I check) optimizing the routine more.

I was able to eliminate the entry for 200 from the table and have the 16-bit case do an optimized check for this at the end, simply seeing if the high byte is non-zero, in which case it subtracts 200 without any further checks. If the high byte is zero, then it uses the normal hex2bcd_8bit code.

I found another optimizatin using the 7/4/2/1 group: if the 7 test passes, then the 4 test can be skipped. I incorporated this into the ends of the 16-bit and 24-bit match loops, having them skip the last entry if the next-to-last entry matched.

I was able to reduce the vtable size further by changing the millions to the 1/2/4/7 pattern, allowing the 16-bit and 24-bit routines to use the same vtable from the beginning. I had to fix a bug in the HEXBYTE macros: rather than subtract 1 from the low byte, I needed to subtract 1 from val in all of them.

I rearranged the match handlers to fall through to the loop, and for the non-match loop case to fall through to the next smaller handler (i.e. 24 falls into 16, 16 falls into 8), eliminating more branches.

Finally, I added short-circuiting that immediately jumps to the 16- and 8- bit handlers once the upper bits of the value become zero. This is just two branches that can be removed if desired, as they slightly increase the worst-case performance.

I removed the init routine from the source since it was cluttering it up. I also wasn't able to use the local symbol names as my ca65 gave a diagnostic (same for using :+, which sucks), and I don't feel like checking for a new version, getting it to compile, and seeing if these are fixed.

hex2bcd5.asm

Best case is for value 0. Worst case values listed.
hex2bcd_24bit Best: 106, Average: 760, Worst: 962 (n=16768641)
hex2bcd_16bit Best: 77, Average: 357, Worst: 439 (n=34640)
hex2bcd_8bit Best: 55, Average: 67, Worst: 78 (n=240)
252 bytes

If the two lines marked "optimization only" are removed: (note improved worst-case)
hex2bcd_24bit Best: 589, Average: 788, Worst: 924 (n=15964654)
hex2bcd_16bit Best: 278, Average: 361, Worst: 421 (n=34640)
248 bytes


Top
 Profile  
 
 Post subject:
PostPosted: Sun Sep 17, 2006 12:48 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1190
blargg wrote:
Quote:
Out of curiosity, why do you handle the 40000 case within the 24-bit code and then skip it when falling through to 16-bit?

The ten thousand values subtracted are 70000, 40000, 20000, and 10000. In the 24-bit case, once it's eliminated 70000, the number could still be 69999 ($01116F). If it jumped to the 16-bit code immediately, the still non-zero most significant byte would be ignored, making the value look like 4463. Of course I had the benefit of the validator, so I could just adjust the overlap until it worked, then figure out why. :)


Oh yeah, forgot about that. :)
To be fair, I never would have actually seen the 256-299 case, since my score and timer all go by multiples of 50; as for the 65536-69999 case, I just wasn't patient enough to try large enough values.
Come to think of it, it would probably be worth it to just divide the score and bonus timer by 10 internally and just pad them with an extra 0.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Feb 13, 2009 4:44 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1694
** BUMP **

Was looking for this lib, but the links are all dead. Specifically I'm looking for the one previously at this link:

it's now http://h1.ripway.com/blargg/temp/to_decimal_asm.zip

which was the 2nd link in blargg's post here:

http://nesdev.com/bbs/viewtopi ... 7351#17351


If someone has this, could they re-link it somewhere? Or if it exists somewhere else, could someone give a link?

Thanks!


Top
 Profile  
 
 Post subject:
PostPosted: Fri Feb 13, 2009 6:24 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3711
Location: Central Texas, USA
For any old links of mine, try replacing "www.io.com/~greens" with "h1.ripway.com/blargg". So this one becomes http://h1.ripway.com/blargg/temp/to_decimal_asm.zip


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 93 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group