It is currently Fri Oct 20, 2017 9:23 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sat Feb 11, 2012 6:06 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
I've been playing a lot of Battletoads lately, and I was surprised to realize that every level introduction has several different texts, chosen from at random. I was impressed by how often I was seeing new text.

So, I thought it would be fun to dump the text out of the ROM. At first I tried a straightforward search for known words with various offsets, and I found that some of the text is actually just uncompressed ASCII-ish strings right in the ROM, but not the level introductions. After digging in with a debugger/trace I eventually discovered that most of the text of the game is compressed with a huffman scheme.

I wrote a python script that will dump all the text of the ROM for you, and it's very closely based on the assembly code that actually does the decompression.

http://rainwarrior.ca/projects/nes/battletoads_text.py

I've disassembled and annotated the heart of the decompressor in the original ROM:

Code:
; Battletoads text decompression code
;
; taken from 32k ROM bank 6 at memory $E60F
; dissassembled with DISASM6
; edited and annotated by Brad Smith 2/11/2012
; http://rainwarrior.ca

; The code at $E60F represents the main text decompression loop.
; It is a subroutine that will read bits from a bitstream
; and return a decoded byte in register A.

; $1B stores the current byte, each bit read shifts it left
; $1C/$1D is a pointer to the next byte of the stream (with a -7 byte offset)
; $1E counts how many bits remain in the current loaded byte

__E60F:     LDX #$00           ; X stores current tree node index
__E611:     DEC $1E            ; $1E counts bits left in the current byte at $1B
            BPL __E623         ; if we're out of bits, load a new byte
            LDY #$07           ;
            STY $1E            ; reset $1E to tell us there are 8 available bits
            LDA ($1C),Y        ; load byte from indirect $1C/$1D address + 7
            STA $1B            ; store byte in $1B
            INC $1C            ; increment stream byte position ($1C/$1D)
            BNE __E623         ;
            INC $1D            ;
__E623:     ASL $1B            ; shift the high bit out of $1B and...
            BCS __E631         ;
            LDA __E63B,X       ; if bit is 0 load next node from $E63B table
            TAX                ;
            BPL __E611         ; loop to next bit if node < 128
            LDA __E611,X       ;
            RTS                ; return byte from $E691 + (node - 128)
__E631:     LDA __E666,X       ; if bit is 1 load next node from $E666 table
            TAX                ;
            BPL __E611         ; loop to next bit if node < 128
            LDA __E611,X       ;
            RTS                ; return byte from $E691 + (node - 128)

__E63B:     ... ; 43 byte table, tree node index if bit is 0
__E666:     ... ; 43 byte table, tree node index if bit is 1
__E691:     ... ; 44 byte table, output symbol lookup

; note that the $E691 table is looked up from $E611 with an index that has 128 added


The bytes returned from the lookup table at $E691 are more or less just ASCII characters with a few control characters (see the python code for more details). All the compressed text data in the ROM is stored as one contiguous bitstream, which made dumping all of it relatively straightforward once I understood this algorithm.

Overall this compressed about 10k of text down to about 6k (plus ~300 bytes of decompressor code and tables). Since this is an AxROM mapper, the whole 32k bank gets switched at once, so you need to keep your code and data together; this bank seems to be dedicated to all the cutscenes and title screen. I can imagine that saving that extra 4k of space was useful, especially since you need space for your graphics because it's a CHR-RAM mapper.

There are a few interesting optimization hacks in there; there seems to be a bit of need to keep this inner loop efficient. The code actually tends to decompress and discard several strings, usually, before it gets to the one it wants to use. One hack is the indirect index address being used is off by 7. Y is used to reset the $1E bit counter, I think specifically so that it will be set to a known number (7) before using it for the indirect index lookup. Another hack is the table at $E691 is actually addressed as $E611 and the high bit that indicates to use that table it just used as part of the index. Saves an AND, I guess. I had to wonder why it was looking up a symbol from the code itself before I realized X was always >= 128.

Anyhow, I thought you guys might enjoy seeing a real-world huffman coding implemention for the NES.

Image of Battletoads huffman tree


Last edited by rainwarrior on Mon May 16, 2016 1:01 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sat Feb 11, 2012 9:36 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
I'm surprised they used Huffman. Huffman really doesn't provide very many benefits for English (ASCII) character sets. I forget what the rule of thumb is, but it's something like letters R, S, T, and vowels, are the most commonly-used characters in English words. Surely there isn't enough text in the game using these letters to justify Huffman compression...? (This is a real question: I hate Battletoads, so I have no idea)

That said: did this really save them 4KBytes of space?

Here's an alternate theory, which I think is much more likely: did this game come out in Japan? If so, that would probably explain their choice of Huffman, which compresses Japanese amazingly well (and is used regularly in many text-heavy games in Japan, including SFC/Super Nintendo games where you would think "bigger ROM, who cares" (who cares = the company, mask ROMs were expensive as hell back then)). For example, Otogirisou's text when decompressed from Huffman comes out to be approximately 1.4MBytes of Japanese. The game ROM itself is 2.0MBytes (all graphics, sound, code, text). The benefits of using Huffman here are astounding.

Also, just noting this here: there are probably multitudes of Battletoad ROMs (including corrupted ones) floating around the Internet, so you would be best to provide the MD5/SHA1/SHA256 hashes of the ROM itself.

The Python code makes assumptions -- and it has to, it's not your fault -- regarding file offsets of where texts are. These offsets will probably be different depending upon region and dump of the game (there are bad dumps of pretty much everything on the planet).

Cheers!

EDIT: I see the filename in the Python code is hard-coded to "Battletoads (U).nes". MD5/SHA1/SHA256 would still be good to have. You'll thank me later when a year from now someone shows up on this forum stating your code doesn't work and it turns out they have a corrupted file. :-)


Top
 Profile  
 
 Post subject:
PostPosted: Sat Feb 11, 2012 9:56 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 3943
Japanese Battletoads used Hiragana mixed with English for all names and other proper nouns. It's really weird. Normally you'd expect Katakana instead.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
 Post subject:
PostPosted: Sat Feb 11, 2012 10:15 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
Yes, there was a Japanese version, but it was developed in Britain. I doubt the decision to use huffman had anything to do with it being theoretically good for Japanese text. They had one 32k bank dedicated to cutscenes, and they managed to squeeze an extra 4k out of it, so... why not? There's very little fill space left over in this page of the ROM. Possibly they wrote so much random variant text to fill up the space!

The ROM filename is specified in the python file. ("Battletoads (U).nes") If you're trying it out and you don't seem to have the right ROM you could just e-mail me I guess. I don't really care if it's not robust across different ROM versions, it's a one-use program anyway. I could just send you the output even. I posted it because it's interesting how it works; and it could trivially be adapted for the other ROM versions.

The specfic huffman table was obviously generated from the set of text they had. I believe the symbol table is stored in order of frequency, and it looks like:
space, E, T, O, A, S, R, N, H, '\n', I, U, D, L, Y, G, '!', ''', C, M, B, ',', W, P, F, '\n', end, K, '-', '.', V, '\n' + 2 spaces, '\n' + 1, '\n' + 3, '\n' + 5, '\n' + 4, '\n' + 6, '\n' + 7, '?', Z, Q, J, X, '\n' + 8, '?' (don't ask me why there's two)

I haven't bothered to actually reconstruct the decoding tree, but you can do it if you're interested.


Last edited by rainwarrior on Sat May 05, 2012 6:20 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 2:51 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
koitsu wrote:
I'm surprised they used Huffman. Huffman really doesn't provide very many benefits for English (ASCII) character sets.


Pretty much the main reason to use it is because you can start decompressing right in the middle of a block of text. You try that with LZ and you'll need lots of RAM to hold your window, and more time to process all of the text up to that point.

Would be interesting to compare it to a 6-bit (upper+lower) or 5-bit (upper only) bit-pack, see how much space they actually saved. Rough guess from past experiences, I'd say 10-20% more.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 6:57 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7231
Location: Chexbres, VD, Switzerland
I think huffman works well on any text data, no matter the language (as long as the tree is adapted to the language of course), as in all languages some letters (or symbols, as not all languages uses letters) comes more frequently than others.

_________________
Life is complex: it has both real and imaginary components.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 2:06 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
So I got bored and made a graph of the Huffman tree:
http://rainwarrior.ca/projects/nes/battletoads_huffman.png


Last edited by rainwarrior on Sat May 05, 2012 6:02 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 6:20 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19107
Location: NE Indiana, USA (NTSC)
Huffman coding is optimal where the probability of a symbol doesn't depend on the symbols surrounding it. English, on the other hand, is very context-sensitive. Once I considered Huffman with the optimization of finding the most common character following each given character and then swapping the code for that character with the code for a space. It saved a little but not very much.

Digram coding (aka BPE or DTE) is a context-sensitive technique used in many Super NES RPGs. Huffword is another context-sensitive technique that works well on really long texts, reducing a novel by 60% or more. But the amount of text in an NES game's cut scenes might not be enough to take full advantage of this context; Huffword found too many hapax legomena (words occurring once that would need to be encoded with character-based Huffman coding) when I ran it on the script of Thwaite.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 7:43 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
byuu, if all of your symbols had equal probability (worst case for huffman) the resulting huffman tree would be the same as fixed-bit-length (though storing the tree itself will put you over by the size of the tree). Best case for huffman I guess would be all the same symbol, which could be represented by a single bit, giving you 8:1 compression. Obviously you won't get near that, in generall, but it's a lower bound on the possible compressed size. For Battletoads it works out to maybe 4.7 bits per byte (you'd need 6 for a fixed-bit encoding, or about 5.5 if you built a tree that matches the number of symbols, but at that point you should just go with huffman).

tepples mentioned its primary weakness; it starts with the premise of only looking at individual bytes and trying to compress them. You can try to do better with longer strings, words, or other things that seem appropriate to your data; the spaces after a newline in Battletoads are RLEd as a minor adaptation (it's not entirely successful; the tree produced a bitcode for \n+2 that is longer than \n with 2 individual spaces, for instance).

Digram coding sounds interesting; very elegant! I'd never heard of it. Though I suspect it could be very slow, since it's liable to have many passes (even worse if you need to do it in place). Might be acceptable for shorter texts. What Super NES RPGs used it? (I guess being invented in 1994 it would have been limited to later ones...)


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 7:54 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19107
Location: NE Indiana, USA (NTSC)
rainwarrior wrote:
though storing the tree itself will put you over by the size of the tree

And even that can be minimized. Instead of storing each element of the tree, store only the number of symbols with each bit length. Such a "canonical Huffman" code has a very compact representation, as I described in my article about Huffword.

Quote:
Digram coding [...] could be very slow, since it's liable to have many passes

Which is probably tolerable for three reasons:
  • PCs are real fast nowadays, and they can afford to make numerous passes over a text.
  • If you're doing the compression on a PC, you can build the digram tree on one core and compile the rest of the program on another core.
  • The statistics for an entire corpus aren't likely to change much over the course of one day. You could rebuild the digram tree only once a day and then just reuse the same tree, even if it has become slightly suboptimal, when reencoding your corpus.
Quote:
What Super NES RPGs used it?

Romhacking.net would be able to answer that one best. (They know it by the name DTE rather than the academic name "digram coding".)

Quote:
(I guess being invented in 1994 it would have been limited to later ones...)

The 1994 date has to be taken in the context of Wikipedia's verifiability policy. Wikipedia considers only "reliable" sources available to the public (whether free or paid). So even if games used it before then, unless it was explained in a published article, it doesn't count.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 8:18 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
Well, 1994 is the earliest cited example; why would I assume there's something earlier unless someone tells me there is?

I tried a quick search of Romhacking.NET but all I'm finding is that people are commonly hacking an implementation of DTE into old ROMs to make themselves more room to translate into English, which is a pretty cool application of it. I'd love to know if/when it first appeared in an original game though.

I was wondering if decompression was slow, not the compression part, but I hadn't realized it could decompress as a tree; I was thinking it would scan the string and rebuild it with each substitution one at a time, but realizing you could apply them recursively, I guess the decompression time wouldn't be that bad at all. (This is a pretty neat method. Thanks for mentioning it!)

Also, the size of the huffman table isn't necessarily an issue; I was just mentioning it because it is there. For text of any size worth compressing, it's probably negligible. In this case it was 84 bytes, not a big deal. Looking at that article, I guess this case it could be gotten down to 22 bytes (44 4-bit bitcode length values), and I'm not sure this wouldn't add another 62 to the decompressor. If you had lots of tables though it could be worthwhile.

Another thing I'm wondering about byte-pair encoding is that you need extra symbols for your substitutions. For text stored as bytes it's probably great, since you've got tons of unused symbols at the ready, but for general purpose... the algorithm can probably be adapted to get around this problem, but it might get a bit more complicated?


Last edited by rainwarrior on Sun Feb 12, 2012 8:40 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 10:17 pm 
Online

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19107
Location: NE Indiana, USA (NTSC)
rainwarrior wrote:
Looking at that [section about canonical Huffman codes in the Huffword] article, I guess this case it could be gotten down to 22 bytes (44 4-bit bitcode length values)

Actually it's one byte for each bit length, not for each distinct symbol. For example, if codes are up to 13 bits long as in the Battletoads tree, the description would be 13 bytes, followed by the 43 or so bytes for the actual values sorted by increasing bit length (i.e. decreasing frequency).

Quote:
and I'm not sure this wouldn't add another 62 to the decompressor.

Which means the only way for me to settle it would be to write a demo of CH decompression.

Quote:
If you had lots of tables though it could be worthwhile.

Or if you had a huge table, like the table used for Huffword.

Quote:
Another thing I'm wondering about byte-pair encoding is that you need extra symbols for your substitutions. For text stored as bytes it's probably great, since you've got tons of unused symbols at the ready, but for general purpose...

What are you trying to compress that can be modeled by byte pairs but doesn't have any built-in internal fragmentation for digram coding to exploit? CHR data?


Last edited by tepples on Mon Feb 13, 2012 9:23 am, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 12, 2012 11:54 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 5724
Location: Canada
I said "general purpose". I'm not trying to compress anything, I'm just wondering what kind of data (if any) this algorithm can't apply to. If your data doesn't have unused symbols, is there a viable variant of the algorithm? (Or would it become complex enough that there are better compression methods for the same level of complexity?)

I can see how it can work nicely for text. A common word would eventually be compressed down to a single character, each pass of the algorithm packing in another letter, but also other words with shared substrings would get to reuse some of the same symbols. Pretty nifty. Hmm, this seems a lot like LZ78, actually.

What does internal fragmentation have to do with it? I don't follow why you mentioned it. Free symbols seem to be the primary requirement (for a simple implementation, at least). A lot of data has repeated multi-byte strings in it, though, which is why I'm wondering if the free symbol requirement can be relaxed.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 13, 2012 4:55 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10064
Location: Rio de Janeiro - Brazil
rainwarrior wrote:
What does internal fragmentation have to do with it?

I think internal fragmentation means the same as free symbols/codes in this case. Recently I read about a DTE variation that doesn't require any free codes, but I didn't really get how it works yet.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 13, 2012 8:52 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10064
Location: Rio de Janeiro - Brazil
Here's the DTE variant I was talking about: http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression#SCZ_byte_pair_encoding

Apparently it takes a rarely used byte and uses it as an escape code, which is used to indicate single bytes. If the most rarely used bytes are all preceded by the escape code, their codes become available for representing the most common pairs. What happens is that the least common bytes are expanded to 2 bytes, but the most common pairs are compressed to 1 byte. Also, as pairs are encoded, more characters might not appear by themselves anymore, also freeing their codes for representing compressed pairs.

That's what I got so far, but I'm sure there's more to it.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Bing [Bot], slobu, zeroone and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group