Generic compression on NES - *done*

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

Post Reply
User avatar
Bregalad
Posts: 8056
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Generic compression on NES - *done*

Post by Bregalad »

*** EDIT ***
Now this project have been compleded. RHDN page of this project here : http://www.romhacking.net/utilities/882/

*** Original post ***
I'd like to create tools to help compression of generic data on the NES (or any other platform, for that matter).
ROM space is limited on the NES which makes compression especially important. Data that could be compressed includes (but is not restricted to) :
- Game level maps (contains some reperitiveness)
- Tilesets (I'm pretty sure tokumaru already released a great algorithm specific for tilessets but any generic algorithm should be applicable too)
- Text data (as there is only 26 letters and predictable entropy it can compress very well)
- Meta-sprite Data (with 4 bytes per hardware sprite it takes a lot of space)
- Music sequence data (contains some repetitiveness / predictable )
- 6502 code (predictable entropy)

Then there is 2 ways I can think of data being decompressed :
1) The obvious one : A whole "file" is compressed as a whole and decompressed as a whole. Of course, you need to use WRAM to decompress the file unless it' really small (and if it's reall small then why compress it). So this method is useful only if you want to compress large files.
2) The useful one : A whole "file" is compressed but you only decompress a small section of it at a time, typically less than 256 bytes.
For example, if you have a compressed script of the game, you only want to decompress a senstance at a time, if you compress a map you only need one row or column at a time, etc... and therefore no WRAM is needed.

This makes it an important point. Compression algorithms used MUST allow breaks in the compressed sequence and allot sub-point entries, so that you can only decompress a small section of the "file" when it's needed.
Most algorithms I've seen doesn't take this into account. More specifically, LZ77 requires references to previously decoded data to work. This typically makes it not usable.

So, the tool I'd like to build would have to be practical, general purpose and flexible. I'd like it to try to compress a file with multiple algorithms, and let the use choose the best one (the one that compresses better). Also, it should be easy for anyone to add their own algorithms to the list, provided they write the compression and decompression code themselves.

To be practical the tool should work for binary data but also for assembly code so that it could turn this :

Code: Select all

Label:
   .db $ab, $7f, $1b, $7a, $02, $99, $00    ; Uncompressed data
To this :

Code: Select all

Label:
    .db $3f, $d4, $eb         ; Compressed data
(I just wrote down random data).

This will be crucial if there is many label inside a file, so that the program can access (and decompress) a small chunk of data that is in the file (instead of decompressing the whole file).

All compression algorithms could be used if they follow these 3 conditions :
- Pseudo-random acessible data (with labels in the middle of the file) that doesn't require references to previous data.
- Don't require too much RAM to decompress
- Don't require too much CPU to decompress

The size of the decompression code should also be taken in account when comparing different algorithms to determine the most efficient one for a particular data.
Last edited by Bregalad on Mon Jul 22, 2013 12:12 am, edited 2 times in total.
Useless, lumbering half-wits don't scare us.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

A few years ago, I played around with "Huffword" compression, which is like metatiles for text. A compressed text would be broken into "pages", such as lines of dialogue in an RPG or screens of text in the game's built-in operations guide. It would consist of the following:
  • Huffman-compressed stream of word IDs
  • Pointers to start of page in the word stream
  • Huffman decoding table for word IDs
A dictionary would consist of the following:
  • Huffman-compressed character streams
  • Pointers to each word ID in the character stream (I have some ideas on how to store these pointers efficiently)
  • Huffman decoding table for characters
Thus there would be two Huffman decoding tables, one for words and one for characters. I know of an efficient way to store "canonical Huffman" decoding tables if you're curious.

With some kinds of data, the only kind of compression you can do is to rearrange data with little internal fragmentation. My music engine stores music sequence data in what I believe is a fairly efficient format: 5-bit interval from a pattern's base note and 3-bit duration (1, 2, 3, 4, 6, 8, 12, or 16 rows). In-between durations are stored as two bytes, representing two notes tied together. I'm not so sure as to the most efficient way to store sound effect sequence data or instrument envelope data though.

What kind of predictable entropy is in 6502 code? I've read code for 8-bit architectures is already a lot more dense than code for 32-bit RISC architectures such as MIPS or ARM, apart from things like Thumb.

The kind of compression you'd use for game level maps depends on how the maps are structured. For something like SMB1 (inf by 13), SMB2 (inf by 15), SMB3 (inf by 27), or Super Mario World (inf by 32), the level could be treated as a list of objects roughly sorted from left to right, and scrolling into a new column would involve finding which objects intersect a given column.
slobu
Posts: 276
Joined: Tue Jul 12, 2011 10:58 am

Post by slobu »

I have no idea what I'm talking about. That being said, here is a solution used by Genesis homebrew developers:
https://github.com/sikthehedgehog/mdtools

I heard it wasn't a terrible leap from C64 assembly to Amiga. Maybe some of the C or assembly code will give you ideas. I think the author mentioned it can decompress a single tile from a set just fine.
doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax »

Bregalad wrote:This makes it an important point. Compression algorithms used MUST allow breaks in the compressed sequence and allot sub-point entries, so that you can only decompress a small section of the "file" when it's needed.
Most algorithms I've seen doesn't take this into account. More specifically, LZ77 requires references to previously decoded data to work. This typically makes it not usable.
Only if you need random access. For something like a one-way scroller or a music track an LZ-77 coder with a small sliding-window can work quite nicely. At any rate small block sizes and contexts tend to really hurt most compression algorithms, so be careful.
tepples wrote:What kind of predictable entropy is in 6502 code? I've read code for 8-bit architectures is already a lot more dense than code for 32-bit RISC architectures such as MIPS or ARM, apart from things like Thumb.
There's still a fair bit of redundancy. My LZ coder typically saves 30-40% on C64 "code", though YMMV. Aside from unrolled loops and little tables there's usually a lot of repeated variable addresses and opcodes, assuming that encoding nearby one or two-byte matches is a win.

Also even when you do have a "general-purpose" compressor working there's usually much to be gained by transforming your content to better fit it. So think hard about how you interleave your data streams and experiment with simple delta-coding and the like to expose more redundancy.
User avatar
Bregalad
Posts: 8056
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

Oh I forgot to mention that I wanted to base my work on this library :

http://www.romhacking.net/docs/134/
I'm pretty sure it contains quite a few different algorithms, although each algorithms can have it's variants.
The problem will be to code all the parsing of data in C, but I guess I should first come with something that works with purely binary files, and then do something that parses text files, detects labels, and .db, .dw and .incbin statements to be able to compress the data, while keeping the rest intact.
Also I'd have to compile decompression routines for all algorithms and compute how many bytes they take, and write a comparative table on the screen to tell the user how efficient all implemented algorithms are.
This might not be easy but it will be SOOO much worth it !

tepples, of course I'm interested in your efficient way to store the binary tree !! The only time I've tried to implement huffman I used 16-bits words for the binary tree, which obviously is NOT efficient. Each node contained two pointers to both nodes, the first in case a '0' is met and the second if a '1' is met.
Then when you eventually reach a leaf, the pointer's high byte was $00 which indicates the low byte is the litteral to insert in decompressed stream.

Unfortunately, this is efficient only for data such as text where only a small part of all 256 bytes are actually used. If all 256 bytes would be used at least once, it would result in a very large tree.
I guess it would be more efficient to only use 8-bit words, but then how do you know if it's a litteral or a pointer ?
Only if you need random access. For something like a one-way scroller or a music track an LZ-77 coder with a small sliding-window can work quite nicely. At any rate small block sizes and contexts tend to really hurt most compression algorithms, so be careful.
Interesting. I bet many stuff would not compress well with a small window though.
For example if your window is 64 bytes you'd need to remain a circular buffer of 64 bytes constantly "open" in your compressed stream, and you can only "advance" in it, never go back or anything.
For example in the case of a song that loops, you'd have to tell the compressor to reset everything at the start of the loop, and never use references to previous data, else it'll work at the first play of the song, and then it would screw up after the first loop !

In other words, all "sub-sections" that are between labels (every sentense in case of text, every metasprite in the case of metasprites, every routine in the case of 6502 code) would have to be compressed with a blank sliding window in mind, which would annihilate any efficiency of the compression.

Not that I have anything against Lxxx algorithms. If anyone has a variant of them that would work for very small amount of data at a time (using a common dictionary for the whole file) then I'm all for it too.


What kind of predictable entropy is in 6502 code? I've read code for 8-bit architectures is already a lot more dense than code for 32-bit RISC architectures such as MIPS or ARM, apart from things like Thumb.
I'm pretty sure common opcodes and common arguments will be more frequent.
$00, $ff, $a9, $a5, $85, $4a and $0a will be very frequent bytes and could be compressed with less bits. While for example, say, $de, will be rare and be compressed with more bits.
I have no idea what I'm talking about. That being said, here is a solution used by Genesis homebrew developers:
https://github.com/sikthehedgehog/mdtools
Nice but unfortunately it looks like it's specifically for graphics. I'm no expert, but a codec made for graphics might not work well with other kind of data. I might be wrong though.
Anyway my gold would be to made a tool that would be completely free and open source, and then anyone could write his own compression and decompression routines in C that would be added to the list my tool tries, and then it's up to the user to know which ones to actually use in their games/demoes.

@tepples : about efficient ways to encode music, I don't know. I use a format which is 1 bytes per note (4-bytes pitch and 4-byte lenght, some values acts as command instead), but then I have to use a command byte for each octave change. I also can repeat sections and call "subroutines" (only one level, I don't have a stack although a stack could be implemented).
Also, I wonder if you are going to be compressing the data anyways, maybe sometimes it might be better to store data in an inefficient way that is compressed efficiently, than in a efficient way that is compressed inefficiently. The latter could turn up like trying to compress already compressed data, which often doesn't work well (although I might be wrong).
Last edited by Bregalad on Tue Aug 16, 2011 12:06 pm, edited 1 time in total.
Useless, lumbering half-wits don't scare us.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

SMB3 maps are not infinitely wide. The game loads the entire map into RAM. There are 0x1950 bytes reserved for the map in RAM. At a height of 27, that's up to 240 tiles wide.

As for the main point of the thread, how to compress stuff, never overlook explicit dictionary compression (LZ78). I saw Zelda Oracles use it, and was amazed by how well it worked. You get random-access, and a low RAM requirement.
But if you have enough RAM to hold the entire decompressed thing, an LZ77-based algorithm always beats LZ78.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax »

Bregalad wrote:
Only if you need random access. For something like a one-way scroller or a music track an LZ-77 coder with a small sliding-window can work quite nicely. At any rate small block sizes and contexts tend to really hurt most compression algorithms, so be careful.
Interesting. I bet many stuff would not compress well with a small window though.
For example if your window is 64 bytes you'd need to remain a circular buffer of 64 bytes constantly "open" in your compressed stream, and you can only "advance" in it, never go back or anything.

For example in the case of a song that loops, you'd have to tell the compressor to reset everything at the start of the loop, and never use references to previous data, else it'll work at the first play of the song, and then it would screw up after the first loop !

In other words, all "sub-sections" that are between labels (every sentense in case of text, every metasprite in the case of metasprites, every routine in the case of 6502 code) would have to be compressed with a blank sliding window in mind, which would annihilate any efficiency of the compression.
By a "small" window I was thinking more along the lines of a kilobyte or two stored in SRAM. And, yes, you'd only do it with one-way streaming data. If your average block is smaller than a couple of kilobytes then it's the wrong way to go.

This approach is particularly useful if for streaming off of an external medium which isn't random-access, like a floppy, tape or network. Imagine an FDS shoot'em up continually streaming in maps, objects and music.

Of course random-access to compressed data is often amortized by caching larger blocks, and hoping that the next access will tend to fall into the same block. Think packing text in RPGs into kilobyte-sized chunks.
Bregalad wrote:Not that I have anything against Lxxx algorithms. If anyone has a variant of them that would work for very small amount of data at a time (using a common dictionary for the whole file) then I'm all for it too.
I too would be interested in any such methods and I wouldn't be surprised if it's been studied extensively, but I've never seen such a method. Generating an optimal "phrasebook" for a file is bound to be computationally infeasible but there's got to be decent heuristics out there.

How did you intend to do this for huffword, Tepples?
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Bregalad wrote:tepples, of course I'm interested in your efficient way to store the binary tree !!
In a canonical Huffman code, the codes are always lexicographically ordered by their bit strings, which increase in length. Consider this code for text:

Code: Select all

00    a
010   t
011   i
100   n
1010   
1011  u
1100  m
1101  p
1110  l
11110 k
11111 EOF
It turns out you can reconstruct the bit patterns from code lengths (0, 1, 3, 5, 2) fairly efficiently, and then mapping code indices to code words is as easy as "atin umprk\0".
Also, I wonder if you are going to be compressing the data anyways, maybe sometimes it might be better to store data in an inefficient way that is compressed efficiently, than in a efficient way that is compressed inefficiently. The latter could turn up like trying to compress already compressed data, which often doesn't work well (although I might be wrong).
In a few oddball cases, a round of RLE before a round of LZ can help, especially if the LZ's sliding window is short. I'd look it up in more detail, but I have pressing housework.
doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax »

Dwedit wrote: As for the main point of the thread, how to compress stuff, never overlook explicit dictionary compression (LZ78). I saw Zelda Oracles use it, and was amazed by how well it worked. You get random-access, and a low RAM requirement.
That still leaves the question of what to add to the dictionary and what to leave out. Traditional "streaming" LZ-78 builds it incremenally and adds every non-match to it, until it gets full anyway, which would be rather inefficient in this case.
Also, are their dictionaries recursively structured (like LZW) or unordered blobs of memory (like LZ-77) or built some other way?
tepples wrote:
Bregalad wrote:Also, I wonder if you are going to be compressing the data anyways, maybe sometimes it might be better to store data in an inefficient way that is compressed efficiently, than in a efficient way that is compressed inefficiently. The latter could turn up like trying to compress already compressed data, which often doesn't work well (although I might be wrong).
In a few oddball cases, a round of RLE before a round of LZ can help, especially if the LZ's sliding window is short. I'd look it up in more detail, but I have pressing housework.
RLE is just a special-case of LZ with a single-byte offset, so it's going to be a loss unless encoding an RLE run uses significantly fewer bits than the equivalent LZ match.
Anyway, Bregalad makes a good point. It's often worth enlarging data to make it compress better, so e.g. it's probably a bad idea to bit-stuff your data feeding it through a general-purpose compressor. This is also why EXE-packers like to transform short relative branches into absolute jump addresses.
Last edited by doynax on Tue Aug 16, 2011 12:58 pm, edited 1 time in total.
User avatar
Bregalad
Posts: 8056
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Post by Bregalad »

As for the main point of the thread, how to compress stuff, never overlook explicit dictionary compression (LZ78). I saw Zelda Oracles use it, and was amazed by how well it worked. You get random-access, and a low RAM requirement.
But if you have enough RAM to hold the entire decompressed thing, an LZ77-based algorithm always beats LZ78.$
I've currently no idea how '78 works but I'll look into it.

The problem is that most info pages I can find with google insist on how the data is encoded, when I'm more interested at how it's decded but oh yeah...
It turns out you can reconstruct the bit patterns from code lengths (0, 1, 3, 5, 2) fairly efficiently, and then mapping code indices to code words is as easy as "atin umprk\0".
I don't really understand what you mean but that library on RHDN that I linked also does something like that so I'll look into it.

@Dwedit : SMB3 uses WRAM, which I want to not always uses, as I say in my first post. Of course I have nothing against methods that would require decompressing a large file to WRAM but I think there is a total lack of info for compressing really small chunks data (< 256 bytes), which would end up more useful in games.
Useless, lumbering half-wits don't scare us.
ccovell
Posts: 1045
Joined: Sun Mar 19, 2006 9:44 pm
Location: Japan
Contact:

Post by ccovell »

Look into pucrunch tools for 6502?
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

Pucrunch and Aplib are both LZ77-based. So is zip, 7-zip, etc. When I say that, I mean any system that looks at previous data with a sliding window. (sometimes the sliding window can extend to the entire block)
They need the entire block of ram available to do decompression, and require sequential access if you don't want to use the entire block of ram.

LZ78-style systems have random access, as long as you have a pointer table that gets you to the correct starting location.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
tomaitheous
Posts: 592
Joined: Thu Aug 28, 2008 1:17 am
Contact:

Post by tomaitheous »

Dwedit wrote:Pucrunch and Aplib are both LZ77-based. So is zip, 7-zip, etc. When I say that, I mean any system that looks at previous data with a sliding window. (sometimes the sliding window can extend to the entire block)
They need the entire block of ram available to do decompression, and require sequential access if you don't want to use the entire block of ram.
The block of ram only needs to be the size of the sliding window, if you do a circular buffer. But then again, that limits you to turning off the display large updates to the ppu or only updating data from the ring buffer to port access (like the ppu) in small amounts during vblank period.

Just nitpicking, but LZ77 is the original method and uses run lengths of raw data. LZSS uses a bitmask to select if the byte read is a raw/literal or part of a command event for operating the LZ window access. I know pucruch uses LZSS. I can' t really see any LZ based compression algo's using the original LZ77 spec like I previously mentioned.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

tomaitheous wrote: The block of ram only needs to be the size of the sliding window, if you do a circular buffer. But then again, that limits you to turning off the display large updates to the ppu or only updating data from the ring buffer to port access (like the ppu) in small amounts during vblank period.
And it also limits the sliding window to the dinky little RAM in the NES unless you're paying extra for PRG RAM on the cart.
Just nitpicking, but LZ77 is the original method and uses run lengths of raw data. LZSS uses a bitmask
I guess the meaning of "LZ77" has been generalized from the original version that strictly alternated one-codeword literals with history matches to new variants that indicate literal vs. match in some other way. Nintendo since at least the GBA has used "LZ77" to refer to an implementation of LZSS almost identical to that in the Allegro library except for a few bitfield ordering differences. LZSS has an 8-bit control word whose bits indicate one match (with fixed-width distance and length) or one literal codeword. Pucrunch doesn't use that exact format either; it uses a lot of gamma codes for lengths and such.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Post by Dwedit »

Pucrunch has a design flaw that Apack corrected, byte values are not guaranteed to be aligned in Pucrunch, but they are in Apack.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
Post Reply