NTC: compression for short strings

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Post Reply
calima
Posts: 1239
Joined: Tue Oct 06, 2015 10:16 am

NTC: compression for short strings

Post by calima » Mon Nov 25, 2019 11:22 am

I invented a method for compressing short strings. Quite useful for text-heavy NES things, RPG dialog etc.

https://github.com/clbr/NTC

Code: Select all

NTC - NES text compression
==========================

This is a compression format intended for English text, as short strings.
It uses second-order probability combined with canonical Huffman.

It creates a small index (1-2kb) that would be kept in a common bank, then
each string is compressed separately and may be stored where-ever convenient.

Compressing this program's license text (COPYING, 35147 bytes), treating each
non-empty line as a string, creates an index of 1034 bytes. The index
is included in the following table's compressed size.

Zlib and LZ4 compress the entire file as a blob, not per-string; they
are included for scale. Per-string they would likely increase the total
size.

|======================================
|Algo   | Compressed size       | %
|NTC    | 17925                 | 52
|LZ4HC  | 15521                 | 44
|Zlib   | 12144                 | 35
|======================================

There are two versions of the NES decompressor, generic and direct.
The generic one takes the index by pointer, but is slower and uses
more RAM. It's useful e.g. if you have multiple languages. The
direct one uses absolute addressing to a single index.

|============================================================
|Version        | Code size     | RAM   | Avg cycles per byte
|Generic        | 426           | 8     | 785
|Direct         | 296           | 0     | 735
|============================================================

This method won't do well with binary data; for that consider other
algorithms.

If you're interested in using this in commercial setups, please
contact Mega Cat Studios.

JRoatch
Formerly 43110
Posts: 390
Joined: Wed Feb 05, 2014 7:01 am
Location: us-east
Contact:

Re: NTC: compression for short strings

Post by JRoatch » Mon Nov 25, 2019 3:56 pm

For comparison, 2 other text compression formats specifically made for the NES in mind, Brad Smith's huffmunch, and Recursive DTE.

Code: Select all

|===========================================
|Algo          | Compressed size       | %
|huffmunch     | 15377                 | 44
|Recursive DTE | 18787                 | 53
|===========================================

tepples
Posts: 22145
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: NTC: compression for short strings

Post by tepples » Mon Nov 25, 2019 4:04 pm

Compression ratio for English language text appears comparable to DTE, as used in robotfindskitten, 240p Test Suite for NES and GB, Thwaite 0.04, and apparently a bunch of English translations of JRPGs:

Code: Select all

$ wc -c /usr/share/automake-1.15/COPYING
35147 /usr/share/automake-1.15/COPYING
$ common/tools/dte /usr/share/automake-1.15/COPYING COPYING.dte COPYING.dtbl
$ wc -c COPYING.dte COPYING.dtbl
16867 COPYING.dte
  512 COPYING.dtbl
17379 total
It looks like the source code uses canonical Huffman coding at some layer. Is there a document describing the format of the index?

EDIT: It appears I got ninja'd by the programmer of the DTE compressor that I use.

JRoatch
Formerly 43110
Posts: 390
Joined: Wed Feb 05, 2014 7:01 am
Location: us-east
Contact:

Re: NTC: compression for short strings

Post by JRoatch » Mon Nov 25, 2019 9:16 pm

I'll just note the difference between the stats of DTE (from tepples and I) are due to different default parameters of the versions we have (use all available code-points vs using only the top 128).

User avatar
rainwarrior
Posts: 7891
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: NTC: compression for short strings

Post by rainwarrior » Tue Nov 26, 2019 12:06 am

Looking at the decompressor, trying to get an idea of the compression method.

Seems like a canonical huffman tree for coding bytes. Is that the technique, or is it modified in some way?

The "index" seems like a list of leaf counts (in two separate tables for 8 and 16 bit levels of the tree), followed by a list of <=256 leaf characters that can be output? Then any string's data starts with a length byte, then a raw byte for the first character (?), then a huffman coded bitstream? Have I got that right?

Edit: Hmm, I missed the "second order probability" part of the description. What does that mean? Is each subsequent output relative to the previous in some way? Is that why the string data starts with 1 raw byte?

User avatar
rainwarrior
Posts: 7891
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: NTC: compression for short strings

Post by rainwarrior » Tue Nov 26, 2019 1:26 am

So looking at it a little more...

The index contains, after a list of leaf counts describing a canonical huffman tree, up to 256 tables of output bytes, each a probability-sorted list of possible bytes to follow each previous output byte?

That part seems to make sense to me, but it looks like the same huffman tree is always used, isn't it? Wouldn't each following-byte set have a different probability distribution, and require its own tree? Or is the huffman tree not really optimal, but instead based on some sort of average of their probabilities so that the distribution of the tree used is close enough that it still benefits them all in aggregate?

calima
Posts: 1239
Joined: Tue Oct 06, 2015 10:16 am

Re: NTC: compression for short strings

Post by calima » Tue Nov 26, 2019 1:30 am

DTE seems to do rather well. I guess it doesn't get much weaker if used per-string?
tepples wrote:
Mon Nov 25, 2019 4:04 pm
It looks like the source code uses canonical Huffman coding at some layer. Is there a document describing the format of the index?
internal.h contains a description of the index. Not very detailed though.
rainwarrior wrote:
Tue Nov 26, 2019 12:06 am
any string's data starts with a length byte, then a raw byte for the first character (?), then a huffman coded bitstream? Have I got that right?
Yes, that's the string format.
Edit: Hmm, I missed the "second order probability" part of the description. What does that mean? Is each subsequent output relative to the previous in some way? Is that why the string data starts with 1 raw byte?
Exactly, each byte is encoded by its probability compared to the previous byte. Like 'h' often follows 't', and so gets a low value (high probability), which is then encoded with canonical Huffman.

On the ninja post: straight byte Huffman only saves 20-30% (70-80% compressed size), while this approach gets ~50%.
The same tree is used yes, with the aggregate probabilities of each position. Using a separate tree for each byte probability array might or might not save total space, could be worth trying though.

calima
Posts: 1239
Joined: Tue Oct 06, 2015 10:16 am

Re: NTC: compression for short strings

Post by calima » Tue Nov 26, 2019 2:18 am

A quick calculation puts separate huffman trees at increasing the index by ~900 bytes, which may be too much in the precious common bank. OTOH it would also drop the total bitstream size by up to half in the best case.

I also tried chaining DTE with this for kicks. It actually shrunk a tiny bit:
$ dte -r 1-255 -e 10 COPYING COPYING.dte COPYING.dtb
$ nestextcomp COPYING.dte
Index 4752 bytes
Compressed 16498 to 15107 bytes, 91.57%

calima
Posts: 1239
Joined: Tue Oct 06, 2015 10:16 am

Re: NTC: compression for short strings

Post by calima » Tue Nov 26, 2019 2:38 am

Looking at the .dte file's byte histogram, it's highly uneven, so it'd benefit even from single-byte Huffman.
edit: Hm, if the DTE decoder was implemented by recursion, it'd blow the stack on NES. Worst-case recursion for this file would be about 162 deep, the entire NES stack can hold 128, and the stack on neslib defaults is only 32 bytes (16 deep).

Tepples, is your GB/NES decoder recursive or iterative?

tepples
Posts: 22145
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: NTC: compression for short strings

Post by tepples » Tue Nov 26, 2019 7:52 am

calima wrote:
Tue Nov 26, 2019 1:30 am
DTE seems to do rather well. I guess it doesn't get much weaker if used per-string?
Not too much. There's an option in the encoder to mark ranges of bytes as incompressible. If I were to mark newline as incompressible, you lose context across lines of text, but I don't see it as a big ratio loss. There's also an option to start the pair codes at a value greater than $80, in case you have reserved (say) $80-$87 for accented characters and symbols, though as expected this slightly reduces the compression ratio. The other option is to reserve $10-$1F for accented characters.
calima wrote:
Tue Nov 26, 2019 2:38 am
Tepples, is your GB/NES decoder recursive or iterative?
It's iterative.

The 6502 decompressor was first, made for my rfk port to fit as many NKIs as possible into 12 KiB. It uses a fixed-size output buffer sized for the longest line of text that the rfk RFC permits, treating its end as a software-defined stack of bytes. It starts by pushing the entire string to the stack starting at its end. When the value popped from the stack is a pair code, it pushes the second byte then the first byte and continues. When the value is a literal character code, it gets written at the start of the output buffer.

The Game Boy decompressor I made when it became apparent that without compression of the help pages, the test suite wouldn't fit in Catskull's 32 KiB flash cartridges. (Those appear to be the only mass-produced rewritable GB cartridge sold on a western site, as if Infinite NES Lives made only NROM.) The output buffer is again of fixed size, this time sized for the longest line of text that I estimated would fit in a 112-pixel-wide VWF window. With the 8080-derived CPU's three register pairs usable for addresses, I used a slightly different approach. The stack is a separate area of RAM, limited to eight levels. I could have overlapped it with the output buffer the way I did on 6502, but I was inexperienced with RGBDS at the time, and the GB's 8 KiB work RAM looked positively spacious.

calima
Posts: 1239
Joined: Tue Oct 06, 2015 10:16 am

Re: NTC: compression for short strings

Post by calima » Tue Nov 26, 2019 11:38 am

Using the other end of the buffer as stack is smart. The straightforward iteration would've been ping-pong buffers. Though if the buffer-as-stack is small enough, a pathological input would cause corruption.

tepples
Posts: 22145
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: NTC: compression for short strings

Post by tepples » Tue Nov 26, 2019 12:47 pm

Pathological input in DTE means one of two things:
  1. Table contains a cycle
    A table entry refers to itself or a later entry, representing a graph with cycles rather than a forest of trees.
  2. Text overflows the buffer
    With a finite-sized output buffer, one or more lines (ranges between delimiter code units) exceed its length. Or if the stack is in a separate buffer, as would be used with a streaming decoder, the table includes one or more code units whose resulting tree has a left depth exceeding the size of the stack buffer.
In normal operation, the compressor emits a tree in which all entries refer to an earlier entry, which cannot trigger situation A. The front-end also catches lines that are too long, producing a diagnostic. I admit my decoders aren't currently hardened against attacker-controlled input.

User avatar
pubby
Posts: 557
Joined: Thu Mar 31, 2016 11:15 am

Re: NTC: compression for short strings

Post by pubby » Wed Nov 27, 2019 6:22 pm

DTE is cool because of how simple/small the decompressor is (like 30 lines of asm). You can also design it to be backwards-compatible with uncompressed ASCII strings, which is convenient if you need that sort of thing.

Anyway, this looks cool too. I'll keep it in mind the next time I need to compress text. Thanks calima.

Post Reply