### Re: Compression benchmarks

Posted:

**Sat Jan 02, 2016 10:51 am**That's because unary coding and run-length encoding are inverses of each other. From an RLE point of view, PB8's repeat coding is a unary code.Sik wrote:I don't think that qualifies under the definition of unary code you linked, though... (the tokens are always 0 or 1, not a string of bits)tepples wrote:I like to use a unary code, in which each bit of a control word can be 0 for a single literal symbol or 1 for part of previous run, to be fairly efficient on real data and easy to decode to a multiple of 8 bytes. LZSS in the Allegro 4 library and Game Boy Advance and Nintendo DS BIOS uses a variant of this unary code, where each bit of the control word is one way for a literal and the other way for a reference to previous data.

0: Run of 1 byte, then change byte

10: Run of 2 bytes, then change byte

110: Run of 3 bytes, then change byte

1110: Run of 4 bytes, then change byte

etc.

It models the probability of a back-reference as 50%, with the other 50% for a single literal byte, with no dependence on the previous runs. This is equivalent to runs of back-references and runs of literals each having a geometric distribution to their lengths. Unary codes and other Golomb codes are ideal for geometric distributions, with the Golomb code'sAlso what you've described applies to pretty much nearly every LZ77-based scheme out there

*M*parameter depending on the distribution's

*p*parameter. Optimally,

*M*should be close to -log(2)/log(1 -

*p*), with the unary code corresponding to

*M*= 1 or

*p*= .5. Most schemes branded as "RLE", on the other hand, treat all run lengths up to 32 or 128 or whatever as equally likely.

Another option for coding run lengths is the exp-Golomb code. It replaces the unary code for the quotient with the Elias gamma code, resulting in longer codes for values 1 and 3 (runs of length 2 or 4) for shorter codes for longer runs. Exp-Golomb is ideal for power law (Zipf) distributions.

Repeated strings do go into the ring buffer, at least in Allegro 4 LZSS (which is equivalent to GBA LZSS with different bit ordering). The use of a ring buffer is to allow encoding or decoding without keeping the entire uncompressed stream in RAM at once. The difference between LZSS and the format described in Ziv and Lempel's 1977 paper is that original LZ77 assumed that all back references would be followed by exactly one literal and potentially have zero length, while LZSS allows back-references to follow one another directly.(LZSS's particularity is the use of a ring buffer instead of the already decoded data, allowing for more room to reuse strings as repeated strings don't go into the buffer).