nesdev.comhttps://forums.nesdev.com/ Is there a known algorithm for this?https://forums.nesdev.com/viewtopic.php?f=5&t=14370 Page 1 of 2

 Author: tokumaru [ Tue Jun 07, 2016 11:22 am ] Post subject: Is there a known algorithm for this? I'm toying with a new type of level map compression (will share the details later if the results are good), and in order to make the most out of it, at one point I need to combine several strips of numbers (all strips are the same length) into one long strip. The resulting strip should be the shortest possible combination of all strips, which must be overlaid in a way that makes optimal use of any redundancy that might exist.For example, if I need to combine the strips "AABBBBBAAAAA" and "AAAAACCAAAAA", the result would be "AABBBBBAAAAACCAAAAA". This is simple when you only have a handful of strips, but with several hundred of them it's easy to make the wrong choices. My current idea is the following:1 - Find the longest overlap for each strip and save this information (length of the overlap, index of the strip);2 - Find the longest overlap among all strips, using the previously calculated values (if the longest overlap is 0, go to step 7);3 - Join the two strips that are involved in the longest overlap and mark both original slots "empty";4 - Add the new strip to the end of the list and find the longest overlap for it;5 - Find the longest overlap for any strips that had the removed strips as their longest overlaps;6 - Go to step 2;7 - There are no more overlaps, so concatenate all the strips, ignoring the empty slots;That seemed OK to me at first, but now I have the impression that by combining the best possible pair at each step, I might be botching future combinations that would result in larger gains, but I can't seem to think of a better way to do this. This is why I'm asking here if anyone knows of an algorithm (or can come up with one on the spot!) that can solve this problem optimally.

 Author: 3gengames [ Tue Jun 07, 2016 11:37 am ] Post subject: Re: Is there a known algorithm for this? Sounds like you want a dictionary based compression method. It might waste a few bytes because of it having to reference the dictionary, but I believe as the data set grows it'd be better than anything else, as like runs will get down to just single bytes if they are referenced multiple times and it would be acceptable.

 Author: pubby [ Tue Jun 07, 2016 11:50 am ] Post subject: Re: Is there a known algorithm for this? At a quick glance it sounds like the traveling salesman problem. You have a weighted directed graph where the strips are the nodes and the amount of overlap between strips form the edge weights. Finding the optimal path that visits each node once sounds very much like TSP, but then again I haven't thought about this very hard.

 Author: tepples [ Tue Jun 07, 2016 12:10 pm ] Post subject: Re: Is there a known algorithm for this? This is the shortest common supersequence problem, and it indeed appears related to the traveling salesman problem. For a general set of strings, it is NP-complete.

 Author: calima [ Tue Jun 07, 2016 12:29 pm ] Post subject: Re: Is there a known algorithm for this? Surely you can brute-force it with NES-sized data?It's a n! problem, but if you did it in sets of say 12 strings, it would still compute fast, and likely have a better result than just with pairs.

 Author: tokumaru [ Tue Jun 07, 2016 1:32 pm ] Post subject: Re: Is there a known algorithm for this? 3gengames wrote:Sounds like you want a dictionary based compression method.Yes, and I want to make the dictionary as compact as possible.pubby wrote:At a quick glance it sounds like the traveling salesman problem.I've never been good at solving that! Quote:You have a weighted directed graph where the strips are the nodes and the amount of overlap between strips form the edge weights. Finding the optimal path that visits each node once sounds very much like TSP, but then again I haven't thought about this very hard.Makes sense.tepples wrote:This is the shortest common supersequence problemThanks for finding the name of the problem! Now I can see if I can find any interesting ways of solving it.calima wrote:Surely you can brute-force it with NES-sized data?I don't know... Even with tens of strips, trying all permutations isn't feasible, let alone with 500!Quote: if you did it in sets of say 12 stringsWhich 12 strings, though? Those that are involved in the longest overlaps? What kind of selection would make me lose less opportunities?

 Author: tokumaru [ Tue Jun 07, 2016 1:58 pm ] Post subject: Re: Is there a known algorithm for this? Found an interesting presentation on this subject. The greedy version described there is exactly what I was planning on doing:Quote:At each step, the greedy algorithm “greedily” chooses longest remaining overlap, merges its source and sinkThen it says this:Quote:But greedy algorithm is a good approximation; i.e. the superstring yielded by the greedy algorithm won’t be more than ~2.5 times longer than true SCS (see Gus!eld 16.17.1)How is 2.5 times longer a good approximation?!?!

 Author: pubby [ Tue Jun 07, 2016 2:03 pm ] Post subject: Re: Is there a known algorithm for this? Well you can always attempt to improve an imperfect solution using an optimization technique like hill climbing or simulated annealing.So let's say that you knew the order of the strips. "AABBBBBAAAAA" comes first and "AAAAACCAAAAA" comes second and so on.Can you figure out the minimum length required when all of these strips are combined in their given order? It shouldn't be too hard or computationally intensive -- just a little bit hard. Some type of dynamic programming is probably the answer.When you have that figured out the optimization techniques are easy. Just modify the order of the strips and compare lengths. Keep the shorter lengths and discard the rest.

 Author: tokumaru [ Tue Jun 07, 2016 2:12 pm ] Post subject: Re: Is there a known algorithm for this? pubby wrote:Can you figure out the minimum length required when all of these strips are combined in their given order?Yes, this is the easy part. Figuring out the optimal order is the tough task.Quote:Just modify the order of the strips and compare lengths.But I can't possibly test every possible order! Even if I had only 20 strips total, there'd be 20! = 2432902008176640000 permutations! With 500 strips the number of permutations is practically infinite...

 Author: pubby [ Tue Jun 07, 2016 2:24 pm ] Post subject: Re: Is there a known algorithm for this? What I was getting at was that you can turn an imperfect solution into a much-better-but-not-perfect solution in only a few hundred iterations of improvement. Start with a greedy algorithm and then apply an optimization technique until it's "good enough".Read up on simulated annealing to see what I mean. It's probabilistic. You check a certain number of random permutations and then stop. Diminishing returns, etc.(And this is a common way of solving the salesman problem. Start with a simple nearest-neighbor approach, similar to your greedy algorithm, and then improve it for several iterations with hill climbing.)

 Author: Kasumi [ Tue Jun 07, 2016 4:47 pm ] Post subject: Re: Is there a known algorithm for this? I wrote a program that does this (except all blocks are not a fixed size), but it doesn't really try for absolutely optimal. Here's some of its log file:Code:Chunks to Export: 81 Found block \$00C. Looking for a good merge match. Attempting to merge Block 00C and Block 005... Block 00C: \$00\$0E\$01\$04\$00\$00\$03\$03\$14\$17\$09\$03\$00\$1D\$04\$09\$20\$00\$10\$24\$09\$28\$00\$1D\$2D\$2E\$3 0\$04\$09\$03\$00\$37\$1D\$04\$0E\$09\$48\$72\$02\$03\$01\$4F\$7A\$53\$10\$1D\$1D\$17\$1D\$03\$04\$56\$42\$ 00\$34\$28\$16\$3C\$4F\$00\$1C\$53\$48\$3E\$03\$09\$46\$09\$09\$5A\$03\$5F\$63\$85\$82\$93\$A2\$82\$80\$94 \$8E\$96\$00\$98\$93\$94\$02\$02\$9B\$9F\$9D\$00\$8F\$02\$98\$A1\$8A\$D5\$D3\$A8\$A6\$D7\$D1\$B5\$B3\$D1\$D 3\$A4\$02\$B1\$02\$02\$AA\$AC\$B0\$A8\$7D\$03\$7F\$F2\$59\$03\$03\$03\$68\$03\$68\$03\$03\$03\$03\$FF\$FD\$ DF\$DF\$FE\$03\$AD\$AA\$AF\$03\$E1\$AC\$AF\$DF\$DF\$DF\$03\$C6\$FD\$D9\$DB\$C7\$C9\$C6\$C0\$C7\$4F\$C9\$C6 \$C0\$53\$D0\$48\$D0\$C6\$C0\$00\$BB\$00\$CF\$00\$BB\$CF\$00\$BD\$BD\$C0\$4D\$64\$00\$F4\$F4\$05\$03\$03\$0 3\$00\$00\$00\$05\$05\$03\$05\$03\$00\$00\$00\$00\$00\$00\$00\$F0\$00\$00 Block 005: \$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$20\$20\$20\$20\$30\$30\$30\$30\$30\$30\$30\$30\$10\$10\$10\$10\$1 0\$10\$10\$10\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$40\$40\$40\$40\$18\$18\$ 18\$20\$20\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40 Beginning merge. Best offset is 203. Merged size: 277. mergeddata initialized. \$00\$0E\$01\$04\$00\$00\$03\$03\$14\$17\$09\$03\$00\$1D\$04\$09\$20\$00\$10\$24\$09\$28\$00\$1D\$2D\$2E\$3 0\$04\$09\$03\$00\$37\$1D\$04\$0E\$09\$48\$72\$02\$03\$01\$4F\$7A\$53\$10\$1D\$1D\$17\$1D\$03\$04\$56\$42\$ 00\$34\$28\$16\$3C\$4F\$00\$1C\$53\$48\$3E\$03\$09\$46\$09\$09\$5A\$03\$5F\$63\$85\$82\$93\$A2\$82\$80\$94 \$8E\$96\$00\$98\$93\$94\$02\$02\$9B\$9F\$9D\$00\$8F\$02\$98\$A1\$8A\$D5\$D3\$A8\$A6\$D7\$D1\$B5\$B3\$D1\$D 3\$A4\$02\$B1\$02\$02\$AA\$AC\$B0\$A8\$7D\$03\$7F\$F2\$59\$03\$03\$03\$68\$03\$68\$03\$03\$03\$03\$FF\$FD\$ DF\$DF\$FE\$03\$AD\$AA\$AF\$03\$E1\$AC\$AF\$DF\$DF\$DF\$03\$C6\$FD\$D9\$DB\$C7\$C9\$C6\$C0\$C7\$4F\$C9\$C6 \$C0\$53\$D0\$48\$D0\$C6\$C0\$00\$BB\$00\$CF\$00\$BB\$CF\$00\$BD\$BD\$C0\$4D\$64\$00\$F4\$F4\$05\$03\$03\$0 3\$00\$00\$00\$05\$05\$03\$05\$03\$00\$00\$00\$00\$00\$00\$00\$F0\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$ 20\$20\$20\$20\$30\$30\$30\$30\$30\$30\$30\$30\$10\$10\$10\$10\$10\$10\$10\$10\$00\$00\$00\$00\$00\$00\$00 \$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$40\$40\$40\$40\$18\$18\$18\$20\$20\$40\$40\$40\$40\$40\$40\$40\$4 0\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40 Blocks merged. Found block \$00B. Looking for a good merge match. Attempting to merge Block 00B and Block 00C... Block 00B: \$00\$0E\$01\$04\$00\$00\$03\$03\$14\$17\$09\$03\$00\$1D\$04\$09\$20\$00\$10\$24\$09\$28\$00\$1D\$2D\$2E\$3 0\$04\$09\$03\$00\$37\$1D\$04\$0E\$09\$48\$72\$02\$03\$01\$4F\$7A\$53\$10\$1D\$1D\$17\$1D\$03\$04\$56\$42\$ 00\$34\$28\$16\$3C\$4F\$00\$1C\$53\$48\$3E\$03\$09\$46\$09\$09\$5A\$03\$5F\$63\$85\$82\$93\$A2\$82\$80\$94 \$8E\$96\$00\$98\$93\$94\$02\$02\$9B\$9F\$9D\$00\$8F\$02\$98\$A1\$8A\$D5\$D3\$A8\$A6\$D7\$D1\$B5\$B3\$D1\$D 3\$A4\$02\$B1\$02\$02\$AA\$AC\$B0\$A8\$7D\$03\$7F\$F2\$59\$03\$03\$03\$68\$03\$68\$03\$03\$03\$03\$FF\$FD\$ DF\$DF\$FE\$03\$AD\$AA\$AF\$03\$E1\$AC\$AF\$DF\$DF\$DF\$03\$C6\$FD\$D9\$DB\$C7\$C9\$C6\$C0\$C7\$4F\$C9\$C6 \$C0\$53\$D0\$48\$D0\$C6\$C0\$00\$BB\$00\$CF\$00\$BB\$CF\$00\$BD\$BD\$C0\$4D\$64\$00\$F4\$F4\$05\$03\$03\$0 3\$00\$00\$00\$05\$05\$03\$05\$03\$00\$00\$00\$00\$00\$00\$00\$F0\$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$ 20\$20\$20\$20\$30\$30\$30\$30\$30\$30\$30\$30\$10\$10\$10\$10\$10\$10\$10\$10\$00\$00\$00\$00\$00\$00\$00 \$00\$00\$00\$00\$00\$00\$00\$00\$00\$00\$40\$40\$40\$40\$18\$18\$18\$20\$20\$40\$40\$40\$40\$40\$40\$40\$4 0\$40\$40\$40\$40\$40\$40\$40\$40\$40\$40 Block 00C: \$00\$0F\$0F\$04\$00\$00\$03\$03\$00\$18\$04\$0A\$1A\$1E\$0A\$03\$23\$00\$25\$26\$0A\$27\$00\$2C\$2D\$00\$3 1\$34\$04\$0A\$35\$1D\$04\$0A\$0F\$03\$49\$73\$4C\$76\$0F\$50\$7B\$54\$25\$2C\$1E\$3A\$04\$03\$30\$57\$23\$ 00\$25\$41\$00\$34\$50\$1B\$1F\$54\$4C\$47\$0A\$0A\$40\$03\$0A\$5B\$5E\$60\$03\$86\$83\$92\$89\$91\$81\$95 \$A3\$97\$00\$99\$92\$95\$9A\$02\$9C\$02\$9E\$00\$90\$02\$A0\$9E\$8B\$D6\$D4\$A9\$A7\$D8\$D2\$B6\$B4\$D2\$D 4\$A5\$02\$B2\$02\$02\$AB\$AD\$AF\$AF\$64\$03\$64\$F2\$03\$74\$03\$03\$66\$03\$69\$62\$03\$03\$03\$DD\$DD\$ 03\$FE\$DD\$DD\$E2\$AB\$DD\$B0\$AA\$AD\$03\$A8\$03\$FF\$03\$BC\$DD\$DA\$DC\$C8\$CA\$BC\$C5\$C8\$C5\$CA\$49 \$C5\$64\$64\$4C\$49\$BC\$C5\$B7\$BC\$D0\$00\$B7\$52\$00\$48\$00\$00\$C1\$4E\$00\$D0\$03\$04\$04\$03\$F8\$F 8\$00\$00\$00\$06\$04\$03\$06\$03\$00\$00\$00\$00\$00\$00\$00\$F1\$00\$00 Beginning merge. Best offset is -204. Merged size: 481. mergeddata initialized. Code:No compressable blocks foundBlocks Merged: 457Blocks Tagged: 289Bytes Saved: 725I guess I shouldn't be surprised it's been thought of before, and now I have the actual name rather than what I've been calling it I guess. It's one of a few parts to how I store some very large maps in 16KB. This merges RLE compressed blocks with each other (and a few other things.)Even without doing it optimally, you'll get decent savings. (But whether this is okay depends on your whole goal I guess.) I actually think the steps you described are better than what this does to search. I didn't need best, I only did it because it was "free" compression. (No extra cycles spent decompressing at run time.) The block lengths are known, it doesn't matter if the one ends how another starts.I might share the code, but I'm not convinced ripping it out of what it's currently part of to use generically will be easy enough for me to bother. I'll certainly watch this topic with interest, I might even attempt to implement your steps into mine to see if I get better savings.

 Author: tokumaru [ Tue Jun 07, 2016 5:26 pm ] Post subject: Re: Is there a known algorithm for this? pubby wrote:Read up on simulated annealing to see what I mean. It's probabilistic. You check a certain number of random permutations and then stop. Diminishing returns, etc.I will check it out. Thanks for the tip.Kasumi wrote:(except all blocks are not a fixed size)This doesn't matter at all in my algorithm either, I just mentioned this in case it made any difference for someone else coming up with another idea.Quote:Here's some of its log file:Interesting... You're apparently attempting to merge blocks with previous blocks as soon as they are found, is that right? I can see why that wouldn't be optimal, since a premature merge prevents better merges from happening with future blocks. I was originally gonna do it like this, but opted for getting all blocks ready first to avoid making crappy merges.Quote:I guess I shouldn't be surprised it's been thought of before, and now I have the actual name rather than what I've been calling it I guess.I was sure this had a proper name, since it seemed like a common programming problem. One of the reasons I started the thread was my conviction that tepples would know the name! Quote:It's one of a few parts to how I store some very large maps in 16KB. This merges RLE compressed blocks with each other (and a few other things.)I'm also planning on using this for large maps, while still keeping them accessible directly from ROM (which is why the strips are all the same length, they're not compressed).Quote:Even without doing it optimally, you'll get decent savings. (But whether this is okay depends on your whole goal I guess.)I'll just try the greedy algorithm I described and see how much space I can save with that. It was good to find that presentation that validated my approach as an acceptable solution for the problem. Quote:I only did it because it was "free" compression. (No extra cycles spent decompressing at run time.) The block lengths are known, it doesn't matter if the blocks overlap if they start and end the same.Yes, for the decompressor it doesn't make any difference, there's no performance hit whatsoever. The amount of compression I get from this will be very decisive though... I can't possibly go with this scheme if the compression isn't significant.Quote:I might share the code, but I'm not convinced ripping it out of what it's currently part of to use generically will be easy enough for me to bother.You could describe it, though.

 Author: Myask [ Tue Jun 07, 2016 5:36 pm ] Post subject: Re: Is there a known algorithm for this? even if you can't check all permutations(because N!), checking all pairs is "reasonably" quick O(Ln²) to do your greedy algorithm (remember to check BA as well as AB merges). You then can just inherit the longest string-matches from each end, unless it was the entire string (as it's stopping within the new, longer string).

 Author: tokumaru [ Tue Jun 07, 2016 5:47 pm ] Post subject: Re: Is there a known algorithm for this? Myask wrote:checking all pairs is "reasonably" quick O(Ln²) to do your greedy algorithm (remember to check BA as well as AB merges). You then can just inherit the longest string-matches from each endYes, that's my plan. Test all pairs, and merge the pair with the greatest overlap. Then I test only the blocks that were affected by the merge (i.e. the newly formed block and any blocks that wanted to pair with the ones that were merged), to keep the information updated. Then I keep merging and testing affected blocks until there's nothing left to merge, at which point I concatenate all remaining blocks to form the final block.