Page 1 of 3

Minimizing expansion of CHR converted for real-time updates

Posted: Wed Aug 22, 2012 7:55 am
by tokumaru
In order to maximize the number of patterns I can update during VBlank, I thought of converting the graphics to a series of LDA #$XX; STA $2007 followed by an RTS, which would send tiles to VRAM the fastest way possible (6 cycles per byte). Of course this has the inconvenient side effect of expanding the data by a factor of 5. In order to minimize that, I implemented a simple compression scheme that uses a 3-byte dictionary (the dictionary is stored in A, X and Y), which usually reduces the expansion to about 4 times rather than 5. Still incredibly huge though. Can anyone think of better ways to minimize that expansion?

I imagine that I could reuse longer strings if I made them separate subroutines that could be called as many times as necessary, or maybe make a more advanced analysis of the data and generate algorithms that would produce the desired output (this sounds too complex!). One more thing to worry about is that the generated code can't take longer than a certain threshold to execute, since it has to fit in VBlank along with other tasks, so the converter would have to count cycles and use that information to make decisions.

Obviously, I wouldn't use this for all the graphics, just for the ones that need to be animated more frequently, such as the main character. Other graphics can just be loaded with indexed absolute addressing or even the slow LDA ($XX), Y; STA $2007; INY way.

I'd like to point out that I'm aware of other methods of uploading data to VRAM, such as pulling data from the stack (8 cycles per byte) or loading from ZP (7 cycles per byte), but these are still too slow and require a good chunk of RAM. The only way to achieve 6 cycles per byte is with immediate addressing, and without WRAM, the code that writes the data does have to be in the ROM.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 11:30 am
by Bregalad
Why not generate the long lda #$xx / sta $2007 chain in RAM or WRAM in real time ? Of course if you plan to do this while an actual game is playing WRAM will be pretty much required, but for an intro plain RAM could be just enough - there is just enough space in $300-$7ff to store 256 "transfers".

If I remember well already done it at a time just to try, and if I remember well transfering 256 bytes (16 tiles) in a single VBlank + doing sprite DMA was no problem (on NTSC - on PAL that would be no problem even using a fully rolled loop).

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 12:59 pm
by rainwarrior
Battletoads does a lot of its CHR-RAM updates this way with code copied to RAM. Obviously you can't practically max out your usage without WRAM, but it could be worth devoting some RAM to it.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 1:09 pm
by tokumaru
Yeah, the obvious solution would be to use WRAM, but that's not always available. I never use WRAM, I'm not sure why. Maybe it's to keep the cost down if I decide to make carts, or even just to see how far the NES can go without such extensions. Without WRAM, there would be no memory left for the actual game.
Bregalad wrote:if I remember well transfering 256 bytes (16 tiles) in a single VBlank + doing sprite DMA was no problem
Exactly. My game is supposed to execute 2 VRAM updates + sprite DMA every frame, and if both slots are occupied by pattern updates, 256 bytes of CHR (16 tiles) will be transferred. That won't happen so often, since other updates (rows and columns of tiles, palettes, etc.) also have to use those slots.

I know that this is a crazy solution, and I know that the amount of ROM required is a big price to pay. I'm just trying to reduce that cost a bit, I'm not looking for anything miraculous that will make the quick CHR code occupy the same space as regular CHR data.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 1:12 pm
by 3gengames
32KB of WRAM is IMO the best answer, 8KB or even 16KB can hold a ton of unrolled code! :) And then your game has a ton more RAM to work with in general.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 1:20 pm
by tokumaru
rainwarrior wrote:but it could be worth devoting some RAM to it.
With just 2KB of RAM, maybe dedicating 128 or 256 bytes to CHR data would be realistic (although in my case I really can't spare that much), but that will only allow you to copy as fast as 7 cycles per byte (if you put the data in ZP, otherwise it would take 8 cycles per byte and there would be no advantage at all). To go as fast as 6 cycles per byte you need 5 times the space, which would be prohibitive with such little RAM.

I'm just looking for a way to make the expanded code a little smaller. If I can't find a way, I'll still use the simple 3-byte dictionary compression I already implemented, and keep the amount of tiles stored this way to a minimum (i.e. just the main character and some animated level objects).

EDIT: Just in case this isn't clear: for this particular game, I'm not using WRAM. I will take the data expanded 5 times rather than using WRAM.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 1:46 pm
by thefox
Well, here's a pipe dream: wait for somebody to come up with a super mapper that allows one to upload PPU updates to the mapper (FPGA blockram mayhaps), and then have the mapper generate the LDA #imm STA pairs on the fly. :)

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 2:01 pm
by tepples
A proper "super mapper" would take full control over the CHR bus and implement what kevtris calls a "stuffer": a FIFO of (address, data) pairs to execute on cycles when the PPU isn't doing anything that matters, such as the fetches for the 34th tile on a row or the nametable fetches between the sprite pattern fetches.

But it'd almost be easier to use mapper 119, which puts CHR ROM in banks 0-63 for the main character and CHR RAM in banks 64-71 for everything else.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Wed Aug 22, 2012 2:35 pm
by tokumaru
tepples wrote:But it'd almost be easier to use mapper 119, which puts CHR ROM in banks 0-63 for the main character and CHR RAM in banks 64-71 for everything else.
I could very well use the MMC3 (which is common as hell) with CHR-ROM and not have to worry about updating patterns at all. But I still want to see how well the NES can perform when using the same style of pattern updates that's common on other platforms which don't have the possibility of using ROM for storing tiles.

The Master System is an 8-bit console that does extremely well in this area. Since it doesn't have sprite flipping on hardware, it's pretty much mandatory that the patterns are constantly rewritten. Most games animate the main character this way, and a good number of them animate other objects as well.

On the NES this is less common. Sprite flipping makes it possible to keep all animation frames loaded at all times on simpler games, and more complex games usually went with CHR-ROM and bankswitching. The few games that made heavy use of CHR-RAM were either PAL-only or resorted to forced blanking in order to be able to transfer the amount of data necessary for the animations.

According to my calculations, the NES can handle quite a lot of pattern animation without the need for forced blanking, but the speed of 6 cycles per byte would really help.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 1:09 am
by Bregalad
You're saying you have 2 slots for VRAM updates, and that both can point to palettes, NT/AT or Patterns ?
It might sound like a silly question, but...
Would your game still work if only one of those slots could update patterns ? That way the "worst case" you're thinking about will never be happening.

Why would you update 16 tiles in a frame to update 0 on the next when you could update 8 tiles on both frames ?

Nevertheless if you have to do it the way you said, even though it wastes ridiculous amount of ROM I'd say it's an interesting idea.

I think it could be optimized by doing the following :
- Load the most 2 commonly used bytes in X and Y at the start of the code, and never touch them again. This will save some lda #$xx instructions.
- When there is "runs" of multiple identical bytes, do only have a single lda #$xx

However this won't reduce much the size of your code, just a little.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 7:04 am
by tokumaru
Bregalad wrote:You're saying you have 2 slots for VRAM updates, and that both can point to palettes, NT/AT or Patterns ?
Yes, there can be 2 pattern updates or 2 NT/AT updates in the same frame, but this is not necessary for the palette, since it's completely overwritten (i.e. once a slot has been allocated for a palette update it doesn't matter how many objects modify the palette, it will all be updated at once).
It might sound like a silly question, but...
Would your game still work if only one of those slots could update patterns ?
It would work, but since the main character is practically always changing tiles it would be kinda hard to animate anything else. I'm afraid that some animations would become noticeably laggy.
Why would you update 16 tiles in a frame to update 0 on the next when you could update 8 tiles on both frames ?
The idea is indeed to update 8 tiles per frame most of the time, while the other slot is used for other things. If the other slot is free however, using it for pattern updates will help keep the animations flowing smoothly, because there might be frames where none of the slots are used for patterns (such as when scrolling diagonally, when one slot is used for updating rows and the other for updating columns - scrolling updates really can't be delayed, because the visual glitches are too noticeable).
Nevertheless if you have to do it the way you said, even though it wastes ridiculous amount of ROM I'd say it's an interesting idea.
Yeah, that's what I think. Every unconventional programming solution has its disadvantages, and in this case the downside is the huge amount of ROM it costs to store the graphics. But I can live with that if only some of the graphics use this technique.
I think it could be optimized by doing the following :
- Load the most 2 commonly used bytes in X and Y at the start of the code, and never touch them again. This will save some lda #$xx instructions.
- When there is "runs" of multiple identical bytes, do only have a single lda #$xx
I'm doing something similar, but instead of keeping two values loaded at all times I'm throwing away the value that will take the longest to be used again when a new value has to be loaded. I might try your method and see which one is better.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 7:26 am
by tepples
Consoles where this was widespread had either DMA to VRAM during vblank or dual-ported VRAM or both. The SMS and Genesis in particular have pseudo-dual-ported VRAM implemented using what kevtris calls a "stuffer". It has two different VRAM addresses instead of the NES PPU's single loopy_v: one used for rendering and one used by the host. Games write to VRAM through the host port, and the VDP holds on to the address and data until the next idle cycle in the scanline.

Now consider how this could be simulated on the NES. There are 341 dots, and it takes about 24 dots to add one item to the FIFO. Thus the FIFO would need to be about 14 units deep. A mapper simulating this would have plenty of time to empty the FIFO. Out of 170 video memory reads per scanline, the data sent to the PPU doesn't matter for 22 reads that occur in or near horizontal blanking: four for the thirty-fourth background tile, two for the nametable fetch before each sprite pattern fetch, and two at the end of the scanline. So a mapper that controls all 13 CHR RAM address lines could sit between the PPU address bus and the CHR RAM, watch the nametable access pattern, and execute writes from the FIFO at those times.
tokumaru wrote:throwing away the value that will take the longest to be used again when a new value has to be loaded
The problem of optimizing the use of three registers is equivalent to cache algorithms. What you describe is the clairvoyant algorithm, which can't run in real time but is optimal when run offline: "When a [value] needs to be swapped in, the [compiler] swaps out the [value] whose next use will occur farthest in the future." I wonder to what extent you can save bytes by planning out which values can be calculated with ASL/LSR/ROL/ROR (and thus kept in A) or with DEX/INX/DEY/INY (and thus kept in X or Y).

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 8:48 am
by Bregalad
It would work, but since the main character is practically always changing tiles it would be kinda hard to animate anything else. I'm afraid that some animations would become noticeably laggy.
Then give the main character's animation update higher priority.

Does the player change frame every frame ? No, very unlikely, even if you have very detailed graphics frames of animation will last at least 4 hardware frames. I think it's affordable to have 1/4 of probability delay other updates of a single frame, and it won't be that noticeable. Unless you use re-writable patterns for all enemies/whathever, but that would not be a good idea on the NES anyway.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 8:51 am
by cpow
tepples wrote:I wonder to what extent you can save bytes by planning out which values can be calculated with ASL/LSR/ROL/ROR (and thus kept in A) or with DEX/INX/DEY/INY (and thus kept in X or Y).
Good god this sounds like an extremely interesting technical challenge! So much more so than the fire-drill remote debugging drudgery I'm currently enslaved in at work.

Given any set of bytes (tiles or NT/AT), what is the shortest NES 6502 code segment (bytes *and* cycles) that can achieve copying said set to the PPU? Hah!

I can imagine the 'compiler' would take several phases:

1. Analyze for RLEable groups.
2. Analyze for "nearest neighbor groupings" in the (-2,+2) domain since more than two INX/DEX mightaswell just use LDX.
3. Analyze for "nearest neighbor groupings" in the ASL/LSR and ROL/ROR domain.
4. Analyze for "repeat patterns (chains of bytes longer than 7 bytes perhaps?) that don't fit anything already found" that can be written using a loop and a table of $2006 values for starting place.
5. Emit optimized code.
6. $?

Obviously, though, since it has to be lossless there's always the tileset that's going to result in 0% compression over the original suggestion.

Re: Minimizing expansion of CHR converted for real-time upda

Posted: Thu Aug 23, 2012 9:16 am
by tepples
cpow wrote:2. Analyze for "nearest neighbor groupings" in the (-2,+2) domain since more than two INX/DEX mightaswell just use LDX.
As I understand the problem that tokumaru stated, we're trying to minimize cycles first and then break ties by minimizing bytes. Naive application of the clairvoyant method will already minimize cycles and given an upper bound for bytes. A single INX, DEX, INY, DEY, ASL, LSR, ROL, or ROR can save one byte and zero cycles over LDX #, LDY #, or LDA #; more than one will waste cycles.