I'd love to see what you have on sprite scaling.
Sorry for the silence. It took a while longer than I predicted to get on top of this.
I do however think I've got a pretty nifty variation on Tepples's original scaling method. But on the other hand, given its main drawback I'm not so sure it will be useful to your application...
For the cutscene composer I'm working on, I was using the Ninja Gaiden artwork for testing. And I never before appreciated just how elaborate its palette usage actually is. The face in the first cutscene for example, uses 3 palette variations just to allow a 4th color to be used within the 8x8 tile, and then a sprite overlay on top of that for the eye, maxing out the sprite palette.
Anyway, I was thinking I really wanted a version of sprite scaling that could accommodate these sort of multi-palette sprites. Additionally, I wanted to see if Tepples's sprite scaling could be a bit optimised to run more efficiently, and if there was opportunity to skip unnecessary updates to CHR-RAM.
The horizontal scaling is done using lookup table just as in Tepples version. But for the vertical part, I've chosen to employ the same system of overlapping the sprites, and using unrolled loops for the vertical scaling to efficiently scale the vertical sprites. For the unrolled loops I chose to have them write directly to CHR-RAM, mainly due to my use-case being nearly-full-screen images with scanline effects applied, so the trade-off of sacrificing generic frame time for more efficient vblank period didn't feel like the right one... also, I figured doing brute-force scaling updates of big sprites could potentially require more than the standard ~240 bytes a transfer buffer in the stack can provide. Finally, using the auto-increment feature of $2007 allowed for more efficient register usage, and pha was not an option given I was using the stack for efficient hand-over between vertical scalers.
Here's one the unrolled loops:
; 4 cycles
; 8 cycles
; Scale 8 -> 5
.repeat 2, I
; Use sliver 0
ldy SourceTilesS0 + 256*I,x
; -> 4+5+4 = 13 cycles (3+2+3 = 8 bytes)
; Skip sliver 1
; Use sliver 2
ldy SourceTilesS2 + 256*I,x
; Skip sliver 3
; Use sliver 4-5
ldy SourceTilesS4 + 256*I,x
ldy SourceTilesS5 + 256*I,x
; Skip sliver 6
; Use sliver 7
ldy SourceTilesS7 + 256*I,x
; 3 empty lines
; Return code
; Total cycles: 2*(5 * 13 + 2 + 3*4) + 8 = 2*79 + 8 = 166
; Total bytes: 2*(5 * 8 + 2 + 3*3) + 2 = 2*51 + 2 = 104
The way these loops is executed in vblank is also worth mention: To keep logic minimal, I employ what I call a "call chain" on the stack, consisting of 2*MAX_ROWS bytes denoting an RTS return address to the next unrolled loop, for an almost-instant handover. The actual address is then set depending on the vertical scale factor for each row. To accommodate longer/shorter columns and / or partially updated columns, the stack pointer can be set to start at a different point in the call chain, and an address in the call chain can be overwritten with the final return address to end the column updating prematurely.
But even with the unrolled loops and selective updates of only the columns and rows that actually changed in the frame, the brute-force method was still taking almost half the frame time for ~60 sprites, which made it not very useful for practical scenarios. So I realised to get this thing off the ground I would have to really embrace the NES philosophy of just doing small incremental updates to VRAM...
The core problem here is that your default integer-rounding for deciding horizontal / vertical scale factors for each tile will end up with tile sizes oscillating between the current / next integer scale factor. This pattern can be observed in the left version of the following gif animation, which illustrates both the column / row scale factors, as well as overlaying red transparency on the "dirty" columns and rows.
The left-hand version thus means invalidating the heavy work we just did, just to re-do it again soon. Not ideal.
The right-hand version OTOH manages to minimise the updates to just one dirty row / column at a time, by using a "scaling progression" table for the columns / rows that essentially increases the size of just one column / row at a time according to a set pattern. Whilst I thought a bit about how to best generate this table, I just ended up manually writing it up to be visually appealing. Basically, it just picks columns / rows to increase to the next tile size, depending on the column width / row height for the sprite. Here's the hard-coded patterns I chose for a column or row size of 1-10:
Code: Select all
2: 1, 0,
3: 1, 0, 2,
4: 2, 0, 3, 1,
5: 2, 0, 4, 1, 3,
6: 3, 0, 5, 2, 4, 1,
7: 3, 0, 6, 2, 5, 1, 4,
8: 4, 0, 2, 6, 5, 7, 1, 3,
9: 4, 0, 8, 2, 6, 1, 7, 3, 5,
10: 5, 0, 2, 8, 6, 3, 9, 1, 7, 4
Basically, I just tried to start by incrementing the middle column first, and then successively try to increment the columns / rows that have the furthest distance to already-incremented neighbours, to try and manually "smooth out" the scaling.
This show the same naive scaling on the left and scaling progression table on the left, without the debug visualisations:
The right-side does end up looking different than the left, but not observably worse AFAICT. Both of the methods are kind of equally wrong, as the "proper" way would be to increase the pixel sizes, given infinite resolution.
There's also yet another trick I tried employing to minimise the VRAM bandwidth: If your scaling happens to just occasionally dirty both the column and the row in the same frame, you can delay one of them, with the only artifact being that the growing / shrinking of your sprite is delayed by a single frame, which won't be noticeable in practice. Essentially, the desired x/y scale of a sprite becomes a target value to attempt to reach within a set bandwidth limit.
Finally, the big caveat of this method are the blank "zombie" pixels of vertically downscaled tiles that overlap with the pixels in the tile below it, and would cause the sprites-per-scanline limit to kick in. The way to get rid of this requires sorting the sprites so that zombie pixels always have a higher sprite index than the non-zombie ones. Here's another animation to illustrate it. (yellow parts are the blank zombie pixels that we want to be dropped before any other sprites)
Now in practice, a sorting routine would be too expensive on the NES. But merging two originally sorted lists is a linear operation. So essentially you just need your sprite drawing code to work on a per-row basis, and select a row from the metasprite whose partial vertically-scaled tiles have the lowest y-coordinate for its bottom edge.
But this caveat is also the big drawback of this method. Not only does it add some awkward complexity to your code and puts limitations on your OAM cycling schemes, but it will also prevent you from being able to use the sprite index number to have metasprites appear in front / behind each other based on Z-order.
And given that the top use case for sprite scaling is to simulate a 3d scenario, that's a pretty tough sacrifice to make to get smooth sprite scaling...
My own use-case for making deterministic cutscenes with a lot of room left for raster effects means that the lower CPU time and VRAM bandwidth wins out over depth-ordering features. But unless you can design your ray caster levels to avoid having enemies that can go in front of / behind each other - a very limiting constraint - then you might be better off just using prescaled sprites and accepting choppier scaling.
Reverting to Tepples's original scheme is also possible, but then you're back to using a lot of VRAM writes, as a single dirty row will effectively invalidate all other rows below it by default. Though the scaling progression table should still help alleviate *some* of this cost.
Another downside compared to Tepples's method is that it doesn't work so well for simulating rotation with sprite shearing. In Tepples's method you just rewrite rows instead of shifting the sprite grid vertically, so shifting the whole thing horizontally works fine without any artifacts. But trying to shift the grid both vertically and horizontally will introduce ugly holes in your sprite as can be seen in these images.
Although this problem you can probably get around by just introducing a default horizontal overlap of 1 pixel for your sprite rows, at the expense of incrementing the number of sprites used by one. It won't help the Ninja Gaiden face example though, as it already maxes out the sprites / scanline usage.
is a simple demo rom using it in practice. It actually uses forced blanking at the top of the screen and the CHR-RAM writes are just slightly too long to fit in vblank due to the unrolled loop writing directly to CHR-RAM. But if you converted the unrolled loops to a version that fills a transfer buffer, it would definitely fit in the vblank period.
(edit: Replaced previously uploaded buggy demo ROM with one which has correct sprite indexing and also avoids the OAM corruption bug on NTSC)
A few other variations I thought of:
- If you have WRAM on your cart, it might make more sense to do your sprite scaling in two stages: The horizontal scaling updating dirty columns in WRAM, and the vertical scaling unrolled-loops then writing the data of the already-horizontally-scaled columns (if dirtied) to VRAM. Removing the horizontal scaling from the critical path effectively makes the VRAM writes as cheap as using a transfer buffer, but you also gain the benefit of re-using the work done for horizontal scaling, which should be a nice performance gain.
- I'd like to do an 8x16 sprite version of these unrolled loops at some point, not least because it's a bit cumbersome to have to track dirty rows / tile sizes in two separate bytes to accommodate. It'd also be nice to display a scaling 128-tile metasprite of this method, though it would requires most of the vblank time and blanking the top / bottom 8 pixels to achieve.
- I think an unrolled version of Tepples's original version would be worth exploring as well at some point, due to the benefits it has over sprite overlapping. But the unrolled loop would be a bit more complicated, because of the NES CHR format planes are interleaved. You essentially need to have your unrolled loop work on a bitplane of a tile at a time, and use one of the index registers to fetch source CHR data from a variable point. If you want to avoid using indexed-indirect for this source CHR data and use a fixed address, then this also limits the tiles a single unrolled 1bpp-loop can address to 256/8 = 32 tiles in height, which could be limiting. Though if your memory organisation so allows, you could pack that source CHR data into the start of each bank and employ bank switching to have your fixed $8000 fetch data from different PRG-ROM locations.