Dealing with only X, Y, and Direct Page.

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Dealing with only X, Y, and Direct Page.

Post by Drew Sebastino »

I know I've iterated about how big of a pain in the ass this is in the past, and I've asked some of you people how you've worked around this, but this has been a big enough issue for me that I think it deserves its own thread. Basically, two index registers is not enough for a lot of things. Two examples where I ran into problems are my metasprite routine, and my vram finding routine. My metasprite routine needs offsets for my object table, sprite buffer, and metasprite data, and my vram finding routine needs offsets for my object table, vram space table, and animation frame data. You actually have just enough registers if you include Direct Page but, asside from only saving you one cycle when it is a multiple of 256, it's got a major problem, and that's that it can only be in bank 0. If the SNES were designed to have more than 8KB or RAM in each bank, IHow the hell did anyone program for the PCE/Turbografx 16?) it wouldn't be a problem, but as it currently stands, it's a major pain in the ass. I really don't want to worry about cramming all my metasprite data, along with everything else, into one bank (as I had been doing until I realized the limitedness of Direct Page) but I also don't want to worry about running out of space for an object table. How many bytes make up each object in your code? I guess I could comfortably see using 48 bytes per slot and having 128 object slots, which should be 6KB. It's a bit ridiculous how I've gotten slower how time has gone on due to me revising my code one thousand times. :|
User avatar
TOUKO
Posts: 306
Joined: Mon Mar 30, 2015 10:14 am
Location: FRANCE

Re: Dealing with only X, Y, and Direct Page.

Post by TOUKO »

On PCE you don't need any VRAM finding routine as you can CRAM sprites in all your 64ko .
Personally i use a double buffer in VRAM if dynamic sprites is needed(E.G:for BTU) as you can write to it at any time.

For OAM/SAT, no registrers is needed, as personally i write directly to VRAM .

I think if you lack of index registers, you must use pointers, for your OAM buffer perhaps .
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Dealing with only X, Y, and Direct Page.

Post by HihiDanni »

I ran into the same issue while refactoring my sprite routine, actually. I had the direct page pointing to the spritedef with X and Y pointing to the current object and the OAM destination, respectively. I realized that this would only let you put spritedefs into bank 0, so I reworked the function.

Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination. This theoretically means that I can now have the spritedef in any LoROM style bank, since I can switch the data bank before calling into the function.

Currently I'm finding 8kB plenty to work with, although space might get tighter once I start working on the optimized collision routines where I'll need some data structures for indexing.

Edit: A VRAM slot finder probably doesn't need the current object index to be able to look for a new slot, so you should be able to just store it temporarily in a scratch variable (or the stack). I don't suspect such a function would be called super often each frame so you can easily afford it.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Dealing with only X, Y, and Direct Page.

Post by tepples »

Espozo wrote:My metasprite routine needs offsets for my object table, sprite buffer, and metasprite data
My NES programs use 16 bytes of zero page for local variables. The metasprite routine in The Curse of Possum Hollow uses 13 of them.

The "draw individual actor" and "draw bullet" subroutines have to copy the position and identity of each object from the respective object table into zero page. But once all these are read out, it no longer has to access the object table for that sprite when drawing the metasprite proper.

It's called with the the sprite sheet ID and frame number in X and A. Additional arguments are passed in the local variable area, taking 6 bytes:
2 bytes: X coordinate
2 bytes: Y coordinate
1 byte: Base tile number in video memory
1 byte: Attributes

Once it starts running, it can proceed to use Y to index into the metasprite data, X to index into for the sprite buffer, and 7 more bytes of zero page for the current horizontal strip's state.
2 bytes: X coordinate
1 byte: Remaining width in sprites
1 byte: Y coordinate
1 byte: Attributes of current strip (for split-palette or layered sprites)
2 bytes: Pointer to start of metasprite data
Espozo wrote:How the hell did anyone program for the PCE/Turbografx 16?
The TG16 has 8 KiB of RAM, which you observed is the same as the Super NES's low memory. The NES has 2 KiB of RAM, and the majority of games didn't expand that with extra RAM on the cartridge. Yet people programmed for it.
Espozo wrote:How many bytes make up each object in your code?
Actors in Curse are 16 bytes, and there are 6 slots. There are 8 additional "entry queue" slots for actors, each 4 bytes in size. Bullets are 6 bytes, and there are 12 slots.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Dealing with only X, Y, and Direct Page.

Post by psycopathicteen »

tepples wrote:
Espozo wrote:How the hell did anyone program for the PCE/Turbografx 16?
The TG16 has 8 KiB of RAM, which you observed is the same as the Super NES's low memory. The NES has 2 KiB of RAM, and the majority of games didn't expand that with extra RAM on the cartridge. Yet people programmed for it.
Espozo wrote:How many bytes make up each object in your code?
Actors in Curse are 16 bytes, and there are 6 slots. There are 8 additional "entry queue" slots for actors, each 4 bytes in size. Bullets are 6 bytes, and there are 12 slots.
Pretty much it. SNES homebrewers want more stuff onscreen than NES homebrewers.

Now that I think about it, I wonder how much of a placebo effect memory size has on how you use it. Maybe if I try stuffing more stuff in 8kB, 8kB wouldn't seem as tight.

Something that gets on my nerves more and more is how you can do long indexing with X but not Y, but you can do long indirect indexing with Y but not X. I don't think you can directly load X or Y from long addresses.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Dealing with only X, Y, and Direct Page.

Post by Drew Sebastino »

HihiDanni wrote:Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination.
How are you able to use Direct Page for this with the SNES's dumbass HiOAM table?
HihiDanni wrote:Currently I'm finding 8kB plenty to work with
I mean, it's plenty unless you plan on squeezing your object table into it. However, I have noticed that I have the majority of the data in need set aside for an object in each routine, and it's 20 bytes, and I should be able to have over twice that for each object and still have 128 objects.
psycopathicteen wrote:Something that gets on my nerves more and more is how you can do long indexing with X but not Y, but you can do long indirect indexing with Y but not X. I don't think you can directly load X or Y from long addresses.
It's honestly not as absurd as not having an add without carry. Having under 256 different instructions must have been a bitch when designing this thing.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Dealing with only X, Y, and Direct Page.

Post by psycopathicteen »

That makes no sense either. Don't think it even needs an "ADC" and "SBC" in the first place. Just have an "ICS" increment if carry set, and "DCC" decrement if carry clear instructions and it would make more sense.

I'd like to know if there were any other cheap CPUs that actually fixed the problems of the 65xx architecture. Everything else seemed to be just a battle of who can make the most expensive CPU possible.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Dealing with only X, Y, and Direct Page.

Post by HihiDanni »

Espozo wrote:
HihiDanni wrote:Now, X is the object as always, Y is now the spritedef, and D points to the OAM destination.
How are you able to use Direct Page for this with the SNES's dumbass HiOAM table?
I don't. I reuse the X index register to do it. This requires preserving the current value of X, and it's something I could optimize in the future maybe, but I personally don't think it's a big deal.

And most of the high OAM processing isn't done in AddSprite either.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
User avatar
rainwarrior
Posts: 8733
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Dealing with only X, Y, and Direct Page.

Post by rainwarrior »

psycopathicteen wrote:That makes no sense either. Don't think it even needs an "ADC" and "SBC" in the first place. Just have an "ICS" increment if carry set, and "DCC" decrement if carry clear instructions and it would make more sense.
This is only true if the only result you care about is what's left in the accumulator. The result of the other flags after adding or subtracting are dependent on that carry, and are essential for multi-byte/word operations. You need ADC/SBC for that.
User avatar
HihiDanni
Posts: 186
Joined: Tue Apr 05, 2016 5:25 pm

Re: Dealing with only X, Y, and Direct Page.

Post by HihiDanni »

rainwarrior wrote:This is only true if the only result you care about is what's left in the accumulator.
On the SNES, that is most of the time. It has a 16-bit CPU after all.
SNES NTSC 2/1/3 1CHIP | serial number UN318588627
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Dealing with only X, Y, and Direct Page.

Post by 93143 »

Espozo wrote:Having under 256 different instructions must have been a bitch when designing this thing.
Try coding for the Super FX. 16 registers, 8-bit instruction size. And lots of instructions need source, operand and destination registers. Something as simple as XOR requires a prefix instruction just to specify the operation because they ran out of opcodes.

But yeah, it's a tad limiting. Just upgrading the 65xx concept to 16-bit without worrying about backward compatibility would have resulted in a massively more powerful processor. I kinda like the idea of a Z register too...
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: Dealing with only X, Y, and Direct Page.

Post by psycopathicteen »

rainwarrior wrote:
psycopathicteen wrote:That makes no sense either. Don't think it even needs an "ADC" and "SBC" in the first place. Just have an "ICS" increment if carry set, and "DCC" decrement if carry clear instructions and it would make more sense.
This is only true if the only result you care about is what's left in the accumulator. The result of the other flags after adding or subtracting are dependent on that carry, and are essential for multi-byte/word operations. You need ADC/SBC for that.
You need the carry bit, but that doesn't mean you need an ADC/SBC for that.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Dealing with only X, Y, and Direct Page.

Post by Drew Sebastino »

93143 wrote:
Espozo wrote:Having under 256 different instructions must have been a bitch when designing this thing.
Try coding for the Super FX. 16 registers, 8-bit instruction size. And lots of instructions need source, operand and destination registers. Something as simple as XOR requires a prefix instruction just to specify the operation because they ran out of opcodes.
Sounds like a real POS. :lol: Correct me if I'm wrong, but it seems the only advantage the Super FX has over the SA-1 is converting packed pixel to the SNES graphics format in hardware.
93143 wrote:But yeah, it's a tad limiting. Just upgrading the 65xx concept to 16-bit without worrying about backward compatibility would have resulted in a massively more powerful processor. I kinda like the idea of a Z register too...
The lack of a Z register wouldn't be too bad if there were a way to quickly switch values in and out of X and Y. For example, if there was an instruction for swapping the value of X or Y with an area of memory, (if that's even possible) that would be fine, but as it stands, it's too slow for what you're doing. I don't know how much better they could have made it, but I will give it to the 65816 for stomping other processors from the period at the same clock speed despite having only an 8 bit data bus. Of course, most all of them ran faster than 3MHz. :? I really don't understand, how did they get the SA-1 to run at 10MHz? Was it built using a smaller manufacturing process? (It doesn't have a fan, or even a heat sink.) The 5A22 in the SNES can't be that underclocked.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: Dealing with only X, Y, and Direct Page.

Post by 93143 »

Espozo wrote:Correct me if I'm wrong, but it seems the only advantage the Super FX has over the SA-1 is converting packed pixel to the SNES graphics format in hardware.
EDIT: I misread that as a comparison with the S-CPU. The SA-1 is indeed much less weak.

The Super FX still had a number of advantages. For one, the clock speed was higher, at least for the later revisions. For another, it had a faster multiplier - it could do 8x8 in two master cycles, or 16x16 in eight master cycles, or nine if you wanted the full 32 bits of output. (On the other hand, it had no real division functionality as far as I can tell.) Also, the RISC idea wasn't entirely bogus - lots of stuff could be done in a single cycle, and lots more could be done in two, versus an average of about 4 for the 65816. And of course it had 16 general-purpose registers (including the program counter) and you could operate between any of them with no I/O penalty.

Instructions often consisted of a 4-bit opcode and a 4-bit operand register number, with source and destination set by FROM, TO, or WITH. If the source and destination registers were unset, they defaulted to R0, and most instructions reset them; this meant you could get better speed by using R0 as an accumulator of sorts.

The PLOT functionality was more than just a packed-to-planar converter (which the SA-1 had in the form of a couple of special DMA modes). It had features like automatic checkerboard dither and palettized 8bpp drawing with 4-bit input. It used two of the registers as screen coordinates, removing the need to calculate addresses, and it auto-incremented the x coordinate. All in one cycle, so you could continue with the rest of the algorithm while the pixel caching system did its work.

The Super FX actually had hardware texture mapping capability, kinda. The MERGE opcode takes the top bytes of R7 and R8 and concatenates them into the destination register. If R7 and R8 contain 8.8 fixed-point texture coordinates, and you prefix MERGE with TO R14, you can then read a texel from ROM.

It may be interesting to note that the default ADD doesn't take carry into account. You need to prefix it with a flag instruction to get ADC. Kinda the opposite of the 65816, where you must prefix ADC with CLC to get ADD...
I don't know how much better they could have made it
I'm guessing, but a 16-bit chip would obviously have a much larger potential opcode selection, and of course double the bus width at a given memory speed is double the bandwidth. You could do 16-bit addressing in two words, or 32-bit addressing in three, and reading or writing a word would be one cycle. If you were willing to constrain the opcode space a bit, you could do 8-bit addressing in one word, or 24-bit in two, but I'm not sure the internal architecture would be up to the former... Basically everything would be either twice as big or twice as fast, and opcode count would no longer be a significant constraint. You'd get no bonus for 8-bit data, but that was always a bit of a booby prize anyway...

...not to mention that if you eliminated the phi1/phi2 nonsense like Hudson did, you'd double performance for free (assuming the process was up to it)...

As long as we're making wish lists, how about some reasonably quick multiply and divide instructions? The 5A22 has an external multiplier and divider, though they aren't very good, and the SPC700 has them as actual instructions - it uses Y for the upper byte of 16-bit values. A hypothetical 6516 with 16x16 and 32/16 could do something similar, and with a Z register there'd still be two index registers free.

Are we getting into the 68000 price range here?
I really don't understand, how did they get the SA-1 to run at 10MHz?
I think the speed of the core wasn't the issue with the S-CPU. It was the memory speed. (It also came out five years earlier. Five years was a long time back then. Remember, the SA-1 only came out a year before the N64...)

With the SA-1, I believe they used 16-bit ROM with a memory controller to split the words for the CPU core. This meant you'd get wait states if you accessed data or had to branch; only linear program counter reads and DMA went at full speed. Somebody did the math, and it seems that if the SA-1 used single-master-clock half-cycles, ordinary FastROM would be enough for 10.74 MHz with this setup.

But they also used 2 KB of fast RAM for the I-RAM cache, and you could run at the full 10.74 MHz in that. BW-RAM was bigger but slower; it would take the chip down to 5.37 MHz. And of course if the S-CPU accessed a particular memory at the same time you'd get extra wait states on the SA-1 side...

The later models of Super FX could run at 21.4 MHz. Unfortunately, this was only possible inside the 512-byte instruction cache. There was no data cache, just the internal registers. A cache miss or any sort of data access was five master clocks per byte, or 4.3 MHz. The dual busing and buffering saved it somewhat; if you were careful, you could generally keep working while data access was happening - unless you were reading from RAM, since there was no preload functionality for RAM (I suspect it would have been complicated) and out-of-order execution wasn't really a thing back then. Try to avoid reading from RAM a lot when using the Super FX.

...it kinda burns me that nobody thought to use 120ns ROM with the Super FX. That would be 3 master clocks per byte, and would dramatically speed up any sort of memory access including PLOT. I wonder if you could simply overclock it to 43 MHz and leave it in Slow mode - that gets you 3-cycle memory accesses, but your period authenticity is shot...
Last edited by 93143 on Thu Oct 26, 2017 3:28 pm, edited 1 time in total.
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: Dealing with only X, Y, and Direct Page.

Post by creaothceann »

93143 wrote:...not to mention that if you eliminated the phi1/phi2 nonsense like Hudson did, you'd double performance for free
What did they do?
93143 wrote:Five years was a long time back then.
Still is. ;)
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Post Reply