I didn't even know that SNES port was started as unofficial one.
I have read about SuperFX a little and it's architecture is very interesting one (at least for me).
It has 16 general registers (also there is a lot of control registers, but it's not important here) R0..R15, where R15 is instruction pointer, R11 - link register and some other specialized stuff.
R0 is accumulator by default. But in fact there are two internal registers Sreg and Dreg - they are numbers of source register and destination register.
After reset and in the end of execution of instructions which uses source/destination registers they are zeroed, that is are pointer to R0/accumulator.
But there are three instructions to change them:
Code: Select all
FROM Rn ; opcode 0xBn - sets Sreg to n TO Rn ; opcode 0x1n - sets Dreg to n WITH Rn ; opcode 0x2n - sets Sreg and Dreg to n
Code: Select all
ADD Rn ; opcode 0x5n
That is ADD Rn adds Rn to R0 by default.
But sequence of instructions:
Code: Select all
FROM R3 TO R4 ADD R5
Well, this is interesting.
Moreover - there is no dedicated "MOVE Rn, Rm" instruction code. Instead of that prefix instruction "TO Rn" works as "MOVE Rm, Rn" if it is preceded by "WITH Rm" instruction. That is "MOVE" is two-byte instruction reusing WITH/TO 1-byte prefixes.
Interestion instruction set and I think it's not RISC at all. It's full of prefixes/immediate data/different instruction formats.
However some things resemble RISC. For example there is no CALL instruction. Instead it has four LINK 1...LINK 4 instructions (opcodes 0x91-0x94) which copies R15+imm to R11 (link register). Usually this instruction is followed by immediate word transfer to R15 (PC) "IWT R15, #proc_addr16" and later R11 was used as return address. But different LINK offsets are used for different loading schemes of R15 instruction pointer.
So there are a lot of prefixes/extended opcodes/immediate data/specialised registers and asymmetries in operations for me to honestly claim that SuperFX architecture is RISC.
(There's also the fact that it was somewhat specialized for drawing graphics. The merge opcode doesn't make much sense until you realize it's ideal for texture mapping...)
I wonder how much more expensive a Super FX with full 16-bit external data busing would have been. Not only would memory access have been faster, but the instruction set would have had room to breathe. I imagine the instruction cache might have had to be a KB instead of 512 bytes... Memory compatibility with the S-CPU wouldn't have been an issue because it's easy to use the bottom address bit as a half-word selector...
(There's also some getb; color; plot; from R4; color; loop; plot, which by how the color sources are handled looks like it might be dither code (you can't autodither at 8bpp unless I'm greatly mistaken). And some plot; plot; plot; plot; plot; plot; plot; plot, which was amusing but I'm not sure what it could be for. Maybe it's for clearing the map screen, but why does it show up more than once? None of these methods appears to be using a secondary buffer or any way to run multiple columns at a time.)
Looks like Randy didn't try to get fancy with the renderer. Unless I missed something, or the debugger misinterpreted code as data somewhere, most things seem to be drawn directly with plot in individual columns of double pixels. Based on my understanding of the Super FX, I believe it is possible (at least with some CPU-ROM added; this game is pretty squished I'm told) to speed things up significantly.
Take with a grain of salt; I haven't done a thorough code review or anything...
Once the source code drops, it will hopefully be possible to figure out how he did what he did in detail, and from that whether it's possible to materially improve it in certain ways. (It's eye-opening reading the Doom Black Book; apparently just the visplane array alone in the PC version was larger than the RAM available to the Super FX in the SNES version.) Some potential improvements, however, are apparent even now, as the port was somewhat limited by incidental technical and schedule constraints.
We know it's possible to add up to 6 MB of CPU-only ROM in parallel with the Super FX, although apparently Nintendo didn't offer this early enough for Doom to use it. This alone would have noticeably improved Randy's port, by allowing him to remove a bunch of stuff like splash screens, gun and status bar graphics, and audio data from the Super FX's ROM, thus making room for some of the stuff he had to cut (in addition to allowing for higher quality in the moved items - e.g. the title screen in the released game is multipalette 4bpp, and looks somewhat hideous).
I've thought of a way to further compress most of the graphics - walls, objects, maybe even skies. I haven't tested it, but if it works well, it could cut the size of the assets almost in half compared to RLE while being almost as fast. I've also come up with what I think is a good way to compress textures for flats without slowing down texturing much. This could allow more of the original game's assets to fit in the Super FX's ROM. Randy mentioned he wanted to look into improved compression but ran out of time.
Although it needs further development, I'm just about convinced that my intermediate buffering idea constitutes a practical way to improve the rendering speed. (Also, if done fullscreen rather than just for walls, it would enable framebuffer effects like Spectres, although this would eat into the speed advantage.) Depending on the situation, using this method might save as much as a frame or two per viewport update, even while ditching low-detail mode for full resolution at the same time.
Speaking of low-detail mode, with the mosaic trick it's pretty much a straight-up 50% savings on pixels drawn and a 50% savings on DMA time, and with textured flats you wouldn't miss the dither. It might be worth it to include it as an option even if full resolution works well, because low-detail mode might work better. Especially at frame sizes that don't work at all in full resolution due to VRAM limits...
Circle-strafing should be easy to enable. Figuring out what causes Doomguy to get stuck on walls might be more complicated...
My big question right now is whether the S-CPU, running in a large pool of FastROM with tons of lookup tables and provided with a fast SRAM scratch pad, could handle the bulk of the non-rendering part of the engine fast enough to keep pace with an optimized GSU renderer. If so, a substantial performance increase could be on the cards.
If the CPU can handle enough of the pre-render setup, the level data could even be moved out of Super FX ROM to free up more space. And if enough space can be saved in Super FX ROM (and my compression techniques perform as hoped), it might be possible to include all of the PC version's graphical assets.
But Randy Linden still didn't publish code and this is sad.
Definitely I want to see it and do some analysis (but I do not plan to do some serious work/research).
Moreover, I think Doom community is strong (I am part of it as gamer and I complete tops of 'Cocawards' every year and enjoy it every time) and I think it will become part of this culture/society.
Because it is special thing in many aspects.
Maybe some 'collective letter' to Linden? Just to ask why sources are not published yet? Maybe some problems with publisher or something?
This could take a while. Lotta files. Unfortunately I can't just dive into it now, for the same reason my shmup port isn't done yet.
Looks like rlmain.a is the main RAM code on SNES side.
It does appear that the Super FX is doing the bulk of the engine work, like Randy said in the interview. BSP traversal, enemy target acquisition, all that good stuff. IMO this means that it should be possible to run some of it on the S-CPU, given a large whack of FastROM to play in. How much of it could the S-CPU handle? Not an easy question...
wat; >>> TRANSFER VRAM COPY OF WEAPON DEF STACK TO EXECUTABLE WRAM LOCATION <<<
Ooh. If this is using blockmap, perhaps it could be sped up via the belated Naylor/Carmack idea of using the BSP...; * * * * * * * DETERMINE ENDPOINT BLOCKMAP LOCATIONS * * * * * * *
...and now I'm looking at the wall drawing code. It appears to be what I saw in the Mesen-S debugger (more debuggers should allow you to search the disassembly for text strings). But quite apart from the inefficiency of using plot in columns, I can't figure out why it's doing some of this other stuff. EDIT: That's not a slam; I actually can't make sense of the code after one reading and I'm going to have to stare at it for a bit...
The sky drawing inner loop appears to be maximally efficient, at least in low-detail mode. Ten cycles to draw two pixels, including a lag cycle on getc because R14 was set too recently, and the loop doesn't touch R1 (meaning he's drawing in horizontal rows). Unless everybody is wrong about the RAM buffer speed, you can't do better than five cycles per pixel in 21 MHz mode, so this loop is as tight as it needs to be.
I bet my simulation is finishing up; I should check on it...
bank00.a contains initialization (RESET0 vector). which goes to init.a with RESET proc.
Looks like GSU handles all interrupts (???).