Merging in my reply to the post in this thread
viewtopic.php?f=5&t=17169&start=30#p216192
93143 wrote:I'm just saying that there was a surprising amount of overlap given the power difference, which may be attributable to the N64 being hard to program, compounded by the inefficient official microcode holding the system back
I would say for the vast majority of developers any inefficiencies in the official microcode would not have been problem, because they would have been memory bandwidth bound well before they were at the stage to push RSP hard. Rather, the issue was that the official microcode was built around the use of a z-buffer, which guzzled away at precious memory bandwidth (as noted in my previous post, the N64 did not have significantly higher "peak" memory bandwidth than its console competition anyway).
The more technically advanced N64 games, with memory bandwidth under a certain level of control, could certainly have put pressure on RSP. For example, Conker with its multiple light sources, or the Factor 5 games with their relatively complex terrain generation from heightmaps. I believe the consensus among developers was that SGI did not get the balance right in the official microcode between vertex/lighting quality and performance (probably owing to their workstation background), which is why the top-level developers had to tweak provided microcodes or write their own. SGI did actually provide updated versions of the official microcode throughout the N64's life which would have made life a fair bit easier. As for the official audio microcode, it seems to have been unnecessarily resource hungry, maybe because SGI did not have the experience of an audio hardware company. For developers like Factor 5 who needed to squeeze more performance out of RSP, they wrote their own audio 'driver'.
93143 wrote:Comparing bad N64 games with good PlayStation games can make it look like the PlayStation was more powerful, which is not something you tend to get with (say) the NES vs. the SNES
I don't think I quite understand your point here. It has always been true that the best programmed games on weaker hardware in a console generation look better than the worst programmed games on stronger hardware in the same generation. For example, the Xbox was unequivocally stronger than the PS2 (blending aside, thanks to eDRAM) but Xbox game Bruce Lee: Quest of the Dragon looked worse than the older PS2 game The Bouncer.
The comparison between NES and SNES is not legitimate because the technical difference between the two consoles was several times that of PS1 and N64, or PS2 and Xbox. A sufficiently large generational difference in technology and tools will usually rescue even poorly programmed games from looking less graphically advanced than older titles.
93143 wrote:Speaking of RAM, what was the latency like on the RCP side? I've heard figures as high as 600 ns for a CPU cache miss, but I can't imagine that's representative of RCP random access
600ns for a CPU cache miss sounds right. Of course that timing also accounts for the memory controller being built into RCP and being physically external to the CPU. That means the latency for RCP's random access will definitely be lower but I'm not sure exactly by how much.
93143 wrote:Also, I believe the N64's RAM was divided into four banks; did this have any relevance to latency?
It is divided in two banks, with color intended to go into one and z into the other (though not mandatory), but the banks are not simultaneously accessible. Yes, it's supposed to greatly decrease latency through address pipelining. Because RDRAM does row and column addressing separately, you should theoretically be able to change banks without incurring significant latency even though the access is going outside the current page (which should be useful when switching between color and z buffers). Unfortunately, like most things related to RDRAM, in practice it didn't work nearly that well so you still got much of the latency. Though I'm unsure if the fault was with the RDRAM design itself or the memory controller in RCP.
EDIT: Oops. Checked into it. The RDRAM really is divided into four banks.
Texture cache is divided into four simultaneously accessible banks, though. That's how RDP achieves one-cycle bilinear filtering.
93143 wrote:And yet, it seems that certain late games, such as World Driver Championship, managed to match or exceed the PlayStation's advertised capabilities (180,000 textured tris per second) while maintaining the additional features that made N64 polygons so power-intensive...
World Driver Championship didn't use a z-buffer, so that meant it was far less memory bound than most other N64 games. As RSP is both wider and clock-faster than GTE, and RDP has almost double the pixel-fill rate of PS1's GPU (even with most features turned on) it's unsurprising that N64 can put out considerably more and better polygons than Playstation when memory bandwidth is out of the way. Not having a z-buffer won't be good in every situation though, so that's why SGI would have considered z-buffering the sensible "default option". EDIT: More choice for developers would have been even better...
93143 wrote:I'm sorry; I can't accept that without more detail. What's wrong with the methods I proposed? I could guess, but that's not a path to edification
My apologies, I was simply describing what would have been SGI's intended use case for RCP's additive blending. I have the following points to make about the theoretical workaround described in your earlier post.
1) Yes, the color combiner does clamp its output.
2) Copy mode won't increase the speed of copying the framebuffer into TMEM (that would continue to be memory bus bound anyway). All copy mode does is make pixels go through RDP's pipeline faster by skipping almost all processing. So rather than making framebuffer -> TMEM go more quickly, it would make TMEM -> framebuffer go faster. The purpose is to make 2D tiling really fast.
3) I don't think you can have reliable pixel-level control over primitives already dispatched to RDP (not without insane cycle counting across the whole console, anyway). So you can't change RDP attributes or modes on individual primitive pixels and get expected results every time. Reliable per-primitive changes are possible through the synchronization function though.
But I think what you are describing is definitely possible with the color combiner in two cycle mode. The color combiner maths is newcolor = (A - B) x C + D. So you could write your own color combiner function like this newcolor = (1 - 0) x TEXEL0 + TEXEL1, with TEXEL0 being the texture for blending, and TEXEL1 being the framebuffer chunk both loaded into TMEM. You could still use the result of this in a second color combiner operation. The N64's little 'pixel shader' may yet save the day.
I haven't given sufficient thought as to what this would mean for the blender (including anti-aliasing or fog) so I may get back to you on that. But as you can imagine, this would be hell on TMEM. I still think it may all be madness in practice, but maybe it is not.
There's no way to safely pre-clamp the pixels if you want to use the blender, except as I stated before, be really conservative and cautious with the hardware additive blending feature as SGI intended. I suppose you could do the additive blending with the CPU if you wanted for some software rendering. Since the N64 has a unified memory architecture, the CPU can just read/write straight to the framebuffer. Hardly an efficient use of resources, but it's there.