pipeline clarification

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
laughy
Posts: 41
Joined: Wed Nov 17, 2004 12:34 pm
Contact:

pipeline clarification

Post by laughy »

Not as much activity here as before :( So I'll ask some questions :D

Did someone figure out exactly how the pixels are pieplined - I imagine this could affect the output results greatly.

I know Hyde had this problem but there wasn't really a clear answer.

In other words, if sprite 0 is at ppu cycle say 25, will the bit for a sprite 0 hit not happen at this time, but at:

25 + (16 - CURRENT_HORIZONTAL_SCROLL_VALUE)

??

Is this true for ANY ppu register? (Once the ppu hits vblank, it doesn't set vblank until (16 - X cycles have passed)

a 16 pixel delay is no laughing matter (although in reality that's only like one cpu instruction =]). This would also mean that the cpu will always be roughly 16 ppu cycles behind the ppu.

Also,
I'm using a stack system where writes to registers etc. are pushed on a stack. Does this mean that to every write that affects drawing I should add 16 - HORIZONTAL_SCROLL ? I.e. the writes to a register are immediate, but not for the graphics you think it should be displaying.

So the big general question of the day:

:?: I'm at PPU cycle 15. Am I drawing the 15th pixel on the screen :?:

Thanks :)
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Re: pipeline clarification

Post by Disch »

All info here is what I gathered from BT's ppu doc.
laughy wrote:In other words, if sprite 0 is at ppu cycle say 25, will the bit for a sprite 0 hit not happen at this time, but at:
25 + (16 - CURRENT_HORIZONTAL_SCROLL_VALUE)
If Sprite 0 hits on the 16th pixel into the screen... then the sprite zero hit flag is set on the 16th cycle of the screen. No further math is needed. 16th pixel = 16th cycle.

Is this true for ANY ppu register? (Once the ppu hits vblank, it doesn't set vblank until (16 - X cycles have passed)
As far as I know... all flags are adjusted immediately without any delay whatsoever. This goes for register writes, too (for example, in a game like Final Fantasy, the game switches between monochrome/color mode mid-scanline.. the visual effect takes effect immediatly without any delay)


The word "delay" is very very very misleading... since there is no real delay. What happens is... tiles are loaded a bit before they're displayed. Tile graphics data is loaded 16 cycles ahead of time and stored (presumably in some sort of PPU cache or something) until it's actually output to the screen. The only time I can see this having an effect is if the game swaps CHR banks (or changes CHR in some other manner) mid scanline. For example... the images for the first 2 tiles are already loaded on scanline cycle 0... so if the game swaps CHR on scanline cyc 0, the old data will still be output for 16 pixels (since it's already loaded). The actual pixels are rendered on the proper cycle without delay, though (pixel 0 -- cycle 0)
:?: I'm at PPU cycle 15. Am I drawing the 15th pixel on the screen :?:
Yes


I tried going with the queue method of emulation a while back. It's easy to log all register writes and draw the full frame at once... but getting the register -reads- is what's hard. Remember that the game can read from $2007 (and other PPU regs) mid-frame / mid-scanline, and it could have an effect on PPU drawing (likewise, PPU drawing could affect what's returned back from the register read).
Hyde
Posts: 101
Joined: Mon Sep 27, 2004 11:51 pm

Re: pipeline clarification

Post by Hyde »

Oh yes, the pipelining effect... Basically it's due to the fetching of nametable / attribute data associated with the first two tiles of a given scanline, which takes place on the scanline prior to the one in question. So, at the beginning of a scanline, the first tile (fetched on the previous scanline) is rendered (i.e. displayed) while nametable / attribute / pattern data for the *third* tile is fetched. I lost a lot of time trying to figure this thing out by myself, but finally got it. Forget the formula given by Brad Taylor on one of his docs (I'd like to see it work for some tricky games). The way you figure out the exact *CPU* cycle at a which a collision takes place is (PPU Pixel #) / 3. So, if you had some stupid solid background picture with a solid retarded sprite #0 pixel directly over said background at x = 240, bit 6 of $2002 ought to be set at CPUCC = 240 / 3.
User avatar
laughy
Posts: 41
Joined: Wed Nov 17, 2004 12:34 pm
Contact:

:)

Post by laughy »

"So, if you had some stupid solid background picture with a solid retarded sprite #0"

classic

Thanks for the clarification guys :)

As for the queue thing , luckily there are only a few things that the ppu needs to actually calculate on a read, such as if the ppu current clock is in vblank, if the current ppu clock is past sprite 0, or what tile the ppu should currently be drawing (2007). No doubt the reads to 2007 will be the most costly and complicated, but luckily reading 2007 should not be something games should be doing all the time, and if such is the case more extreme measures can be taken to make it faster. :)

The real complication is writing to 2007. If one writes to what could be a future read to 2007 in the same frame, then that's gonna be a problem. I think a small cache of written addresses, easily indexed, will need to be in play so that on a read of 2007 one can see if a 'newer' version was written in the past frame.

However I think 2007 is really the only read I have to worry about. 2004 can be indexed by a temporary port 2003, and the other registers will also just reflect the last write to them (if the write was allowed).

However feel free to tell me if I'm off my rocker :)
Last edited by laughy on Thu Nov 18, 2004 10:02 am, edited 1 time in total.
User avatar
baisoku
Posts: 121
Joined: Thu Nov 11, 2004 5:30 am
Location: San Francisco, CA
Contact:

Re: :)

Post by baisoku »

laughy wrote:As for the stack thing (it's a stack since it's a LIFO thing instead of a FIFO), luckily there are only a few things that the ppu needs to actually calculate on a read, such as if the ppu current clock is in vblank, if the current ppu clock is past sprite 0, or what tile the ppu should currently be drawing (2007).
er, i think you want a fifo to process things in the correct order..

the way to accurately and efficiently emulate the PPU/CPU/APU is simple, yet i haven't seen it mentioned at all on these boards:

1) run the CPU until state-dependent event occurs
2) mark
3) run the affected unit from its last mark until current mark
4) goto 1.

a 'state-dependent event' in the case of the PPU is a) register read, b) register write, c) time to render a frame on host.
...patience...
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Re: :)

Post by Disch »

baisoku wrote:yet i haven't seen it mentioned at all on these boards:
I went into detail on how my emu did things (similar to this method) on the old boards.

Relevant linkage:
http://nesdev.com/cgi-bin/wwwthreads/sh ... 5#Post1445

But I see what he's going for. This queue system could be faster if done properly. And yeah I'd think it'd have to be FIFO too.
User avatar
laughy
Posts: 41
Joined: Wed Nov 17, 2004 12:34 pm
Contact:

:)

Post by laughy »

Wow I posted this at like 12:30AM, came back at 9AM to fix it cause I realized I was a fool for putting LIFO, and already two replies! hehe. yay for edit.

baisoku what you said is something I think everyone considers before the queue thing, and the whole point is speed, not simplicity. The overhead of starting the ppu process is much to great to do it on every read or write, especially when there are very few registers involved to keep track of.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

What you're encountering is a "data hazard", and computer architecture classes teach two ways of avoiding them: interlocks to catch up with the "architectural" state of the machine, or data forwarding. For something like VRAM sync, I'd suggest using an interlock: run the PPU before a $2007 read if the last $2007 access was a write.
User avatar
Zepper
Formerly Fx3
Posts: 3262
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Re: pipeline clarification

Post by Zepper »

Hyde wrote:(...)I lost a lot of time trying to figure this thing out by myself, but finally got it. Forget the formula given by Brad Taylor on one of his docs
Heh, that was very funny! :) One more hit on BT's docs. This mean there's no real pipeline: at pixel X, collision at pixel X. ;)
Hyde wrote:So, if you had some stupid solid background picture with a solid retarded sprite #0 pixel directly over said background at x = 240, bit 6 of $2002 ought to be set at CPUCC = 240 / 3.
And is this bit really clear at VBlank end? [line 261]
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Re: pipeline clarification

Post by Disch »

Fx3 wrote:Heh, that was very funny! :) One more hit on BT's docs. This mean there's no real pipeline: at pixel X, collision at pixel X. ;)
There is a pipeline. Graphics are loaded ~16 pixels/cycles before they're displayed. But sprite 0 hit takes effect on display, not on load.

It's just the pipeline seems to be confused/misunderstood.
User avatar
laughy
Posts: 41
Joined: Wed Nov 17, 2004 12:34 pm
Contact:

:)

Post by laughy »

What you're encountering is a "data hazard", and computer architecture classes teach two ways of avoiding them: interlocks to catch up with the "architectural" state of the machine, or data forwarding. For something like VRAM sync, I'd suggest using an interlock: run the PPU before a $2007 read if the last $2007 access was a write
The ideas may be similar but the environment and problem is quite different from a cpu piepline! We have a lot more options open to us. I'm going to try the frame queue idea and see how it works out - if I find it too slow to keep in sync, I'll see how the catching up idea works. Thanks for your ideas guys!
There is a pipeline. Graphics are loaded ~16 pixels/cycles before they're displayed. But sprite 0 hit takes effect on display, not on load
Do you think emulating the pipeline is worth it..? IIII don't!!!! hehe.
User avatar
Zepper
Formerly Fx3
Posts: 3262
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Re: :)

Post by Zepper »

laughy wrote:Do you think emulating the pipeline is worth it..? IIII don't!!!! hehe.
I have the same opinion... I might consider it as a 'grain of salt' ;)
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Re: :)

Post by Disch »

laughy wrote:Do you think emulating the pipeline is worth it..? IIII don't!!!! hehe.
Well... I find it to be easier to impliment it than it would be not to. I mean... you're going to have to do some sort of "saving" tiles regardless (since your PPU emulation can be interrupted mid-tile if you're doing pixel-based rendering). So you might as well "save" them at the proper PPU cycle and then render them later instead of having to deal with knowing which part of the tile is already drawn and possibly having to re-load the same tile multiple times and yadda yadda.
Hyde
Posts: 101
Joined: Mon Sep 27, 2004 11:51 pm

Re: :)

Post by Hyde »

laughy wrote: Do you think emulating the pipeline is worth it..? IIII don't!!!! hehe.
Depends what you are going for. My emulator has two different cores, one is scanline-based and the other cycle-based. The first does not take care of the pipeline, but the second does. If you are going for speed then I would not recommend emulating it (very few games depend on its effect).
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: :)

Post by tepples »

Hyde wrote:If you are going for speed then I would not recommend emulating [the fetch pipeline] (very few games depend on its effect).
Can your scanline engine detect when a game would need the effect and automatically switch to the cycle engine if necessary? (Caution: Detecting which games need a particular hack turned off by CRCing the ROM may be patented, and in any case, it's dangerous for testing homebrews.) Or can it run the scanline engine for those scanlines that don't contain any PPU writes?
Post Reply