Page 1 of 5

DMA operation in APU

Posted: Sat Feb 27, 2010 3:10 pm
by jwdonal
Still working on the APU - things are coming along quite well actually. I need to go pick up a 3.5mm-to-3.5mm cable from RadioShack or somewhere today so I can post some audio samples that you can listen to. The sound isn't quite right yet, but hey, I'm just happy to hear something!

Anyway, I am about to start implementing the DMC and was reading up on the Wiki documentation. It all seemed to make sense to me but something caught my attention.

From Wiki (
The 6502 cannot be pulled off of the bus normally. The 2A03 DMC gets around this by pulling RDY low internally. This causes the CPU to pause during the next read cycle, until RDY goes high again. The DMC unit holds RDY low for 4 cycles. The first three cycles it idles, as the CPU could have just started an interrupt cycle, and thus be writing for 3 consecutive cycles (and thus ignoring RDY). On the fourth cycle, the DMC unit drives the next sample address onto the address lines, and reads that byte from memory. It then drives RDY high again, and the CPU picks up where it left off.

This matters, because it can interfere with the expected operation of the controller registers, reads of the PPU status register, and CPU VRAM or SPR reads if they happen to occur in the same cycle that the DMC unit pulls RDY low.
I have 2 questions regarding this:

1) If a sprite DMA transfer is already in progress (and therefore already in control of the bus and already deasserting the RDY signal on the CPU), does a DMC DMA operation override (interrupt) the sprite DMA process or does the DMC wait for the entire sprite RAM transfer to finish before taking control of the bus?

2) About the mention of the DMC waiting 4 CPU cycles before taking control of the bus...Does the sprite DMA transfer module do the same thing (i.e. wait 4 cycles to ensure that the CPU has finished with its last operation)? I ask because...umm...I currently don't wait for the CPU to finish its current operation before starting the sprite DMA xfer. I just pull RDY low, take control of the bus, and start the transfer. LOL, I'm thinking now that could be a bad thing. :-P

Thanks!! :)

Posted: Sat Feb 27, 2010 4:26 pm
by tepples
2) DMC DMA has to wait for up to three writes in a row before stopping the 6502 off the bus, which might happen on a BRK instruction, IRQ, or NMI. Sprite DMA doesn't have to wait as long because it always occurs immediately after STA/STX/STY $4014, instructions that produce only one write. (Games are expected not to use read-write-write instructions like INC when accessing $4014.)


Posted: Sat Feb 27, 2010 8:02 pm
by jwdonal
Awesome, thanks tepples! I think my current implementation is OK then.

Anybody for #1 ?


Posted: Sun Feb 28, 2010 2:36 am
by Bregalad
Does this means the CPU could crash if an IRQ/NMI were to interrupt a sta/stx/sty $4014 instruction ?

Posted: Sun Feb 28, 2010 5:03 am
by tepples
IRQ and NMI have the same read pattern as the BRK instruction, which means they start with two reads to fetch (and discard) the opcode.

Posted: Mon Mar 01, 2010 4:09 am
by jwdonal
Nobody knows the answer to my first question? There must be somebody...


Posted: Mon Mar 01, 2010 4:30 am
by tepples
I understand your frustration, but I don't own a logic analyzer with which to watch the NES address bus while it executes a test program.

Posted: Mon Mar 01, 2010 7:26 am
by blargg
No need for a logic analyzer, since it should be CPU-observable behavior. You just need to have an IRQ or NMI interrupt said instruction at various positions and see what happens.

Posted: Tue Jun 08, 2010 11:22 pm
by blargg
Based on the following tests, DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle. The test ROMs here print the below outputs, and verify that they match what's expected.

They also verify that the bytes copied to OAM match what is expected, so DMC DMA isn't corrupting the data. Further, the DMC sample playing during the test is all $55 bytes, so if the DMC DMA read were corrupted, it'd be audible. I recorded output and don't see any corruption.

This test has DMC DMA occur at each cycle in a piece of code, and prints how many cycles the code took, including any extra cycles the DMA added. For example, this code generates the output after it:

Code: Select all

sta $100    ; 4
lda $100    ; 4
sta $100    ; 4
sta $100    ; 4
T+ Clocks (decimal)
00 20
01 20
02 20
03 19
04 20
05 20
06 20
07 20
08 20
09 20
0A 20
0B 19
0C 20
0D 20
0E 20
0F 19
The code should take 16 cycles, but DMA adds four. However, when it lands on the write cycles of the three STA instructions, it only takes three. You can clearly see the pattern of the STA-LDA-STA-STA in the result, confirming that it's really measuring something useful.

Now, the code that tests sprite DMA:

Code: Select all

lda #$07    ; 2
sta $4014   ; 4 + 513/514
sta $100    ; 4

T+ Clocks (decimal)
00 527      +4  LDA #$07    ; 2
01 528      +4
02 527      +4  STA $4014   ; 4 + 513/514
03 528      +4
04 527      +4
05 526      +2
06 525      +2
07 526      +2
08 525      +2
09 526      +2
0A 525      +2
0B 526      +2
0C 525      +2
0D 526      +2
0E 525      +2
0F 526      +2
200 525     +2
201 526     +2
202 525     +2
203 526     +2
204 524     +1  DMA next-to-next-to-last cycle
205 525     +1  DMA next-to-next-to-last cycle
206 526     +3  DMA last cycle
207 527     +3  DMA last cycle
208 527     +4  STA $100 second cycle
209 528     +4  STA $100 second cycle
20A 526     +3  STA $100 fourth cycle
20B 527     +3  STA $100 fourth cycle
I've manually listed the number of DMA cycles added (clocks-523/524), and what instruction is executing. The main snag is that sprite DMA takes 513 OR 514 cycles, depending on whether it's started on an even or odd 2A03 cycle. I'm assuming this is very similar to $4017 writes being delayed a cycle if on an odd 2A03 cycle.

The way this test works, the test code begins on even/odd 2A03 cycles based on the time it has arranged the DMC DMA to occur. This complicates things. At the end of OAM DMA and after, it means that DMC DMA is only hitting every other cycle of the test code. You can see this in the STA $100 after OAM DMA, where DMC DMA takes three cycles for two different times. This is because both times it's landing on the fourth cycle of STA $100 (I tried other instruction sequences to be sure of this, and it checks out).

Maybe someone with a logic analyzer can see what's really going on. The above is about as much as you're going to get with a CPU test alone. :)

Posted: Wed Jun 09, 2010 1:01 am
by ReaperSMS
Fascinating stuff.

This suggests a shared DMA unit, running on a 2-cycle period, writes happening in the first period, reads happening in the last. There's probably a start flag that suppresses the address bus switchover for the first 1.5 cycles.

Random brainfart guess as to what the sequence and priorities look like:

(all sequences happen in parallel for each cycles, all inputs are read before updating outputs)

bus_request is asserted by 4014 writes and DMC intermediate buffer empty & DMC active
4014 writes set spr_page to data, spr_flag to 0, and spr_byte to 0

Code: Select all

  if (bus_request) bus_grant <= 1;
  if (bus_grant)
    if (spr_flag)
       addr <= {spr_page, spr_byte}
       data <= spr_data
       spr_byte <= spr_byte + 1
  bus_request <= (4014 written | spr_flag) | (~bus_grant & DMC enabled & buffer empty)
  if (bus_grant)
    if (DMC)
      addr <= DMC_addr
      DMC_data <= data
    else if (SPR)
      addr <= {spr_page, spr_byte}
      spr_data <= data
      spr_flag <= 1
The above is likely completely incorrect when interrupts factor in, and the cycles may be flipped. The important bit would be that there's a bus request DMA cycle, followed by the working DMA cycle(s). DMC takes priority over the read, and the SDMA can tell when it's data isn't valid. Request cycles can be flagged as late as the end of C0.

4014 writes that land on C0 will assert bus_request for C1, and bus_grant will kick in at the next C0. The CPU will be stalled out by RDY in C1 if it's a read, or continue on it's merry way if the 4014 access was RMW (it looks at RDY late in the cycle).

Once granted, DMC will read it's data. If this is during a SDMA, the grant will already be flagged, so the request is hidden, and there's only a 2-cycle overhead due to stalling the SDMA one DMA cycle. If the DMC pops up near the last-ish SDMA cycle, specifically when SDMA doesn't have to read any more, it will only delay things one cycle, through extending RDY past the end of SDMA.

Posted: Wed Jun 09, 2010 9:28 am
by jwdonal
Holy cow Blargg! Thank you so much! What have I done to deserve such generosity??? I need to look over your post in much greater detail to really understand everything your saying - I'm heading into work at the moment. I actually just got a nice tektronix logic analyzer so maybe with your test software we can finally put an end to this mystery once and for all. First, I need to understand the clock cycling as well as you apparently do. Then I could even provide the analyzer traces here on NesDev (or my site).

In fact, now that I think about it, I'm wondering if there are any other long standing questions that could only be answered by a logic analyzer that I might be able to answer while I have the whole thing hooked up? Does anyone know of any? Maybe this would be an opportunity for me to give something back to the NesDev community when they have helped me so much...

Thanks again! :-)


Posted: Wed Jun 09, 2010 6:33 pm
by cpow
blargg wrote:Based on the following tests, DMC DMA adds 4 cycles normally, 3 if it lands on a CPU write, 2 if it lands on the $4014 write or during OAM DMA, 1 if on the next-to-next-to-last DMA cycle, 3 if on the last DMA cycle. The test ROMs here print the below outputs, and verify that they match what's expected.
Okay this one leaves me with a big WTF...


First of all my output doesn't match anything in Blargg's description.

Blargg can you provide the source? Is the description you provided in this thread what should be seen on the screen?

Posted: Wed Jun 09, 2010 6:58 pm
by blargg
It relies on DMC timing and operation being correct. I'll have to finish updating my APU tests and releasing them. This DMC DMA during sprite DMA is one of the hardest-core tests I've written, depending on many other things being perfect.

Posted: Wed Jun 09, 2010 7:09 pm
by cpow
blargg wrote:It relies on DMC timing and operation being correct. I'll have to finish updating my APU tests and releasing them. This DMC DMA during sprite DMA is one of the hardest-core tests I've written, depending on many other things being perfect.
Ok so the fact that I don't [yet] have a completely cycle-accurate APU implementation is what is causing those ridiculously large cycle counts?

After I found your APU test suite a couple days ago I rearchitected my APU from a PPU-frame based one to one that is at least running in its own 'frame' field [independent of CPU or PPU 'frames']. However, I still do one 240Hz frame of the APU at once (ie, generate 735/4 samples of sound based on the current state running forward in time).

I think I'll go see if I can get it down to the cycle...

Posted: Wed Jun 09, 2010 7:26 pm
by ReaperSMS
If your APU isn't running on a per cycle basis, how could it possibly be doing the DMA cycle stealing correctly?