It is currently Fri Aug 23, 2019 1:46 am

All times are UTC - 7 hours



Forum rules





Post new topic Reply to topic  [ 9 posts ] 
Author Message
PostPosted: Thu Aug 01, 2019 4:59 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Currently, my CPU core is littered with this nasty hack:

Code:
#define L lastCycle();

void WDC65816::instructionJumpIndirect() {
  V.l = fetch();
  V.h = fetch();
  W.l = read(uint16(V.w + 0));
L W.h = read(uint16(V.w + 1));
  PC.w = W.w;
}


The idea of L (lastCycle()) is to test interrupts.

This becomes important for eg CLI:

Code:
void WDC65816::CLI() {
L idleIRQ();
  P.i = 0;
}


Eg: https://wiki.nesdev.com/w/index.php/CPU ... 2C_and_PLP

The understanding I've heard in the past is this is a side-effect of the 65xx's two-stage pipeline. While the CPU is doing work cycle N, it's performing bus cycle N+1. So the last work cycle is really the first bus cycle of the next instruction, and that's where it's testing for interrupts (IRQ/NMI.) The test ends up happening before P.i can be set to zero.

So breaking down cycle timings ... presume V=scanline#, H=scanline clock counter, and each opcode cycle takes 6 clocks.

Code:
  V  H bus              work
  0,12 fetch 0x58 cli
L 0,18 idle             I = 1
  0,24 fetch 0xea nop


So the bus fetches 0x58 (CLI) during cycles 12-17. Then there's an idle bus cycle during cycles 18-23. Then the CPU fetches the next opcode during cycles 24-29 ... but we are testing for interrupts at cycle 18, not 24!!

If we re-imagine the bus/work breakdown, we could say that the opcode fetch was already done from the previous instruction.

Code:
  V  H bus              work
  0,12 idle
L 0,18 fetch 0xea       I = 1
  0,24 idle


In code:

Code:
void WDC65816::CLI() {
L prefetch();  //load the next instruction byte
  P.i = 0;
}


But we aren't eliminating any complexity, now every single instruction has to end with the opcode byte fetch, so that our final work cycle can come after it.

Then there's this weird thing I found while chasing cycle-perfect interrupt timings:

Code:
//immediate, 2-cycle opcodes with idle cycle will become bus read
//when an IRQ is to be triggered immediately after opcode completion.
//this affects the following opcodes:
//  clc, cld, cli, clv, sec, sed, sei,
//  tax, tay, txa, txy, tya, tyx,
//  tcd, tcs, tdc, tsc, tsx, txs,
//  inc, inx, iny, dec, dex, dey,
//  asl, lsr, rol, ror, nop, xce.
auto WDC65816::idleIRQ() -> void {
  if(interruptPending()) {
    //modify I/O cycle to bus read cycle, do not increment PC
    read(PC.d);
  } else {
    idle();
  }
}


I spent weeks testing this extensively to confirm this is exactly what was happening ... it was pretty hellish to exhaustively test every possibility. So then, a read from a slow ROM region can take 8 clocks instead of 6 clocks, so it can affect timing.

If we fetch opcode bytes and then execute their instructions, the interrupt() function looks like this:

Code:
void WDC65816::interrupt() {
  read(PC.d);
  idle();
N push(PC.b);
  push(PC.h);
  push(PC.l);
  push(EF ? P & ~0x10 : P);
  IF = 1;
  DF = 0;
  PC.l = read(r.vector + 0);
  PC.h = read(r.vector + 1);
  PC.b = 0x00;
}


But if we fetch opcodes at the end of each instruction (thus the next opcode has already been fetched before executing instructions), it looks like this:

Code:
void WDC65816::interrupt() {
  idle();
  PC.w--;  //undo the last instruction prefetch increment
N push(PC.b);
  push(PC.h);
  push(PC.l);
  push(EF ? P & ~0x10 : P);
  IF = 1;
  DF = 0;
  PC.l = read(r.vector + 0);
  PC.h = read(r.vector + 1);
  PC.b = 0x00;
  prefetch();  //since PC changed, the old opcode prefetch was invalidated
}


If an interrupt fires after say, NOP, and each instruction ends with an opcode prefetch, then the last cycle of NOP was an opcode fetch and that takes +8 cycles instead of +6 cycles.

But that only explains the exception case, so all we did was invert idleIRQ(). An instruction like XBA (fetch + idle + idle) doesn't have the effect. Which would mean that with opcode fetching coming at the end of the instruction, XBA is (idle + idle + opfetch). But my testing showed there was no opcode fetch if an interrupt fires after XBA, so that would mean we'd have to make the last cycle of XBA call idleInvertedIRQ() that turns an opcode fetch into an idle cycle (along with a PC.w increment) if an interrupt is pending.

So I don't really know which method is better. The only nice thing about the opcode fetch appearing at the end is it becomes a conistent final cycle. Branch conditions are currently pretty annoying otherwise:

Code:
void WDC65816::instructionBranch(bool take) {
//prefetch() implied right before executing this by WDC65816::interrupt() dispatch call
  if(!take) {
L   fetch();
  } else {
    U.l = fetch();
    V.w = PC.d + (int8)U.l;
    idle6(V.w);
L   idle();
    PC.w = V.w;
    idleBranch();
  }
}


But with the opcode fetch at the end:

Code:
void WDC65816::instructionBranch(bool take) {
  if(!take) {
    fetch();
  } else {
    U.l = fetch();
    V.w = PC.d + (int8)U.l;
    idle6(V.w);
    idle();
    PC.w = V.w;
    idleBranch();
  }
L prefetch();
}


And that means we can combine all those annoying 8/16-bit CPU instructions:

Code:
void WDC65816::instructionImmediateRead8(alu8 op) {
L W.l = fetch();
  alu(W.l);
}

void WDC65816::instructionImmediateRead16(alu16 op) {
  W.l = fetch();
L W.h = fetch();
  alu(W.w);
}


To eg:

Code:
void WDC65816::instructionImmediateRead(alu16 op, bool word) {
  W.w = fetch();
  if(word) W.h = fetch();
  alu(W.w, word);
L prefetch();
}


...

Ultimately what I want to do here is get rid of the need for L. My thought for this was, what if we do the lastCycle() test on every single cycle, and use the second-to-last result to determine if we should fire an interrupt after an instruction?

But with my current lastCycle() test, it's not possible to call it on every cycle because it has side effects.

Code:
bool CPU::nmiTest() {
  if(!status.nmiTransition) return false;
  status.nmiTransition = false;
  r.wai = false;
  return true;
}

bool CPU::irqTest() {
  if(!status.irqTransition && !r.irq) return false;
  status.irqTransition = false;
  r.wai = false;
  return !r.p.i;
}

void CPU::lastCycle() {
  if(!status.irqLock) {
    if(nmiTest()) status.nmiPending = true;
    if(irqTest()) status.irqPending = true;
    status.interruptPending = status.nmiPending || status.irqPending;
  }
}


The (irq,nmi)Transition flags are tested for every clock cycle, and set the instant an interrupt triggers. If we call lastCycle() on every cycle, then we could end up clearing the transition flags early. The same goes for the WAI instruction flag, though that's probably less important since WAI just sits in a spinloop testing the interrupt flag constantly.

...

The reason I bring this up is because Sour appears to be doing exactly what I've always wanted to do in testing interrupts every cycle and then using the second-to-last result for triggering interrupts. If he can actually pass my test_nmi and test_irq test ROMs then that is really, really impressive. Because right now, I don't see how that's possible >_<


Top
 Profile  
 
PostPosted: Thu Aug 01, 2019 6:59 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 724
byuu wrote:
If he can actually pass my test_nmi and test_irq test ROMs then that is really, really impressive. Because right now, I don't see how that's possible >_<
I feel like I should probably mention that I don't pass those tests still (I was mostly busy fixing more critical stuff up till now :p).

That being said, in my case my code is just checking the state of the irq signal (e.g if it's low or high) and storing that result (while keeping the I flag in consideration). The IRQ signal itself is controlled externally by the HV counters & coprocessors, the CPU core has no knowledge of these. The HV counters' irq signal is updated every 4 master clocks, based on the PPU's position, etc. I haven't looked into the test_nmi/irq tests enough to know what causes them to fail, unfortunately, so it's not impossible that my current design wouldn't work for them.


Top
 Profile  
 
PostPosted: Thu Aug 01, 2019 7:04 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
byuu wrote:
...
This becomes important for eg CLI:
...
Eg: https://wiki.nesdev.com/w/index.php/CPU ... 2C_and_PLP

The understanding I've heard in the past is this is a side-effect of the 65xx's two-stage pipeline. While the CPU is doing work cycle N, it's performing bus cycle N+1. So the last work cycle is really the first bus cycle of the next instruction, and that's where it's testing for interrupts (IRQ/NMI.) The test ends up happening before P.i can be set to zero.

Correct, and it's "unofficially documented" across some places (6502.org forum probably has the deep details), and I doubt the 65816 is any different in this regard.

* http://visual6502.org/wiki/index.php?ti ... t_Handling

There's also this fun one (not trying to get off topic here at all):

* https://www.youtube.com/watch?v=fWqBmmPQP40 -- 42:00 to 44:15 -- and why BRK immediately after an IRQ can cause the BRK to be skipped. That may be more specific to http://visual6502.org/wiki/index.php?ti ... _and_B_bit and https://www.pagetable.com/?p=410 but there are the details.

These are pretty extreme (IMO) edge cases, and not sure if it's truly worth revamping something to address. If there are large sums of games utilising these quirks, that would be a different matter.

If I'm incorrect in the sense that the above things I've said are not related to what's being discussed/described, call me out on it so that future readers know to ignore what I've said here.


Top
 Profile  
 
PostPosted: Thu Aug 01, 2019 7:06 pm 
Offline

Joined: Thu Aug 20, 2015 3:09 am
Posts: 462
When I was tearing my hair out trying to get my FM chip emulators cycle-accurate, I got a lot of mileage out of basically vectorizing everything. Instead of having a variable for each piece of state, I'd have an array. All the wonky off-by-one timings could just look up a different cycle in the array, edge-triggered things could just read pairs of slots in sequence, and I could blast batches of a few hundred or thousand cycles out together for a nice performance boost.

It obviously won't work as well for a CPU, but I've been meaning to try doing the same for the interrupt handling in my 6502 emulator. Run everything that can assert interrupts ahead, store the results in an array, then run the CPU, looking up whatever cycle is needed (and making sure to refresh the array if something invalidates the prediction).

Dunno how useful that'd be for the SNES because I know literally nothing about it, but it might give someone some ideas...?


Top
 Profile  
 
PostPosted: Thu Aug 01, 2019 7:36 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
I improved things as much as I could.

https://github.com/byuu/bsnes/blob/mast ... pu/irq.cpp

status.interruptPending is just a bit-mask combination of (nmi|irq|resetPending), since the WDC65816 core calls it quite often. Omitting that here.

Code:
auto CPU::lastCycle() -> void {
  if(!status.irqLock) {
    if(nmiTest()) status.nmiPending = 1;
    if(irqTest()) status.irqPending = 1;
  }
}


We don't need to clear the WAI flag at this point, it can be cleared immediately during a transition while polling IRQs or writing to NMITIMEN. I also can't confirm if irqTest() should poll r.irq (the external IRQ pin) or not, so for now I'm instead clearing WAI whenever the external IRQ pin is driven high (by eg the SuperFX, SA-1, or HG51BS169/Cx4.)

Code:
auto CPU::nmiTest() -> bool {
  if(!status.nmiTransition) return 0;
  status.nmiTransition = 0;
  return 1;
}

auto CPU::irqTest() -> bool {
  if(!status.irqTransition) return 0;
  status.irqTransition = 0;
  return !r.p.i;
}


Well, at least it should be easier now to figure out a solution for this problem.


Top
 Profile  
 
PostPosted: Thu Aug 01, 2019 7:44 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Quote:
I feel like I should probably mention that I don't pass those tests still (I was mostly busy fixing more critical stuff up till now :p).


... oh ^-^;;

Perhaps we can work out a solution at the same time then. Your CPU core design isn't going to work if you have to poll interrupts the way I do ...

Sorry, those test ROMs weren't ever cleaned up for public release and I lost the source to flash drive bit rot ._.
But until someone makes a better test, it's pretty much the best test there is for NMI/IRQ validation :/

Quote:
The IRQ signal itself is controlled externally by the HV counters & coprocessors, the CPU core has no knowledge of these.


Yeah, I duplicate the PPU H/Vcounters into my CPU core so that I can predict when the PPU H/Vblank lines will change. Otherwise I would not be able to run the CPU ahead of the PPU pretty much ever, because I wouldn't know in advance if the PPU were about to change the H/Vblank pins.

Of course a truly low-level emulation (eg Verilog) would want to have the CPU keep its own H/Vcounters that simply increment and wrap based on the PPU H/Vblank pins, which neatly explains why CPU IRQs aren't affected by PPU long dots.

Quote:
and why BRK immediately after an IRQ can cause the BRK to be skipped


I should probably go ahead and support this on the 6502. It may apply to the 65816, and will likely give me some insights. Thanks!

Quote:
Instead of having a variable for each piece of state, I'd have an array. All the wonky off-by-one timings could just look up a different cycle in the array


I have a time-shifter for the H/Vcounters exactly like that, yeah. It's pretty great ^-^.

I tried it with this lastCycle() case, but unfortunately I haven't been able to make it work.


Top
 Profile  
 
PostPosted: Fri Aug 02, 2019 8:42 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 201
Location: Germany
byuu wrote:
Quote:
and why BRK immediately after an IRQ can cause the BRK to be skipped

I should probably go ahead and support this on the 6502. It may apply to the 65816, and will likely give me some insights. Thanks!

Not necessary for the 65c816:
http://en.wikipedia.org/wiki/Interrupts ... _anomalies
http://wilsonminesco.com/NMOS-CMOSdif/


EDIT:
byuu wrote:
Code:
//immediate, 2-cycle opcodes with idle cycle will become bus read
//when an IRQ is to be triggered immediately after opcode completion.

Afaik any idle cycle is already a bus read (or a bus write in certain situations), it's just that the value waiting on the data bus is ignored.

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Last edited by creaothceann on Mon Aug 05, 2019 12:59 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Fri Aug 02, 2019 12:44 pm 
Offline

Joined: Fri Feb 24, 2012 12:09 pm
Posts: 972
For delaying irq enable/disble transitions, memorize the clock cycle counter where the transition has occurred
If an irq occurs, check if the cycle counter is still same, and if so, treat the irq enable flag as if it were still having the opposite of the current state..

_________________
homepage - patreon


Top
 Profile  
 
PostPosted: Sun Aug 04, 2019 9:12 am 
Offline
User avatar

Joined: Mon Jan 23, 2006 7:47 am
Posts: 201
Location: Germany
I've been looking at datasheets, and I'm still not 100% sure how exactly interrupts are processed.
It doesn't matter that much for /ABORT (doesn't exist on the SNES), BRK and /RESET, but /IRQ, /NMI and WAI ought to be relevant...

WAI actually stops the 65C816's internal clock in the PHI2 high state, putting the microprocessor into a sort of catatonia, reducing its power consumption to micro-amperes and halting all processing (hardware note: executing WAI also causes the 65C816's bi-directional RDY pin to go low—knowing that is a clue to what is going on inside while the 65C816 is WAIting). The system will appear to have gone completely dead.

However, as soon as any hardware interrupt other than a reset occurs the microprocessor will restart and exactly one PHI2 cycle after the interrupt was received, the next instruction will be executed. In other words, interrupt latency in this scenario will always equal exactly one PHI2 cycle — 70ns at at the 65C816's maximum officially-rated PHI2 frequency of 14 MHz. Unlike the usual behavior when a hardware interrupt input is asserted, there is no delay while the current instruction finishes execution (there is no "current instruction" while WAIting) and the 65C816 performs no stack operations upon awakening.

[GTE] wrote:
Code:
6c Wait for Interrupt               
  (WAI)
  (1 Op Code)
  (1 byte)
  (3 cycles)

             CYCLE  /VP  /ML  VDA  VPA  RDY  ADDRESS BUS  DATA BUS  R/W  Notes
               1     1    1    1    1    1   PB:PC        Op Code    1
               2     1    1    0    0    1   PB:PC+1      IO         1   Wait at cycle 2 for 2 cycles after /NMI or /IRQ active input.
               3     1    1    0    0    0   PB:PC+1      IO         1
     IRQ,NMI   1     1    1    1    1    1   PB:PC+1      IRQ(BRK)   1


6d Stop-The-Clock
  (STP)
  (1 Op Code)
  (1 byte)
  (3 cycles)

             CYCLE  /VP  /ML  VDA  VPA  RDY  ADDRESS BUS  DATA BUS  R/W  Notes
               1     1    1    1    1    1   PB:PC        Op Code    1
               2     1    1    0    0    1   PB:PC+1      IO         1
       RES=1   3     1    1    0    0    1   PB:PC+1      IO         1
       RES=0   1c    1    1    0    0    1   PB:PC+1      RES(BRK)   1
       RES=0   1b    1    1    0    0    1   PB:PC+1      RES(BRK)   1
       RES=1   1a    1    1    0    0    1   PB:PC+1      RES(BRK)   1
               1     1    1    1    1    1   PB:PC+1      BEGIN      1


21a Stack (Hardware Interrupts) -- s
 (IRQ,NMI,ABORT,RES)
 (4 hardware Interrupts)
 (0 bytes)
 (7 and 8 cycles)

             CYCLE  /VP  /ML  VDA  VPA  RDY  ADDRESS BUS  DATA BUS  R/W  Notes
               1     1    1    1    1        PB:PC        IO         1
               2     1    1    0    0        PB:PC        IO         1   This is the last cycle which may be aborted, or the P, PB or DB registers will be updated.
               3     1    1    1    0        00:S         PB         0   Subtract 1 cycle for emulation mode (P.e = 1).
               4     1    1    1    0        00:S-1       PC.H       0   R/W remains high during Reset.
               5     1    1    1    0        00:S-2       PC.L       0   R/W remains high during Reset.
               6     1    1    1    0        00:S-3       P          0   R/W remains high during Reset. BRK bit 4 equals "0" in emulation mode.
               7     0    1    1    0        00:VA        AAVL       1
               8     0    1    1    0        00:VA+1      AAVH       1
               1     1    1    1    1        00:AAV       Op Code    1


6.9 STP WAI

SToP the clock
WAit for Interrupt

Code:
OP LEN CYCLES      MODE      nvmxdizc e SYNTAX
-- --- ----------- --------- ---------- ------
DB 1   3           imp       ........ . STP
CB 1   3           imp       ........ . WAI

STP stops the clock input of the 65C816, effectively shutting down the 65C816 until a hardware reset (interrupt) occurs. This puts the 65C816 into a low power state. This is useful for applications (circuits) that require low power consumption, but STP is rarely seen otherwise.

WAI puts the 65C816 into a low power sleep state until a hardware interrupt occurs. In addition to reducing power consumption, using WAI also ensures that the interrupt will be recognized immediately. In other words, if an interrupt (e.g. an NMI) occurs in the middle of an instruction (e.g. ADC), the instruction must finish before the interrupt will be recognized (i.e. before jumping to the interrupt vector). When WAI is used, once its third cycle is complete, the 65C816 will wait for the interrupt and can respond to it without any additional delay whenever it occurs.

When the i flag is 1, interrupts are disabled, and normally an IRQ would be ignored. However, WAI when the i flag is 1 is a special case; specifically, when an IRQ occurs (after the WAI instruction), the 65C816 will continue with the next instruction rather than jumping to the interrupt vector. This means an IRQ can be responded to within one cycle. The interrupt handler is effectively inline code, rather than a separate routine, and thus it does not end with an RTI, resulting in fewer cycles needed to handle the interrupt.


So the first two IO cycles in the "Hardware Interrupts" section are the same two IO cycles in the WAI and STP sections?

Cycle 1 of WAI (opcode fetch) is actually also the last cycle of the previous instruction, which is finishing its work (unless this involves a write cycle). This is followed by 2 IO cycles if P.i = 1, and at the end of the second one the CPU sleeps. Otherwise (if P.i = 0), if at the end of cycle 2 there's a pending interrupt the CPU executes one more IO cycle and then continues with cycle 3 of the "Stack (Hardware Interrupts)" section. Correct?

But when P.i = 0, WAI wouldn't make sense because the code after WAI would be executed immediately?


EDIT: Wait, I think it works like this instead: WAI always puts the CPU to sleep until an hardware interrupt occurs, but P.i=0 means that the CPU uses the interrupt vector and P.i=1 means that the CPU continues with the program code, which would then handle the interrupt.

_________________
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group