It is currently Tue Dec 12, 2017 1:31 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Fri Oct 28, 2016 9:24 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
How in the world is this possible?

Image

The opcode fetch (0xed) should consume 4 cycles.
The suffix fetch (0xb0) should consume 4 cycles.
The read from (HL) should consume 3 cycles.
Incrementing HL and DE; plus decrementing BC; should take 3 cycles? Yet we end up consuming 5.
Decrementing PC when BC=0 should take 1 cycle? But ends up consuming 5.

This one's even worse:

Image

The opcode fetch (0xed) is once again 4 cycles.
But now the suffix fetch (0xa2) is taking FIVE cycles??!!
And now our read from in(C) should take 4 cycles, but is now THREE cycles?! >_<
The increment of HL and decrement of B should take 2 cycles, but instead takes four.


Top
 Profile  
 
PostPosted: Fri Oct 28, 2016 5:06 pm 
Offline

Joined: Sun Mar 19, 2006 9:44 pm
Posts: 923
Location: Japan
It's microcoded, right? So those extra cycles are taken up by some hidden opcode sequencer figuring out what instruction to do next to perform LDIR, I'd imagine.

_________________
http://www.chrismcovell.com


Top
 Profile  
 
PostPosted: Sat Oct 29, 2016 6:02 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
ccovell wrote:
It's microcoded, right? So those extra cycles are taken up by some hidden opcode sequencer figuring out what instruction to do next to perform LDIR, I'd imagine.


Probably. I know the 68K definitely is, at least.

The issue is I'm really not sure how to emulate such variable timings. Maybe I could make a template parameter on the read/write/in/out functions to override the amount of cycles they consume. Sometimes add one, sometimes remove one.

But when there are extra M cycles (presumably wait states), it's important to know where to place them to get a proper cycle timing between components.

There's also so many mistakes in the manual. I'd hate to go to the extra trouble and end up with broken timings for it.


Top
 Profile  
 
PostPosted: Sat Oct 29, 2016 11:10 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 3968
You end up using unrolled LDIs because it's faster than LDIR.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Sun Oct 30, 2016 6:49 am 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 228
HL, BC, DE registers are 16 bits so that requires using 16 bits inc/dec ALU, not sure how many are available in the Z80 CPU but that may explain why the LDI instruction is so slow.


Top
 Profile  
 
PostPosted: Sun Oct 30, 2016 4:06 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
byuu wrote:
How in the world is this possible?


You can look at the bus traces I linked a while back to see what's going on. Remember, internal operations (i.e. T-states in excess of what that type of bus access should normally take) occur at the end of an M-cycle.

Quote:
Image

The opcode fetch (0xed) should consume 4 cycles.
The suffix fetch (0xb0) should consume 4 cycles.
The read from (HL) should consume 3 cycles.
Incrementing HL and DE; plus decrementing BC; should take 3 cycles? Yet we end up consuming 5.
Decrementing PC when BC=0 should take 1 cycle? But ends up consuming 5.


Let's look at the bus trace for LDIR:

Code:
Opcode: ED B0 => LDIR

-----------------------------------------------------------+
#001H T1  AB:000 DB:--  M1                                 |
#002H T2  AB:000 DB:ED  M1      MREQ RD                    | Opcode read from 000 -> ED
#003H T3  AB:000 DB:--     RFSH                            |
#004H T4  AB:000 DB:--     RFSH MREQ                       | Refresh address  000
-----------------------------------------------------------+
#005H T1  AB:001 DB:--  M1                                 |
#006H T2  AB:001 DB:B0  M1      MREQ RD                    | Opcode read from 001 -> B0
#007H T3  AB:001 DB:--     RFSH                            |
#008H T4  AB:001 DB:--     RFSH MREQ                       | Refresh address  001
#009H T5  AB:06C DB:--                                     |
#010H T6  AB:06C DB:00          MREQ RD                    | Memory read from 06C -> 00
#011H T7  AB:06C DB:00          MREQ RD                    | Memory read from 06C -> 00
#012H T8  AB:05B DB:--                                     |
#013H T9  AB:05B DB:00          MREQ                       |
#014H T10 AB:05B DB:00          MREQ    WR                 | Memory write to  05B <- 00
#015H T11 AB:05B DB:00                                     |
#016H T12 AB:05B DB:00                                     |
#017H T13 AB:05B DB:--                                     |
#018H T14 AB:05B DB:--                                     |
#019H T15 AB:05B DB:--                                     |
#020H T16 AB:05B DB:--                                     |
#021H T17 AB:05B DB:--                                     |
-----------------------------------------------------------+


The fourth M-cycle is the write to (DE) (#012-#014) plus two internal T-states (#015-#016). The Z80 has to increment DE, decrement BC, and test the decremented BC for zero.
The fifth M-cycle when the loop is taken (#017-#021) is completely internal (no bus access). It's probably doing the same stuff as a JR instruction (which also takes 5 T-states at the end)

Quote:
This one's even worse:

Image

The opcode fetch (0xed) is once again 4 cycles.
But now the suffix fetch (0xa2) is taking FIVE cycles??!!
And now our read from in(C) should take 4 cycles, but is now THREE cycles?! >_<
The increment of HL and decrement of B should take 2 cycles, but instead takes four.


4, 5, 3, 4 is obviously incorrect (the manual writer probably used the counts from OUTI by mistake). It's actually 4, 5, 4, 3:

Code:
Opcode: ED A2 => INI

-----------------------------------------------------------+
#001H T1  AB:000 DB:--  M1                                 |
#002H T2  AB:000 DB:ED  M1      MREQ RD                    | Opcode read from 000 -> ED
#003H T3  AB:000 DB:--     RFSH                            |
#004H T4  AB:000 DB:--     RFSH MREQ                       | Refresh address  000
-----------------------------------------------------------+
#005H T1  AB:001 DB:--  M1                                 |
#006H T2  AB:001 DB:A2  M1      MREQ RD                    | Opcode read from 001 -> A2
#007H T3  AB:001 DB:--     RFSH                            |
#008H T4  AB:001 DB:--     RFSH MREQ                       | Refresh address  001
#009H T5  AB:001 DB:--                                     |
#010H T6  AB:049 DB:--                                     |
#011H T7  AB:049 DB:--               RD    IORQ            | I/O read from 049
#012H T8  AB:049 DB:--               RD    IORQ            | I/O read from 049
#013H T9  AB:049 DB:--               RD    IORQ            | I/O read from 049
#014H T10 AB:06E DB:--                                     |
#015H T11 AB:06E DB:A3          MREQ                       |
#016H T12 AB:06E DB:A3          MREQ    WR                 | Memory write to  06E <- A3
-----------------------------------------------------------+


Fetch, fetch + one internal operation (#009), port in, write.

I believe the Z80 has only one 16-bit inc/dec unit which is shared between the PC, the SP, and the register pairs, and it takes more than one T-state to do its job (i.e. increment or decrement one 16-bit register). Any M-cycle that performs an increment/decrement that the next cycle needs the value of seems to end up taking extra T-states. Notice that PUSH qq has an internal T-state after the fetch and before the first memory write (5, 3, 3) but POP qq doesn't have any (4, 3, 3). PUSH needs the decremented value of SP, so the second M-cycle can't begin until the decrement is complete. But for POP the effective address is the current value of SP, so the memory access can happen right away.

(Notice that the 6502 is the exact opposite of the Z80: PHA uses the current value of SP, PLA uses the incremented value of SP, and PLA takes one more cycle than PHA does. Stacks that work like the Z80's are called "full" stacks and stacks that work like the 6502's are called "empty" stacks)

Don't think of each T-state as one discrete operation (this cycle is incrementing HL, this next one is incrementing DE), think of several things going on at once inside the chip that take varying amounts of time, and the next M-cycle can't begin until all of them are done. Internal T-states happen when the register-level operations can't all be completed in the time that the memory operation takes.

From an emulation perspective I suggest having a "op_internal(unsigned count)" method that can efficiently add any number of internal T-states, and call that from each opcode handler at the appropriate times (just like you call op_io() or whatever it's called now in the 65816/6502) You pretty much have to handle the internal cycles separately from the memory-accessing cycles, because some (many) instructions have an internal T-state or two on their first cycle, and you can't tell that until you've decoded the instruction!


Top
 Profile  
 
PostPosted: Mon Oct 31, 2016 5:31 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
Thanks as always for the info!

> Let's look at the bus trace for LDIR:

Would you happen to have a bus trace for every instruction? That would be really super helpful.

If you're making these by hand, then I don't want to bug you to do all that work. But if it already exists, please share :D

> It's actually 4, 5, 4, 3:

Oh, whew. I can accept that the increment wait-state is earlier in the instruction this time, but not that a read/write/in/out call takes less time.

> From an emulation perspective I suggest having a "op_internal(unsigned count)" method that can efficiently add any number of internal T-states

Yeah, right now I have: read, write, in, out, wait. And I also have some convenience wrappers like opcode() [8-bit read from PC + one wait state], operand() [8-bit read from PC], operands() [16-bit read from PC], push, pop, etc that are all built off the first five.

The downside is those internal operations consume time, so it becomes harder to count the T-states to make sure there are no errors. But it's not impossibly difficult.

What would really help would be if the manual broke down the T-states. Or at the very least, if the official manual weren't chock full of errors >_<

> You pretty much have to handle the internal cycles separately from the memory-accessing cycles, because some (many) instructions have an internal T-state or two on their first cycle, and you can't tell that until you've decoded the instruction!

Given that internally, all read/write/in/out/wait calls advance the CPU time, and can thus cause a context switch, it may be a good micro-optimization to have something like read<5>(addr) instead of read(addr), wait(2). But the cycle timing could end up wrong if the wait(2) isn't in the correct location. I don't know if all T-states do the bus operation at the start, and the I/O stuff at the end or not.

Probably not a good idea to micro-optimize. The Z80 shouldn't be a bottleneck in the Mega Drive nor Master System.

...

EDIT: I assume OUTI, OUTD are correct at 4,5,3,4 then. Since it's read+out at the end.


Top
 Profile  
 
PostPosted: Tue Nov 01, 2016 3:12 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
Okay, implemented every instruction on the Z80. Have a few more questions, of course.

RETI, RETN

I understand that RETI is a specialized version of RET that somehow signals to external hardware that it occurred. For the sake of the Master System and/or Mega Drive, do I need to do any kind of special handling for this, or can I just treat it like RET?

And on that note, RETN is just RET + IFF1=IFF2, right?

ED 77, ED 7F

Are these instructions NOP (as per z80-documented.pdf), or LD I,I and LD R,R (as per http://www.z80.info/z80oplist.txt) ? This matters as there's (presumably) an extra cycle penalty on the latter.

LD (I,A;A,I;A,R;R,A)

The manual states the T-cycles for these are T(4,5). I really would have expected this one to be T(4,4).

Is there really an extra cycle here?

Also, the manual says that "LD A,I"; "LD A,R"; set flags. I could understand that adding an extra cycle penalty. But "LD I,A"; "LD R,A" do not set flags and still (supposedly) have the extra cycle penalty?

DAA

These algorithms are always brutally difficult to get correct. I remember VBA using a lookup table for the Game Boy DAA instruction, and still getting the wrong results! (the table itself had bad values in it.)

Wasn't able to use my LR35902 implementation (known to be correct per blargg), because it's missing a lot of the flag values and said CPU doesn't have the N flag that affects the computation.

So my implementation was based off CZ80 ... is this algorithm known to be correct or incorrect?

Specifically, this is what I came up with in adapting said code:
Code:
auto Z80::instructionDAA() -> void {
  uint8 lo = A.bits(0,3);
  uint8 hi = A.bits(4,7);
  uint8 diff;

  if(CF) {
    diff = lo <= 9 && !HF ? 0x60 : 0x66;
  } else if(lo >= 10) {
    diff = hi <= 8 ? 0x06 : 0x66;
  } else if(hi >= 10) {
    diff = HF ? 0x66 : 0x60;
  } else {
    diff = HF ? 0x06 : 0x00;
  }

  if(NF == 0) A += diff;
  if(NF == 1) A -= diff;

  CF = CF || (lo <= 9 ? hi >= 10 : hi >= 9);
  PF = parity(A);
  XF = A.bit(3);
  HF = NF ? (HF && lo <= 5) : (lo >= 10);
  YF = A.bit(5);
  ZF = A == 0;
  SF = A.bit(7);
}


If that works, it's a very clever way to implement it. But it's highly unusual looking to me.

16-bit arithmetic

I'd rather be a bit lazy and reuse the existing 8-bit ADD/SUB functions if possible.
99% sure this is fine, but just to be certain, it's okay if I implement these like so, correct?

Code:
auto Z80::instructionADC_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = ADD(HL >> 0, x >> 0, CF);
  wait(3);
  auto hi = ADD(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}

auto Z80::instructionADD_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = ADD(HL >> 0, x >> 0);
  wait(3);
  auto hi = ADD(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}

auto Z80::instructionSBC_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = SUB(HL >> 0, x >> 0, CF);
  wait(3);
  auto hi = SUB(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}


RLD / RRD

It was easy enough to see what these were doing via CZ80, but ... what the hell are these useful for? >_>

Testing

Last question, are there any good stress-testing modules for a Z80 core that don't require a working Master System VDP core? If I have to go my usual route (compare my trace logs to Mednafen's), I can do that, but I'd rather save time on the inevitable weeks of debugging a new CPU core.


Top
 Profile  
 
PostPosted: Tue Nov 01, 2016 4:22 am 
Offline

Joined: Thu Oct 05, 2006 6:29 am
Posts: 911
>what the hell are these useful for? >_>

I found some use for RLD when I was writing some code to read a sector of data from an SD card one nybble at a time:

Code:
; Read to RAM/SRAM
;
; 46.625 cycles/byte
;
; In:
;   HL = buf
neo2_recv_sd:
        ld      b,#64
        ld      de,#MYTH_NEO2_RD_DAT4
        ; Read one sector (512 bytes)
1$:
        ld      a,(de)   ; 7
        ld      (hl),a   ; 7
        ld      a,(de)   ; 7
        rld              ; 18.  (hl) = (hl)<<4 + a&0x0F
        inc     hl       ; 6
; 2nd byte
       same as for byte 1. Unrolled 8 times.
       ....


Top
 Profile  
 
PostPosted: Tue Nov 01, 2016 7:22 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
Wow, the card reads nybbles into RAM, with the top four bits unused in each byte?

Yeah, that would be a perfect use case for RLD. But, guessing they weren't anticipating SD cards back then :P

It actually would've been a pretty amazing opcode for serial if it were (hl)=(hl)<<1|a&1 instead. RRD as well in case the bits were in the other order.


Top
 Profile  
 
PostPosted: Tue Nov 01, 2016 2:51 pm 
Offline

Joined: Mon Jul 01, 2013 11:25 am
Posts: 228
byuu wrote:
RETI, RETN

I understand that RETI is a specialized version of RET that somehow signals to external hardware that it occurred. For the sake of the Master System and/or Mega Drive, do I need to do any kind of special handling for this, or can I just treat it like RET?

And on that note, RETN is just RET + IFF1=IFF2, right?


If you don't plan to emulate back signal from RETI (which is indeed not needed in case of the Sega Megadrive) then RETI = RETN
They both do IFF1 = IFF2 while RET doesn't.

Quote:
DAA

These algorithms are always brutally difficult to get correct. I remember VBA using a lookup table for the Game Boy DAA instruction, and still getting the wrong results! (the table itself had bad values in it.)

Wasn't able to use my LR35902 implementation (known to be correct per blargg), because it's missing a lot of the flag values and said CPU doesn't have the N flag that affects the computation.

So my implementation was based off CZ80 ... is this algorithm known to be correct or incorrect?

...

If that works, it's a very clever way to implement it. But it's highly unusual looking to me.


I wrote CZ80 very quickly and only made one bugfixe update to it (first version was 0.90, and that is the 0.91), i'm almost certain it still has some bugs so i wouldn't trust it too much :-/
To be honest i don't even remember from where i get that implementation, i guess it was by reading and analyzing what was doing the DAA instruction or maybe from the small z80-documented.pdf file. The reason i used that code was to reduce code size, I mainly wrote CZ80 to provide a fast and (hopefully :p) accurate Z80 C core that i could use on Dreamcast (where code size matter a lot for better cache use).

Quote:
16-bit arithmetic

I'd rather be a bit lazy and reuse the existing 8-bit ADD/SUB functions if possible.
99% sure this is fine, but just to be certain, it's okay if I implement these like so, correct?

Code:
auto Z80::instructionADC_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = ADD(HL >> 0, x >> 0, CF);
  wait(3);
  auto hi = ADD(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}

auto Z80::instructionADD_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = ADD(HL >> 0, x >> 0);
  wait(3);
  auto hi = ADD(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}

auto Z80::instructionSBC_hl_rr(uint16& x) -> void {
  wait(4);
  auto lo = SUB(HL >> 0, x >> 0, CF);
  wait(3);
  auto hi = SUB(HL >> 8, x >> 8, CF);
  HL = hi << 8 | lo << 0;
  ZF = HL == 0;
}



I would say the arithmetic is correct, but be careful about the FLAG calculation. Strangely enough Z is not affected by 16 bits ADD, nor are S and P bit, they are only affected by 16 bits ADC/SBC. Also N is cleared *even* for SBC...


Top
 Profile  
 
PostPosted: Tue Nov 01, 2016 4:38 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
The bus traces I've been quoting are here:

http://baltazarstudios.com/zilog-z80-un ... -behavior/

byuu wrote:
LD (I,A;A,I;A,R;R,A)

The manual states the T-cycles for these are T(4,5). I really would have expected this one to be T(4,4).

Is there really an extra cycle here?


Yes, there really is an extra T-state for all I/R moves. The pipeline stall probably has nothing to do with flags; it's more likely because R has to be incremented on every M1 cycle (opcode fetch), and therefore the next opcode fetch can't happen until the register move is complete (unlike normal register-register moves, which can be overlapped with the next opcode fetch)

The reason moves to/from I incur the stall (and not just moves to/from R) is that internally IR is a single 16-bit register, with I in the upper half and R in the lower half.

Quote:
DAA

These algorithms are always brutally difficult to get correct. I remember VBA using a lookup table for the Game Boy DAA instruction, and still getting the wrong results! (the table itself had bad values in it.)

Wasn't able to use my LR35902 implementation (known to be correct per blargg), because it's missing a lot of the flag values and said CPU doesn't have the N flag that affects the computation.


DAA is different between the LR35902 and the Z80 even apart from the Z80's additional flags. One difference (not the only one) is that on the LR35902 the upper and lower nybbles of A only affect the result of an adjust-after-add (NF == 0), but on the Z80 they affect both adjust-after-add and adjust-after-subtract.

In fact, I believe the problem with the table VBA used was that it was correct for the Z80 rather than the LR35902.

Here's my algorithm for Z80 DAA, which should be equivalent to the one you quoted, and a fair bit simpler and easier to understand:

(Edit: Simplified by doing the adjusts directly on A, taking advantage of the fact that the upper nybble adjust doesn't affect the test for the lower nybble adjust. Note that the converse is not true; the lower nybble adjust would mess up the upper nybble test if you did it first, so don't change the order of the first two lines!)

Code:
uint oldA = A; // save the previous value of A to calculate HF with

if (CF || (A > 0x99)) { A += (NF ? -0x60 : 0x60); CF = 1; } // if carry set or A > BCD 99, adjust upper nybble and set carry
if (HF || (A.bits(0,3) > 9)) { A += (NF ? -6 : 6); }        // if half-carry set or lower nybble > 9, adjust lower nybble

HF = (A ^ oldA).bit(4); // half-carry is set if bit 4 changed, otherwise cleared

// the rest of the flags are set the usual way for an ALU operation (except that PF is parity, rather than overflow like you'd expect...)
// note that unlike the LR35902, NF is preserved
PF = parity(A);
XF = A.bit(3);
YF = A.bit(5);
ZF = A == 0;
SF = A.bit(7);


Here's a side-by-side tester I whipped up in Python, omitting flag calculations that depend solely on the resulting value of A. It passes, but someone should double-check my translation of byuu's algorithm (and mine) from nalled C++ to Python:

Code:
#!/usr/bin/python3

def mydaa(A, CF, HF, NF):
    oldA = A
    if CF or A > 0x99:
        A += -0x60 if NF else 0x60
        CF = True
    if HF or A & 0xf > 9:
        A += -6 if NF else 6

    HF = bool((A ^ oldA) & 0x10)

    return A, CF, HF


def byuudaa(A, CF, HF, NF):
    lo, hi = A & 0xf, A >> 4

    if CF:
        diff = 0x66 if HF or lo > 9 else 0x60
    elif lo >= 10:
        diff = 0x66 if hi > 8 else 0x06
    elif hi >= 10:
        diff = 0x66 if HF else 0x60
    else:
        diff = 0x06 if HF else 0

    A = A - diff if NF else A + diff

    CF = CF or (hi >= (10 if lo <= 9 else 9))
    HF = (HF and lo <= 5) if NF else (lo >= 10)

    return A, CF, HF

myresult = [(A, CF, HF, NF, mydaa(A, CF, HF, NF))
            for A in range(256)
            for CF in (False, True)
            for HF in (False, True)
            for NF in (False, True)]

byuuresult = [(A, CF, HF, NF, byuudaa(A, CF, HF, NF))
              for A in range(256)
              for CF in (False, True)
              for HF in (False, True)
              for NF in (False, True)]

passed = True
for mine, byuu in zip(myresult, byuuresult):
    if mine != byuu:
        print("mine: " + repr(mine))
        print("byuu: " + repr(byuu))
        passed = False

if passed:
    print("Passed")


Quote:
RLD / RRD

It was easy enough to see what these were doing via CZ80, but ... what the hell are these useful for? >_>


Just like the LR35902 SWAP instruction, they're useful for working with nybble-sized data (e.g. BCD decimal numbers).


Top
 Profile  
 
PostPosted: Thu Jan 05, 2017 10:29 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
Okay, I think this is the most evil one yet.

Image

So in this case the M cycles / T states break down like so:

[1] 4 => read 0xDD prefix (or 0xFD prefix)
[2] 4 => read 0xCB prefix
[3] 3 => read int8 displacement byte
[4] 5 => compute (IX,IY)+displacement
[5] 4 => read opcode identifier
[6] 4 => read from (IX,IY)+displacement and RLC it (extra cycle penalty for the RLC operation)
[7] 3 => write RLC'd value to (IX,IY)+displacement

That's seven cycles, not six.

Every instance ever of (IX,IY)+displacement consumes the two 3,5 T states.

It looks like the documentation is acting as though you can read the opcode identifier at the same time as doing the wait(5) computation for (IX,IY)+displacement.

Or is the documentation wrong here?


Top
 Profile  
 
PostPosted: Thu Jan 12, 2017 7:07 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 431
byuu wrote:
Okay, I think this is the most evil one yet.

Image

So in this case the M cycles / T states break down like so:

[1] 4 => read 0xDD prefix (or 0xFD prefix)
[2] 4 => read 0xCB prefix
[3] 3 => read int8 displacement byte
[4] 5 => compute (IX,IY)+displacement
[5] 4 => read opcode identifier
[6] 4 => read from (IX,IY)+displacement and RLC it (extra cycle penalty for the RLC operation)
[7] 3 => write RLC'd value to (IX,IY)+displacement

That's seven cycles, not six.

Every instance ever of (IX,IY)+displacement consumes the two 3,5 T states.

It looks like the documentation is acting as though you can read the opcode identifier at the same time as doing the wait(5) computation for (IX,IY)+displacement.


Yes, that's exactly what the Z80 does. The displacement calculation occurs in parallel with the sub-opcode fetch. That's why instructions with both a DD/FD prefix and a CB prefix are encoded "out of order", with the displacement coming before the sub-opcode. Index+displacement instructions that only have the DD/FD prefix have no useful work that can be done while the ALU is calculating the effective address, but instructions with both prefixes are a bit more efficient thanks to the funny encoding (though they're still among the slowest instructions in the Z80 instruction set--perhaps "closer to memory-bound" is a better term than "efficient")

Note that for the double-prefixed instructions, the bus signals and timing for the sub-opcode fetch cycle are the same as an operand fetch with two dead T-states after, not an opcode fetch (unlike CB-prefix-only instructions, where the sub-opcode fetch is a real opcode fetch) That also means that the R register isn't incremented for the sub-opcode.


Top
 Profile  
 
PostPosted: Fri Jan 13, 2017 3:35 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1339
> That's why instructions with both a DD/FD prefix and a CB prefix are encoded "out of order", with the displacement coming before the sub-opcode.

Oh man, thank you so much for explaining that. I was really racking my brain trying to understand why they went with that design. It completely wrecked my attempts to merge the DD/FD prefixes into the existing opcode tables. I ended up needing an entirely separate DD/FD CB instruction table, and as such had to do the displacement before invoking the individual opcodes due to needing the CB opcode number before I could invoke the appropriate instruction.

I'll probably end up doing the same for the regular non-CB/ED instructions since why not at this point? Avoids the messy specialization for (IX,IY+n) conversion that's not needed for regular HL instructions.

> That also means that the R register isn't incremented for the sub-opcode.

If you don't mind, I do actually have one more interesting question.

Turns out the source of a LOT of game bugs was that I was following the Z80 documented undocumented document where it was saying that DD/FD are treated as "separate instructions that set an internal flag", hence you can easily stack multiple prefixes like DD DD FD DD FD CB ... and it'll "ignore" all but the final FD.

But if I do this, and allow interrupts to fire between the prefix and CB opcode, obviously very bad things happen.

So ... what happens in a sequence like the above? What if I fill the bus with nothing but 0xDD? (All ROM and all RAM.) Will IRQs and NMIs never fire as a result?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group