It is currently Thu Jul 27, 2017 7:47 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue Mar 29, 2016 11:12 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
So I've been writing a GB emulator like so many others, mostly to learn about the innards. I'm trying to make it as accurate as possible. One thing I've been confused about is the "4 clock cycles/1 machine cycle" thing.

All the sources I find list the instructions as taking a multiple of 4 clock cycles. I've noticed that the multiple is the number of memory accesses (plus one if the instruction is a branch). So "ld a,b" takes 4 ticks, "ld a,1" takes 8 (4 to fetch the instruction, 4 to fetch the operand), etc. My understanding is that each instruction has 4 "T-states", which explains the 4-to-1 ratio. What I don't understand is what those T-states are doing.

If I assume it's just "fetch, decode, execute", then what's the 4th? If they're the "sub-stages" of the instruction (doing math, shuffling registers around, asserting the address bus) then why is it always a multiple of 4? Is there some limitation that memory accesses can only happen on every 4th cycle, or take a minimum 3 cycles? Are the docs that list every instruction as a multiple of 4 clock cycles just wrong?

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Tue Mar 29, 2016 11:15 am 
Offline

Joined: Thu Aug 12, 2010 3:43 am
Posts: 1589
No idea exactly how it works, although I do know that every time the Z80 fetches an opcode it spends an extra cycle for DRAM refresh (no idea if the GBZ80 keeps this behavior though).


Top
 Profile  
 
PostPosted: Tue Mar 29, 2016 11:25 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 18681
Location: NE Indiana, USA (NTSC)
My guess is that it's analogous to MC68000 memory cycle timing. That CPU takes eight half-cycles to access memory.


Top
 Profile  
 
PostPosted: Tue Mar 29, 2016 7:54 pm 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
I'd be surprised if it used DRAM, since the need for constant refreshing would hurt battery life. Especially, Nintendo advises to use HALT to save battery power, but the Z80 manual says HALT just runs endless NOPs until interrupted, that would imply it's not really saving any power. So I would expect some changes there.

I made a spreadsheet with my information and assumptions. I don't know how close these are to the actual stages of each instruction, but they make sense to me and fit the cycle counts for each. The only exception is PUSH that requires an extra cycle I can't account for. If I assume that, like the original Z80, cycles after the first are only 3 stages, and that decrementing SP takes two steps, then the cycle counts for PUSH are correct, but I didn't look too much into how that affects the other instructions. Anyway that would definitely disprove the "every instruction takes a multiple of 4 clocks" idea.

I'm also not entirely sure what limitations there are on memory access timing. Again my assumptions fit the pattern without much fudging (a few instructions have a "wait" step or two) but I really can't verify.

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Tue Mar 29, 2016 8:40 pm 
Offline

Joined: Sat Aug 28, 2010 9:01 am
Posts: 169
Repeat after me. Gameboy is not Z80. Gameboy is not Z80. T state in this context would be Z80 terminology. Gameboy CPU is a Sharp CPU core which is kinda-sorta binary code similar with i8080 and Z80, but probably doesn't share much in common with those two interally. In other words, you cannot assume that it will behave like a Z80 in any given situation. The 4 cycle alignment is correct. The mental model I have for this is that the 4 steps are read, decode, execute, write which are on a "grid". Ie, in every slot, the CPU has an opportunity to do one read and one write, but the time is consumed regardless of whether the read/write "slots" are used or not.

Because all instructions take multiples of 4 cycles, GB instructions are often counted in "instruction cycles", divided by 4. So a nop would take 1 such cycle for example.

Regarding your spreadsheet, there is a similar one available here:

http://www.pastraiser.com/cpu/gameboy/g ... codes.html

Also, what you call rlc a, rl a etc are normally written as a single word, rlca, rla etc. The reason is that there are CB prefixed opcodes for rlc r8, rl r8 etc which work on arbitrary registers, including a, but take 8 cycles instead because they are two bytes long, and also affect flags differently. The "native" rlca, rla etc opcodes set z to 0, whereas the extended ones affect z depending on whether the operand is zero.


Top
 Profile  
 
PostPosted: Wed Mar 30, 2016 9:48 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
Thanks, I wasn't sure how similar the two are internally.

Your model seems to make sense, though I haven't examined in detail. If you don't mind my asking, where is this information from?

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Wed Mar 30, 2016 10:56 am 
Offline

Joined: Sat Aug 28, 2010 9:01 am
Posts: 169
Various documents, my own experience of both programming and looking at oscilloscope traces of the address bus. Perhaps not all that rigorous, but probably correct. You may want to check out Blargg's test ROMs, for example instr_timing and cpu_instrs.

http://gbdev.gg8.se/files/roms/blargg-gb-tests/


Top
 Profile  
 
PostPosted: Wed Mar 30, 2016 11:07 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
Those test ROMs are super helpful. I just wish they gave more information about what, exactly, is wrong. (Trying to read their code isn't much help either...)

I added a new sheet to the document ("breakdown2") that tries to break down the instruction stages based on the "read, decode, execute, write" model, but there are a lot I can't reconcile this way. I'm really not sure what the decode stage would be doing after the first cycle, and some instructions take longer than this model suggests they should. Perhaps there are more limitations, such as not being able to directly manipulate some registers (like putting BC on the address bus, or directly manipulating SP) and/or not being able to perform both a read and a write in the same cycle? Or could the stages be in a different order?

I don't really understand why decode and execute are two separate tasks, either.

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Wed Mar 30, 2016 10:05 pm 
Offline

Joined: Sun Jan 26, 2014 9:31 am
Posts: 244
Gekkio has done extensive testing and verification on instruction timings. Have a look at the Mooneye GB emulator for documentation -> https://github.com/Gekkio/mooneye-gb/ Specifically, try looking through the tests if you have a certain area you want to look at.


Top
 Profile  
 
PostPosted: Thu Mar 31, 2016 11:16 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
Nice, that is very helpful. (Now how do I read Rust... :p)

The main thing I'm after right now is understanding just what, exactly, is going on at each clock cycle. So for example https://github.com/Gekkio/mooneye-gb/bl ... of-push-rr we can see many "internal delay" stages; my goal is to understand what's going on during those delays. Maybe it ultimately doesn't matter for emulation, but it definitely matters for my curiosity. ;)

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Fri Apr 01, 2016 10:19 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
So I've done more analysis in "decode" sheet based on a better understanding of how processors actually work (this was very helpful) and the information from Mooneye. The decoding logic I've come up with probably isn't entirely the same as the real thing but it matches the known timings for almost all instructions and seems to make sense. (I don't know how understandable it is, though!)

The few instructions that continue to confuse me are:

  • LD (i16),SP: the only instruction which has, in the "load" stage, a load into W
  • LD SP, SP+i8 and LD HL, SP+i8: why it takes longer to write the result into SP than into HL, and why it doesn't do "SP = SP+W" in the load stage
  • RET cc: I have no idea what it does during M2. Mooneye states that the PC is pushed in M3 and M4. M2 happens even when condition isn't met, and I can't figure out what it could be doing.

All of those involve SP, that makes me think there's some extra logic required to do certain operations to it?

Also, I left some stages out of some M-states when they aren't used, just for clarity. The way I imagine, the instruction decode logic takes the current instruction, the current M-state, and some other inputs, and outputs a lot of control signals like "do we modify a register? which one? what source? what modification?" (the "write reg" stage) and "do we copy/shift the low byte of W into the high byte?", but for the M-states where the answer is always "no/none", I left those columns out.

Perhaps it's impossible to know without decapping the chip, what exactly is going on under the hood when it's not generating any external signals we can analyze. On the plus side, at least we don't need to know to emulate it...

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Fri Apr 01, 2016 11:08 am 
Offline

Joined: Fri Oct 16, 2015 6:18 am
Posts: 32
You're very quickly delving deep into the "not measurable nor known" areas of Game Boy hardware. I share your curiosity, but there is no documentation about these things and if things are not measurable, it's just conjecture.
Of course, if a model such as your decoding model accurately predicts things, it is still useful even if it is not completely based on proof. :)

Did you notice the logic analysis directory under tests in the mooneye-gb repository? I've done some logic analysis on the Game Boy hardware, and you might be interested in things like the write and read timings in the external bus. The git repository has tex files, but I've got them rendered here:

http://gekkio.fi/files/mooneye-gb/night ... timing.png
http://gekkio.fi/files/mooneye-gb/night ... timing.png

I still need to verify when the data is sampled during a read cycle, but my current guess is the third falling or rising edge.
Also, I have not yet verified the exact durations of how long signals are valid, because the logic analyser cannot detect hi-z states.

Quote:
based on a better understanding of how processors actually work (this was very helpful)


If you're interested in more material, I can recommend the book Digital Design and Computer Architecture

Quote:
Perhaps it's impossible to know without decapping the chip, what exactly is going on under the hood when it's not generating any external signals we can analyze. On the plus side, at least we don't need to know to emulate it...


If you've got a couple thousand hours of spare time, take a look at the DMG CPU photos here :)

http://siliconpr0n.org/archive/doku.php ... :dmg-cpu_b

I've already done some initial tracing of signals, but the lack of other layers than the top layer make it (probably) impossible to identify the transistor structures.


Top
 Profile  
 
PostPosted: Fri Apr 01, 2016 11:37 am 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
Yeah, there's a lot of guesswork involved here. It's mainly just trying to come up with the most logical design that fits what we know and is nice and simple (=cheap). Unfortunately I wouldn't know where to begin actually looking at the decapped circuits.

I did see those directories but I didn't know there were PNG versions. That might be very handy.

My model does predict a few interesting things:

  • If the CB prefix is an instruction itself, which sets a "use alternate decoding logic for next instruction" flag, then it might be possible for an interrupt to fire before that next instruction is fetched. This would be pretty bad, so probably it can temporarily block interrupts (or isn't a separate instruction).
  • Instruction D3 and DB might be "JP ?,i16" and "JP N?,i16" where ? is some unknown condition flag. (C3 is JP always, so CB could be JP never, but it's special-cased for the ALU instead.)
  • Instruction DD might function as "CALLI i16", i.e. call and enable (or disable?) interrupts.
  • Instruction E4 and F4 might do "LD (?+i8),A" and "LD A,(?+i8)".
  • Instruction EC and FC would be "LD SP+i8,SP" and "LD SP+i8,HL" which makes no sense, so who knows what they'd actually do...
  • Instruction ED and FD might be "LD HL,PC" and "LD HL,SP"?
  • No idea what E3 and EB would do. Possibly clear/set some control flag (like F3 and FB clear/set IME).

I really need a hardware rig to test this on. Probably everyone has already tried these and they just lock up?

BTW, corrected the matrix sheet. Anonymous users should be able to add notes/comments.

_________________
Sent from my Game Boy.


Top
 Profile  
 
PostPosted: Fri Apr 01, 2016 11:46 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 18681
Location: NE Indiana, USA (NTSC)
Rena wrote:
If the CB prefix is an instruction itself, which sets a "use alternate decoding logic for next instruction" flag, then it might be possible for an interrupt to fire before that next instruction is fetched. This would be pretty bad, so probably it can temporarily block interrupts (or isn't a separate instruction).

Unless the CB bit is in the status flags, as is the case for the T flag that the HuC6280's SET prefix sets. (SET-prefixed ALU ops use ZP[X] instead of the accumulator.)


Top
 Profile  
 
PostPosted: Fri Apr 01, 2016 12:46 pm 
Offline
User avatar

Joined: Tue Mar 29, 2016 8:56 am
Posts: 9
Location: Kitchener, Ontario, Canadia
The Game Boy doesn't automatically push/pop when servicing an interrupt, except pushing PC (it basically forces a CALL nn). So if CB were one of the four "always 0" flags, we'd expect to see some effects in that situation:

  • The CB prefix applies to the first instruction of the ISR (usually a push), and everything breaks
  • The CB flag automatically is cleared, and the interrupted instruction gets decoded incorrectly on resume, and everything breaks
  • The CB flag somehow doesn't affect the ISR, and we'd see it in the stack (hopefully your ISR does a PUSH AF toward the beginning)

Only the third case would not lead to incorrect instruction decoding, and I don't know how it would happen.

If someone cared to test (and had a flash cart to do it with), it should be simple enough to set up a test that would have an interrupt trigger just after the CB prefix is read, and then watch the address bus to see if the next memory access is to the stack (to push PC) or to fetch the next instruction, and have your program look at the pushed F register to check for unused bits being set.

Another test would be:
ld bc, $01FF
push bc
pop af
nop

If CB flag is in the status flags (and POP itself doesn't clear it), the NOP should be interpreted as RLC B, so B will be 3 now.

_________________
Sent from my Game Boy.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Bing [Bot] and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group