cycle for cycle stuff

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg » Sun May 07, 2006 3:35 pm

When you write in C, you aren't specifying the machine code that comes out; you're specifying the external side-effects that must occur when executing the code. Since the two have the same side-effects, and they are reasonably easy for a compiler to optimize and occur commonly, compilers generate the same code.

User avatar
baisoku
Posts: 121
Joined: Thu Nov 11, 2004 5:30 am
Location: San Francisco, CA
Contact:

Post by baisoku » Sun May 07, 2006 5:51 pm

WedNESday wrote:The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?
No, it is not.
...patience...

mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz » Sun May 07, 2006 7:59 pm

WedNESday wrote:The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?
They should have exactly the same performance on any decent compiler, because of the optimizer. Those two programs are exactly equivalent if you convert them into a DAG representing their control flow (which the optimizer does, probably even at -O1 or -O0).

WedNESday
Posts: 1236
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday » Mon May 08, 2006 5:33 am

Fx3 wrote:Compile both versions and compare the binaries using an hexa editor. ^_^;;..
Never!

I believe you guys. However, if you had a really crap compiler then I would be right :lol: .

WedNESday
Posts: 1236
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday » Tue May 09, 2006 9:55 am

Observe the following;

Code: Select all

Read-Modify-Write instructions (ASL, LSR, ROL, ROR, INC, DEC,
                                     SLO, SRE, RLA, RRA, ISB, DCP)

        #   address  R/W description
       --- --------- --- ------------------------------------------
        1    PC       R  fetch opcode, increment PC
        2    PC       R  fetch low byte of address, increment PC
        3    PC       R  fetch high byte of address,
                         add index register X to low address byte,
                         increment PC
        4  address+X* R  read from effective address,
                         fix the high byte of effective address
        5  address+X  R  re-read from effective address
        6  address+X  W  write the value back to effective address,
                         and do the operation on it
        7  address+X  W  write the new value to effective address
When it says read from effective address would that affect the VRAM register if the address were $2007? Also is it neccessary to read, then re-read or can that be done just the once? (my concern being that some kind of mapper or external device may cause a bank switch or something)

mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago
Contact:

Post by mattmatteh » Tue May 09, 2006 12:55 pm

blargg tested this and it does effect the ppu registers. so if there were 2 reads or what ever then the ppu would both reads. i looked a 1 game and it used the ABS addressing mode for ppu registers. perhaps thats why ? maybe check to see what addressing mode games use to read/write the ppu registers and write to the mmc registers.

matt

mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz » Tue May 09, 2006 6:15 pm

mattmatteh wrote:blargg tested this and it does effect the ppu registers. so if there were 2 reads or what ever then the ppu would both reads. i looked a 1 game and it used the ABS addressing mode for ppu registers. perhaps thats why ? maybe check to see what addressing mode games use to read/write the ppu registers and write to the mmc registers.

matt
The CPU is either reading or writing at some address in every cycle. That table lists the addresses, too. If you want the most accurate emulation, you should simulate every read and every write (even the "dummy" ones that the CPU doesn't use the value of). I don't know if there are any games that rely on that behaviour, but that's what the 6502 in the NES did.

tepples
Posts: 22092
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples » Tue May 09, 2006 6:33 pm

WedNESday wrote:Observe the following;

Code: Select all

[...]
        4  address+X* R  read from effective address,
                         fix the high byte of effective address
        5  address+X  R  re-read from effective address
        6  address+X  W  write the value back to effective address,
                         and do the operation on it
        7  address+X  W  write the new value to effective address
When it says read from effective address would that affect the VRAM register if the address were $2007? Also is it neccessary to read, then re-read or can that be done just the once? (my concern being that some kind of mapper or external device may cause a bank switch or something)
Yes. Every read or write on $2007 advances the VRAM pointer, so this is advanced by 3 or 4 depending on whether index addition crosses a page boundary. And yes, if you want to emulate the bus accurately, you have to make both reads if index addition crosses a page boundary, and you have to make both writes.

Near
Founder of higan project
Posts: 1550
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near » Thu May 11, 2006 11:00 am

Alright, figured I'd post about this here as well since we were discussing cooperative multithreading, etc previously in this thread.

Basically, what I've found out about cooperative multithreading is that there is an implementation in Win95/NT3.51+ called Windows Fibers. On my Athlon 3500+, it is generally 10-15x slower to call SwitchToFiber(my_proc) than it is to call my_proc() directly. And it's also Win32-specific.

However, I was able to write my own implementation of the library that works on any x86 target including Win32, Linux and FreeBSD (excepting maybe OSx86), and get that number down to 6-7x slower.

My cothreading library creates separate stacks for each cothread, supports direct jumping or infinite call/return recursion of threads, and saves/restores everything the c/c++ ABI for win+lin+bsd specifies for each context switch (basically ebx+ebp+esi+edi, ignoring eax+ecx+edx+ST(0)-ST(7)+mm0-mm15). And context switches occur in roughly 14 opcodes, requiring no user<>kernel transisitons.

The source to the library is here : http://byuu.cinnamonpirate.com/temp/libco_x86.asm

And the entire package is here :
http://byuu.cinnamonpirate.com/temp/libco_v03.zip

This basically allows for exactly the kind of implementation I was talking about before, and should hopefully be very fast as well. Even though you have the overhead of switching contexts instead of just calling a subroutine, you should hopefully gain it back by not having to use a state machine (switch/case) every time you call your CPU core.

So for example, the following code :
main() -> cpu_run() -> switch(state) case OPCODE_EXEC: -> exec_opcode() -> switch(opcode) case 0xa9: -> op_a9() -> switch(opcode_cycle) case 2:
Will become :
main() -> co_call(cpu_context) -> /* we are now within case 2: */

You also no longer *have* to break out of the CPU core after every opcode cycle, and you can just test the memory address, breaking only if it affects another unit (eg it is a PPU / APU read/write operation), or if a significant amount of time has passed. It also allows for more accuracy as you can trivially simulate bus hold delays now.

The actual CPU core can be implemented exactly like an opcode-based core, putting the co_return() calls inside the op_read() and op_write() functions alone.

So you gain more speed, more accuracy, and much cleaner code. But there are a few drawbacks to this apprach that will probably mean nobody ever uses this besides myself :

1) Platform-dependance. The library is trivial to implement on any platform, but it would take a good programmer with knowledge of assembler to do it right.
2) Context-switching requires flushing the pipeline/L1 cache or whatever, meaning performance will be worse on more excessively pipelined processors (such as the P4) compared to more moderate designs (Athlons, older processors ...). Basically, a lower clock multiplier will probably result in faster performance. Processors with more than 8 volatile registers could require significantly more time to save/restore the processor context.
3) And this is the big one... you can't save and restore the state of threads in a platform-independant way, meaning you can't have savestates... it may be possible to save the context-buffer + stack into a savestate and restore it, but the format would change with even a minimal code change, breaking older savestates in the process. Power/reset events are still possible, one need only destroy and recreate the CPU thread.

Always a tradeoff, right? I'm personally going to write a CPU+APU core, and then write the code to run them so that I have both an opcode-based, less accurate core supporting savestates, and a bus-accurate core that will not support savestates.

So, ideas, comments? Would anyone else consider using something like this?

-----

An example opcode using cothreading :

Code: Select all

void op_lda_addr() {
  aa.l = op_read(regs.pc++);
  aa.h = op_read(regs.pc++);
  cpu_io();
  regs.a.l = op_read(aa.w);
  regs.p.n = bool(regs.a.l & 0x80);
  regs.p.z = regs.a.l == 0;
}

uint op_read(uint addr) {
  add_clocks(4);
#ifdef USE_COTHREADS
  co_return(); //synchronization magic!
#endif
uint r = mem_read(addr);
  add_clocks(4);
  return r;
}

WedNESday
Posts: 1236
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday » Sat May 20, 2006 3:07 pm

Can anybody enlighten me on this;

Code: Select all

Relative addressing (BCC, BCS, BNE, BEQ, BPL, BMI, BVC, BVS)

        #   address  R/W description
       --- --------- --- ---------------------------------------------
        1     PC      R  fetch opcode, increment PC
        2     PC      R  fetch operand, increment PC
        3     PC      R  Fetch opcode of next instruction,
                         If branch is taken, add operand to PCL.
                         Otherwise increment PC.
        4+    PC*     R  Fetch opcode of next instruction.
                         Fix PCH. If it did not change, increment PC.
        5!    PC      R  Fetch opcode of next instruction,
                         increment PC.

       Notes: The opcode fetch of the next instruction is included to
              this diagram for illustration purposes. When determining
              real execution times, remember to subtract the last
              cycle.

              * The high byte of Program Counter (PCH) may be invalid
                at this time, i.e. it may be smaller or bigger by $100.

              + If branch is taken, this cycle will be executed.

              ! If branch occurs to different page, this cycle will be
                executed.
It's the only thing left that I am yet to do. It seems to me that there are way too many 'increment PC''s in there. It must check the branch condition on cycle no 2 and not 3 otherwise the minimum number of cycles required would be 3 which is not the case.

User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg » Sat May 20, 2006 5:18 pm

The description looks fine to me (the following is based on what you quoted). If you read the description, several of the PC increments are conditional. The first step, fetch opcode, is performed on the last cycle of the previous instruction. The second step fetches the branch offset. The third step fetches the next instruction regardless of whether the branch is taken. If it was not taken, PC is incremented and the branch is complete.

If the branch was taken, the fourth step fetches the new opcode. If no page-crossing occurred, the PC is incremented and the branch is complete. If a page-crossing did occur, then the high byte of the PC needs to be fixed and the new opcode fetched one final time in step 5, and the PC incremented.

augnober
Posts: 23
Joined: Sun Jan 08, 2006 12:22 pm

Post by augnober » Sun May 21, 2006 8:04 pm

byuu wrote:3) And this is the big one... you can't save and restore the state of threads in a platform-independant way, meaning you can't have savestates... it may be possible to save the context-buffer + stack into a savestate and restore it, but the format would change with even a minimal code change, breaking older savestates in the process. Power/reset events are still possible, one need only destroy and recreate the CPU thread.
Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)

Near
Founder of higan project
Posts: 1550
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near » Sun May 21, 2006 10:15 pm

augnober wrote:Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)
Uhh... what?

augnober
Posts: 23
Joined: Sun Jan 08, 2006 12:22 pm

Post by augnober » Mon May 22, 2006 12:01 am

byuu wrote:
augnober wrote:Confucius says.. Platforms and emulators may come and go, but the NES will never change 8)
Uhh... what?
Sorry about that. You'd made it sound like it wouldn't be possible to avoid savestate versioning issues.. but since the NES is static, it's more a matter of how practical it is. My thought was that to avoid version-dependence you could be careful to only write out data which is a snapshot of the emulated NES internals (only the essentials). All emulators would need this and only this data. If you could find a way for that to be sufficient without making the export and import too unwieldly, there would be no platform or emulator dependency (the only platform it would depend on is the NES, which any future versions will emulate anyway).

Edit: To put this in perspective, it would have been possible to save the state off a real NES back in the 80's (well, in theory anyway) and load it up in an emulator today, provided the data saved were sufficient. Presumably this data would have been in bits and pieces with no obvious original organization, and the person who made the record would have needed to organize the data in a presentable fashion. This could just as easily (or difficultly) be done from an emulator today.

Near
Founder of higan project
Posts: 1550
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near » Mon May 22, 2006 8:29 am

You'd made it sound like it wouldn't be possible to avoid savestate versioning issues.. but since the NES is static, it's more a matter of how practical it is.
This isn't related to the NES. The reason you can't use savestates with cooperative multithreading is because you can't save the stack + context registers of each thread into a savefile with any kind of reliability that it will ever work again. Let alone on a different OS or processor.

You can save the current state of the NES, but you wouldn't be able to load it. You'd have no way to set program execution to the correct point in each of the threads. I'm sure it's possible to create a semi-hybrid design that would allow savestates + partial multithreading, but it would lose the main advantages of using threads in the first place: code simplicity and possibly accuracy.

Post Reply