faster 'P' emulation

Are you new to 6502, NES, or even programming in general? Post any of your questions here. Remember - the only dumb question is the question that remains unasked.

Moderator: Moderators

Post Reply
Posts: 1236
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany

faster 'P' emulation

Post by WedNESday » Tue Feb 07, 2006 7:15 am

I have read somewhere that it is possible to use a host's 'P' status flag as we would use the NES's. This would mean less code within our CPU emulators. We could then use assembler to access these flags. However, I fear that after a register transfer has been made, changing the PC and CC could overwrite our work. For example;

Code: Select all

inline void OpticCode98()
	^^ Would set the neccessary flags
	CPU.CC += 2;
	^^ Would reset the neccessary flags?
Can anyone shed more light on this?

Posts: 48
Joined: Mon Sep 20, 2004 6:47 am

Post by RoboNes » Tue Feb 07, 2006 11:10 am

i believe cpu flags are saved/restored upon the os switching to another running program if that what's you wanted to know

User avatar
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA

Post by blargg » Tue Feb 07, 2006 1:16 pm

WedNESday is probably referring to legacy processor architectures with only one set of status flags that are set as if the last result of an operation was compared with zero (like the 6502, for example LDA, ORA, ADC, etc.). On those, any intervening operations between the flag setting and branch must be severely limited (STA is doesn't affect the flags). It's unlikely that using the status flags would give a speed benefit, because accessing them probably stalls the pipeline, as it's not a common operation to need.

WedNESday, if you do a profile of how often each instruction is used, and look at what 6502 status flags they modify, you might find some opportunities for optimization without using non-portable techniques like this.

Posts: 1236
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany

Post by WedNESday » Tue Feb 07, 2006 2:41 pm

Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running. I know that this method of implementation is possible.

Code: Select all

inline void OpticCode98()
   CPU.A = CPU.Y;

      // if CPU.Y = 0 then the x86's zero flag would be set, no problem

   CPU.CC += 2;

      // but incrementing these two would modify the x86 status flag, therefore losing the data
If we could retain the status flags register in the way that we wanted then we could omit data like the following...

Code: Select all

CPU.P &= 0x7D;
if( !CPU.A )
    CPU.P += 0x02;
CPU.P += (CPU.A & 0x80);
...from just about every instruction. That would be an obvious speed increase.

Posts: 21983
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)

Post by tepples » Tue Feb 07, 2006 4:59 pm

WedNESday wrote:Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running.
For that, you probably have to use assembly language. C definitely won't work, but C-- (C minus minus) might.

Posts: 134
Joined: Mon Sep 20, 2004 11:13 am
Location: Sweden

Post by Nessie » Tue Feb 07, 2006 6:45 pm

Instead of keeping all flags in a single byte, you should use one boolean for each flag. That way, you won't have to mask out any bits whenever you want to access a flag. When you push the status register to the stack you just convert those eight booleans to a single byte (and the other way around when you pull the status from stack).

You don't want to keep your flags in the x86 flag register (too much overhead), but you can use the x86 flags after an arithmetic operation to set your own boolean flag vars.
Here's an ADC example in assembly. I'm not sure exactly how to make your C compiler understand asm.

Code: Select all

// ADC, operand is in al

shr flagC, 1  // Put C flag into x86 register
adc a, al     // Do an adc and let the x86 set all flags

sets flagS    // Yay, here we
setz flagZ    // use the x86 flags
setc flagC    // to set our booleans
seto flagV    // to the proper values
I believe this is how Q did it in Nintendulator, you should check his source.
There's also blargg's approach where you don't evaluate any flags until you need to. It's described somewhere in the wiki.


Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz » Tue Mar 07, 2006 12:09 am

Nessie wrote:Instead of keeping all flags in a single byte, you should use one boolean for each flag.
For a while I was working on x86 assembly code for an SPC700 core (the sound chip for SNES, which is 6502-based). Using the x86 flags is a nice trick you can do if your core is written in assembly. In nearly all cases, the x86 instructions incidentally compute into x86 flags the values you need for the 6502 flags. Here's some example code to save away the x86 flags. This is common tail code I would stick right before my dispatch loop and jump into for non-RMW instructions (warning: UNTESTED):

Code: Select all

vhcnz_tail:     seto    [ebp+FLAG_V]                ; 4 *3    0000000V
                lahf                                ; 1  1
                mov     [ebp+FLAG_H],ah             ; 3 *2    ???H????
cnz_tail:       setc    [ebp+FLAG_C]                ; 4 *3    0000000C
nz_tail:        lahf                                ; 1  1
                mov     [ebp+FLAG_NZ],ah            ; 3 *2    NZ??????
Keep in mind that after a subtract, the carry flag in 6502 has the opposite value from the x86 carry flag. So use SETNC for that case.

The NZ flags are (almost?) always set together, so it's convenient to combine them into one byte. Note that N and Z are stored in bits 7 and 6 of the LAHF result. Half-carry is stored in bit 4. So those are the meaningful bits of my FLAG_NZ and FLAG_H bytes. Whereas I use bit 0 in the FLAG_C and FLAG_V bytes.

The SETcc instructions are available and efficient on all modern x86 processors. The LAHF looks pretty efficient on paper but I haven't really tried this stuff so I'm not 100% sure. On paper at least, on a Pentium II or III it's a 1-uop instruction with a 3-cycle latency and on an AMD chip it's a direct-path instruction with a 3-cycle latency. So its no more costly than a cache-hit load. LAHF is a nice way to get at the x86 Half-carry flag too, which (as far as I'm aware) works exactly the same as the Carry flag (where you have to flip it for SBC). Here's a snippet of code for a read-modify-write SBC instruction:

Code: Select all

op2_sbc_t_t2_W8:mov     dl,[ebp+FLAG_C]             ; 3 *1
                sub     dl,1                        ; 3  1 CF=(!C)
                sbb     al,cl                       ; 2 *2
sbc_tail_w8:    lahf                                ; 1  1
                seto    [ebp+FLAG_V]                ; 4 *3
                setnc   [ebp+FLAG_C]                ; 4 *3 C=(!CF)
                xor     ah,0x10                     ; 3  1
                mov     [ebp+FLAG_H],ah             ; 3 *2 H=(!AF)
                jmp     short nz_tail_w8.1          ; 2  1
The only tricky part about having this non-uniform representation of the flags, is what do you do when you need to merge them into a 6502 flag byte, or split a byte of 6502 flags back into your internal representation? Here's some more code (again, UNTESTED):

Code: Select all

%macro MERGE_FLAGS 0          ; what we need:           NVP0H0ZC
                mov     cl,[ebp+FLAG_H]         ; 3  1  ???H????
                and     cl,0x10                 ; 3  1  000H0000
                shr     cl,3                    ; 3  1  000000H0
                mov     al,[ebp+FLAG_NZ]        ; 3  1  NZ??????
                shr     al,7                    ; 3  1  0000000N C=Z
                adc     cl,cl                   ; 2  1  00000H0Z
                mov     bl,[ebp+FLAG_C]         ; 3  1  0000000C
                add     al,al                   ; 2  1  000000N0
                add     cl,cl                   ; 2  1  0000H0Z0
                add     al,[ebp+FLAG_V]         ; 3 *2  000000NV
                add     cl,bl                   ; 2  1  0000H0ZC
                add     al,al                   ; 2  1  00000NV0
                add     al,[ebp+FLAG_P32+1]     ; 3 *2  00000NVP
                shl     al,5                    ; 3  1  NVP00000
                add     al,cl                   ; 2  1  NVP0H0ZC

%macro SPLIT_FLAGS 0          ; start with:             NVP?H?ZC
                test    al,0x20                 ; 2  1  NZ=(P)
                setnz   [ebp+FLAG_P32+1]        ; 4 *3           ---> 0000000P
                mov     bl,al                   ; 2  1  NVP?H?ZC
                test    al,0x01                 ; 2  1  NZ=(C)
                setnz   [ebp+FLAG_C]            ; 4 *3           ---> 0000000C
                add     al,al                   ; 2  1  VP?H?ZC0 C=N
                and     bl,0x80                 ; 3  1  N0000000
                mov     [ebp+FLAG_H],al         ; 3 *2           ---> xxxHxxxx
                rol     al,4                    ; 3  1  ?ZC0VP?H
                test    al,0x08                 ; 2  1  NZ=(P)
                setnz   [ebp+FLAG_V]            ; 4 *3           ---> 0000000V
                and     al,0x40                 ; 2  1  0Z000000
                add     bl,al                   ; 2  1  NZ000000
                mov     [ebp+FLAG_NZ],bl        ; 3 *2           ---> NZxxxxxx
The numbers in the ; comments are instruction size in bytes, and number of uops on a P2/P3. * mark insns that have to pass through the first decoder on a P2/P3 (remember, they used the 4-1-1 decoder template). That doesn't matter for P4's but it probably does for modern Pentium M's (I've never bothered to look into that).

Hopefully reading the above will give people some clever ideas.

Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM. I use handler functions for those addresses which store the port state somewhere else; then I fill those bytes within the address space with 0xFF, the opcode for a rarely-used instruction (STOP). Instruction fetch then ignores the possibility that I fetched an opcode from a port address. The port check is actually done in the handler for the STOP instruction, and if it turns out we fetched the 0xFF from a port, it fixes up the cycle counter, does a "real" port fetch using the port handler and then dispatches again to the new opcode. In the SPC700 case my port check is so fast it might not matter (two alu insns and one highly-predictable branch).

[Edit: does the 6502 even have a half-carry flag? Maybe you get off easy not having to worry about that one. Hmm.:lol:]

[Edit: I forgot to mention, part of my rationale for combining NZ into one byte was to reduce the number of writes. But since most flag-writing instructions only set two or three flags, and since modern processors have lots of store buffers...maybe its simpler to just use separate SETcc instructions. Premature optimization is a favorite pasttime of mine...]

[Edit: hmm, this got me thinking---if you use SETcc for all flags and lay the flag bytes out in your context structure the way the bits are laid out in the 6502 flags register...then you might be able to merge flags with the mmx instruction PMOVMSKB. I think it makes my head hurt too much. :roll:]

User avatar
Posts: 121
Joined: Thu Nov 11, 2004 5:30 am
Location: San Francisco, CA

Post by baisoku » Tue Mar 07, 2006 8:01 am

mozz wrote:Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM.
Hmm.. In a chat i had with Burger Bill once, he told me he stored code in some SNES I/O port registers in one of his titles, Wolfenstein 3-D i believe. Was this a common technique? Neat emulation trick, though.

Post Reply