It is currently Mon Jul 22, 2019 5:14 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 8 posts ] 
Author Message
 Post subject: faster 'P' emulation
PostPosted: Tue Feb 07, 2006 7:15 am 
Offline

Joined: Thu Sep 15, 2005 9:23 am
Posts: 1236
Location: Berlin, Germany
I have read somewhere that it is possible to use a host's 'P' status flag as we would use the NES's. This would mean less code within our CPU emulators. We could then use assembler to access these flags. However, I fear that after a register transfer has been made, changing the PC and CC could overwrite our work. For example;

Code:
inline void OpticCode98()
{
   CPU.A = CPU.Y;
   ^^ Would set the neccessary flags
   CPU.PC++;
   CPU.CC += 2;
   ^^ Would reset the neccessary flags?
}


Can anyone shed more light on this?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Feb 07, 2006 11:10 am 
Offline

Joined: Mon Sep 20, 2004 6:47 am
Posts: 48
i believe cpu flags are saved/restored upon the os switching to another running program if that what's you wanted to know


Top
 Profile  
 
 Post subject:
PostPosted: Tue Feb 07, 2006 1:16 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
WedNESday is probably referring to legacy processor architectures with only one set of status flags that are set as if the last result of an operation was compared with zero (like the 6502, for example LDA, ORA, ADC, etc.). On those, any intervening operations between the flag setting and branch must be severely limited (STA is doesn't affect the flags). It's unlikely that using the status flags would give a speed benefit, because accessing them probably stalls the pipeline, as it's not a common operation to need.

WedNESday, if you do a profile of how often each instruction is used, and look at what 6502 status flags they modify, you might find some opportunities for optimization without using non-portable techniques like this.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Feb 07, 2006 2:41 pm 
Offline

Joined: Thu Sep 15, 2005 9:23 am
Posts: 1236
Location: Berlin, Germany
Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running. I know that this method of implementation is possible.

Code:
inline void OpticCode98()
{
   CPU.A = CPU.Y;

      // if CPU.Y = 0 then the x86's zero flag would be set, no problem

   CPU.PC++;
   CPU.CC += 2;

      // but incrementing these two would modify the x86 status flag, therefore losing the data


If we could retain the status flags register in the way that we wanted then we could omit data like the following...

Code:
CPU.P &= 0x7D;
if( !CPU.A )
    CPU.P += 0x02;
CPU.P += (CPU.A & 0x80);


...from just about every instruction. That would be an obvious speed increase.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Feb 07, 2006 4:59 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21510
Location: NE Indiana, USA (NTSC)
WedNESday wrote:
Here is what I mean exactly. I want to use the x86 processors status flag while the emulator is running.

For that, you probably have to use assembly language. C definitely won't work, but C-- (C minus minus) might.

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
 Post subject:
PostPosted: Tue Feb 07, 2006 6:45 pm 
Offline

Joined: Mon Sep 20, 2004 11:13 am
Posts: 134
Location: Sweden
Instead of keeping all flags in a single byte, you should use one boolean for each flag. That way, you won't have to mask out any bits whenever you want to access a flag. When you push the status register to the stack you just convert those eight booleans to a single byte (and the other way around when you pull the status from stack).

You don't want to keep your flags in the x86 flag register (too much overhead), but you can use the x86 flags after an arithmetic operation to set your own boolean flag vars.
Here's an ADC example in assembly. I'm not sure exactly how to make your C compiler understand asm.

Code:
// ADC, operand is in al

shr flagC, 1  // Put C flag into x86 register
adc a, al     // Do an adc and let the x86 set all flags

sets flagS    // Yay, here we
setz flagZ    // use the x86 flags
setc flagC    // to set our booleans
seto flagV    // to the proper values


I believe this is how Q did it in Nintendulator, you should check his source.
There's also blargg's approach where you don't evaluate any flags until you need to. It's described somewhere in the wiki.

--Martin


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 07, 2006 12:09 am 
Offline

Joined: Mon Mar 06, 2006 3:42 pm
Posts: 94
Location: Montreal, canada
Nessie wrote:
Instead of keeping all flags in a single byte, you should use one boolean for each flag.


For a while I was working on x86 assembly code for an SPC700 core (the sound chip for SNES, which is 6502-based). Using the x86 flags is a nice trick you can do if your core is written in assembly. In nearly all cases, the x86 instructions incidentally compute into x86 flags the values you need for the 6502 flags. Here's some example code to save away the x86 flags. This is common tail code I would stick right before my dispatch loop and jump into for non-RMW instructions (warning: UNTESTED):

Code:
vhcnz_tail:     seto    [ebp+FLAG_V]                ; 4 *3    0000000V
                lahf                                ; 1  1
                mov     [ebp+FLAG_H],ah             ; 3 *2    ???H????
cnz_tail:       setc    [ebp+FLAG_C]                ; 4 *3    0000000C
nz_tail:        lahf                                ; 1  1
                mov     [ebp+FLAG_NZ],ah            ; 3 *2    NZ??????

Keep in mind that after a subtract, the carry flag in 6502 has the opposite value from the x86 carry flag. So use SETNC for that case.

The NZ flags are (almost?) always set together, so it's convenient to combine them into one byte. Note that N and Z are stored in bits 7 and 6 of the LAHF result. Half-carry is stored in bit 4. So those are the meaningful bits of my FLAG_NZ and FLAG_H bytes. Whereas I use bit 0 in the FLAG_C and FLAG_V bytes.

The SETcc instructions are available and efficient on all modern x86 processors. The LAHF looks pretty efficient on paper but I haven't really tried this stuff so I'm not 100% sure. On paper at least, on a Pentium II or III it's a 1-uop instruction with a 3-cycle latency and on an AMD chip it's a direct-path instruction with a 3-cycle latency. So its no more costly than a cache-hit load. LAHF is a nice way to get at the x86 Half-carry flag too, which (as far as I'm aware) works exactly the same as the Carry flag (where you have to flip it for SBC). Here's a snippet of code for a read-modify-write SBC instruction:

Code:
op2_sbc_t_t2_W8:mov     dl,[ebp+FLAG_C]             ; 3 *1
                sub     dl,1                        ; 3  1 CF=(!C)
                sbb     al,cl                       ; 2 *2
sbc_tail_w8:    lahf                                ; 1  1
                seto    [ebp+FLAG_V]                ; 4 *3
                setnc   [ebp+FLAG_C]                ; 4 *3 C=(!CF)
                xor     ah,0x10                     ; 3  1
                mov     [ebp+FLAG_H],ah             ; 3 *2 H=(!AF)
                jmp     short nz_tail_w8.1          ; 2  1


The only tricky part about having this non-uniform representation of the flags, is what do you do when you need to merge them into a 6502 flag byte, or split a byte of 6502 flags back into your internal representation? Here's some more code (again, UNTESTED):
Code:
%macro MERGE_FLAGS 0          ; what we need:           NVP0H0ZC
                mov     cl,[ebp+FLAG_H]         ; 3  1  ???H????
                and     cl,0x10                 ; 3  1  000H0000
                shr     cl,3                    ; 3  1  000000H0
                mov     al,[ebp+FLAG_NZ]        ; 3  1  NZ??????
                shr     al,7                    ; 3  1  0000000N C=Z
                adc     cl,cl                   ; 2  1  00000H0Z
                mov     bl,[ebp+FLAG_C]         ; 3  1  0000000C
                add     al,al                   ; 2  1  000000N0
                add     cl,cl                   ; 2  1  0000H0Z0
                add     al,[ebp+FLAG_V]         ; 3 *2  000000NV
                add     cl,bl                   ; 2  1  0000H0ZC
                add     al,al                   ; 2  1  00000NV0
                add     al,[ebp+FLAG_P32+1]     ; 3 *2  00000NVP
                shl     al,5                    ; 3  1  NVP00000
                add     al,cl                   ; 2  1  NVP0H0ZC
%endmacro

%macro SPLIT_FLAGS 0          ; start with:             NVP?H?ZC
                test    al,0x20                 ; 2  1  NZ=(P)
                setnz   [ebp+FLAG_P32+1]        ; 4 *3           ---> 0000000P
                mov     bl,al                   ; 2  1  NVP?H?ZC
                test    al,0x01                 ; 2  1  NZ=(C)
                setnz   [ebp+FLAG_C]            ; 4 *3           ---> 0000000C
                add     al,al                   ; 2  1  VP?H?ZC0 C=N
                and     bl,0x80                 ; 3  1  N0000000
                mov     [ebp+FLAG_H],al         ; 3 *2           ---> xxxHxxxx
                rol     al,4                    ; 3  1  ?ZC0VP?H
                test    al,0x08                 ; 2  1  NZ=(P)
                setnz   [ebp+FLAG_V]            ; 4 *3           ---> 0000000V
                and     al,0x40                 ; 2  1  0Z000000
                add     bl,al                   ; 2  1  NZ000000
                mov     [ebp+FLAG_NZ],bl        ; 3 *2           ---> NZxxxxxx
%endmacro


The numbers in the ; comments are instruction size in bytes, and number of uops on a P2/P3. * mark insns that have to pass through the first decoder on a P2/P3 (remember, they used the 4-1-1 decoder template). That doesn't matter for P4's but it probably does for modern Pentium M's (I've never bothered to look into that).

Hopefully reading the above will give people some clever ideas.

Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM. I use handler functions for those addresses which store the port state somewhere else; then I fill those bytes within the address space with 0xFF, the opcode for a rarely-used instruction (STOP). Instruction fetch then ignores the possibility that I fetched an opcode from a port address. The port check is actually done in the handler for the STOP instruction, and if it turns out we fetched the 0xFF from a port, it fixes up the cycle counter, does a "real" port fetch using the port handler and then dispatches again to the new opcode. In the SPC700 case my port check is so fast it might not matter (two alu insns and one highly-predictable branch).

[Edit: does the 6502 even have a half-carry flag? Maybe you get off easy not having to worry about that one. Hmm.:lol:]

[Edit: I forgot to mention, part of my rationale for combining NZ into one byte was to reduce the number of writes. But since most flag-writing instructions only set two or three flags, and since modern processors have lots of store buffers...maybe its simpler to just use separate SETcc instructions. Premature optimization is a favorite pasttime of mine...]

[Edit: hmm, this got me thinking---if you use SETcc for all flags and lay the flag bytes out in your context structure the way the bits are laid out in the 6502 flags register...then you might be able to merge flags with the mmx instruction PMOVMSKB. I think it makes my head hurt too much. :roll:]


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 07, 2006 8:01 am 
Offline
User avatar

Joined: Thu Nov 11, 2004 5:30 am
Posts: 121
Location: San Francisco, CA
mozz wrote:
Here's another (unrelated) trick I came up with, to cheaply support executing code out of I/O port memory: the SPC700 has only a small range of I/O ports in its address space, the rest is basically RAM.

Hmm.. In a chat i had with Burger Bill once, he told me he stored code in some SNES I/O port registers in one of his titles, Wolfenstein 3-D i believe. Was this a common technique? Neat emulation trick, though.

_________________
...patience...


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group