Learning the 65816 cycle penalties
Moderator: Moderators
Forum rules
- For making cartridges of your Super NES games, see Reproduction.
Learning the 65816 cycle penalties
The cycle penalties for 65816 instructions are listed at "65816 Reference" on Super Famicom Development Wiki. For someone moving from 6502 to 65816, that's a fairly big table to memorize. I find it easier to memorize the rules for when a penalty applies.
[^1]: !P.m (16-bit accumulator)
Main ALU ops (adc, and, etc.)
[^2]: low(D)>0 (direct page not 256-byte aligned)
dp; dp,X; dp,Y; (dp); (dp),Y; [dp]; [dp],Y
[^3]: hibyte(a16)!=hibyte(EA)||!P.x (16-bit index or page crossed)
(dp),Y; [dp],Y; a16,Y; a16,X
[^4]: 2*!P.m (2 cycles for 16-bit accumulator)
Read-modify-write (RMWs) instructions (asl, dec, etc.)
[^5]: EA!=PC (branch taken)
Branches
[^6]: EA!=PC&&P.e&&hibyte(EA)!=hibyte(PC) (taken, cross page, and emulation mode)
Branches
[^7]: !P.e (native mode)
brk, cop, rti
[^8]: !P.x (16-bit index)
cpx, cpy, etc.
Other rules:
- sr,S and (sr,S),Y have same timing as dp and (dp),Y with unaligned DP and X=0
- As on 6502, indexed writes (STA, STX, STY, STZ) and RMWs always incur the penalty for 16-bit index or page crossed. This is a penalty compared to LDA/LDX/LDY but not listed because is incorporated into the instruction's basic count.
- REP and SEP are 3 cycles to let the MX bits propagate to the entire CPU. XBA is also 3 cycles for some reason.
Now whether these penalties are considered internal operation cycles (and thus fast even if in WRAM or slow ROM) or read cycles (and thus slow from WRAM or slow ROM) is another question. In what manual could I find that?
[^1]: !P.m (16-bit accumulator)
Main ALU ops (adc, and, etc.)
[^2]: low(D)>0 (direct page not 256-byte aligned)
dp; dp,X; dp,Y; (dp); (dp),Y; [dp]; [dp],Y
[^3]: hibyte(a16)!=hibyte(EA)||!P.x (16-bit index or page crossed)
(dp),Y; [dp],Y; a16,Y; a16,X
[^4]: 2*!P.m (2 cycles for 16-bit accumulator)
Read-modify-write (RMWs) instructions (asl, dec, etc.)
[^5]: EA!=PC (branch taken)
Branches
[^6]: EA!=PC&&P.e&&hibyte(EA)!=hibyte(PC) (taken, cross page, and emulation mode)
Branches
[^7]: !P.e (native mode)
brk, cop, rti
[^8]: !P.x (16-bit index)
cpx, cpy, etc.
Other rules:
- sr,S and (sr,S),Y have same timing as dp and (dp),Y with unaligned DP and X=0
- As on 6502, indexed writes (STA, STX, STY, STZ) and RMWs always incur the penalty for 16-bit index or page crossed. This is a penalty compared to LDA/LDX/LDY but not listed because is incorporated into the instruction's basic count.
- REP and SEP are 3 cycles to let the MX bits propagate to the entire CPU. XBA is also 3 cycles for some reason.
Now whether these penalties are considered internal operation cycles (and thus fast even if in WRAM or slow ROM) or read cycles (and thus slow from WRAM or slow ROM) is another question. In what manual could I find that?
-
- Posts: 1565
- Joined: Tue Feb 07, 2017 2:03 am
Re: Learning the 65816 cycle penalties
XBA
A - > temp
B -> A
temp -> B
thus 3 cycles
the SNES only hits the clock penalty if the address bus is in those ranges. So you would need to look up the addressing mode tables in the 65816 and work out if the CPU is doing a dummy read from the target address range. See table 5-7 from the data sheet.
For example, for the abs,index the address bus is
so if you say read $0001,x and X is $4001 and you are in a LOROM FAST ROM config
A - > temp
B -> A
temp -> B
thus 3 cycles
the SNES only hits the clock penalty if the address bus is in those ranges. So you would need to look up the addressing mode tables in the 65816 and work out if the CPU is doing a dummy read from the target address range. See table 5-7 from the data sheet.
For example, for the abs,index the address bus is
Code: Select all
PBR,PC
PBR,PC+1
PBR,PC2
DBR, AAH, AAL+XL ; this only happens on a 16 bit, which does a dummy read at the target without the high offset
DBR,A+X
DBR,A+X+1
Code: Select all
PBR,PC ; 3.58
PBR,PC+1 ; 3.58
PBR,PC2 ; 3.58
DBR, AAH, AAL+XL ; $7e0001 2.58
DBR,A+X ; $004001 1.78
DBR,A+X+1 ; $004002 1.78
-
- Posts: 53
- Joined: Wed Sep 09, 2020 3:08 am
Re: Learning the 65816 cycle penalties
Can you explain this once again, in simpler terms? Say we run across XBA while in "8-mini-cycles" ROM area or while in "6-mini-cycles" ROM area, how much time is it going to take and why?
Re: Learning the 65816 cycle penalties
https://github.com/larsbrinkhoff/awesom ... 65c816.txt
When encountering an XBA, the CPU takes one cycle to load the opcode. That will be either 8 or 6 master clocks depending on where the ROM is and what the $420D setting is. It then takes two more internal cycles to do its job, because you can't directly transfer data from one register to the other while also transferring from the other to the one; you need at least one intermediate step. Both of those internal operation cycles should be 6 master clocks, according to both anomie's docs and nocash's docs.stan423321 wrote: ↑Mon Mar 15, 2021 11:47 amSay we run across XBA while in "8-mini-cycles" ROM area or while in "6-mini-cycles" ROM area, how much time is it going to take and why?
Code: Select all
*6b Implied -- i
(XBA)
(1 Op Code)
(1 byte)
(3 cycles)
1 1 1 1 1 PBR,PC Op Code 1
2 1 1 0 0 PBR,PC+1 IO 1
3 1 1 0 0 PBR,PC+1 IO 1
Last edited by 93143 on Mon Mar 15, 2021 1:20 pm, edited 1 time in total.
Re: Learning the 65816 cycle penalties
The 21MHz clock is usually called master clock, mini clock sounds a bit unfamilar ; )
This document lists timings in detail https://github.com/larsbrinkhoff/awesom ... 65c816.txt
It might output something on the address bus, but doesn't actually transfer anything on the data bus.
As far as I remember, the 65816 doesn't do dummy memory transfers (unlike 6502), and does instead use those internal operation cycles.
So the general rule is "total number of clocks - number of bytes = number of internal clocks".
For example NOP has 2 cycles, and transfers only one byte (the opcode byte, without further parameter/address/data bytes), so the second cycle is an internal one.
This document lists timings in detail https://github.com/larsbrinkhoff/awesom ... 65c816.txt
No, that step should be an Internal Operation "IO" thus 3.58MHz.
It might output something on the address bus, but doesn't actually transfer anything on the data bus.
As far as I remember, the 65816 doesn't do dummy memory transfers (unlike 6502), and does instead use those internal operation cycles.
So the general rule is "total number of clocks - number of bytes = number of internal clocks".
For example NOP has 2 cycles, and transfers only one byte (the opcode byte, without further parameter/address/data bytes), so the second cycle is an internal one.
-
- Posts: 611
- Joined: Mon Jan 23, 2006 7:47 am
- Location: Germany
- Contact:
Re: Learning the 65816 cycle penalties
Table 5-7 in the 65c816 data sheet has the "VDA" (valid data address) and "VPA" (valid program address) columns, corresponding to the pins shown on page 12. Every Internal Operation cycle has zeroes in both of these columns. The 5A22 then knows to add no extra waitstate cycles to the 6 minimum master clock cycles per 1 CPU core cycle, even if the core's address bus points to a "slow" or "xslow" region.
To add more names to the mix: it seems that on most SNES versions, the main clock crystal is labeled "X1" and the audio crystal is labeled "X2".
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Re: Learning the 65816 cycle penalties
In particular, decoding is complete enough and fast enough that the second cycle of an implied, accumulator, push/pull instruction, or RTS/RTL/RTI is already VDA=VPA=0, therefore no /RD. BRK and COP have VPA=1 for the signature byte.creaothceann wrote: ↑Mon Mar 15, 2021 2:59 pm Table 5-7 in the 65c816 data sheet has the "VDA" (valid data address) and "VPA" (valid program address) columns, corresponding to the pins shown on page 12. Every Internal Operation cycle has zeroes in both of these columns. The 5A22 then knows to add no extra waitstate cycles to the 6 minimum master clock cycles per 1 CPU core cycle, even if the core's address bus points to a "slow" or "xslow" region.
-
- Posts: 1565
- Joined: Tue Feb 07, 2017 2:03 am
Re: Learning the 65816 cycle penalties
So the SNES is smart enough to work out and gate the IO operations?nocash wrote: ↑Mon Mar 15, 2021 1:16 pm The 21MHz clock is usually called master clock, mini clock sounds a bit unfamilar ; )
This document lists timings in detail https://github.com/larsbrinkhoff/awesom ... 65c816.txt
No, that step should be an Internal Operation "IO" thus 3.58MHz.
It might output something on the address bus, but doesn't actually transfer anything on the data bus.
As far as I remember, the 65816 doesn't do dummy memory transfers (unlike 6502), and does instead use those internal operation cycles.
So the general rule is "total number of clocks - number of bytes = number of internal clocks".
For example NOP has 2 cycles, and transfers only one byte (the opcode byte, without further parameter/address/data bytes), so the second cycle is an internal one.
-
- Posts: 611
- Joined: Mon Jan 23, 2006 7:47 am
- Location: Germany
- Contact:
Re: Learning the 65816 cycle penalties
It just takes a few transistors.
In fact, the entire address map decoding logic is relatively simple, all you need to do is combine some bits.
In fact, the entire address map decoding logic is relatively simple, all you need to do is combine some bits.
Code: Select all
// banks | addresses | speed | mapping banks | addresses | speed | mapping
// --------+-------------+-------+----------------------------- --------+-------------+-------+-----------------------------
// $00-$3F | $0000-$1FFF | slow | address bus A + /WRAM mirror $80-$BF | $0000-$1FFF | slow | address bus A + /WRAM mirror
// | $2000-$20FF | fast | address bus A | $2000-$20FF | fast | address bus A
// | $2100-$21FF | fast | address bus B | $2100-$21FF | fast | address bus B
// | $2200-$3FFF | fast | address bus A | $2200-$3FFF | fast | address bus A
// | $4000-$41FF | xslow | internal CPU registers | $4000-$41FF | xslow | internal CPU registers
// | $4200-$43FF | fast | internal CPU registers | $4200-$43FF | fast | internal CPU registers
// | $4400-$5FFF | fast | address bus A | $4400-$5FFF | fast | address bus A
// | $6000-$7FFF | slow | address bus A | $6000-$7FFF | slow | address bus A
// | $8000-$FFFF | slow | address bus A + /CART | $8000-$FFFF | note2 | address bus A + /CART
// --------+-------------+-------+----------------------------- --------+-------------+-------+-----------------------------
// $40-$7D | $0000-$FFFF | slow | address bus A + /CART $C0-$FD | $0000-$FFFF | note2 | address bus A + /CART
// $7E-$7F | $0000-$FFFF | slow | address bus A + /WRAM $FE-$FF | $0000-$FFFF | note2 | address bus A + /CART
// --------+-------------+-------+----------------------------- --------+-------------+-------+-----------------------------
// 65c816 VDA ( 1 bit ): is valid data address
// 65c816 VPA ( 1 bit ): is valid programm address
// 65c816 RWB ( 1 bit ): is read operation
// 65c816 MAR (24 bits): Memory Address Register (address bus value)
// as per jwdonal's schematics:
// /CPURD and /CPUWR = A address bus indicator
// /PARD and /PAWR = B address bus indicator
is_valid_address := VDA or VPA;
is_reading := is_valid_address and ( RWB);
is_writing := is_valid_address and (not RWB);
is_lower_bank := ((MAR.BankByte AND %11000000) = %00000000); // banks 00..3F
is_WRAM_bank := ((MAR.BankByte AND %11111110) = %01111110); // banks 7E..7F
is_CART_bank := (not is_WRAM_bank) and ((MAR.BankByte AND %01000000) = %01000000); // banks 40..7F and C0..FF, except for 7E..7F
is_WRAM_mirror := is_lower_bank and ((MAR.HighByte AND %11100000) = %00000000); // offsets 0000..1FFF
is_B_bus := is_lower_bank and ((MAR.HighByte ) = $21 ); // offsets 2100..21FF
is_internal_register := is_lower_bank and ((MAR.HighByte AND %11111100) = %01000000); // offsets 4000..43FF
is_upper_offset := ((MAR.HighByte AND %10000000) = %10000000); // offsets 8000..FFFF
is_A_bus := not (is_internal_register or is_B_bus);
/CPURD := not (is_A_Bus and is_reading);
/CPUWR := not (is_A_Bus and is_writing);
/PARD := not (is_B_Bus and is_reading);
/PAWR := not (is_B_Bus and is_writing);
/CART := not (is_upper_offset or is_CART_bank);
/WRAM := not (is_WRAM_mirror or is_WRAM_bank);
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10