Re: Adding features to discrete mapper with multipurposed CI
Posted: Wed Sep 27, 2017 10:24 pm
Made a little more progress!
Mostly have pinout nailed down for my first board version. Came to realization that I really just need two STM8 footprints on the board for now. Reason being that for my standard discrete 'non-CICOp' boards the goal is to have the STM8 increase user friendliness with fewer jumpers for things like PRG-ROM and CHR-RAM/ROM size trimming. Where the CICOp demands pins like CPU D0-3, and signals useful for all the hoped features. So it looks like I've got enough room for the two separate footprints with their varying pinouts. One perk of that is that it creates a bit of a backup plan if things go south with implementation of async CIC timing using TIM4 alone. Could actually have two STM8 on board, one for acting as a CIC, and the other as a dedicated co-processor. Certainly doesn't meet the goal of minimal hardware if it actually comes down to that last resort. But considering the STM8 is one of the lowest cost mcus on the market having a second one if it's actually getting well utilized isn't that crazy and much cheaper than a CPLD.
So here's the planned pinout:
PA1 & PA2: CIC Din, Dout, and wire ORed CIC Reset
PA3: CPU R/W
PB5: PRG-ROM /WE pin to support flash writes without '139 logic gate on UxROM (jumper makes optional for alt func)
PB4 & PB5 alternate function: I2C bus available pinned out to female header. Could support RTC, etc.
PC3 TLI: this is an external NMI pin for the STM8, using this for the mapper bit so any other PORTC pins can trigger interrupts with a separate isr. Using TLI gives little extra insurace that CICOp register read/writes have highest priority.
PC4 & PC5: TIM1 & TIM2 output channels. Came up with nifty way to arrange four SMT resistor pads in a square with the mcu pins in opposite corners. IRQ and PWM DAC signals are in other two corners. So placing the resistors horizontally will map TIM1 to 6502 /IRQ (support scanline/CPU cycle counting), and TIM2 to PWM DAC (edge aligned PWM). Mounting resistors vertically maps TIM1 to PWM DAC (center aligned), and TIM2 to /IRQ (async timer only). I was about to give up on option for center aligned PWM DAC till I realized this trick and 0ohm resistor for /IRQ kept routing simple.
PC6: TIM1_CH2 6502 m2 clock for cycle counting with TIM1
PC7: TIM1_CH1 PPU clock, jumper selectable between (default) PPU A12 and (alt) PPU /RD
PD1-4: CPU D0-3
PD5: H/V mirroring control with small MUX gate
PD6: Debug output trying to fit in an LED if I can
PD5 & PD6: can be dual purposed as they're routed to female header for UART to support low cost BT/WiFi modules to be added on.
So in the end giving up on the SPI bus in favor of a scanline/CPU cycle counters gave some breathing room for the pinout. In the end I'm not even sure if I need CPU R/W but not much benefit to leave it out as there's already pins to spare.
Current layout supports CICOp on a decent variety of discrete mappers. Really the only thing that's needed is a spare mapper bit to interrupt the STM8 for CICOp register access. Planning support for UxROM w/512KB PRG-ROM or less, BxROM with 512KB PRG-ROM or less, CNROM with 64KB CHR-ROM or less, and Colordreams with 256KB PRG-ROM or less (or 64KB CHR-ROM or less).
Haven't even started laying traces for the board yet, but current rat nest and component density looks manageable.
As with most things like this I typically realize dumb i/o assignment choices when I actually start writing firmware for the design. So I whipped up a little prototype board with the STM8 on a breakout board. Wrote a little NES test rom and the CICOp register access isr and have successfully transferred data between the 6502 and CICOprocessor!
Realizing some flaws and limitations to my original proposal:
Firstly, the STY $5x0x and STX $5x0x opcodes aren't the best choice because they don't make the CICOp register being accessed variable unless the routine is self modifying.
Secondly it's limiting because we're running out of registers in this routine. I'm trying to keep it as short as possible, and using A to maintain the current bank is a bit of a waste of instruction time and 6502 registers.
Lastly I has assumed the mapper wasn't subject to bus conflicts. But ensuring that may require extra hardware on board, so coming up with a bus conflict compatible routine is helpful. I got around this by simply requiring that a specific bank is always active during this routine, so these values can now be hard coded and align with a bank table.
Remember the CICOp doesn't decode CPU addresses, it's merely snooping the CPU bus during opcode fetching. So really for the 6502 to give data to the CICOp we only need an instruction that presents info on the CPU data bus at some point. Something like STA $5000, X with the register offset in X doesn't work as X is never present on the CPU data bus. Looking at other options I landed on ZP addressing with STA (ZP), Y. This works because the 6502 fetches the ZP bytes from sram so we can sniff their value to glean which CICOp register is being accessed. And while it appears annoying that Y gets consumed/zeroed, it's value is actually moot. So Y can be any value as the CICOp can't even see it.
Having learned a bit more about STM8 interrupts, I also realized that the TLI interrupt is edge sensitive, not level sensitive. So the priority I had to clear $8000.7 mapper bit ASAP is of no value. We can simply clear, then set the TLI mapper bit to create the needed rising/falling edge for the interrupt.
So with all that, this is what I came up with:
So by getting a little tricky with choice of instructions I was able to reduce the number of timing sensitive cycles by 1 cycle, while also transferring 1 more nibble of 6502 read data! Granted that cycle count doesn't include all the preparation to get all the data loaded into ZP, A, & X, and also verify the transfer was successful. But those portions can be tailored/optimized by the user if desired. Would be reasonable to have separate read/write routines on the NES. Also possible some CICOp registers could be defined to have fewer read nibbles, and the STM8 would simply bail from the isr once write data was received depending on the register number. Or perhaps some registers are read/write but the actual value doesn't matter in the case of an something like an IRQ acknowledge/clear register for example.
I feel a lot more confident about the robustness of this routine now. I'm also a bigger fan of always performing both read and write to CICOp. If the register is defined as a write only register, then the last 3 LoaD instructions can actually be used for verification on the 6502 side to ensure that the transaction was successful. The CICOp can repeat back the exact data byte that it heard from the 6502. My thought for now is the 3rd LoaD could be an xor of the two nibbles of the register number. That would help verify that it also sniffed the register number correctly. But in the end this data could be defined as whatever we'd like later on. Appear to have a pretty rock solid transfer routine between the 6502 and CICOp for both reads and writes which is good enough for now!
Implementing things on the STM8 side ended up working pretty well. In the end I don't even actually poll CPU R/W level to try and remove STM8's 5 cycle isr latency variation. It didn't actually help as polling creates it's own jitter. And in practice, the STM8 isr latency variation isn't that bad. In practice it's only 3 cycles (187nsec).
Reason for that is the 1-6 cycles advertised in STM8 programming manual for time needed to complete current instruction assumes worst case. It assumes the STM8 may be executing FAR instructions, and this core doesn't even have any FAR address space. So that cuts us down to 1-5 cycles. However there are really only 3 instructions that are 5 cycles, and we can easily avoid them. They are CALL subroutine with indirect pointer, and LoaD/STore Word with indirect pointer. These 3 instruction are pretty easily avoided as immediate addressing is typically all that's needed for those 3 instructions. So that reduces number of cycles needed to complete current STM8 instruction to 1-4 cycles. Which is only 3 cycles of jitter. In practice I verified this with dozens of logic analyzer captures, so it all checks out.
One idea I had to remove jitter would be to set an STM8 GPIO interrupt on the CPU R/W pin, and the first step of the TLI isr would be to enable the CPU R/W interrupt and then WFI wait for interrupt. This would remove most of the jitter, but I fear this would suffer from incompatibility with various console versions. The edge timing of CPU R/W in relation to M2 could easily vary on clones, AVS, etc. And this current setup bases all it's timing off the mapper register bit setting which should be pretty reliable. My only real concern for timing is drift of the STM8 internal oscillator, may have to trim the oscillator if temperature variation becomes an issue.
For the 6502 being as slow as it is, 187nsec of isr jitter is managable. With properly timed and cycle counted STM8 code I was able to reliably capture and present all necessary data for the CICOp data transfer routine I just lined out above. My current tests are a simple on time check at boot/reset that's printed to the screen. Planning on running some automated tests to really exercise the routines. But early tests look good on original NTSC front loader and a portable clone I keep handy. A separate routine will be needed for PAL support, but there is enough free time early in the isr to branch between separate PAL/NTSC routines with their own fixed cycle counted timing. NTSC is worst case (faster), so I'm not too concerned about PAL.
One idea I had was for the STM8 to verify that CPU R/W was low when it should be for added robustness. One concern I realized when writing this is that extreme caution needs to be exercised to ensure the NES doesn't get interrupted in the middle of this routine. If it did, the STM8 will blindly output data onto the bus when it comes time for the last load instructions. If the 6502 isn't executing this code because it was interrupted that could easily cause a CPU crash. The only real way to guarantee that is for NMI's to be turned off during this routine along with disabling interrupts. That's probably the best call for early NES development using the CICOp until one is certain that an NMI won't occur mid transfer.
I still need to draft up an async NES CIC implementation using TIM4 alone for CIC timing so we can dedicate TIM1 to scanline/CPU cycle counting or center aligned PWM DAC. But there's quite a bit of dead/nop time in my current isr to allow for CIC transfers in the middle of the isr. So I've got a clearer outlook on my timing constraints that the async CIC will have to cater to. That part is still a decent challenge and will require lots of testing, but with successful 6502-CICOp read & write data transfers under my belt I'm pretty confident.
Now that I've got a good means to communicate between the 6502 and CICOp it's time for the fun stuff. Next up is trying out some audio synthesis and PWM DAC experiments, along with scanline/CPU cycle counting tests!
Mostly have pinout nailed down for my first board version. Came to realization that I really just need two STM8 footprints on the board for now. Reason being that for my standard discrete 'non-CICOp' boards the goal is to have the STM8 increase user friendliness with fewer jumpers for things like PRG-ROM and CHR-RAM/ROM size trimming. Where the CICOp demands pins like CPU D0-3, and signals useful for all the hoped features. So it looks like I've got enough room for the two separate footprints with their varying pinouts. One perk of that is that it creates a bit of a backup plan if things go south with implementation of async CIC timing using TIM4 alone. Could actually have two STM8 on board, one for acting as a CIC, and the other as a dedicated co-processor. Certainly doesn't meet the goal of minimal hardware if it actually comes down to that last resort. But considering the STM8 is one of the lowest cost mcus on the market having a second one if it's actually getting well utilized isn't that crazy and much cheaper than a CPLD.
So here's the planned pinout:
PA1 & PA2: CIC Din, Dout, and wire ORed CIC Reset
PA3: CPU R/W
PB5: PRG-ROM /WE pin to support flash writes without '139 logic gate on UxROM (jumper makes optional for alt func)
PB4 & PB5 alternate function: I2C bus available pinned out to female header. Could support RTC, etc.
PC3 TLI: this is an external NMI pin for the STM8, using this for the mapper bit so any other PORTC pins can trigger interrupts with a separate isr. Using TLI gives little extra insurace that CICOp register read/writes have highest priority.
PC4 & PC5: TIM1 & TIM2 output channels. Came up with nifty way to arrange four SMT resistor pads in a square with the mcu pins in opposite corners. IRQ and PWM DAC signals are in other two corners. So placing the resistors horizontally will map TIM1 to 6502 /IRQ (support scanline/CPU cycle counting), and TIM2 to PWM DAC (edge aligned PWM). Mounting resistors vertically maps TIM1 to PWM DAC (center aligned), and TIM2 to /IRQ (async timer only). I was about to give up on option for center aligned PWM DAC till I realized this trick and 0ohm resistor for /IRQ kept routing simple.
PC6: TIM1_CH2 6502 m2 clock for cycle counting with TIM1
PC7: TIM1_CH1 PPU clock, jumper selectable between (default) PPU A12 and (alt) PPU /RD
PD1-4: CPU D0-3
PD5: H/V mirroring control with small MUX gate
PD6: Debug output trying to fit in an LED if I can
PD5 & PD6: can be dual purposed as they're routed to female header for UART to support low cost BT/WiFi modules to be added on.
So in the end giving up on the SPI bus in favor of a scanline/CPU cycle counters gave some breathing room for the pinout. In the end I'm not even sure if I need CPU R/W but not much benefit to leave it out as there's already pins to spare.
Current layout supports CICOp on a decent variety of discrete mappers. Really the only thing that's needed is a spare mapper bit to interrupt the STM8 for CICOp register access. Planning support for UxROM w/512KB PRG-ROM or less, BxROM with 512KB PRG-ROM or less, CNROM with 64KB CHR-ROM or less, and Colordreams with 256KB PRG-ROM or less (or 64KB CHR-ROM or less).
Haven't even started laying traces for the board yet, but current rat nest and component density looks manageable.
As with most things like this I typically realize dumb i/o assignment choices when I actually start writing firmware for the design. So I whipped up a little prototype board with the STM8 on a breakout board. Wrote a little NES test rom and the CICOp register access isr and have successfully transferred data between the 6502 and CICOprocessor!
Realizing some flaws and limitations to my original proposal:
Code: Select all
;now that everything's prepared, perform the mapper write:
STA $8000 ;write to discrete mapper with bit 7 set
STY $5x0x ;write lower nibble to mcu
STX $5x0x ;write upper nibble to mcu
;write complete, now read back the old value that was in the mcu register
LDY $5x0x ;read old value from mcu register (lower nibble)
LDX $5x0x ;read old value from mcu register (upper nibble)
AND #$7F ;clear bit 7 so we can disable mcu's interrupt
STA $8000 ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete
26 cycles of timing sensitive code total
Secondly it's limiting because we're running out of registers in this routine. I'm trying to keep it as short as possible, and using A to maintain the current bank is a bit of a waste of instruction time and 6502 registers.
Lastly I has assumed the mapper wasn't subject to bus conflicts. But ensuring that may require extra hardware on board, so coming up with a bus conflict compatible routine is helpful. I got around this by simply requiring that a specific bank is always active during this routine, so these values can now be hard coded and align with a bank table.
Remember the CICOp doesn't decode CPU addresses, it's merely snooping the CPU bus during opcode fetching. So really for the 6502 to give data to the CICOp we only need an instruction that presents info on the CPU data bus at some point. Something like STA $5000, X with the register offset in X doesn't work as X is never present on the CPU data bus. Looking at other options I landed on ZP addressing with STA (ZP), Y. This works because the 6502 fetches the ZP bytes from sram so we can sniff their value to glean which CICOp register is being accessed. And while it appears annoying that Y gets consumed/zeroed, it's value is actually moot. So Y can be any value as the CICOp can't even see it.
Having learned a bit more about STM8 interrupts, I also realized that the TLI interrupt is edge sensitive, not level sensitive. So the priority I had to clear $8000.7 mapper bit ASAP is of no value. We can simply clear, then set the TLI mapper bit to create the needed rising/falling edge for the interrupt.
So with all that, this is what I came up with:
Code: Select all
;cicop_reg is a ZP pointer that needs to be initialized to $5x0x
; where the 'x' values denote the CICOp register number being accessed
; the '5' and '0' in $5x0x is actually "don't care" as the CICOp can't see it their values.
; The address in cicop_reg just needs to be an empty address space that won't conflict with anything else.
;Lower nibbles of A and X must contain the value (byte) that is desired to be written to the CICOp register
ldy #CICOP_BANK_DIS
sty CICOP_ADDR_DIS ;8C 00 C0
ldy #CICOP_BANK_EN ;A0 80
;trigger CICOp to start transfer operation:
sty CICOP_ADDR_EN
;8C 80 C0
;first allow CICOp to sniff reg number, and write low nibble of data
sta (cicop_reg), y ;y doesn't actually matter CICOp can't see it
;91 04
;now the CICOp has register H:L from ZP sniffing, and data L from A lower nibble
;write upper nibble, contained in lower nibble of X
stx cicop_reg ;doesn't actually matter what ZP byte is written to
;86 04
;cicop_reg gets stomped could use different ZP byte to avoid this
;now the CICOp has data H from sniffing store of X to ZP
;now time to read data from CICOp
;the data is returned in specific order,
;the CICOp doesn't know what register (X,Y,A) the 6502 is loading into
ldx CICOP_PORT ;data L
;AE 00 50
ldy CICOP_PORT ;data H
;AC 00 50
lda CICOP_PORT ;data E (error/verification)
;AD 00 50
;END timing sensitive code. The CICOp has now freed itself to go back to whatever it was doing
;don't actually have to clear mapper enable bit. CICOp won't trigger until next rising edge.
25 cycles total of timing sensitive code.
I feel a lot more confident about the robustness of this routine now. I'm also a bigger fan of always performing both read and write to CICOp. If the register is defined as a write only register, then the last 3 LoaD instructions can actually be used for verification on the 6502 side to ensure that the transaction was successful. The CICOp can repeat back the exact data byte that it heard from the 6502. My thought for now is the 3rd LoaD could be an xor of the two nibbles of the register number. That would help verify that it also sniffed the register number correctly. But in the end this data could be defined as whatever we'd like later on. Appear to have a pretty rock solid transfer routine between the 6502 and CICOp for both reads and writes which is good enough for now!
Implementing things on the STM8 side ended up working pretty well. In the end I don't even actually poll CPU R/W level to try and remove STM8's 5 cycle isr latency variation. It didn't actually help as polling creates it's own jitter. And in practice, the STM8 isr latency variation isn't that bad. In practice it's only 3 cycles (187nsec).
Reason for that is the 1-6 cycles advertised in STM8 programming manual for time needed to complete current instruction assumes worst case. It assumes the STM8 may be executing FAR instructions, and this core doesn't even have any FAR address space. So that cuts us down to 1-5 cycles. However there are really only 3 instructions that are 5 cycles, and we can easily avoid them. They are CALL subroutine with indirect pointer, and LoaD/STore Word with indirect pointer. These 3 instruction are pretty easily avoided as immediate addressing is typically all that's needed for those 3 instructions. So that reduces number of cycles needed to complete current STM8 instruction to 1-4 cycles. Which is only 3 cycles of jitter. In practice I verified this with dozens of logic analyzer captures, so it all checks out.
One idea I had to remove jitter would be to set an STM8 GPIO interrupt on the CPU R/W pin, and the first step of the TLI isr would be to enable the CPU R/W interrupt and then WFI wait for interrupt. This would remove most of the jitter, but I fear this would suffer from incompatibility with various console versions. The edge timing of CPU R/W in relation to M2 could easily vary on clones, AVS, etc. And this current setup bases all it's timing off the mapper register bit setting which should be pretty reliable. My only real concern for timing is drift of the STM8 internal oscillator, may have to trim the oscillator if temperature variation becomes an issue.
For the 6502 being as slow as it is, 187nsec of isr jitter is managable. With properly timed and cycle counted STM8 code I was able to reliably capture and present all necessary data for the CICOp data transfer routine I just lined out above. My current tests are a simple on time check at boot/reset that's printed to the screen. Planning on running some automated tests to really exercise the routines. But early tests look good on original NTSC front loader and a portable clone I keep handy. A separate routine will be needed for PAL support, but there is enough free time early in the isr to branch between separate PAL/NTSC routines with their own fixed cycle counted timing. NTSC is worst case (faster), so I'm not too concerned about PAL.
One idea I had was for the STM8 to verify that CPU R/W was low when it should be for added robustness. One concern I realized when writing this is that extreme caution needs to be exercised to ensure the NES doesn't get interrupted in the middle of this routine. If it did, the STM8 will blindly output data onto the bus when it comes time for the last load instructions. If the 6502 isn't executing this code because it was interrupted that could easily cause a CPU crash. The only real way to guarantee that is for NMI's to be turned off during this routine along with disabling interrupts. That's probably the best call for early NES development using the CICOp until one is certain that an NMI won't occur mid transfer.
I still need to draft up an async NES CIC implementation using TIM4 alone for CIC timing so we can dedicate TIM1 to scanline/CPU cycle counting or center aligned PWM DAC. But there's quite a bit of dead/nop time in my current isr to allow for CIC transfers in the middle of the isr. So I've got a clearer outlook on my timing constraints that the async CIC will have to cater to. That part is still a decent challenge and will require lots of testing, but with successful 6502-CICOp read & write data transfers under my belt I'm pretty confident.
Now that I've got a good means to communicate between the 6502 and CICOp it's time for the fun stuff. Next up is trying out some audio synthesis and PWM DAC experiments, along with scanline/CPU cycle counting tests!