Having a hard time putting down this whole idea of a discrete mapper with CIC mapper expansion.. The infinite number of possibilities that could be unlocked without increasing the BOM cost by a single cent is hard to keep myself from day dreaming about. I've came up with what I think is a fairly clean way to handle mcu mapper register reads and writes. But until someone comes to me and wants to write software targeting this idea, I have a hard time motivating myself to devote the time to fully developing this idea. On top of that, the idea of implementing this in an emulator on anything but a highly abstracted level sounds like living hell. So for now, I'll just document my idea here in public as best I can. Good chance I'll use it as reference in the future if there is outside interest in the event this idea does become a reality. Or perhaps someone else would like to take my idea and run with it which I'm perfectly fine with.
infiniteneslives wrote:
My proposed pin assignments would allow for 4bit nibble wide read/writes at a minimum. If one wasn't looking to utilize the UART then the entirety of PORT D could be used for 6bit wide accesses.
There is a problem though as we can't be certain the mcu is always able to listen to writes to $6000. The mcu could be currently interrupted by CIC comms which must have a higher priority. I can't think of a very clean way to get around this without adding dedicated logic.
Now that I'm more familiar with the STM8 and the CIC's requirements, I've got a better idea of how to handle R/W accesses from the NES CPU. The key comes from making the mcu register r/w interrupt higher priority than the CIC comm timer. This is possible because of the relatively large
6.7usec window we have to output CIC stream bit, lets just call it a 5usec window to be conservative. With that large of a window, there is time to service the potential for a CIC transfer inside the mcu mapper register r/w isr *if* there's an explicit definition of how to r/w to the mcu mapper reg.
My hardware proposal is still similar to my original idea of dedicating one of the discrete mapper flipflop bits to interrupt the mcu. Let's say $8000.7 for discussion's sake. The NES CPU must set this bit, then r/w from the mcu register, and clear $8000.7 in rapid succession. If we explicitly state how this instruction sequence is to be performed, it also provides the benefit of simplifying address decoding. We can actually create a large number of mcu registers effectively decoded by NES CPU address, while only utilizing CPU R/W, and CPU D0-4 as mcu inputs. But how??
This is my NES CPU instruction sequence proposal on how to write a byte to the mcu:
Code:
pseudo code as preparation for write is not timing sensitive, just trying to illustrate idea:
-load A with byte that would like to write to mcu
-transfer A to Y register (Y will contain the lower nibble to write to mcu reg)
-shift A register to the right 4 times (places upper nibble of mcu reg value in bits 3-0)
-transfer A to X register (X will contain the upper nibble that'll be written to mcu reg, but it's placed in bits 3-0 as that's all the mcu sees)
-load A with current bank of discrete mapper register
-set bit 7 of A (this bit being set @ $8000.7 will interrupt the CIC mcu for mapper r/w)
;now that everything's prepared, perform the mapper write:
STA $8000 ;write to discrete mapper with bit 7 set
STY $5000 ;write lower nibble to mcu
STX $5000 ;write upper nibble to mcu
AND #$7F ;clear bit 7 so we can disable mcu's interrupt
STA $8000 ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete
So we're defining the exact sequence of what must be done whenever $8000.7 is set. We know a STY and STX instruction will follow immediately after $8000.7 is set. And the NES CPU won't waste any time clearing $8000.7 once write is complete. This creates a very specific timing constraints from the CIC mcu's perspective.
And since we know exactly what NES CPU instructions are being used, we can simplify address decoding by sniffing the CPU data bus alone. The mcu doesn't need to decode any CPU address lines with this trick, but it has visibility of whichever CPU data pins it's connected to. For this implementation, I've chosen to only connect the STM8 to the lower nibble CPU D0-3. The tssop-20 STM8 doesn't have a full 8bit wide GPIO port pinned out, the most it gives is 6 pins with PORT D1-6.
By sniffing D0-3 during the STX/STY instructions, we can glean CPU A11-8, and CPU A3-0 for the upcoming write cycle. We can afford to cut out CPU A13/14 for mcu decoding purposes, as the mcu is no longer listening for $6000-7FFF as in my original idea. Here the mcu can only decode CPU A11-8 & A3-0. But that's pretty legit as it gives us a potential for 256 mcu mapper registers to work with. To be clear, CPU A15-A12 aren't actually being decoded with this implementation. Selection of $5000-5FFF for the location of the mcu register is arbitrary, that's simply an a convenient address space which doesn't conflict with anything else in the NES CPU memory map.
Since the mcu can sniff D3-0 during the STX/STY opcode fetch, it can differentiate between STY/STX with the lower nibble of opcodes $8E/$8C. So we don't have to require both the upper and lower nibble always be written, nor in a specific order. We just have to pick a convention of X/Y being Hi/Lo nibbles and stick with it.
The requirement to clear $8000.7 asap comes from the fact we can't tie up too much of the mcu's time, as the $8000.7 interrupt is getting set as the mcu's highest priority interrupt. So we have to free it so i can get back to CIC mangle calculations and such in it's main thread. While there's an abundant amount of time that could allow for more than 1 byte to be written at once, things get complex quick trying to define a larger r/w routine with explicitly defined timing to provide the mcu.
If one had an application where larger transfers were desired, my idea about requesting the CIC mcu to interrupt the NES CPU when there's a sufficiently large period of time that CIC comms can be ignored is the better solution and would be relatively easy to implement. We need a means to transfer a single byte before we can solve the KByte transfer solution.
Before we get too far, I want to come up with a definition of our convention for mapper register reads. I'm expecting that this can be pulled off somehow. Although details on the best way to do this didn't start to come to me until I started thinking about how the mcu ISR would work. The CIC mcu ISR for register r/w gets tricky quick. There's a lot of things it needs to ensure and they're all timing sensitive. One of the biggest issues becomes accounting for the 5 cycle jitter for when the ISR starts executing. Putting more burden on the ISR with tasks like determining if the 6502 is reading or writing really starts to become a challenge. We don't have much option with this hardware definition to use a separate ISR for both reads and writes. The only good way to have separate R/W ISRs would be to devote another discrete mapper flipflop bit, one for reads, one for writes. I don't much like that idea though, we may not have bits to spare.
Here's my KISS solution that combines the NES CPU mcu register reads into the same routine with writes:
Code:
pseudo code as preparation for write is not timing sensitive, just trying to illustrate idea:
1) load A with byte that would like to write to mcu (can skip to step 5 if only care about reading from mcu)
2) transfer A to Y register (Y will contain the lower nibble to write to mcu reg)
3) shift A register to the right 4 times (places upper nibble of mcu reg value in bits 3-0)
4) transfer A to X register (X will contain the upper nibble that'll be written to mcu reg, but it's placed in bits 3-0 as that's all the mcu sees)
5) load A with current bank of discrete mapper register
6) set bit 7 of A (this bit being set @ $8000.7 will interrupt the CIC mcu for mapper r/w)
;now that everything's prepared, perform the mapper write:
STA $8000 ;write to discrete mapper with bit 7 set
STY $5x0x ;write lower nibble to mcu
STX $5x0x ;write upper nibble to mcu
;write complete, now read back the old value that was in the mcu register
LDY $5x0x ;read old value from mcu register (lower nibble)
LDX $5x0x ;read old value from mcu register (upper nibble)
AND #$7F ;clear bit 7 so we can disable mcu's interrupt
STA $8000 ;write to discrete mapper with bit 7 clear, CIC mcu interrupt complete
;At this point we've effectively completed a SWAP operation between X/Y registers lower nibbles and mcu mapper register $5x0x
This may seems a little confusing as to why we're writing, and then reading. And what if you didn't want to overwrite the value of a register, and you only wanted to read it? My thought is that the mcu register definitions would overcome this issue. We've got up to 256 registers to work with, so just define them as read only, or write only as needed. So the NES CPU code you're writting probably only cares about read or write, but by using a swap operation, we can tackle two birds (read & write) with one stone (mcu ISR).
Additionally I'm going to discard my earlier idea that the mcu will decode STY/STX by sniffing the opcode. As I get into the details of the ISR, the more that we can simplify with convention of the 6502's r/w routine, the easier life is for the STM8. So for discussion's sake we'll require the sequence of STY-STX-LDY-LDX as lined out by the routine above. Additionally, we'll effectively require that routine to be copy pasted into 6502 assembly code, with only possible changes to be the mcu register address. The x's in $5x0x denote address nibbles that can be modified. But the address for all four load/store's addresses must match. The mcu ISR isn't going to have time to decode each and every one and adapt on the fly. If the 6502's read/write routine is running in rom, this definition would require a separate routine for each register. That may not be an issue if only using a few registers. A more versatile way would be to execute the routine from SRAM and use self modifying code to change the absolute address of the STY-STX-LDY-LDX instructions prior to executing the read/write routine.
Now to try and explain how all this would work from the CIC mcu's perspective... So now we've got an explicitly defined timing of bus operations from the time that the mcu receives it's $8000.7 interrupt, we can utilize cycle counting within the mcu mapper r/w ISR to latch address and data from the NES CPU. But since this ISR is designed to be of higher priority than the dedicated CIC comm ISR, the mapper r/w ISR must also handle necessary CIC comms should they be needed while it's running.
I've gotten into the details of how the STM8 CIC KEY would run asynchronous from the console's LOCK in previous posts in this thread. The basic idea is that there's an mcu timer which is used for counting down to when the next CIC transfer needs to occur. My plan is to use TIM2 for this purpose which in reality can only count up, but math can turn that around. The timer's ISR will account for drift of the clocks by polling LOCK's Dout when expected to be high. That ISR will also set/clear KEY Dout as necessary, but it's a lower priority routine than this mcu register r/w ISR I'm about to discuss.
The CIC mcu is running at 16Mhz with 62.5nsec period, and the NES is running at 1.79Mhz with a period of 559nsec (assuming worst case NTSC). So there are ~8.9 STM8 cycles per 6502 cycle. And we've got a window of 5usec that a CIC bit must be output when needed. That CIC window equates to ~8.9 cycles on the 6502, and 80 cycles on STM8. So it looks as though we've got plenty of time to get everything done if our ISR is smart enough.
Here's some psuedo code and STM8 assembly to give timeline of how I picture the ISR to work, cycle numbers on left are STM8 cycles. I'm sure there are some errors on exact timing of everything, but this gets the idea accross.
Code:
0: NES CPU sets $8000.7 to trigger ISR (6502 end of STA $8000 cycle T3)
1-6: complete instruction in execute cycle (1-6 cycles) -Ooof! we'll have to account for that potential jitter...
2/7-11/16: push registers to stack (9 cycles)
8-17: jump to ISR (docs not explicit on # of cycles, assuming it's 1 cycle like the JMP instruction)
9-18: start executing ISR
Oops! The 6502 has executed ~1-2 cycles by this point..
We don't have a good way to ensure we can sniff T0 & T1 of the first STX/STY, which means we don't know if it's STX/STY
One possible solution would be to define that a NOP is required between STA $8000 and STX/STY.
-No one likes wasting time! And this still doesn't solve the jitter issue.
Another would be to just make it convention that STY is first, however ADL in T1 (our ability to sniff CPU A3-0) may have passed us by.
-This is the reason I made the decision to nix the ability to handle different orders of STX/STY, and require all addresses to match.
We also have to account for the 5 cycle ISR latency jitter, and get aligned with the 6502.
We could let the ISR spin polling CPU R/W and align itself when it goes low.
-This is half of the reason why STore is first, and LoaD is second.
-Other half of reason is logically this is only way X/Y registers can be preserved during a SWAP.
Perhaps it's for the best that STY T0 & T1 have passed us by as we didn't yet have a way to account for ISR jitter anyway
So at this point we know PRG R/W will go low around cycle 27, but we're somewhere between cycle 9-18 and don't know where..
Additionally every ~80 STM8 (or ~8 6502) cycles we need to check the CIC comm timer and output a bit if necessary.
;spin until R/W low for STY T3
rw_still_high:
BTJT rw_port, #rw_bit, rw_still_high ;2/3 cyc
STY cycle T3 starts around STM8 cycle 27
Now we've accounted for jitter and we're ~29 STM8 cycles since 6502 set $8000.7
;Delay a few cycles until CPU D3-0 should be valid for STY T3
30: NOP, NOP...
;Latch CPU D0-3 for STY T3
33: MOV low_wr_data, data_port
6502 is about to go from STY T3 to STX T0 (occurs at STM8 cycle ~36) , this is a good time to handle a CIC comm if needed.
;STM8 assembly rough idea of how check if time to output CIC comm (total 4 STM8 cycles)
LDW X, TIM2_CNTR ;2cyc
SUBW X, #$FFF0 ;2cyc
JRMI no_comm_needed ;1/2cyc
MOV Dout_port, out_val ;1cyc
no_comm_needed:
"reset" count for CIC comm window.
We're in the middle of STX T0 currently. Delay until can sniff ADL from STX T1
NOP, NOP...
;Latch CPU D0-3 for STX T1 to sniff CPU A3-0
50: MOV low_addr, data_port
;Delay till STX T2 to sniff CPU A11-8
NOP, NOP...
59: MOV high_addr, data_port
;Delay till STX T3 to latch upper nibble of mcu register write
NOP, NOP...
68: MOV high_wr_data, data_port
69-99:
All data has been latched for mcu register write, we also know CPU A11-8 & A3-0 for upcoming register read.
We'll assume that the register address can be mapped to a fixed block of STM8 SRAM.
During this time we'll consume a few STM8 cycles to piece together latched high_addr:low_addr
and map that to an STM8 address we can set the X register to point to.
Copy, shift, and mask that the lower nibble of data into data_output_port for upcoming LDY T3
Copy, shift, and mask that the upper nibble of data into SRAM for quick access when time for LDX T3.
Perhaps 30 cycles isn't enough time to handle all that, but it should be for simple tasks.
Worst case require a NOP inserted between STX-LDY if needed.
Even better idea: move AND #$7F instruction between STores and LoaD instructions!
100:
register read lower nibble already stored in data port output register.
Set port register DDR to enable register data to drive 6502 data bus D3-0
Delay while 6502 is latching read for LDY T3
107:
disable data port output drivers with mcu DDR
108:
It's been ~72 cycles since we checked if a CIC comm was needed. Perfect time to check again.
Copy prepared SRAM byte back in cycles 69-99 from SRAM to data port output register
Delay till LDX T3
136:
enable data port output DDR
Delay while 6502 is latching read for LDX T3
142:
disable data port output drivers with mcu DDR
Need to wait for NES CPU to clear $8000.7
This will take 6 cycles on 6502, STM8 can't return from interrupt until complete to prevent re-entry.
Should perform some more CIC comm timer checks during this time.
STM8 IRET takes a whopping 11 cycles, worst case a CIC comm timer interrupt occurs during that IRET.
Need to ensure adequate time for CIC comm timer interrupt to handle a comm that's needed as this ISR returns.
Additionally this routine left KEY Data high if a comm was needed.
Need to ensure the CIC comm timer ISR will clean up after this routine and clear Dout when no longer needed.
;return back to main thread where CIC mangle operations can continue.
;or whatever request made by the 6502 via this routine can be performed.
IRET
Phew, There you have it! So this mcu register r/w routine could hold a higher interrupt priority for the STM8 mcu compared to the CIC comm timer which would have second priority. Any other interrupts would have to have a lower priority than these two, and the STM8 must be set to nested interrupt management mode. That way higher level interrupts are able to interrupt lower priority ones ensuring mcu register r/w are always serviced, and no CIC comms are missed. Beyond all this one just needs to ensure the mcu isn't over worked and that it has adequate time to complete CIC mangle calculations.
The biggest risk for this would be if the NES programmer were to perform multiple mcu register r/w operations back to back. Would have to do some worst case analysis on the time required for mangle calculations. This entire ISR is ~200 STM8 cycles, which is only ~13usec, that's a relatively small amount of time on the scale of the CIC timing and calculations.
My current mangle table routine is ~100 STM8 instructions, isn't well optimized, and takes ~42usec to perform mangle, and spins for ~30usec waiting for the console CIC to perform it's calculations. So that's some where ~60% cpu utilization during the most intensive calculations. During bit transfers, the CIC timer xfr ISR should only need ~5-10% cpu utilization tops. A conservative estimate would be that 70% of CIC time is mangle calc, and 30% is bit transfers. That weighted average STM8 cpu utilization comes out to ~50% which I consider a rather conservative estimate.
A practical rule that would keep from over utilization would be to require ~20usec (~35 NES CPU cycles) between mcu register accesses.EDIT: My original estimate was flawed in that it neglected to calculate the fact that an asynchronous STM8 CIC would be running at 16Mhz, not 4Mhz. Here's a better cpu utilization estimate:
Code:
Mangle calculation:
STM8: 100 instructions = 10.5usec
CIC average mangle 80usec = 13% CPU utilization during mangle calculations
Bit transfers:
Estimate timer ISR to run for ~5usec average maximum (time that pulse is high plus drift trimming)
CIC period of bit transfers 79usec = 6% utilization during bit transfers
Average number of bit transfers = 8 * 79usec = 632usec
Average number of mangle calcs = 16 * 80usec = 1280usec
% time bit transfers = 632 / 1912 = 33%
% time mangle calc = 1280 / 1912 = 67%
Weighted utilization:
bit transfers 6% * 33 = 1.98%
mangle calc 13% * 67 = 8.71%
total utilization 1.98% + 8.71% = 10.7%
So in reality the CIC operations only utilize ~10% of the STM8's processing time. The register read/write ISR is ~12.5usec, with the time the 6502 is going to have to spend processing data coming in and out it's going to have a hard time overloading the STM8 with r/w accesses alone. In practice one might set a rule to not let that 12.5usec exceed 75% of the STM8's utilization. That would equate to providing the STM8 with a 4usec (~8 NES CPU cycles) between mcu register accesses. That's only a couple of instructions which isn't really enough to do anything worthwhile between accesses. In practice I wouldn't expect the STM8 cpu utilization to become an issue until it started being tasked with compute intensive tasks such as sound synthesis, or large UART data transfers perhaps? Those tasks would be make a lesser priority than register accesses, and CIC comms, so at least they wouldn't risk locking up the console/CIC.