I'm a little late to the party, but..
I don't think it was said already, but in case everyone didn't know, a cheap trick to fit more tiles with the same memory is to store two 1bpp tiles combined into 1 NES tile. You display them individually by using different palettes. The trick is using the 'extra' color when a pixel appears in both tiles, so your NES tile usage is such: transparent, tile1color, tile2color, tile1&2color. While the actual palette you send to the PPU would look like:
set 1 - BG, color1, BG, color1
set 2 - BG, BG, color2, color2.
Shiru wrote:Just a random thought. Isn't an ATMega MCU fast enough and has enough pins to emulate a ROM for NES, i.e. put requested data from internal ROM on data bus pins as fast as needed? It probably can also act as CIC.
ATMega64 could be enough for this application (32K for PRG, 32K for AVR code), and its price is under $10, which is a bit less than thousands.
I actually previously thought of doing that with the Propeller MCU (with 8 32-bit cores, it's an odd one).. I coded a small part of it it (never built the hardware) and it would have been fast enough for PRG, but unfortunately was a little too slow for CHR. CIC was not possible because that MCU must always bootload (it's RAM based), so you have to wait 500msec on powerup. I mostly wanted it for CHR, PRG wasn't interesting enough, so I ditched it. But if the Prop2 is ever produced (and is faster), it could become a potentially very powerful CHR-RAM. Those CPUs can have zero latency for this because instead of needing to poll or have an interrupt, you just have a dedicated CPU core run a "wait until input equals" instruction. But imagine your CHR RAM having multiple 6502 cores (emulated) inside it.. why not move metatile rendering into it, or have entire level data in it and have the NES write scroll values to it, or perhaps another core could handle hit detection if you sent it object coordinates. Instead of dual-port, it's like octo-port memory. I better leave something for the NES to do, hehe.
I'm not familiar with ATMega at all, but with the PIC32 I believe if you're willing to write branch-less code (just a string of LDA #xx / STA $2007 can be useful), it should be possible for PIC32 to feed code/data directly to the NES via DMA transfers to it's parallel port, which would free up the PIC32 CPU for other tasks. Perhaps that could work on chykyn's ENIO CPU board.