I'm pretty excited about this. Let me see if I understand the question properly.
My understanding:
The cart has mass-storage which holds all PRG and CHR data.
Data copying to RAM can occur behind the scenes with only a mapper register write.
Copying a routine that will be executed will require loading the data into RAM before the code is executed.
Accessing data from random places in a bank will require preloading, as you never know where in the bank the first location to access will be positioned.
Data accessed in a linear fashion at a rate >= 8 cycles per byte can be streamed without significant impact on CPU memory space.
I'd like to try and consider this from as many angles as I can given the information provided.
My first impulse is that this is a per-level load, since the way I currently use bankswapping on an MMC3 would require instant access to the map bank or the object bank. From there, I need instant access to my game code, level bank, object bank, DPCM, and any supplementary code.
The way I way I can structure the data changes greatly. Fixed bank sizes are a thing of the past, so I could create an engine to populate my CPU RAM with the best spacing for the data exclusive to my level. I won't have to store additional levels in the same level bank, so I can cram the data for each level in the smallest space possible. I could write a routine to rewrite all of my immediate address calls to their new addresses. With the size of the mass storage, if somebody wanted to get their game done more quickly, they just store each level linearly, with a new copy of their game engine along with each level, or perhaps unique code for different levels.
I don't know much about the NES audio hardware yet, but I'm guessing that DPCM data could be streamed in this fashion. If that's so, then it should only need a very small space in which to load, constantly rewriting over itself. The amount of space consumed by DPCM during the entire frame is a major contributing factor to running out of PRG space. Other banks you can swap out when unused but not DPCM.
I tend to want to low on this so I'd see the benefits in other areas:
Memblers wrote:assume smaller CPU RAM greatly allows other mapper features, finer sized CHR pages, lowers hardware cost, etc.
The CHR and how that works will be more important to me than PRG size. Code size, you can almost always find ways to reduce, especially given the large array of options this mapper would present. However, graphics, like DPCM, they always have to be there during the frame. I almost feel like I have to know for certain how the CHR will work and how much of that will have to pass through CPU RAM. If I'd like to store a hypothetical maximum of 128KB CHR on a per-load basis, how does that affect my CPU RAM requirements?
All of this considered, I wonder how substantial of a difference in other areas would I see from a total lack of banking for CPU RAM, ie 32 KB. If I can load my per-level program data without extraneous information, 32 KB may be workable for me for PRG. I'm still a little fuzzy on how this would affect CHR though so I'm not casting my vote yet.
8 CPU cycles per byte access is fast enough to load CHR-RAM even with an unrolled loop (LDA absolute, STA $2007), some data like pattern tables and possibly raw nametables wouldn't even need to be in CPU memory ever.
If I'm understanding this properly, I'd need to start a background copy to a certain location, allocate probably 256 bytes for buffer (so it will wrap easily), and begin loading my tiles in CPU RAM while writing them to VRAM. This contradicts what you said though, so I believe I am missing something in my interpretation.