65816 really 16 bit or just a 6502 with 16 bit registers

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Locked
93143
Posts: 1715
Joined: Fri Jul 04, 2014 9:31 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by 93143 »

AWJ wrote:The S-CPU's external bus does demux the address and data lines
There must be something I don't get here. Why are the memory speeds (200/120 ns) apparently specified under the assumption that they can't start responding until the beginning of phi2?
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Near »

Damn, four pages in one day. Slow down, people :P

The 65816 has a 24-bit address bus and an 8-bit data bus. And probably by necessity, is designed to do 16-bit operations split over two cycles each, with a few exceptions (like XBA.)

I would still call it a 16-bit CPU, but the designation has always been a meaningless designation driven by marketing for bullshit like the "64-bit Jaguar" and "128-bit Dreamcast."

Not worth the time to think about, let alone debate for one view or another.
Nicole wrote:I think any debates about which console is more powerful have ceased to be relevant decades ago.
Exactly. The only thing that matters are the games.

The reason I worked on the SNES first is because I love the games. I'm a huge JRPG fan, and the SNES has: Lufia 1&2, Breath of Fire 1&2, Final Fantasy IV-VI, Chrono Trigger, Bahamut Lagoon, Rudra's Secret Treasure, Star Ocean, Tales of Phantasia, Tengai Makyou Zero, Dragon Quest I&IIR, IIIR, V, VI, Aretha 1&2, Dai Kaijuu Monogatari I&II, and on and on.

The Genesis has........... Phantasy Star IV. Which is also amazing.

But that we're still waging the "Genesis does what Nintendon't" war in 2016 is just beyond the pale ridiculous.

And indeed, I'm the #1 SNES fan out there (bought over 2100 games, know more about it than probably!! anyone else alive), and yet I'm working on a Genesis emulator now.
AWJ wrote:No SNES emulator to my knowledge even attempts to emulate SA-1 memory contention. bsnes always runs it at the full 10MHz and lets DMA occur instantaneously.
I suspect we could fix the latter without much of a performance penalty. v100 shipped with broken SA-1 IRQs, still gotta do a patch release for that, but I've been too busy.
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by AWJ »

byuu wrote: I suspect we could fix the latter without much of a performance penalty. v100 shipped with broken SA-1 IRQs, still gotta do a patch release for that, but I've been too busy.
I'm not worried about performance so much as about actually breaking games by changing the behavior without adequate research. It seems most likely that SA-1 DMA can either stall the SA-1 or run in parallel with it, depending on the priority register setting and on the memory regions used by the DMA, the SA-1 and the S-CPU. If you make DMA stall when it shouldn't, games will run too slowly and in the worst case communication between the CPUs might break down. If you let DMA run in the background when it should actually stall, games might try to use data that hasn't finished transferring yet (because on real hardware execution wouldn't resume until the transfer was finished)

The current behavior is definitely wrong but at least it lets all games work (though I am a bit suspicious of the unmapped accesses SD Gundam does)

Emulating the Genesis should give you some grounding in emulating bus contention in a heterogeneous multi-processor system (AFAIK the relationship between the 68K and the Z80 is somewhat close to that between the S-CPU and the SA-1)
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Near »

It seems most likely that SA-1 DMA can either stall the SA-1 or run in parallel with it, depending on the priority register setting and on the memory regions used by the DMA, the SA-1 and the S-CPU.
Oh, gotcha. That explains why I went with the choice I did, then. Thought it was weird I'd leave out some simple step(cycles) calls.
Emulating the Genesis should give you some grounding in emulating bus contention in a heterogeneous multi-processor system (AFAIK the relationship between the 68K and the Z80 is somewhat close to that between the S-CPU and the SA-1)
Yeah, I'm "looking forward to it."

It's not that I don't understand the problem, it's that libco is an awful choice to emulate such a thing.

I may have to force the synchronization to be once-per-opcode for the 68K/Z80. We'll see how it goes. I'm currently averaging ~1800fps with only the 68K core running alone. Still need to add the Z80, YM2612, PSG, VDP.

Another possibility is to consider C++17 stackless coroutines. I might be able to write a scheduler wrapper around libco and it.

EDIT: you're gonna flip when you see the 68K core, too. For the sake of performance, I had to templatize the size parameter of instructions (8-bit, 16-bit, 32-bit modes ... so triplicates of most instructions and the effective address decoding routines.) I've only implemented about 50% of the instructions, and already the core is 72KiB of source code (the largest of all 15 processors in higan) and generates a 700KiB object file (the next largest is ~200KiB.)
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by AWJ »

byuu wrote:It's not that I don't understand the problem, it's that libco is an awful choice to emulate such a thing.
I don't think the problem is with libco per se, but with your scheduler design that stores each device's time base in terms of clocks relative to a parent device. That works beautifully with the SNES where the devices form a natural tree (the PPU and cartridge coprocessors only communicate with the S-CPU, and the DSP only communicates with the SMP), but it falls apart when both CPUs can communicate with the video and audio hardware.

You probably hate how much I bring up MAME, but I think you should take a look at its scheduler design. In MAME each device's time base is stored in a common unit (attoseconds) relative to power-on. In MAME this requires using 128-bit integers to avoid dealing with overflow, but you could get away with using master clocks as the common unit of time since the MD only has one crystal (I think... Someone correct me if I'm wrong) Once per emulated frame or so, check if any of the time bases is close to overflowing and if any of them is, subtract the lowest time base from all of them (or just subtract the lowest time base unconditionally--probably simpler and faster actually)

I'm not surprised in the least that your 68K dwarfs any of the other CPU cores you've written. The 68K really is the Cadillac of 16 bit microprocessors. 68K bigots who don't consider the 65816 a "real" 16 bit CPU because of its Spartanness should realize that their own favorite CPU is just as much an outlier in the opposite direction.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by tepples »

Doesn't the Genesis still derive all clocks from the same 53.6931 MHz crystal? Z80 is f/15, dot clock is f/8 or f/10, and 68000 is f/7.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Near »

> I don't think the problem is with libco per se, but with your scheduler design that stores each device's time base in terms of clocks relative to a parent device.

You're correct in that the design to have one signed integer represent the time difference between two components (neither is the parent) won't work for the Genesis. It did work for the five emulators I've written previously, but here ... both the 68K and Z80 can talk with the PSG.

The solution isn't that complicated, though. We just need three counters instead of two:
68K <> Z80
Z80 <> PSG
68K <> PSG
The tricky part will be where to put them and how to name them.

But libco is indeed an issue here. The thing is, whenever the 68K accesses shared memory with the Z80, or vice versa, it has to switch to the other to catch up on time. This is always going to be immensely painful. The only magic trick is rollbacks, which Nemesis has already tried with the Genesis.

We know from experience that with the DSP being as simple a device as it is, that a state machine results in a speedup over using libco. You and I disagree on how much ... I observed maybe 3-5%, I think you said 8-10%? But given that's against the total emulator framerate, that's pretty significant for a minor change.

We'll probably be doing around one million of these switches a second with the 68K<>Z80, just like with the SMP<>DSP. The smarter move here would be to try and write the Z80 as a state machine, and enslave it to the 68K. But, you know how stubborn I can be with consistency.

> You probably hate how much I bring up MAME

I get it, at least. You're a MAME dev. Ostensibly you work on it because you believe in what it's doing and how it's designed. It's like me bringing up cooperative threading all the time. I respect that.

> I think you should take a look at its scheduler design

I don't think I'd be too successful tearing at its source code in isolation. But I'd be up for hearing more about it from you, if you were up for it. If not, no big deal.

> the MD only has one crystal

That is correct. Every chip is powered off clock dividers against it.

> The 68K really is the Cadillac of 16 bit microprocessors.

Agreed. From my perspective, the 68K is atrocious.

I understand that as a developer writing code for the system, the 68K is indeed like a Cadillac and the 65816 like a Ford Pinto.

But I admire simplicity more than anything else. I think that's evident in my extreme NIH and efforts to minimize code. I'm extremely proud of having an 8KiB ZIP decoder, an 8KiB PNG decoder (requires the ZIP decoder), a 20KiB web server, etc.

The 68K, from a backend hardware design perspective, is an absolute mess. Many instructions are missing certain effective addressing modes, for absolutely no discernible reason (sometimes you can see why; but often it just feels completely abitrary whether you get to use the PC with index/displacement modes. Newer models start adding them back, so it's clearly possible.) The instruction encoding is just completely off the wall ... sometimes 00 = byte, 01 = word, 10 = long ... sometimes you get these "opmode" 3-bit prefixes to encode what's effective only three values. Sometimes it's one bit. Sometimes it's two bits, but the bits are in a different ordering ... 10 = byte, 11 = word, 01 = long. For byte/word modes, address register ops usually sign extend, data register ops usually leave the upper bits alone. But there's always exceptions. MOVEM sign extends to data registers, too. MOVEQ doesn't sign extend, but fills all the bits of the data registers. Many instructions ignore the size prefix completely when registers are used as destination addresses. Shifting by zero in a register does weird things with flags, whereas shifting by zero with an immediate turns into shift-by-eight. Sometimes an immediate is limited to 3-bits, sometimes 8-bits, sometimes you can load 16-bits or 32-bits from the opcode extensions. Lots of instructions do spurious read cycles for no reason at all. I don't even want to get into 68020 and above ... those just make things a dozen times worse. This is by far the nastiest chip I've worked with, and I've emulated ARM7 and 80186. I feel like this chip had five or more designers creating instructions, and none of them worked with each other to ensure any sort of logical consistency.

But again ... before the 68K defenders come in ... I get it! It's a dream to program for it! The 65816 only has one accumulator!! Madness! I'm right there with you on the user-end perspective.
AWJ
Posts: 433
Joined: Mon Nov 10, 2008 3:09 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by AWJ »

byuu wrote:The solution isn't that complicated, though. We just need three counters instead of two:
68K <> Z80
Z80 <> PSG
68K <> PSG
The tricky part will be where to put them and how to name them.
That seems really likely to get you into a paradoxical situation where device A is ahead of device B, device B is ahead of device C and device C is ahead of device A. And the number of counters explodes as you add devices (every device needs a counter for every other device that can possibly interact with it)

Here's how I'd design a scheduler for the MD: each device has one, unsigned counter. Each time it executes a cycle, it adds n to its counter where n is its divider (so the 68K adds 7 per cycle, the Z80 adds 15, etc.) To determine whether device A is ahead of or behind device B, you can just compare their counters, since they're all in the same unit (53MHz master clocks). To keep the counters from ever overflowing, once per frame go through all the counters, find the smallest value, and subtract that from all of the counters (so the smallest counter becomes zero and all the others are rebased relative to it)

If you're worried about performance, I think you'll have to bite the bullet and make a scheduler capable of handling both cothreaded and state-machine devices. Here's one way you could do it: make Processor a class with all the bits needed to interact with the scheduler except for a cothread pointer, and with a pure virtual method enter(). Make a subclass, CothreadedProcessor, that has the cothread pointer and overrides enter() with an implementation that switches to it. All cothreaded devices (mainly CPUs) will inherit from CothreadedProcessor. State machine devices, on the other hand, will inherit directly from Processor, and override enter() with the main loop of their state machine.

Performance should be pretty good because the compiler should be able to devirtualize most or all calls to enter(). By using virtual methods, none of the devices needs to know which of the other devices is cothreaded and which is a state machine (which I believe was the problem you had with the old, heavily hardcoded bsnes scheduler)
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Near »

Here's how I'd design a scheduler for the MD: each device has one, unsigned counter. Each time it executes a cycle, it adds n to its counter where n is its divider (so the 68K adds 7 per cycle, the Z80 adds 15, etc.) To determine whether device A is ahead of or behind device B, you can just compare their counters, since they're all in the same unit (53MHz master clocks). To keep the counters from ever overflowing, once per frame go through all the counters, find the smallest value, and subtract that from all of the counters (so the smallest counter becomes zero and all the others are rebased relative to it)
Thank you. I implemented this in my Wonderswan Color core, and the results are very impressive:
* Riviera: 177fps -> 195fps
* Final Fantasy: 158fps -> 165fps
Very impressive for such a simple change. And it perfectly eliminates the problem I was having with the 68K<>PSG<>Z80 scenario. And now I can build up the scheduler class to do a lot more heavy lifting.

It's trickier though with the SNES where there are multiple independent clock rates.
If you're worried about performance, I think you'll have to bite the bullet and make a scheduler capable of handling both cothreaded and state-machine devices.
Indeed, that's exactly what I'm thinking of doing. I want to wait to see how well C++17 works out, because there's not an easy way to make a state machine right now without all the red tape.
By using virtual methods, none of the devices needs to know which of the other devices is cothreaded and which is a state machine (which I believe was the problem you had with the old, heavily hardcoded bsnes scheduler)
That's exactly correct. It was an even bigger mess when I had the performance/balanced profiles, so it could change out from under you based on compilation flags.
Stef
Posts: 263
Joined: Mon Jul 01, 2013 11:25 am

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Stef »

rainwarrior wrote:
Stef wrote:For instance it has some nice sound capabilities but severely limited by the small amount of dedicated memory...
It was plenty good enough to produce some of the best game soundtracks of all time.

Whenever memory is limited, it is a problem for anyone working with it. This was always a major part of game development. If it had twice as much RAM you'd still be complaining. ;) It's like that myth about a goldfish that grows to meet the size of its bowl. Unless your goals are drastically smaller than the memory you have, it's going to be one of your biggest problems, and solving that problem is what a good developer does.
...
I agree with that and in fact if you take my next sentence I said that I understand you have anyway cost constraint which drive the amount of memory you can have. The problem for me is more about it is very complicated to stream any data to that memory because of the design. 64kb can be enough for single music but is really short when you add SFX and digitalized voices on top of thats... having something similar to the MD design (having SPC using the A-Bus in cycle steal mode) would have make the whole sound system much more powerful in the whole.


byuu> MOVEQ *does* sign extend...
And i agree the M68000 instruction decoding part is a bit fudgy, having a good hexa decimal opcode table is important to not make any mistake but i guess you have a bunch of good documentation for that.
Still looking back my old 68000 C68K generator table, i think that it's not that much a mess.
At least EA (Effective Address) both for source / destination are always located at the same position, as is the size field (with the single exception of SUBA). Also the size field can be encoded on 1 bit (instead of 2) when the only possible size is word or long (for Ax register operation) and that makes sense. I do agree there are some weirdness in instruction decoding sometime... but for me it's not really worst than other CPU i played around :p

Glad to hear you're working on a MD emulator by the way :) I guess you already know about Exodus (written by Nemesis) which is the equivalent of Bsnes but for the MD. I've an almost finished Gens 2 emulator (rewrite from scratch) sitting somewhere on my hard drive (i wrote it a long time ago but never had motivation to finish / release it) which was working on the same idea to use the master clock as main synchronizer for all components. As you described we meet trouble when Z80 & 68000 could modify the same region, this is true for the PSG / VDP but in fact this as also true for 68000 RAM (strangely the Z80 can actually write it but not read it) or SRAM. But to be honest, i think that not a single game relies on it as no game try to write VDP, 68000 RAM or SRAM from the Z80. And when they access PSG from Z80 they don't do it from the 68000 and vice-versa. So basically you can go over and choose to drive synchronization using a master reference as the 68000 CPU (it's was my case). As soon the 68000 does access or modify Z80 context or VDP context then you need synchronize.
User avatar
TmEE
Posts: 960
Joined: Wed Feb 13, 2008 9:10 am
Location: Norway (50 and 60Hz compatible :P)
Contact:

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by TmEE »

tepples wrote:As for the "bit" designation for an entire console, I choose the widest data bus in the system, measured in word width times words per clock.
  • For NES, Game Boy, and Master System, it's 8.
  • For TG16 and Super NES, it's 16 (VDC/PPU data bus).
  • For Genesis, it's 16 (CPU data bus) or 16 (VDP data bus, 8 bits times 2 transfers per clock).
  • For Jaguar, it's 64.
  • And for Nintendo 64, it's also 64 (9-bit RAM with parity, so really 8 bits, times 8 transfers per clock).
Widest bus in SMS is VRAM bus which is 16 bits.
calima
Posts: 1745
Joined: Tue Oct 06, 2015 10:16 am

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by calima »

byuu wrote:The Genesis has........... Phantasy Star IV. Which is also amazing.
When starting on Gen dev, I played through PS IV because it was a first-party title and said to be one of the best Gen games. Boy was I underwhelmed, it did almost nothing that pushed the hw, the battle char disappearing was completely unnecessary, 4/10 story clearly written by a seventh-grader, unmemorable music and graphics, bad UI, bad dialogue, and I even found a bug (in a certain castle, going to a certain corner always hanged the game).

That... Kind of made me wonder why there are so few good Gen games, when the console would be fully capable of Pokemon Gold or FF 4. That was a highly marketed first-party title too.
yet I'm working on a Genesis emulator now.
Excellent! You're one of the few people who care about portability, as it is, there are practically no good Gen emulators for Linux, especially non-32-bit x86.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by Drew Sebastino »

calima wrote:there are so few good Gen games
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by psycopathicteen »

I'm surprised nobody made a dirt cheap 16-bit RISC cpu. I'm thinking maybe having 8 24-bit registers, with most ALU instructions being 16-bit, but some being 24-bit.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: 65816 really 16 bit or just a 6502 with 16 bit registers

Post by tepples »

Such CPUs exist and are called DSPs and MCUs. They just don't become popular as a general-purpose personal computer's primary CPU because personal computer users tend to demand binary compatibility with their existing proprietary applications. This is why Mac OS 7.5 through 9 for PowerPC include a 68LC040 emulator, why Mac OS X 10.5 and 10.6 for x86 include a PowerPC emulator, and why every version of Windows since the 386 has included a virtual machine for running applications for MS-DOS and/or previous versions of Windows.

As for devices that run graphical applications but not those made for desktop PCs, mobile devices have tended to implement Thumb, the 16-bit instruction set of ARM processors.
Locked