possible ways to speed up 6502 core?

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

possible ways to speed up 6502 core?

Post by GradualGames »

Is it possible to parallelize (i.e. prefetch instructions and execute them on separate threads before encountering one that has some kind of stateful dependency) a 6502 core? If so, is it done in any popular emulators and does it achieve noticeable performance gains?
Last edited by GradualGames on Tue Feb 21, 2017 9:51 am, edited 1 time in total.
User avatar
rainwarrior
Posts: 8731
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by rainwarrior »

No, I don't believe so. Everything about the CPU is a serial operation. What do you propose to split into threads?

You could maybe run 2A03 and 2C02 in parallel simulations, but the frequent synchronization between the two (think of ~60 $2007 syncs per frame) might make it difficult to improve performance this way.

It's sometimes suggested to alternate between CPU/PPU simulation only at synch points (e.g. just run CPU by itself until you hit a BIT $2002 or NMI or something, then run PPU to catch up to that point) instead of simulating them together in lock-step. There is potential advantage already here from not jumping back and forth on each instruction, but if you wanted to try parallelization you'd have to implement this already as a stepping stone.
User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by GradualGames »

I suppose I was thinking: are there enough situations where you have several instructions not dependent on each other like:

;just random set of instructions that don't depend on each other
lda $00
inx
inc $0a

where you could execute all three in parallel and once you find a dependency wait for the results, or something?

I'm grasping at straws lately to speed up my 6502 core on android devices (it runs startlingly well in pure Java, though, but I need to push it over the top to make it truly viable for distribution). The battery saving features of such devices seems to make threads run at different speeds seemingly arbitrarily, no matter what I do to request higher priority. I think for android I may have to write the cpu portion in C.
User avatar
rainwarrior
Posts: 8731
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by rainwarrior »

GradualGames wrote:;just random set of instructions that don't depend on each other
lda $00
inx
inc $0a

where you could execute all three in parallel and once you find a dependency wait for the results, or something?
That's just way too granular for multithreading. Think about all the extra work you've got to do just to identify separable instructions, and then merge the common results, etc... the time it takes to synchronize data between threads greatly outweighs the time taken by any single instruction. You need to be able to separate into batches where the threads don't have to talk to each other for longer periods.

That kind of granularity actually is done by the hardware of modern CPUs internally, though. Not as a threading thing, but on a lower level, as the CPU's pipelining hardware. You could potentially take advantage of that hardware by doing JIT compilation, or static recompilation, where you rebuild the whole program as native instructions instead of the usual byte-by-byte dispatch. Recompilation has some caveats that might make it difficult too (self modifying code, code copied to RAM, overlapping code, etc.).

I've not heard of recompilation being done for an NES emulator, but it's definitely used in several emulators for more powerful systems (not sure if it would really pay off on NES). There's a small list here: https://en.wikipedia.org/wiki/Dynamic_r ... ion#Gaming
User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by GradualGames »

Makes sense. What about other tricks...lazy evaluation of cpu flags for example? A lot of instructions affect many flags but, the following code may or may not use the flags calculated from the previous result. Just trying to learn if there's anything clever one can do in a 6502 core to reduce the amount of work that needs to be done.
adam_smasher
Posts: 271
Joined: Sun Mar 27, 2011 10:49 am
Location: Victoria, BC

Re: Is it possible to parallelize a 6502 core?

Post by adam_smasher »

There's only four ALU flags, and computing them should be basically free on a modern CPU (ironically, since the flags don't depend on each other, the CPU *will* probably be able to compute them in parallel). The overhead involved in trying to compute them lazily will almost certainly be higher than the cost of just computing them.
User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by GradualGames »

Why is that? I'm thinking, for example, of evaluating status_zero for example only when beq/bne is used, for example. There are probably many hundreds of instructions executed in any given game that are not immediately followed by beq/bne (but which do affect the zero flag), and so it seems like when you sum that all up it could save at least a small portion of time overall. Note my core is written in Java, so it is unclear to me how much parallelization (in the real physical cpu) is actually taken advantage of down at the lowest level after the JIT is done and the bytecode is finally interpreted etc.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Is it possible to parallelize a 6502 core?

Post by tepples »

How would the emulator remember from which value the flags shall be calculated, particularly when PLP can load inconsistent flags (e.g. N and Z both set)?
User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

Re: possible ways to speed up 6502 core?

Post by GradualGames »

My hope was I'd just store off the result of the last instruction and evaluate flags from that result variable when I need them. What's this about inconsistent flags? Is that a hardware glitch or is that something else?

One thing that makes Java super awkward for writing a 6502 core is I constantly have to do & 0xff to get an unsigned byte value (always expands to an int, since java only allows bytes to be signed). Perhaps one optimization I could make would be when I load the cartridge to expand it to twice the size and each byte is just stored in an int (short) (doing the & 0xff calculation statically when loading), eliminating the need to use 0xff everywhere. Should have off some time anyway...
User avatar
rainwarrior
Posts: 8731
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: possible ways to speed up 6502 core?

Post by rainwarrior »

GradualGames wrote:One thing that makes Java super awkward for writing a 6502 core is I constantly have to do & 0xff to get an unsigned byte value (always expands to an int, since java only allows bytes to be signed). Perhaps one optimization I could make would be when I load the cartridge to expand it to twice the size and each byte is just stored in an int (short) (doing the & 0xff calculation statically when loading), eliminating the need to use 0xff everywhere. Should have off some time anyway...
I would actually expect Java to handle simple byte masking like "& 0xFF" in a practical/efficient way. Most compilers handle this kind of thing very well, and attempts to "optimize" by manually packing bits in less intuitive ways could actually lower performance. (I could be wrong about this, but as always: measure your optimizations after implementing them to make sure.)
GradualGames wrote:My hope was I'd just store off the result of the last instruction and evaluate flags from that result variable when I need them. What's this about inconsistent flags? Is that a hardware glitch or is that something else?
He means that the flags aren't really set by a single result. The Zero flag may have been set by a different instruction than the Carry flag, etc. so you'd actually need to store a result for each flag, I suppose... but I'm not sure how that would be significantly different than just storing the flags. (Dunno if you're doing this, but you don't need to store flags as a packed 8-bit value- that's only needed for things that use flags on the stack, PHP/RTI/etc. Perfectly fine to just use separate booleans for the internal representation.)

Going back to my earlier thing about recompilation, lazy evaluation is something that compilers are really good at doing, so it actually is a technique that would work very well for JIT / dynamic recompilation. However, trying to do it at runtime, now you're probably spending a lot of CPU time computing the laziness, which could easily negate the work saved.
tepples
Posts: 22705
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: possible ways to speed up 6502 core?

Post by tepples »

The "store the last value that the ALU calculated and calculate N and Z therefrom" model won't always work. Normally it's not possible for N and Z to get set at once:
  • ALU result $00: N false, Z true
  • ALU result $01-$7F: N false, Z false
  • ALU result $80-$FF: N true, Z false
But there are two ways for N and Z to get set at once. One is an instruction that pops flags from the stack, namely PLP or RTI:

Code: Select all

FLAGS_N = $80
FLAGS_V = $40
FLAGS_D = $08
FLAGS_I = $04
FLAGS_Z = $02
FLAGS_C = $01

  lda #FLAGS_N|FLAGS_Z
  pha
  plp
The other is BIT:

Code: Select all

  lda #$80
  sta $00
  lda #$01
  bit $00
This sets Z because there are no 1 bits in common between the value in A and the value read from $00. But it sets N based only on the value read from $00, disregarding the value in A.
hackfresh
Posts: 101
Joined: Sun May 03, 2015 8:19 pm

Re: possible ways to speed up 6502 core?

Post by hackfresh »

From stackoverflow.... leaving things as int's might actually(slightly) improve performance...or if not at least make it easier to code.


http://stackoverflow.com/questions/1453 ... float-inst
calima
Posts: 1745
Joined: Tue Oct 06, 2015 10:16 am

Re: possible ways to speed up 6502 core?

Post by calima »

I do think doing it in C will speed it up vs Java. Then there are the old tricks like using computed gotos instead of switch.
User avatar
GradualGames
Posts: 1106
Joined: Sun Nov 09, 2008 9:18 pm
Location: Pennsylvania, USA
Contact:

Re: possible ways to speed up 6502 core?

Post by GradualGames »

Well the odd thing is, it's actually performing great most of the time, more than enough for what I'm using it for (see GGVm thread), even on a 3 year old phone. Problem is, Android seems to put the thread at different priorities or on different cores outside of my control, and sometimes runs at 1/4 the speed, at which point, depending on the phone I might see some dropped frames (the actual speed winds up being slightly slower than the actual NES would execute a rom). If I can just get it to not throttle down to 1/4 speed I'd be golden. However, I suspect this may be something I can't control and I'll just have to bite the bullet and write it in C for this particular platform.
User avatar
rainwarrior
Posts: 8731
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: possible ways to speed up 6502 core?

Post by rainwarrior »

calima wrote:Then there are the old tricks like using computed gotos instead of switch.
Switches often compile to jump tables anyway, especially for in-order contiguous values.
Post Reply