It is currently Sat May 27, 2017 6:56 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Mon Feb 13, 2017 11:25 pm 
Offline
User avatar

Joined: Sat Apr 18, 2009 4:36 am
Posts: 255
Location: Russia
Hi, nesdev.
I think this info will be interesting.

Q:
Me wrote:
Hi, Marty.

Just want to tell thank you again
for great nestopia emulation core.

I did the test compare performance of modern cycle-accurate
emulators (written on C and C++) vs nestopia
on old intel-atom N550 1.50 GHz machine.

Results is amazing.

- puNES 0.100
- nintendulator 0.975b
- mesen 0.7.0
- bizHawk 1.11.9
- rockNES 5.41

All of them eats 100% of CPU core and cannot
run fullspeed on the old low-powered netbook CPU. It gives only 30-40 FPS without frameskipping.
(real performance of Atom N550 is about good Pentuim 3~1000MHz)

Nestopia result is only 40-45% CPU load, and it run at 60FPS fullspeed!
FCEUX with old inaccurate scanline-based PPU render + low sound quality have the same performance.

For now, nestopia-libretro core (in fact it's your core with minimal modifications by Rdanbrook)
work perfect on the Raspberry Pi 3.

I wonder how you did so _heavy_ optimization of your cycle accurate emulator!

A:
Marty wrote:
Thanks Eugene. Nice to hear from you again, hope you are well.
Doing code optimizations without sacrifizing accuracy can be
real fun and I'm happy to see it payed off.

As for the various optimizations I did to Nestopia at the time,
I heavily used Intel Vtune and AMD CodeAnalyst profiler to
find hotspots in the code and also let the compiled IA-32 assembly
code guide me through it.

I also made heavy use of (or abused if you will) C++ template style
programming, or concept-oriented programming as I'd like to call it,
to let the compiler do as much work for me as possible and allowing
me to not needing to repeat myself in code.

Using the Intel C++ Compiler and Microsoft Visual Studio at the time, I
also fine-tuned many parts of the code through compiler directives to give
hints to the compiler on what to optimize for speed and what to optimize for
size.

As a programmer, having a knowledge of low level stuff such as branch-prediction, cache-lines
and other things helped a lot during development. Even if you're developing something in a high-level
language such as Java, C#, Python, I believe you can still influence performance a great deal in the way
you structure and arrange your code.

For reference and maybe not surprisingly, the most critical method for performance in the whole Nestopia code
base I remember was Ppu::renderPixel(). That one I remember optimizing to be ~20FPS faster just by re-arranging
some statements. That was surely a branch-condition killer, but by allowing the CPU to not stall and do other work in parallell made it almost free.


21.01.2017


Top
 Profile  
 
PostPosted: Tue Feb 14, 2017 7:06 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 198
Just to give my 2 cents on the Mesen part of things:
Mesen is more or less optimized to run on at least 3 different threads (emulation, frame decoding/filtering, rendering), so running it on any dual-core will result in sub-par performance, especially since I abuse spin-locks due to their low latency - but spin locks only work well so long as you actually have free cores to run them on without slowing down the other thread you are waiting on.
On the upside, this design means Mesen can run HDNes' HD packs with very little FPS drop on a quad core machine (e.g Super Mario Bros goes from ~250fps to maybe ~190fps on my machine)

Also, a lot of features result in small performance losses - e.g: debugger, cheats, unlimited sprites, support for HDNes' HD pack format, etc. I try to optimize where I can (using VS' profiler mostly) - but I'm not going to start trying to optimize cache misses in an era where most low end computers are already able to run Mesen at 2-3x normal speed. This made a lot of sense in 2005, but not so much in 2017 (Stuff like raspberry pis aside)

And, this is a matter of taste of course (I'm sure some people might say the same about Mesen's code), but Nestopia's code can be very hard to process. In particular, stuff like this drives me insane:
https://github.com/rdanbrook/nestopia/b ... .cpp#L1435
It might result in slightly faster code, but in my opinion makes the code so much harder to read.
This kind of thing also leads to Nestopia's PPU code being 3.4k lines, against Mesen's 1k lines.

P.S: I'm not trying to hate on Nestopia or anything - it's a great emulator, and I've used it as a reference countless of times!


Top
 Profile  
 
PostPosted: Tue Feb 14, 2017 11:06 pm 
Offline
User avatar

Joined: Sat Apr 18, 2009 4:36 am
Posts: 255
Location: Russia
You're right, nestopia core code is very hard to maintain.
FHorse takes 2 days of debugging to understand and solve bug in NstPpu,
and it was too difficult!
Quote:
Watching routine Ppu::Run you can easily see that the flag of VBLANK and the NMI are performed to cycles.hClock 681 (HCLOCK_VBLANK_0), 682 (HCLOCK_VBLANK_1) and 684 (HCLOCK_VBLANK_2) that is virtually one scanline after the VACTIVE (240) scanlines. This is fine for PPU_RP2C02 (NTSC) and PPU_RP2C07 (PAL) but not for PPU_DENDY that needs another 50 sleep scanlines. What I did was nothing more than adding these 50 scanlines first of the HCLOCK_VBLANK_0 that are performed only when the variable (ssleep >= 0) and this is true only in the case of PPU_DENDY. This way I left intact the logic with which the routine work for NTSC and PAL, intervening only for Dendy mode because ssleep will always be -1 for PPU_RP2C02 and PPU_RP2C07.
I hope that I was able to explain well.

---
By the way, even "easy" things, like minor improvements to NSF-player, FDS, and region selector are difficult too.
Feos tried to fix some other minor bugs, but can solve only FDS.
NSF-player and region selector still need to fix.
Current patches by feos are broken.


Top
 Profile  
 
PostPosted: Thu Apr 20, 2017 12:02 am 
Offline
User avatar

Joined: Sat Apr 18, 2009 4:36 am
Posts: 255
Location: Russia
Sour wrote:
I just released Mesen 0.8.1 which contains a fair amount of speed optimizations.
I know that it's still nowhere near as fast as Nestopia (and most likely never will be) - but I'm curious how much of an impact the changes had on your 1.5ghz Atom CPU.

Is there any chance you could compare 0.8.0 and 0.8.1 with a few games and let me know how much of a speed improvement you get?

On my i5 750 I get +22%, on an i3 I get +23%, and on a very old AMD Opteron (dual-core 2.0ghz) I get +26% performance. With a bit of luck, the Atom might get a +25-30% performance boost, too.

https://github.com/SourMesen/Mesen/comm ... af24f10b84
https://github.com/SourMesen/Mesen/comm ... 9a99ce31e2

Here is results of testing,
intel Atom N550 1.50 GHz, 2 cores / 4 threads:

Super Mario Bros:
0.8.0 - 38~45 FPS
0.8.1 - 49~56 FPS

Rockman\Megaman 6:
0.8.0 - 36~42 FPS
0.8.1 - 46~55 FPS

Akumajou Densetsu (CV3 japan with VRC6):
0.8.0 - 23~31 FPS
0.8.1 - 35~42 FPS

Notice about task manager core utilisation.


Attachments:
performance_test_mesen.rar [863.41 KiB]
Downloaded 13 times
Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group