It is currently Tue Oct 17, 2017 2:50 am

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 138 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next
Author Message
PostPosted: Fri Jun 10, 2016 5:31 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> The BSDs and Solaris, as comparative models, don't behave this way.

I have really tried to be as objective and unbiased as possible, but every time I compare differences in design between Linux and the BSDs, I can't recall a single instance where I felt that Linux had the better design choice. And that includes overcommit. And kqueue vs epoll, and /dev/random behavior, and SO_NOSIGPIPE vs MSG_NOSIGNAL, and OSS fork vs ALSA, and on and on.

> https://github.com/awjackson/bsnes-clas ... f3fcf75908

Damn, very impressive end results. I didn't think it'd be possible to get 100% perfect timings, given the separate oscillators (unless this test was run on a Mario Chip board, of course.)

So, does the Yoshi's Island intro still desync, or do the sprites in-game in Winter Gold ever flicker? Those are the two toughest cases.


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 2:47 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Well AWJ, I'm sure this post is going to make you very happy, but ...

Code:
auto PPU::updateVideoMode() -> void {
  switch(regs.bgmode) {
  case 0:
    bg1.regs.mode = Background::Mode::BPP2; bg1.regs.priority0 = 8; bg1.regs.priority1 = 11;
    bg2.regs.mode = Background::Mode::BPP2; bg2.regs.priority0 = 7; bg2.regs.priority1 = 10;
    bg3.regs.mode = Background::Mode::BPP2; bg3.regs.priority0 = 2; bg3.regs.priority1 =  5;
    bg4.regs.mode = Background::Mode::BPP2; bg4.regs.priority0 = 1; bg4.regs.priority1 =  4;
    sprite.regs.priority0 = 3; sprite.regs.priority1 = 6; sprite.regs.priority2 = 9; sprite.regs.priority3 = 12;
    break;


Code:
auto PPU::Background::get_tile() -> void {
  ...
  priority = (tile & 0x2000 ? regs.priority1 : regs.priority0);


Code:
auto PPU::Sprite::run() -> void {
  ...
  uint priority_table[] = {regs.priority0, regs.priority1, regs.priority2, regs.priority3};
  ...
        output.main.priority = priority_table[tile.priority];


I have no idea why I decided to store the settings as regular variables, and then turn them into arrays/conditionals later on for every rendered pixel. Unfortunately, fixing that didn't affect performance at all, but still ... it was more code for no reason.

Now given, the code's not all bad, but there's definitely questionable things in here.

The hottest function in the emulator is the PPU::Background::run routine, followed by CPU::addClocks and then PPU::Window::run.

Here's the best I can do for Window::run: http://hastebin.com/xusorutava.cpp
Bumps the speed from ~89.5fps to ~94.5fps (but you know how fickle compiler gains are ...)
Any further improvements would be welcome. I don't see an easy way to merge one/two with one_enable/two_enable.

We also really need to figure out how to emulate EXTBG already. We're doing that entirely wrong right now, running mode7 computations twice, just to get around the weird EXTBG mosaic behavior. And it affects more than just mode7.

How interested are you in the accuracy PPU core, anyway? Is it something you're looking to accelerate as well, or are you primarily interested in the other PPU cores?


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 9:04 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
The hottest function in the emulator is the PPU::Background::run routine, followed by CPU::addClocks and then PPU::Window::run.

Here's the best I can do for Window::run: http://hastebin.com/xusorutava.cpp
Bumps the speed from ~89.5fps to ~94.5fps (but you know how fickle compiler gains are ...)
Any further improvements would be welcome. I don't see an easy way to merge one/two with one_enable/two_enable.

We also really need to figure out how to emulate EXTBG already. We're doing that entirely wrong right now, running mode7 computations twice, just to get around the weird EXTBG mosaic behavior. And it affects more than just mode7.

How interested are you in the accuracy PPU core, anyway? Is it something you're looking to accelerate as well, or are you primarily interested in the other PPU cores?


I'm very interested in the accuracy PPU. The other two implementations are frankly unsalvageable, but unlike you, my approach to unsalvageable old code is to write a fully-functional replacement first and then tear up the old code :mrgreen:

You seriously had member variables called priority0, priority1, priority2 and priority3 instead of an array? That's, like, elementary programming :mrgreen:

As I think I've said before, I'm pretty sure the BG implementation will have to be rewritten from scratch to take into account the real VRAM fetch patterns for each mode--that will give us correct EXTBG for free, and hopefully some performance if we write the new implementation smartly. Bottom line, I wouldn't put too much (any) effort into microoptimizing the current BG implementation if I were you. Focus on the sprites and windows.

You should renumber the priority codes so that sprites always use the same numbers (i.e. leave some gaps in the modes with fewer BGs) The Sprite unit really shouldn't need to know about the BG mode at all.

It's been two years and I never did get around to writing that mosaic and EXTBG test ROM. I really need to do that, but I'm currently busy with the SuperFX. I'm afraid I really can't spare any mental bandwidth on the PPU at the moment.

Something I didn't mention in the previous post is that my new icache code gives a noticeable speedup. Doom/WG/YI gain 1-2 FPS in accuracy, 10 FPS or more in balanced. I think it must be because the old code was syncing to the S-CPU (calling add_clocks()) twice per instruction when executing out of cacheable ROM. Of course I still can't merge it until that test ROM works without hacking it :)


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 10:38 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
I think this will be slower, but will drop 11 lines to 3 lines. May be faster; switch statements with small numbers of cases seem to generate pretty awful code, even when all the cases are linear.

Code:
bool array[] = {true, result, !result, false};
output.main.color_enable = array[regs.col_main_mask];
output.sub.color_enable = array[regs.col_sub_mask];


> I'm very interested in the accuracy PPU. The other two implementations are frankly unsalvageable

Awesome! I completely agree. Balanced was an absolute write-off. That was (awful) C code masquerading as C++. The performance version was a failed experiment that wasn't even faster than balanced.

I still want a fast scanline-based PPU core (maybe we can even multi-thread it a bit somehow), and then in the future offer only two bsnes profiles. I don't think any amount of miracles is going to get the accuracy core to run well on anything but high-end Intel CPUs.

> As I think I've said before, I'm pretty sure the BG implementation will have to be rewritten from scratch to take into account the real VRAM fetch patterns for each mode

I agree the implementation is flawed. What I think is more likely is that the tile data lines are 16-bits each, and data gets shifted into them on fixed intervals rather than just-in-time like I do now.

We are really, really crippled on our potential research here because VRAM is 100% unreadable, unwritable during active display. We can't predict where the PPU is fetching data.

Gonna need someone else's help to design a more hardware-accurate fetching model. And it's really not going to be pretty since you can do things like change the BG mode mid-scanline (eg Goodbye, Anthrox.)

I also don't think we're going to be able to split the PPU into separate PPU1, PPU2 logic blocks. We are probably going to want to utilize state machines, even with libco, because there's likely to be a lot of parallel logic going on between each of the units (BG1-4, OAM, Window, Math)

> Bottom line, I wouldn't put too much (any) effort into microoptimizing the current BG implementation if I were you. Focus on the sprites and windows.

The wonderful thing is, it's all nice and separated. All we have to do is drop ppu/background, and maybe change the main loop of ppu.cpp a bit.

I don't really follow what you're meaning with optimizing the sprites, so I guess we'll wait until after the SFX work to focus on those.

> I'm currently busy with the SuperFX. I'm afraid I really can't spare any mental bandwidth on the PPU at the moment.

That's fine. Everything's important. Looking forward to better SFX timing.

> I think it must be because the old code was syncing to the S-CPU (calling add_clocks()) twice per instruction when executing out of cacheable ROM.

Wait until we emulate the program RAM cache for the Cx4. A hackish mock-up was giving a ~30% speed boost in the most demanding areas. Also doesn't help that I sincerely doubt it's one clock cycle per instruction ... we're probably running that chip twice as fast as real hardware.


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 10:54 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19084
Location: NE Indiana, USA (NTSC)
byuu wrote:
Gonna need someone else's help to design a more hardware-accurate fetching model.

Would a trace from a logic analyzer on the VRAM address bus help? I can't provide one myself, but if we all agree that one is needed, I might be able to let people know.


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 11:00 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6273
Location: Seattle
byuu wrote:
We are really, really crippled on our potential research here because VRAM is 100% unreadable, unwritable during active display. We can't predict where the PPU is fetching data.
I could trivially get a logic analyzer log of any eight signals inside my 1-1-1 SNES, if that would help.

(In a couple weeks I'll have 8 additional test clips and could trace 16 signals instead)


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 11:13 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
lidnariq wrote:
byuu wrote:
We are really, really crippled on our potential research here because VRAM is 100% unreadable, unwritable during active display. We can't predict where the PPU is fetching data.
I could trivially get a logic analyzer log of any eight signals inside my 1-1-1 SNES, if that would help.

(In a couple weeks I'll have 8 additional test clips and could trace 16 signals instead)


I'm hoping that we can spy on the PPU VRAM fetches by abusing EXTBG, and not need additional hardware. Though I'd still love to see address bus traces of the S-PPU in action.


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 1:17 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6273
Location: Seattle
AWJ wrote:
Though I'd still love to see address bus traces of the S-PPU in action.
Any particular thing it'd be nice to have on-screen at the time? (Also, RAM B/U4 or RAM A/U5? I guess they're supposed to be the same except in mode7...)


Top
 Profile  
 
PostPosted: Sun Jun 12, 2016 1:28 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
lidnariq wrote:
AWJ wrote:
Though I'd still love to see address bus traces of the S-PPU in action.
Any particular thing it'd be nice to have on-screen at the time? (Also, RAM B/U4 or RAM A/U5? I guess they're supposed to be the same except in mode7...)


One trace for each mode, with BG*SC and BG**NBA all set to different values for each layer so we can distinguish what's being fetched. BG1 tilemap, BG1 pattern, BG2 tilemap, BG2 pattern, etc., should all come from non-overlapping address ranges.


Top
 Profile  
 
PostPosted: Mon Jun 13, 2016 1:10 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
So, good news and bad news.

Went through the entire PPU core and revised a lot of the code. Some of the things that ended up being the fastest were very unintuitive and differed between the BG and OAM renderers. But I went with it anyway, even at the cost of some consistency.

The end result was a ~9% speedup, which applies even to Yoshi's Island title screen (I figured it would be quite a bit less there, but it was right at ~9% as well.)

The bad news here is that this is probably all the low-hanging fruit. And it's not nearly enough, of course. We lost a good bit of performance in rendering at 512x480 always, and I really don't want to revert that. It complicates everything.

More bad news for you is that this is likely to be a very tough merge to your fork. You'll probably just want to take the key gains and ignore the rest.

I think the only major gain we have left at this point is going to be creating a libco-alternative for the PPU/DSP. These cores clearly don't need full stack frames, and the PPU especially is invoked constantly.

What I'd like to try and do is wrap libco plus a state machine engine (maybe just longjmp alone?), preferably not handled through any macro abuse. C++17 cothreads would work great here, but we don't have those yet.

I don't just want to make them dumb state machines with totally inconsistent scheduling code.


Top
 Profile  
 
PostPosted: Mon Jun 13, 2016 6:25 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 6273
Location: Seattle
AWJ wrote:
One trace for each mode, with BG*SC and BG**NBA all set to different values for each layer so we can distinguish what's being fetched. BG1 tilemap, BG1 pattern, BG2 tilemap, BG2 pattern, etc., should all come from non-overlapping address ranges.
All seven traces were done with the following register settings:
BG1SC = 0
BG2SC = $10
BG3SC = $20
BG4SC = $30
BG12NBA = $45
BG34NBA = $67
TM = $0F

I didn't get a trace for mode7.

I put clips on the RAMs on VA14, VAA13, VAB13, VAA12, VAB12, /VRD; on the BA6592F on /CSYNC; and on the S-CPU on REFRESH. Sample rate was 24MHz, the highest I can get with a Saleae.

(Getting a clip onto S-CPU pin 40 was hard. The trace might be suspect)


Attachments:
File comment: CSV files converted for ease of converting into other formats
SNESPPUtraces.7z [1.25 MiB]
Downloaded 51 times
File comment: logicdata files should be openable with Saleae's Logic; generated using v1.18
SNESPPU_logicdata.7z [1007.19 KiB]
Downloaded 45 times
Top
 Profile  
 
PostPosted: Mon Jun 13, 2016 9:24 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
lidnariq wrote:
All seven traces were done with the following register settings:
BG1SC = 0
BG2SC = $10
BG3SC = $20
BG4SC = $30
BG12NBA = $45
BG34NBA = $67
TM = $0F

I didn't get a trace for mode7.

I put clips on the RAMs on VA14, VAA13, VAB13, VAA12, VAB12, /VRD; on the BA6592F on /CSYNC; and on the S-CPU on REFRESH. Sample rate was 24MHz, the highest I can get with a Saleae.

(Getting a clip onto S-CPU pin 40 was hard. The trace might be suspect)


Great work! Thanks a lot!

It's slightly unfortunate that your analyzer's sample rate maximizes aliasing (24MHz is almost exactly 4.5 times the PPU pixel/bus frequency), and that you set BG12NBA and BG34NBA to $45 and $67 rather than $54 and $76. Nevertheless, it's quite easy to see what's going on in each mode:

Code:
0.000001791666667, 0, 1, 1, 1, 1, 0, 0, 1 ; BG4 nametable  .1666 us
0.000001958333333, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 nametable  .2083 us
0.000002166666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000002333333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000002500000000, 1, 1, 0, 1, 0, 0, 0, 1 ; BG4 2bpp       .2083 us
0.000002708333333, 1, 1, 1, 1, 1, 0, 0, 1 ; BG3 2bpp       .1666 us
0.000002875000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp       .2083 us
0.000003083333333, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 2bpp       .1666 us
0.000003250000000, 0, 1, 1, 1, 1, 0, 0, 1 ; BG4 nametable  .2083 us
0.000003458333333, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 nametable  .1666 us
0.000003625000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000003833333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000004000000000, 1, 1, 0, 1, 0, 0, 0, 1 ; BG4 2bpp       .2083 us
0.000004208333333, 1, 1, 1, 1, 1, 0, 0, 1 ; BG3 2bpp       .1666 us
0.000004375000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp       .2083 us
0.000004583333333, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 2bpp       .1666 us
0.000004750000000, 0, 1, 1, 1, 1, 0, 0, 1 ; BG4 nametable

Mode 0 fetches four nametable words in descending order, then four pattern slivers in descending order.

Code:
0.000000791666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 nametable  .2083 us
0.000001000000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000001166666667, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .2083 us
0.000001375000000, 1, 1, 1, 1, 1, 0, 0, 1 ; BG3 2bpp       .1666 us
0.000001541666667, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000001916666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp       .3750 us
0.000002291666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 nametable  .2083 us
0.000002500000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000002666666667, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000002875000000, 1, 1, 1, 1, 1, 0, 0, 1 ; BG3 2bpp       .1666 us
0.000003041666667, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000003416666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp       .3750 us
0.000003791666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 nametable

Mode 1 fetches three nametable words in descending order, then three pattern slivers in descending order. With only the upper address lines, we can't distinguish what order the bitplanes of BG2 and BG1 are fetched in (but we can see that they take twice as long).

Code:
0.000004208333333, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000004375000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000004541666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .3750 us
0.000004916666667, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000005291666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp       .3750 us
0.000005666666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000005875000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000006041666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .3750 us
0.000006416666667, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000006791666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp       .3750 us
0.000007166666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable

Mode 2 fetches the nametables, then two words of offset-per-tile data (again, we would need the lower address lines to distinguish them), then the patterns. Since the offset-per-tile is fetched after the nametables, each offset-per-tile fetch must apply to the next set of nametable fetches. This explains why offset-per-tile never applies to the first visible tile in a scanline.

Code:
0.000000500000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000000708333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000000875000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000001250000000, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 8bpp       .7500 us
0.000002000000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000002208333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000002375000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 4bpp       .3750 us
0.000002750000000, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 8bpp       .7500 us
0.000003500000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable

Mode 3, no surprises here.

Code:
0.000001208333333, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000001375000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .2083 us
0.000001583333333, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .1666 us
0.000001750000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp       .2083 us
0.000001958333333, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 8bpp       .7500 us
0.000002708333333, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .1666 us
0.000002875000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000003041666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .2083 us
0.000003250000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp       .1666 us
0.000003416666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 8bpp       .7500 us
0.000004166666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable

Mode 4 only fetches one word of offset-per-tile data, rather than two like mode 2. As expected.

Code:
0.001594750000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.001594958333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.001595125000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp hires .3750 us
0.001595500000000, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp hires .7500 us
0.001596250000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.001596458333333, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.001596625000000, 1, 0, 0, 0, 0, 0, 0, 1 ; BG2 2bpp hires .3750 us
0.001597000000000, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp hires .7500 us
0.001597750000000, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable

Mode 5 is almost exactly like mode 3, except instead of 4bpp and 8bpp slivers it's fetching double 2bpp and 4bpp slivers. We would need the lower address lines to distinguish the left and right slivers, as well as the bitplanes.

Code:
0.000015291666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000015500000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000015666666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .3750 us
0.000016041666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp hires .7500 us
0.000016791666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable  .2083 us
0.000017000000000, 0, 0, 0, 0, 0, 0, 0, 1 ; BG1 nametable  .1666 us
0.000017166666667, 0, 1, 0, 1, 0, 0, 0, 1 ; BG3 OPT        .3750 us
0.000017541666667, 1, 0, 1, 0, 1, 0, 0, 1 ; BG1 4bpp hires .7500 us
0.000018291666667, 0, 0, 1, 0, 1, 0, 0, 1 ; BG2 nametable

Finally, mode 6. It has a wasted cycle where it does a BG2 nametable fetch even though there is no BG2 in mode 6. Since it's a hires mode (and therefore any pattern fetch needs an even number of cycles), there isn't anything useful to do with an odd leftover cycle anyway. Like mode 2, two words of offset-per-tile data are fetched.

When you get your hands on more test clips, I'd like to see traces of the VRAM address lines from A14 to A3 (from either one of the RAMs, no need for both), plus /CSYNC (to see where the HBlanks are). No need for REFRESH; it's the DRAM refresh, not video related at all. Put some sprites on the screen this time. Oh, and for mode 5 and mode 6, can you put a few horizontally-flipped tiles into the nametables? I'd like to see if that affects the order the half-patterns are fetched in. Let's say, make one out of every four columns contain horizontally-flipped tiles.

Now, I really do have to get back to writing those SuperFX test programs...


Top
 Profile  
 
PostPosted: Sun Jun 26, 2016 5:35 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Two things.

First, AWJ, I was curious if you've made any progress on the SuperFX stuff? If not or you're busy, that's cool. But I'd be very interested in committing your fixes if/when you do figure it out, so please keep me in the loop :D

Second, I wanted to share how great BitField is on other emulation cores.

Here's how I implemented scrollX and scrollY increment on the Famicom before:
Code:
  //scrollx increment
  r.vaddr = (r.vaddr & 0x7fe0) | ((r.vaddr + 0x0001) & 0x001f);
  if((r.vaddr & 0x001f) == 0x0000) {
    r.vaddr ^= 0x0400;
  }

  //scrolly increment
  r.vaddr = (r.vaddr & 0x0fff) | ((r.vaddr + 0x1000) & 0x7000);
  if((r.vaddr & 0x7000) == 0x0000) {
    r.vaddr = (r.vaddr & 0x7c1f) | ((r.vaddr + 0x0020) & 0x03e0);
    if((r.vaddr & 0x03e0) == 0x03c0) {  //0x03c0 == 30 << 5; 30 * 8 = 240
      r.vaddr &= 0x7c1f;
      r.vaddr ^= 0x0800;
    }
  }


And here's how we can do it now:
Code:
  //scrollx increment
  if(!++r.vaddr.x) r.vaddr.h++;

  //scrolly increment
  if(!++r.vaddr.fineY && ++r.vaddr.y == 30) r.vaddr.y=0, r.vaddr.v++;


(might have a logic bug, need to do some full-range of values testing on it first, but that's the gist of it.)

I love C++ over C :D


Top
 Profile  
 
PostPosted: Sun Jun 26, 2016 6:43 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Oh, by the way. Dolphin beat us to template-bitfields. They also beat Evan Teran by several months:

https://github.com/dolphin-emu/dolphin/ ... BitField.h

We should probably spend some time digging through Dolphin for good ideas to steal peruse ;)


Top
 Profile  
 
PostPosted: Sun Jun 26, 2016 8:01 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
Yeah, the NES PPU's internal scroll/VRAM address registers are pure bit-twiddling hell and ideal cases for overlapping bitfields.

re SuperFX: I'm still working on the test program. I want to cram as many tests into one ROM as I can to minimize the amount of soldering and EPROM burning some poor sap has to do.

The following one-line change to CPU::irq_test() greatly improves the timing accuracy of coprocessor-generated IRQs:

Code:
bool CPU::irq_test() {
+ if(!regs.p.i) synchronize_coprocessor();
  if(!status.irq_transition && !regs.irq) return false;
  status.irq_transition = false;
  regs.wai = false;
  return !regs.p.i;
}


Unfortunately it comes at a cost of about 5 FPS in all games with coprocessors. That's actually a much smaller impact than I expected (irq_test() is called from last_cycle(), which is called around once every 20-30 master clocks depending on the mix of instructions being executed) but it hits all games with any coprocessor, including ones that don't have an IRQ output... like the CX4, which is already the slowest thing in bsnes.

An optimization that would at least confine the damage to SuperFX and SA1 games would be to stash a pointer to the coprocessor that's connected to the IRQ line, and only synchronize that one in irq_test(). Drawbacks are (1) it's more complex, which you hate, and requires adding a "synchronize_single_coprocessor()" method which bsnes currently lacks; and (2) it won't work if you somehow have two coprocessors that both have IRQ outputs. On the other hand, (2a) bsnes already fails to handle multiple coprocessors with IRQ outputs correctly--each coprocessor directly sets the IRQ pin as if it had exclusive control over it, there's no attempt to handle "I'm no longer asserting /IRQ but the other coprocessor still is"; and (2b) the idea of a cartridge with both SuperFX and SA-1 coprocessors is so utterly impossible and ridiculous on a hardware level that I consider attempting to support such a thing in an emulator to be actively harmful to the SNES scene (it's like those NSFs that use one of every cartridge sound chip at once--that ain't NES music any more, buddy)

Anyway, ranting aside, I'm probably going to add some IRQ timing dependencies to my SuperFX test ROM, so you might want to reconsider those r14/r15 changes you rejected. Particularly since accurate emulation of the icache and other things will probably require differentiating between r15-modified-by-MMIO and r15-modified-by-instruction anyway.


Last edited by AWJ on Sun Jun 26, 2016 9:47 pm, edited 1 time in total.

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 138 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: kevtris, lint and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group