It is currently Thu Oct 19, 2017 2:19 am

All times are UTC - 7 hours



Forum rules


Related:



Post new topic Reply to topic  [ 86 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
PostPosted: Fri Jun 30, 2017 11:49 pm 
Offline

Joined: Sat Apr 25, 2015 1:47 pm
Posts: 327
Location: FL
byuu wrote:
Is Revenant going to be up for writing all of those tests?

Hell no :P

Instead of compiling a different test ROM for every instruction, I wonder if it'd make more sense to just write some code to allow selecting one instruction at runtime, writing it into SMP RAM a good number of times (which would be better for this purpose than using a loop, if I understand the branch instructions' timing correctly) and then executing that and timing the results.


Top
 Profile  
 
PostPosted: Sat Jul 01, 2017 6:14 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Although with enough loops we can compute the answer either way -- you could use JMP instead of BRA.

A simple framework will work for most tests, but will not work for the stack manipuating ones. PUSH and POP may be okay if we let them overflow and wrap around the stack repeatedly.


Top
 Profile  
 
PostPosted: Sat Jul 01, 2017 9:09 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
https://github.com/awjackson/bsnes-clas ... 02d56f78fb

Note that none of the changes based on Overload's findings are in bsnes-classic yet--with this commit all "idle" cycles, including [pc+1] dummy operands, are treated as IO/ROM cycles (though we already know that's not quite correct, because two of the idle cycles in that mul/inc/bne loop are definitely RAM cycles).

If I run the blargg timer speed tests in this branch, some of the numbers change by 1 one way or the other (but never more than 1) but all the tests still show "passed". If I change the wait_states[] or timer_ticks[] lookup tables at all, the numbers change much more, and sometimes the tests even show "failed".

Revenant wrote:
Instead of compiling a different test ROM for every instruction, I wonder if it'd make more sense to just write some code to allow selecting one instruction at runtime, writing it into SMP RAM a good number of times (which would be better for this purpose than using a loop, if I understand the branch instructions' timing correctly) and then executing that and timing the results.


In order to get big enough numbers to minimize rounding error, we need to run each instruction a couple hundred times. There's no problem using loops if we verify the behaviour of the branch instructions first.

The call instructions are no harder to verify than any other instruction, they just need a bit of setup ahead of time (i.e. plunking suitable vectors in high RAM)

Still, the idea of an interactive test sounds good. Also, that way we can start with educated-guess emulation (e.g. my hypothesis that [pc+1] dummy operands are real reads and every other "idle" is an IO/ROM cycle) and zero in on the instructions that appear to diverge from that after emu-vs-hardware testing.


Top
 Profile  
 
PostPosted: Sat Jul 01, 2017 10:22 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> https://github.com/awjackson/bsnes-clas ... 02d56f78fb

So according to your code ...

If the current cycle wait state is 0, you get 3 ticks of the timer stage 0.
If it's 1, 6 ticks.
If it's 2, 12 ticks.
If it's 3, 24 ticks.

So the real ratio is like:
24 clocks to 3 ticks.
48 clocks to 6 ticks.
120 clocks to 12 ticks.
240 clocks to 24 ticks.

So it changes from a 1/8th ratio to a 1/10th ratio on the upper two. That's peculiar, but if it works, it works.


Top
 Profile  
 
PostPosted: Sat Jul 01, 2017 10:32 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
> https://github.com/awjackson/bsnes-clas ... 02d56f78fb

So according to your code ...

If the current cycle wait state is 0, you get 3 ticks of the timer stage 0.
If it's 1, 6 ticks.
If it's 2, 12 ticks.
If it's 3, 24 ticks.

So the real ratio is like:
24 clocks to 3 ticks.
48 clocks to 6 ticks.
120 clocks to 12 ticks.
240 clocks to 24 ticks.

So it changes from a 1/8th ratio to a 1/10th ratio on the upper two. That's peculiar, but if it works, it works.


I think speed values of 2 and 3 are supposed to be clock dividers of 4 and 8, but because of some interaction with the S-DSP (which actually generates the S-SMP's clock signal) they end up actually taking at least 5 or 10 cycles respectively, sometimes much longer (see Revenant's bizarre result with the mul ya test), and sometimes wedging the clock generator permanently.

The timer_ticks[] values being multiples of 3 is a relic of the old timer_step formula. You can divide them all by 3, also divide the per-timer template arguments by 3 (so 64/64/8 instead of 192/192/24), and everything works out exactly the same:

https://github.com/awjackson/bsnes-clas ... 44d23c6dec

(tested with blargg's tests, Revenant's tests, and Tales of Phantasia's intro)


Top
 Profile  
 
PostPosted: Sat Jul 01, 2017 1:28 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
> I think speed values of 2 and 3 are supposed to be clock dividers of 4 and 8, but because of some interaction with the S-DSP (which actually generates the S-SMP's clock signal) they end up actually taking at least 5 or 10 cycles respectively, sometimes much longer (see Revenant's bizarre result with the mul ya test), and sometimes wedging the clock generator permanently.

Interesting. And yeah, I don't really see us emulating the lock-ups. That's getting too pedantic even for me. Would rather put that effort into the CPU<>DMA crash on R1 CPUs that some homebrew actually hits by accident.

> The timer_ticks[] values being multiples of 3 is a relic of the old timer_step formula. You can divide them all by 3, also divide the per-timer template arguments by 3 (so 64/64/8 instead of 192/192/24), and everything works out exactly the same:

True, that's a nice simplification. And not to nitpick, but at this point I'd suggest dropping timer_ticks and just use:

Code:
unsigned ticks = 1 << speed;


Top
 Profile  
 
PostPosted: Sun Jul 02, 2017 12:04 am 
Offline

Joined: Sat Apr 25, 2015 1:47 pm
Posts: 327
Location: FL
Just for the sake of satisfying my own curiosity, here is a capture of smptesttest.sfc on BMF54123's SNS-101, in which CA/DA/EA/FA are actually usable. Meanwhile, his SNS-CPU-GPM-02 does the same thing as my SHVC-CPU-01, so I think my hunch about it being an issue with pre-1CHIP units might have been correct.

If that's really the case, then doing something akin to smpidletest (where CA etc. makes the SMP run slowly but still eventually recover) could be a way for software to tell 1CHIP/mini consoles apart from previous revisions, if one ever wanted/needed to do that for some reason.


Top
 Profile  
 
PostPosted: Sun Jul 02, 2017 11:47 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
I've just received a PM from Overload. He's done additional testing with a logic analyzer, updated his document, and confirmed a number of my intuitions:

The second cycle of one-byte instructions (and one two-byte instruction) is indeed a kind of dummy operand fetch which uses the external clock divider (TEST bits 4-5) if executing from RAM. Other internal operation cycles always use the internal clock divider and don't trigger read side effects from internal SMP registers, regardless of the address they put on the external bus. The oddball is dbnz y,rr, which has both a dummy operand fetch on cycle 2 and a real operand fetch on cycle 4. It's probably because that instruction shares microcode with the instructions that have a direct-page operand and a relative operand.

blargg was right about mov a,(x)+: the third cycle is the read and the fourth cycle is an internal operation. Whereas for mov (x)+,a the third cycle is an internal operation and the fourth cycle is the write. If I were to guess why this addressing mode works differently from all the other register/memory addressing modes (e.g. (x)), it's probably because the other modes share microcode with adc et al, but the (x)+ mode only exists for mov so it's microcoded specially.

TEST bits 4-7 are clock dividers of 2/4/8/16 applied to the clock coming from the S-DSP, which is already divided by 12 (so a final divider of 24/48/96/192). Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)

Pin 16 (CPUK on the schematic) is the 2.048 MHz clock input from the S-DSP. Pin 15 is R/'W (low on writes, high on reads and internal operations, same as a 6502) Pin 14 is clock output (roughly equivalent to phi2 on a 6502, but its duty cycle is 25% low/75% high rather than 50%/50%) It looks to me like the SPC700 runs on a 4-phase clock internally like a 6809, rather than 2-phase like a 6502--maybe that's how it's able to do RMW ops without an idle cycle between the read and the write.

For emulation purposes, the address on the bus during real internal operations (not dummy operand fetches) seems pretty much irrelevant.


Top
 Profile  
 
PostPosted: Sun Jul 02, 2017 4:21 pm 
Offline

Joined: Sat Apr 25, 2015 1:47 pm
Posts: 327
Location: FL
AWJ wrote:
Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)


That only applies to the internal divider (bits 6-7), right?


Top
 Profile  
 
PostPosted: Sun Jul 02, 2017 4:52 pm 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
Revenant wrote:
AWJ wrote:
Dividers of 8 or 16 cause the S-DSP output clock to become "not stable" (seen from the software side by us as a 25% slowdown in the best case and a total loss of responsiveness in the worst case)


That only applies to the internal divider (bits 6-7), right?


An external divider setting of 2 or 3 seems less likely to lock up (at least with the mixtures of instructions and internal/external cycles we've been doing) but it still seems to cause a 25% slowdown on RAM cycles. Compare the results of all our tests with TEST=$FA to TEST=$0A. $FA makes all the tests take exactly 10 times as long as seen from the S-CPU.


Top
 Profile  
 
PostPosted: Mon Jul 03, 2017 12:00 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Good news and bad news.

The good news is that I've implemented all of Overload's new findings on IO cycles, plus I have the (x)+ case right for both reads and writes now.

Further, I've implemented the SMP as running at DSP/12. For the CPU cycles, I consume {2,4,10,20} cycles to simulate the glitchy behavior where 8,16 are not evenly divisible by 12. Yet I still run the timers by {2,4,8,16}. As a result of this, I've reduced the timer stage 0 counters to {128, 128, 16}.

Note that I could run the SMP at DSP/24 and use {1,2,5,10}, {1,2,4,8}, and {64,64,8}, but I figured I'd be more self-documenting and put a lot of notes about this behavior and its glitchiness into the smp/timing.cpp file.

We now closely match every test by Revenant, and still pass test_speed by blargg.

The bad news is that we fail test_timer_speed now on 1A (and most certainly on the others as well.) Since blargg doesn't print failed values, I traced the ROM and determined higan is getting 1561 for a timer value of 1A, whereas it wants ~1639 to pass. My suspicion is we have converted some cycles that really do read from RAM into idle cycles erroneously.

It's possible that I made a mistake somewhere, but I was super cautious this time, and since all of Revenant's stuff passes ... I think we may still have more stuff to discover here.

All the same, I'll link to the Git repo for the new code once it's been pushed.


Top
 Profile  
 
PostPosted: Mon Jul 03, 2017 5:35 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Okay, relevant files are uploaded:

https://gitlab.com/higan/higan/blob/mas ... ctions.cpp
https://gitlab.com/higan/higan/blob/mas ... uction.cpp
https://gitlab.com/higan/higan/blob/mas ... timing.cpp
https://gitlab.com/higan/higan/blob/mas ... memory.cpp

AWJ, did you disassemble test_timer_speed as well already? Do you know what instructions it's executing during its timer tick counting loop?


Top
 Profile  
 
PostPosted: Mon Jul 03, 2017 8:16 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
Okay, relevant files are uploaded:

https://gitlab.com/higan/higan/blob/mas ... ctions.cpp
https://gitlab.com/higan/higan/blob/mas ... uction.cpp
https://gitlab.com/higan/higan/blob/mas ... timing.cpp
https://gitlab.com/higan/higan/blob/mas ... memory.cpp

AWJ, did you disassemble test_timer_speed as well already? Do you know what instructions it's executing during its timer tick counting loop?


Code:
auto SMP::wait(maybe<uint16> addr) -> void {
  static const uint cycleWaitStates[4] = {2, 4, 10, 20};
  static const uint timerWaitStates[4] = {2, 4,  8, 16};

  uint waitStates = io.externalWaitStates;
  if(!addr) waitStates = io.internalWaitStates;
//snip rest


Excessive C++ cleverness has bitten you in the back. This code is failing to distinguish between an argument of 0 and no argument, and turning accesses to address 0 (which is RAM) into internal accesses. I haven't bothered to disassemble the timer tests (since they worked for me on the first try) but I can tell from the debugger that they do use address 0.

Also, you've changed the order things happen in read() and write(). Before you were doing the read/write and then advancing the timers, now you're advancing the timers and then doing the read/write. I don't think this is the cause of the failure or even that it's necessarily wrong, just pointing it out because you do have to pay close attention to these things (for the S-CPU in particular, it makes a big difference to many edge cases exactly what order things are done in CPU::read() and CPU::write())

Aside, I don't think the glitchiness with dividers of 8 or 16 has anything to do with "being divisible by 12". Dividing by n and then dividing by m is arithmetically equivalent to dividing by (m * n). Whether m is divisible by n or n is divisible by m is irrelevant. I think the S-DSP just isn't happy when the S-SMP's clock output is too slow. Remember that the S-DSP outputs a clock to the S-SMP and the S-SMP divides that clock and outputs it back to the S-DSP--it's a mutual interaction.

ETA:

Changing the subject, I just noticed that in higan you're initializing the S-DSP ENDX to random(0), which means that Magical Drop will never work if randomization is disabled. Surely it should be random(0xff) instead (that's what I've done in bsnes-classic).

Here's my hypothesis for what is going on with the S-DSP initial state on real hardware. The initial state of each voice is completely random, which means that each voice is playing (from a random sample address and with random parameters) when the chip is powered on. If software doesn't touch any of the registers for a voice, eventually it will finish playing (it'll read a BRR header byte that has the END bit set and the LOOP bit clear) and set its corresponding bit in ENDX. There are two cases where this can fail to happen: if the random chunk of RAM that a voice is playing from happens to parse as a looping sample (and never gets overwritten by software to something that doesn't parse as a looping sample), or if the voice has a frequency of 0. Thus, by the time the IPL ROM passes control to an uploaded program, ENDX is usually 0xFF but occasionally one or two bits are clear, and those bits may or may not eventually get set depending on random chance and RAM contents.

This would explain why Magical Drop occasionally fails on certain real consoles, but certainly doesn't fail 25% of the time.


Last edited by AWJ on Tue Jul 04, 2017 9:26 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon Jul 03, 2017 11:04 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1338
Cydrak disassembled the test.

Code:
8f4af0        mov $f0, #$4a   ; set timings (modified per test)
8f81f1        mov $f1, #$81   ; enable IPL, timer 0
8f00fa        mov $fa, #$00   ; set timer 0
8f0100        mov $00, #$01
8f0001        mov $01, #$00   ; $0000 = 1
 
e4fd          lda $fd         ; reset ticks
e800          lda #$00
8d00          ldy #$00
f8fdf0fc      -; ldx $fd; beq -           ; wait on timer tick
7a00f8fdf0fa  -; adw $00; ldx $fd; beq -  ; count loops to next tick
 
8f0af0        mov $f0, #$0a   ; restore default timings
daf6          stw $f6         ; post loop results and sync S-CPU
8f55f4        mov $f4, #$55
e8dd64f4d0fc  lda #$dd; -; cmp $f4; bne -
 
5fc0ff        jmp $ffc0       ; return to IPL


Here are the expected ranges.

Code:
  ; TEST  loops
  ;  $0a   $0a8f <= X < $0ac6
  ;  $1a   $0656 <= X < $0677
  ;  $2a   $0384 <= X < $0397
  ;  $3a   $01dd <= X < $01e6
  ;  $4a   $07eb <= X < $0814
  ;  $5a   $0548 <= X < $0563
  ;  $6a   $032a <= X < $033b
  ;  $7a   $01c2 <= X < $01cb
  ;  $ca   $032b <= X < $033c
  ;  $da   $02a4 <= X < $02b1
  ;  $ea   $01fa <= X < $0205
  ;  $fa   $0151 <= X < $0158


The problem turned out to be that I missed the (8) footnote on 6d. I probably missed the (9) footnote on 11 as well. It's a little tricky reading this PDF. Here is a crude fix:

Code:
auto SPC700::instructionDirectReadWord(fpw op) -> void {
  uint8 address = fetch();
  uint16 data = load(address + 0);
  if(op == &SPC700::algorithmLDW) load(address + 0);
  else idle();
  data |= load(address + 1) << 8;
  YA = alu(YA, data);
}


(the other MOVW is in DirectWriteWord.)

But anyway, all the tests pass now, hooray!

> Excessive C++ cleverness has bitten you in the back. This code is failing to distinguish between an argument of 0 and no argument, and turning accesses to address 0 (which is RAM) into internal accesses.

Unfortunately, that's not correct.

if(!addr) is testing the explicit operator bool() const of maybe<uint16>, which returns true if the maybe has a value in it, false if it's nothing. It won't see an address of zero until executing *addr to get the underlying value.

> Also, you've changed the order things happen in read() and write(). Before you were doing the read/write and then advancing the timers, now you're advancing the timers and then doing the read/write.

Yeah, we had $2137/$4201 to confirm that difference on the CPU side. It probably exists on the SMP side too, but this new emulation code makes this very difficult. And I'm not even sure when the reads happen when the divider is not set to 0 (or effectively 2 cycles.)

> Remember that the S-DSP outputs a clock to the S-SMP and the S-SMP divides that clock and outputs it back to the S-DSP--it's a mutual interaction.

Ah well. It's not like we're gonna be emulating the chance of crashing with this register anyway :/

> Changing the subject, I just noticed that in higan you're initializing the S-DSP ENDX to random(0), which means that Magical Drop will never work if randomization is disabled. Surely it should be random(0xff) instead (that's what I've done in bsnes-classic).

There's no option to disable randomization currently. I'll keep that in mind though.

It seems you know about the oddities with that title's game over screen. We can make a separate topic to work through that if you'd like. I'm very interested in what's going on there. But again, we'll need to confirm things before I'll make changes, and this one's probably not gonna have an "easy mode" like the SMP courtesy of Overload, heheh.


Last edited by byuu on Mon Jul 03, 2017 11:23 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon Jul 03, 2017 11:22 am 
Offline

Joined: Mon Nov 10, 2008 3:09 pm
Posts: 429
byuu wrote:
Unfortunately, that's not correct.

if(!addr) is testing the explicit operator bool() const of maybe<uint16>, which returns true if the maybe has a value in it, false if it's nothing. It won't see an address of zero until executing *addr to get the underlying value.


Are you absolutely sure about that? If I apply the following change in bsnes-classic so that address 0 is treated as internal:

Code:
diff --git a/bsnes/snes/smp/memory/memory.cpp b/bsnes/snes/smp/memory/memory.cpp
index b577ca6..8fdd4cd 100644
--- a/bsnes/snes/smp/memory/memory.cpp
+++ b/bsnes/snes/smp/memory/memory.cpp
@@ -175,6 +175,7 @@ alwaysinline void SMP::op_buswrite(uint16 addr, uint8 data) {
 }
 
 unsigned SMP::speed(uint16 addr) const {
+  if(addr == 0) return status.clock_speed;
   if((addr & 0xfff0) == 0x00f0) return status.clock_speed;
   if(addr >= 0xffc0 && status.iplrom_enabled) return status.clock_speed;
   return status.ram_speed;


then blargg's timer tests fail exactly the same way as they do for you.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 86 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group