Page 2 of 2

Re: Opcode decoding for disassembler

Posted: Wed Apr 17, 2019 7:45 pm
by rainwarrior
Using a branch to transfer execution from $80XX to $7FXX is valid, though I doubt you'd ever see it in a game with SRAM. I think the most likely time to see this is on FDS, but there RAM is continuous from $6000-DFFF.

Branches aren't commonly used to jump across very distinct memory regions like that (PRG to RAM, or between PRG banks.) Not impossible that someone would want to, but extremely unlikely to do it with a branch. JMP or JSR is probably what you'd see doing that.

However, to get back to your question: branches and jumps may target RAM, or other places. Regardless of that, your disassembly should show what the code does whether or not it "makes sense". Often the point of doing a disassembly is to find a bug that was caused by just such a lapse of sensibility. The disassembler's function should be plain, not subject to complex interpretation of a code's intention.

In most disassembly environments, it's important that the user can either identify bytes as belonging to code or data to be disassembled into opcodes or not, or failing this be able to reposition their current view of the disassembly to align on something they know is an instruction. In something like FCEUX, it just does a naive disassembly from the top of what's in view, and if there's something "weird" there usually it can be fixed by moving the view up or down a byte or two until the code comes into alignment. Because of all the 1 byte instructions, naively disassembled 6502 code tends to self-align after a few lines anyway. If this is an offline tool, just give the user ways to specify what part of the file is code or data.


I think conceptually it's valid to think of a branch operand as analogous to an immediate one, sort of in the way that ADC can take an immediate and add it to A, a branch can take this "immediate" and add it to PC? If you want correct terminology though, this is not the name to use, and # is not the notation to use either.

For a disassembly, the standard thing is to use either a label or just the address as the operand, i.e. show where the branch goes, not the actual value of its operand. "BEQ label" or "BEQ $7F05"

Re: Opcode decoding for disassembler

Posted: Wed Apr 17, 2019 8:26 pm
by rainwarrior
CrowleyBluegrass wrote:Should I just interpret all BRK instructions two bytes, with the second as a .db into the disassembly? I assume "being off" means that every BRK instruction is followed by what my assembler thinks is another "opcode" (UNDEFINED above).
The meaning of the byte after a $00 is entirely dependent on how the interrupt is handled. Different programs require different things here. Some will skip the byte, some will adjust the return address, some won't return at all from the interrupt.

There is no solution here except: let the user specify what to do in some way (like I suggested above manually marking bytes as an opcode, manually aligning the current auto-disassembly view, etc.). This is not necessarily a program-wide behaviour either. Case by case resolution will be needed sometimes.

The suggestion to disassembly BRK as a data byte is mostly a concession for if you want to disassemble and then re-assemble. The treatment of BRK varies between assemblers, sometimes as 1-byte, sometimes as 2, but a data byte is unambiguous. If your goal isn't to make output that can be re-assembled, then you're allowed to notate BRK however you want, I suppose, but aside allowing some user interaction to specify, all I can suggest is: don't do something that hides the following byte from the user.

Re: Opcode decoding for disassembler

Posted: Thu Apr 18, 2019 4:28 am
by CrowleyBluegrass
koitsu wrote:In all my years I've never seen someone use branch instructions and while using them "have to worry about if the ROM has enough room left for the branch". This doesn't really make any sense. Can you explain what exactly you're talking about here?
Sorry if I'm causing confusion, I'm obviously missing something :oops: I made an error regarding how the disassembly would be displayed. I put #$ to mean the offset amount to be applied which, although would be the byte in the rom, is not what the disassembly would display. It would work out the address and print the address, obviously. Apologies.

Having this quote in my mind:
koitsu wrote:Code:
8000: lda $1234 ; 8000: ad 34 12
8003: cmp #$10 ; 8003: c9 10
8005: bne $8000 ; 8005: d0 f9
8007: nop ; 8007: ea

As stated, f9 is a signed byte, which is -7. If you count 7 bytes backwards from $8007, you'll get $8000...
I meant that whenever the dissasembler comes upon a branch instruction, the address itself is calculated by taking the address of the next instruction after the branch, and applying the offset (which is the operand to the branch) to that address. What I was trying to get across (and failing, I'm afraid) is the event in which a branch instruction is encountered, but either:
  • a) there isn't even a "next instruction" to apply the offset to
    b) the next instruction address, with the offset applied to it, would end up branching either before the PRG segment, after it, or some other "dubious" place (the definition of "dubious" being what I was unsure about also).
Again, sorry for the confusion. I was trying to assess if, when disassembling a branch, a check should be made to see if either a) or b) would occur.

I think rainwarrior answered my overall question above. Just work out the address, and put it in the dissasembly. Whether it's "valid" or not is not the disassembler's concern. Still, I'd appreciate it if you could point out any errors in my reasoning, I fear there's quite a few lurking still :oops: However, I suppose this project is having the intended effect: (slowly and) steadily causing me to become more aware of how everything actually works.

Thanks :)

Re: Opcode decoding for disassembler

Posted: Thu Apr 18, 2019 3:13 pm
by koitsu
Oh, I see what you meant now. Yes, rainwarrior's answer is the proper solution for a disassembler. The basic premise in this case would be: when handling branch instructions, it should be very easy for you to create a reference to an address that is "outside the ROM range" since branching is based entirely on the current PC/offset. For example, if that bne in your previous code actually ended up branching to $7fbc, your disassembler can either turn that line into bne $7fbc (this is what mine did) or bne L_7FBC (if you keep an internal list of addresses/labels/things) combined with emitting L_7FBC = $7FBC at the top of the disassembly. It's your call.

As for brk: I've always advocated strongly that it be interpreted as 2 bytes for one good reason: the CPU actually increases PC by 2 when handling brk. Quoting Lichty/Eyes (and now WDC since they own the book):
Although BRK is a one-byte instruction, the program counter (which is pushed onto the stack by the
instruction) is incremented by two; this lets you follow the break instruction with a one-byte signature byte
indicating which break caused the interrupt. Even if a signature byte is not needed, either the byte following the
BRK instruction must be padded with some value or the break-handling routine must decrement the return
address on the stack to let an RTI (return from interrupt) instruction executed correctly.
However, if you read other 6502 books/documents, or even look at old (circa 80s) disassemblers, you'll see that many of them insist it's 1-byte instruction as a whole and that the PC+2 aspect is just a runtime nicety. As such, it is actually somewhat common -- at least on the Apple II series, which is not the platform you're focused on -- to find code where the programmer simply did brk without a signature byte, followed by code that was used normally (not just in the BRK handler!) except the first byte would be effectively skipped. This is a crappy example, but you'll see what I mean:

Code: Select all

L_8000:
  lda $ee
  cmp #$16
  bne Somewhere
  brk
L_8007:
  lda #$ea
  sta $20fe
; ...
; Other legit code from this point on, blah blah blah
; ...
  jmp L_8007
In this case, the optional signature byte would be $a9 (the lda of lda #$ea at $8007). In situations like these, IRQ/BRK handler code tends to end up manipulating the return address on the stack. (Side note: as point of info, not about brk: you'll find many NES games -- like the Final Fantasy series -- that modify the stack like this, especially in a jsr/rts scenario). But if it didn't, if it returned from the BRK handler, it would end up executing an $ea (nop) and continuing on pleasantly.

BRK usage is pretty rare, even on the Apple II (don't know about other home computers). Most programmers would use it as a kind of "welp, everything is screwed up" situation and thus really don't intend to recover well from the situation, if they bothered implementing a BRK handler at all. In my experience, most didn't/dont -- buggy programs would literally crash the system and the results would vary depending on infinite criteria.

If you **really** want to do something unique, offer a flag/switch (e.g. --brk1byte) to treat brk as 1-byte and just emit brk when $00 is encountered. There may be cases where people might want that (probably more common if your tool was used on a non-NES platform).

I'll end on this note: it is very common for romhackers/etc. to disassemble a game, end up with a 1-byte-misalignment (due to code vs. data), split that part of the ROM file known to be code into its own BIN file, then re-disassemble that so that they get correct assembly output. What rainwarrior said is spot on: your tool should be "smart" but should not operate under the pretense of "trying to do everything" -- disassemblers CAN'T do everything because they're disassemblers, not emulators. But offer a good set of features and you'll find people will use your software with joy; that's what I found with TRaCER, anyway. I can't tell you how many times I've had to hand-modify disassembled output to fix that alignment. I'm not complaining, but it's *very VERY* common.