It is currently Fri Mar 22, 2019 6:06 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Sat Mar 09, 2019 3:25 am 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 21
Hi there :)
I've recently started writing an NES disassembler. Recently as in... today!
I have a decent (I hope) surface-level knowledge of the 6502, but not enough to do any serious game development. I was hoping you could give me some help regarding the decoding/processing of opcodes.

I'm working through Dougeff's python code. which utilises a large case-statement:
Code:
def ToASM(byte1,byte2,byte3):
   global count
   global currentBank
   count2 = 0
      
   if byte1 == "00":
      return ("\tbrk\t\t\t\t; 00") # none

I'm assuming byte1 is the opcode/operator itself, as it's being used as the switch value. Could someone please let me know what bytes 2 and 3 represent? I assume byte 2 is the operand. The parts I'm most unsure about are the cases which use the count2 variable:
Code:
   elif byte1 == "10":
      y = int(byte2, 16)
      if y > 127:
         y -= 256
      count2 = count + y + 2
      z = str(hex(count2))
      z = z[2:]
      z = z.zfill(4)
      
      count += 1
      return ("\tbpl B" +currentBank+"_"+ z + " ; 10 " + byte2) # Relative

The common denominator here is that all of these (count2 cases) are branch instructions. Is byte2 the offset for the branch? I'm assuming (again) that the range-checking of y is to do with page-boundaries? I'm still unsure about the rest of the code in this excerpt though. There's some array slicing and zero padding going on, but it's fairly impenetrable with my current level of knowledge. i get the feeling I'm going to kick myself though, because I know it's a slice of something from the second index onwards, padded with zeros, and converted to hex...

Thanks in advance. Any wisdom bestowed is greatly appreciated! :)


Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 6:28 am 
Offline
User avatar

Joined: Fri May 08, 2015 7:17 pm
Posts: 2465
Location: DIGDUG
Some opcodes are 1 byte
TAX
PLA
CLI

Some are 2 bytes

LDA #$01
BEQ +32

Some are 3 bytes

LDA $0235

I always sent the 3 next bytes to the function, even if they weren't all used.

Quote:
the count2 variable


branches (BEQ, BPL, etc) have a byte following that is a relative jump from the current address + 2. values 0-127 are forward jump, values 128-255 are backwards jump

count2 is used for calculating that jump address.

_________________
nesdoug.com -- blog/tutorial on programming for the NES


Last edited by dougeff on Sat Mar 09, 2019 7:08 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 6:29 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3958
Location: A world gone mad
I had a wonderful response written then lost it because I closed the bloody tab. Sigh.

dougeff I'm sure can answer questions about his code, but I'm also the author of an old DOS-based 65816/65c02/6502 disassembler. I think generally anyone can write a basic one, but the complexity comes with all the features people want and need. Anyway...

You need to learn about 6502 instructions and addressing modes ASAP.

6502 instructions consist of an opcode and a variable number of bytes for the operand -- either 0, 1, or 2. The operand length is based on the addressing mode used by the instruction itself (they're effectively tied together). For example, here are some random instructions and what their bytecode is:

Code:
nop           --> ea          implied
lda #$bb      --> a9 bb       immediate
lda $bb       --> a5 bb       zero page
lda $00bb     --> ad bb 00    absolute
lda $00bb,x   --> bd bb 00    absolute indexed with X
lda $00bb,y   --> b9 bb 00    absolute indexed with Y
lda ($bb,x)   --> a1 bb       indexed indirect
lda ($bb),y   --> b1 bb       indirect indexed
jmp ($00bb)   --> 6c bb 00    indirect
beq $xx       --> f0 xx       relative (branching); see below

The code you linked is handling is the bpl instruction (opcode $10), which uses relative addressing. It is not as simple as my above code/line indicates.

Branch instructions, thus relative addressing, consists of an opcode and a 1-byte operand. The operand value is signed, i.e. ranges from +127 to -128. If you don't understand the difference between an unsigned and signed number, or don't understand how it works, learn about it ASAP -- every programmer should learn this, it's universally important.

The signed value is relative to the current PC (program counter), kind of (keep reading). Good disassemblers keep track of the PC during disassembly, that way relative instructions can be disassembled depicting actual addresses (e.g. bne $8000) and not their raw operand byte (e.g. bne $xx). Several 80s-era disassemblers do the latter and it is not particularly helpful.

In actuality (because of how the 6502 works internally), the PC in to which the signed byte gets applied is actually the address of the next instruction, not the address of the branch opcode or its operand (which means the effective range is more like +129 to -126). Most disassemblers handle this by simply adding 2 to the final effective address. To understand this, code with actual addresses are needed:

Code:
8000: lda $1234     ; 8000: ad 34 12
8003: cmp #$10      ; 8003: c9 10
8005: bne $8000     ; 8005: d0 f9
8007: nop           ; 8007: ea

As stated, f9 is a signed byte, which is -7. If you count 7 bytes backwards from $8007, you'll get $8000. Here's another example, this time branching forward:

Code:
8000: lda $abcd     ; 8000: ad cd ab
8003: beq $8062     ; 8003: f0 5d
...
8060: ldx #$00      ; 8060: a2 00
8062: sta $0350,x   ; 8062: 90 50 03

5d is a signed byte, which is +93. If you count 93 bytes forwards from $8005, you'll get $8062.

I haven't looked at dougeff's program in full, but from the code you pasted it looks like his PC is made up of variables count and count2, with currentBank thrown in (probably to try and track/deal with PRG-ROM banks; that's a more NES-esque complication). y seems to be used for the operand byte itself, turned into a signed number.

As for the rest of the code: everything else there code-wise is Python-specific crap that makes it way harder to understand that it really should be, IMO. I did a write-up on that but intentionally removed it at the last minute because it's not related to understanding relative addressing on the 6502. But -- that whole routine could really become something like this:

Code:
elif byte1 == "10":
  y = int(byte2, 16)
  if y > 127:
    y -= 256
  count2 = count + y + 2
  count += 1
  return "\tbpl B{:02x}_{:04x} ; 10 {:02x}".format(currentBank, count2, byte2)

Example:
Code:
>>> currentBank = 2
>>> count2 = 0x8062
>>> byte2 = 0x5d
>>> "\tbpl B{:02x}_{:04x} ; 10 {:02x}".format(currentBank, count2, byte2)
'\tbpl B02_8062 ; 10 5d'

Note #1: I suspect one could still use unpack to turn byte2 into a proper signed 8-bit number thus avoiding the whole int() / >127 / -256 thing (though the comparison code is 100% OK)
Note #2: I'm not sure the printing of byte2 would work quite right because it's apparently a string... again with the bloody strings!
Note #3: I'm making some assumptions about what currentBank gets displayed as, but the formatting could be easily changed to match what it currently outputs.

Let's not get hung up on all of that though, as it's fairly subjective. Anyway, HTH.


Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 8:28 am 
Offline
User avatar

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 144
I made a debugger that ran on the NES a couple years ago as a quick project to remind myself how 6502 assembly worked. The easiest way I could think to implement it was using a bunch of tables. Not terribly efficient or compat, but it was really easy.

https://gist.github.com/slembcke/4b746c ... e2de3201f3

Given an instruction, one table would give you the addressing mode, one would give you the mnemonic, the number of bytes for the instruction, etc. Then it was all fed into a sprintf() statement where the format string was looked up using the addressing mode, and all the other params passed in.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 5:12 am 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 21
I got it working! The output is identicle to Dougeff's program... except the very last opcode of mario shows up as a BEQ instead of a BRK :?
Strange, they're identicle everywhere else!
Oh well, I'm sure I'll bump into the reason as I continue tweaking and adding things. The opcode for BEQ is F0, and Dougeff's program has a .db for f0 there, which seems like too much of a coincidence. Still, strange in that the outputs are identical for every other instruction.

Couldn't have done it without your help everyone, much appreciated :)


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 8:08 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3958
Location: A world gone mad
His disassembler is "smart" enough to know that if the last byte of the file/region is an opcode that has operands, but there are none (because the opcode itself is the final byte), then to do a .db {raw-opcode-byte} instead, because just saying beq by itself would be syntactically incorrect/fail to assemble.

There are always "edge cases" like that with disassemblers that have to be handled -- also because not everything is code. The "smarter" you make your disassembler to try and figure that out the better. That's partially why Bisqwit made clever-disasm, where you inform the disassembler via an INI file what's code vs. data. There's also tools like disasm6 which support (non-bank-aware) CDL files natively, which are from FCEUX and other emulators. They're essentially a "ROM map" of what is code vs. data; see the link for details. Anyone doing reverse-engineering work in this day and age *cough*me*cough* appreciates such features (though to be frank, it would be wonderful to have a disassembler supporting the modified bank-aware CDL format that Mesen uses... at least I think that's what it offers? Sour can confirm), since with those, you can get some pretty good output.

The real test is making your disassembler output code that can be reassembled by whatever assembler you intend for it to be used with -- and then doing a binary compare (e.g. fc /b) against the ROM or portion of the ROM (e.g. minus header) to see if it reassembled correctly.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 12:52 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7328
Location: Canada
CrowleyBluegrass wrote:
I got it working! The output is identicle to Dougeff's program... except the very last opcode of mario shows up as a BEQ instead of a BRK :?

BRK can be considered a "two byte" instruction, even though it has no operands. When you RTI from the IRQ response that BRK generates, the return address is 2 bytes past the BRK rather than 1.

It's a little bit difficult to disassemble since this is ambiguous, and if an assembler generates $00 $00 or maybe $00 $EA for BRK (don't know what's common off the top of my head, and I think some assemblers will emit just $00 too), disassembling $00 $0F as BRK would throw away that second byte of information. The safest thing might be to disassemble BRK as two data bytes instead of an instruction mnemonic.

Use of BRK isn't really a generic thing, if a game is actually doing it they're likely to also be doing weird stack manipulation tricks that will make effective disassembly difficult. You can even roll back that extra byte by decrementing the return value on the stack in your IRQ handler... which some games do!


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 2:11 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 629
koitsu wrote:
the modified bank-aware CDL format that Mesen uses... at least I think that's what it offers?
It doesn't have any bank-specific information, on the other hand FCEUX files actually have some bank-related info:
Code:
AA = Into which ROM bank it was mapped when last accessed:
         00 = $8000-$9FFF        01 = $A000-$BFFF
         10 = $C000-$DFFF        11 = $E000-$FFFF
This is probably enough to know the in-memory address of a given byte in PRG (but that's not really important in terms of disassembly, I think?) Mesen's CDL files don't have this info, though (I've used the bits for other flags that I needed for debugger)

What Mesen's CDL files add (as of the latest dev builds anyway) is the ability to know if a specific byte in memory is the target of a jump/branch operation, or the target of a JSR operation (or both). This is useful to improve presentation of the info in the debugger, but I'm not sure it would be of much use for disassembly purposes.

rainwarrior wrote:
The safest thing might be to disassemble BRK as two data bytes instead of an instruction mnemonic.
That's exactly what I ended up doing for the SNES. COP & BRK both disassemble as 2-byte instructions, which is a lot easier to manage.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 6:52 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3958
Location: A world gone mad
Older assemblers and disassemblers both always treated BRK/COP as instructions with an opcode and a 1-byte operand, even though the operand is just a "signature byte" (ex. brk $47 or cop $02). So if you're doing it that way, thumbs up.

Even older disassemblers and debuggers, on the other hand (for 6502/65c02) used to just show brk as a single instruction, followed by an utter mess due everything "being off" by 1 byte. Not something I'd care to re-live.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 9:27 pm 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 21
koitsu wrote:
His disassembler is "smart" enough to know that if the last byte of the file/region is an opcode that has operands, but there are none (because the opcode itself is the final byte), then to do a .db {raw-opcode-byte} instead, because just saying beq by itself would be syntactically incorrect/fail to assemble.

EDIT: I thought I understood, but now I'm not so sure.. Dougeff's disassembler outputs the following:
Code:
B0_7ffc:      brk            ; 00
B0_7ffd: .db $80
         .db $f0
         .db $ff

As opposed to (please forgive the fact that my program rawly decodes the instruction at the moment)
Code:
32764#(0x00 BRK IMP 1)
32765#( UNDEFINED  1)
32766#(0xf0 BEQ REL 2)

I'm probably going to feel very silly again (like missing the signed integer thing above..) but I'm struggling to understand. Wouldn't $ff be the operand to the BEQ instruction? So that would be a -1 jump.

EDIT DEUX: Sorry for the repeated edits. I studied Koitsu's post some more:
Quote:
In actuality (because of how the 6502 works internally), the PC in to which the signed byte gets applied is actually the address of the next instruction, not the address of the branch opcode or its operand

So the issue here is: The branch would be applied to the address of the NEXT instruction, which doesn't exist. So in this case, it's not so much that BEQ is the last byte of the entire program (and therefore would have no operand describing the offset) it's because although BEQ is a two-byte instruction (the instruction itself, followed by the offset to potentially apply) it requires there to be a third byte, as that is the address the offset is actually applied to? Does that sound about right?

EDIT #3: I believe I'm also interpreting the BRK instructions in the way you guys weren't so thrilled about. Treating it as a 1 byte instruction:
koitsu wrote:
Even older disassemblers and debuggers, on the other hand (for 6502/65c02) used to just show brk as a single instruction, followed by an utter mess due everything "being off" by 1 byte. Not something I'd care to re-live.
Should I just interpret all BRK instructions two bytes, with the second as a .db into the disassembly? I assume "being off" means that every BRK instruction is followed by what my assembler thinks is another "opcode" (UNDEFINED above).

Thanks :)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group