It is currently Tue Sep 17, 2019 4:17 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sat Mar 09, 2019 3:25 am 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 41
Hi there :)
I've recently started writing an NES disassembler. Recently as in... today!
I have a decent (I hope) surface-level knowledge of the 6502, but not enough to do any serious game development. I was hoping you could give me some help regarding the decoding/processing of opcodes.

I'm working through Dougeff's python code. which utilises a large case-statement:
Code:
def ToASM(byte1,byte2,byte3):
   global count
   global currentBank
   count2 = 0
      
   if byte1 == "00":
      return ("\tbrk\t\t\t\t; 00") # none

I'm assuming byte1 is the opcode/operator itself, as it's being used as the switch value. Could someone please let me know what bytes 2 and 3 represent? I assume byte 2 is the operand. The parts I'm most unsure about are the cases which use the count2 variable:
Code:
   elif byte1 == "10":
      y = int(byte2, 16)
      if y > 127:
         y -= 256
      count2 = count + y + 2
      z = str(hex(count2))
      z = z[2:]
      z = z.zfill(4)
      
      count += 1
      return ("\tbpl B" +currentBank+"_"+ z + " ; 10 " + byte2) # Relative

The common denominator here is that all of these (count2 cases) are branch instructions. Is byte2 the offset for the branch? I'm assuming (again) that the range-checking of y is to do with page-boundaries? I'm still unsure about the rest of the code in this excerpt though. There's some array slicing and zero padding going on, but it's fairly impenetrable with my current level of knowledge. i get the feeling I'm going to kick myself though, because I know it's a slice of something from the second index onwards, padded with zeros, and converted to hex...

Thanks in advance. Any wisdom bestowed is greatly appreciated! :)


Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 6:28 am 
Offline
User avatar

Joined: Fri May 08, 2015 7:17 pm
Posts: 2560
Location: DIGDUG
Some opcodes are 1 byte
TAX
PLA
CLI

Some are 2 bytes

LDA #$01
BEQ +32

Some are 3 bytes

LDA $0235

I always sent the 3 next bytes to the function, even if they weren't all used.

Quote:
the count2 variable


branches (BEQ, BPL, etc) have a byte following that is a relative jump from the current address + 2. values 0-127 are forward jump, values 128-255 are backwards jump

count2 is used for calculating that jump address.

_________________
nesdoug.com -- blog/tutorial on programming for the NES


Last edited by dougeff on Sat Mar 09, 2019 7:08 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 6:29 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
I had a wonderful response written then lost it because I closed the bloody tab. Sigh.

dougeff I'm sure can answer questions about his code, but I'm also the author of an old DOS-based 65816/65c02/6502 disassembler. I think generally anyone can write a basic one, but the complexity comes with all the features people want and need. Anyway...

You need to learn about 6502 instructions and addressing modes ASAP.

6502 instructions consist of an opcode and a variable number of bytes for the operand -- either 0, 1, or 2. The operand length is based on the addressing mode used by the instruction itself (they're effectively tied together). For example, here are some random instructions and what their bytecode is:

Code:
nop           --> ea          implied
lda #$bb      --> a9 bb       immediate
lda $bb       --> a5 bb       zero page
lda $00bb     --> ad bb 00    absolute
lda $00bb,x   --> bd bb 00    absolute indexed with X
lda $00bb,y   --> b9 bb 00    absolute indexed with Y
lda ($bb,x)   --> a1 bb       indexed indirect
lda ($bb),y   --> b1 bb       indirect indexed
jmp ($00bb)   --> 6c bb 00    indirect
beq $xx       --> f0 xx       relative (branching); see below

The code you linked is handling is the bpl instruction (opcode $10), which uses relative addressing. It is not as simple as my above code/line indicates.

Branch instructions, thus relative addressing, consists of an opcode and a 1-byte operand. The operand value is signed, i.e. ranges from +127 to -128. If you don't understand the difference between an unsigned and signed number, or don't understand how it works, learn about it ASAP -- every programmer should learn this, it's universally important.

The signed value is relative to the current PC (program counter), kind of (keep reading). Good disassemblers keep track of the PC during disassembly, that way relative instructions can be disassembled depicting actual addresses (e.g. bne $8000) and not their raw operand byte (e.g. bne $xx). Several 80s-era disassemblers do the latter and it is not particularly helpful.

In actuality (because of how the 6502 works internally), the PC in to which the signed byte gets applied is actually the address of the next instruction, not the address of the branch opcode or its operand (which means the effective range is more like +129 to -126). Most disassemblers handle this by simply adding 2 to the final effective address. To understand this, code with actual addresses are needed:

Code:
8000: lda $1234     ; 8000: ad 34 12
8003: cmp #$10      ; 8003: c9 10
8005: bne $8000     ; 8005: d0 f9
8007: nop           ; 8007: ea

As stated, f9 is a signed byte, which is -7. If you count 7 bytes backwards from $8007, you'll get $8000. Here's another example, this time branching forward:

Code:
8000: lda $abcd     ; 8000: ad cd ab
8003: beq $8062     ; 8003: f0 5d
...
8060: ldx #$00      ; 8060: a2 00
8062: sta $0350,x   ; 8062: 90 50 03

5d is a signed byte, which is +93. If you count 93 bytes forwards from $8005, you'll get $8062.

I haven't looked at dougeff's program in full, but from the code you pasted it looks like his PC is made up of variables count and count2, with currentBank thrown in (probably to try and track/deal with PRG-ROM banks; that's a more NES-esque complication). y seems to be used for the operand byte itself, turned into a signed number.

As for the rest of the code: everything else there code-wise is Python-specific crap that makes it way harder to understand that it really should be, IMO. I did a write-up on that but intentionally removed it at the last minute because it's not related to understanding relative addressing on the 6502. But -- that whole routine could really become something like this:

Code:
elif byte1 == "10":
  y = int(byte2, 16)
  if y > 127:
    y -= 256
  count2 = count + y + 2
  count += 1
  return "\tbpl B{:02x}_{:04x} ; 10 {:02x}".format(currentBank, count2, byte2)

Example:
Code:
>>> currentBank = 2
>>> count2 = 0x8062
>>> byte2 = 0x5d
>>> "\tbpl B{:02x}_{:04x} ; 10 {:02x}".format(currentBank, count2, byte2)
'\tbpl B02_8062 ; 10 5d'

Note #1: I suspect one could still use unpack to turn byte2 into a proper signed 8-bit number thus avoiding the whole int() / >127 / -256 thing (though the comparison code is 100% OK)
Note #2: I'm not sure the printing of byte2 would work quite right because it's apparently a string... again with the bloody strings!
Note #3: I'm making some assumptions about what currentBank gets displayed as, but the formatting could be easily changed to match what it currently outputs.

Let's not get hung up on all of that though, as it's fairly subjective. Anyway, HTH.


Top
 Profile  
 
PostPosted: Sat Mar 09, 2019 8:28 am 
Offline
User avatar

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 170
I made a debugger that ran on the NES a couple years ago as a quick project to remind myself how 6502 assembly worked. The easiest way I could think to implement it was using a bunch of tables. Not terribly efficient or compat, but it was really easy.

https://gist.github.com/slembcke/4b746c ... e2de3201f3

Given an instruction, one table would give you the addressing mode, one would give you the mnemonic, the number of bytes for the instruction, etc. Then it was all fed into a sprintf() statement where the format string was looked up using the addressing mode, and all the other params passed in.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 5:12 am 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 41
I got it working! The output is identicle to Dougeff's program... except the very last opcode of mario shows up as a BEQ instead of a BRK :?
Strange, they're identicle everywhere else!
Oh well, I'm sure I'll bump into the reason as I continue tweaking and adding things. The opcode for BEQ is F0, and Dougeff's program has a .db for f0 there, which seems like too much of a coincidence. Still, strange in that the outputs are identical for every other instruction.

Couldn't have done it without your help everyone, much appreciated :)


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 8:08 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
His disassembler is "smart" enough to know that if the last byte of the file/region is an opcode that has operands, but there are none (because the opcode itself is the final byte), then to do a .db {raw-opcode-byte} instead, because just saying beq by itself would be syntactically incorrect/fail to assemble.

There are always "edge cases" like that with disassemblers that have to be handled -- also because not everything is code. The "smarter" you make your disassembler to try and figure that out the better. That's partially why Bisqwit made clever-disasm, where you inform the disassembler via an INI file what's code vs. data. There's also tools like disasm6 which support (non-bank-aware) CDL files natively, which are from FCEUX and other emulators. They're essentially a "ROM map" of what is code vs. data; see the link for details. Anyone doing reverse-engineering work in this day and age *cough*me*cough* appreciates such features (though to be frank, it would be wonderful to have a disassembler supporting the modified bank-aware CDL format that Mesen uses... at least I think that's what it offers? Sour can confirm), since with those, you can get some pretty good output.

The real test is making your disassembler output code that can be reassembled by whatever assembler you intend for it to be used with -- and then doing a binary compare (e.g. fc /b) against the ROM or portion of the ROM (e.g. minus header) to see if it reassembled correctly.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 12:52 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7582
Location: Canada
CrowleyBluegrass wrote:
I got it working! The output is identicle to Dougeff's program... except the very last opcode of mario shows up as a BEQ instead of a BRK :?

BRK can be considered a "two byte" instruction, even though it has no operands. When you RTI from the IRQ response that BRK generates, the return address is 2 bytes past the BRK rather than 1.

It's a little bit difficult to disassemble since this is ambiguous, and if an assembler generates $00 $00 or maybe $00 $EA for BRK (don't know what's common off the top of my head, and I think some assemblers will emit just $00 too), disassembling $00 $0F as BRK would throw away that second byte of information. The safest thing might be to disassemble BRK as two data bytes instead of an instruction mnemonic.

Use of BRK isn't really a generic thing, if a game is actually doing it they're likely to also be doing weird stack manipulation tricks that will make effective disassembly difficult. You can even roll back that extra byte by decrementing the return value on the stack in your IRQ handler... which some games do!


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 2:11 pm 
Offline

Joined: Sun Feb 07, 2016 6:16 pm
Posts: 725
koitsu wrote:
the modified bank-aware CDL format that Mesen uses... at least I think that's what it offers?
It doesn't have any bank-specific information, on the other hand FCEUX files actually have some bank-related info:
Code:
AA = Into which ROM bank it was mapped when last accessed:
         00 = $8000-$9FFF        01 = $A000-$BFFF
         10 = $C000-$DFFF        11 = $E000-$FFFF
This is probably enough to know the in-memory address of a given byte in PRG (but that's not really important in terms of disassembly, I think?) Mesen's CDL files don't have this info, though (I've used the bits for other flags that I needed for debugger)

What Mesen's CDL files add (as of the latest dev builds anyway) is the ability to know if a specific byte in memory is the target of a jump/branch operation, or the target of a JSR operation (or both). This is useful to improve presentation of the info in the debugger, but I'm not sure it would be of much use for disassembly purposes.

rainwarrior wrote:
The safest thing might be to disassemble BRK as two data bytes instead of an instruction mnemonic.
That's exactly what I ended up doing for the SNES. COP & BRK both disassemble as 2-byte instructions, which is a lot easier to manage.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 6:52 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
Older assemblers and disassemblers both always treated BRK/COP as instructions with an opcode and a 1-byte operand, even though the operand is just a "signature byte" (ex. brk $47 or cop $02). So if you're doing it that way, thumbs up.

Even older disassemblers and debuggers, on the other hand (for 6502/65c02) used to just show brk as a single instruction, followed by an utter mess due everything "being off" by 1 byte. Not something I'd care to re-live.


Top
 Profile  
 
PostPosted: Sun Mar 10, 2019 9:27 pm 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 41
koitsu wrote:
His disassembler is "smart" enough to know that if the last byte of the file/region is an opcode that has operands, but there are none (because the opcode itself is the final byte), then to do a .db {raw-opcode-byte} instead, because just saying beq by itself would be syntactically incorrect/fail to assemble.

EDIT: I thought I understood, but now I'm not so sure.. Dougeff's disassembler outputs the following:
Code:
B0_7ffc:      brk            ; 00
B0_7ffd: .db $80
         .db $f0
         .db $ff

As opposed to (please forgive the fact that my program rawly decodes the instruction at the moment)
Code:
32764#(0x00 BRK IMP 1)
32765#( UNDEFINED  1)
32766#(0xf0 BEQ REL 2)

I'm probably going to feel very silly again (like missing the signed integer thing above..) but I'm struggling to understand. Wouldn't $ff be the operand to the BEQ instruction? So that would be a -1 jump.

EDIT DEUX: Sorry for the repeated edits. I studied Koitsu's post some more:
Quote:
In actuality (because of how the 6502 works internally), the PC in to which the signed byte gets applied is actually the address of the next instruction, not the address of the branch opcode or its operand

So the issue here is: The branch would be applied to the address of the NEXT instruction, which doesn't exist. So in this case, it's not so much that BEQ is the last byte of the entire program (and therefore would have no operand describing the offset) it's because although BEQ is a two-byte instruction (the instruction itself, followed by the offset to potentially apply) it requires there to be a third byte, as that is the address the offset is actually applied to? Does that sound about right?

EDIT #3: I believe I'm also interpreting the BRK instructions in the way you guys weren't so thrilled about. Treating it as a 1 byte instruction:
koitsu wrote:
Even older disassemblers and debuggers, on the other hand (for 6502/65c02) used to just show brk as a single instruction, followed by an utter mess due everything "being off" by 1 byte. Not something I'd care to re-live.
Should I just interpret all BRK instructions two bytes, with the second as a .db into the disassembly? I assume "being off" means that every BRK instruction is followed by what my assembler thinks is another "opcode" (UNDEFINED above).

Thanks :)


Top
 Profile  
 
PostPosted: Wed Apr 17, 2019 4:54 pm 
Offline

Joined: Sun Jun 30, 2013 7:59 am
Posts: 41
(Apologies for the bump, but I didn't feel this was worth starting an entirely new thread.)

In the event that an opcode is disassembled as a branch, but the location it is trying to branch to is outside a reasonable range, for example:
Code:
;;start of PRG
LDA #$00
BEQ #$FB
LDA #$23

If my two's-complement is correct, that would branch outside (before) the PRG-rom which the code is in. Is that a thing that would ever be expected to turn up intentionally? I'm not sure whether to just produce an error upon encountering that. Seems that would be the only thing for it, but I wanted to ask, as I might not even be interpreting this properly.

I originally started thinking about this due to the previous problem I was having: a branch juuuust before the end of rom, followed only by its offset. Of course, the branch was invalid because there was no valid address to apply the offset to. So now with every branch instruction, I'm also checking to see if the rom has enough bytes left to allow for the branch. Of course, performing this check every time a branch is encountered is not so good, as the majority of branches are likely to be within a valid range, but that's what I've got for now.

Thanks :)


Top
 Profile  
 
PostPosted: Wed Apr 17, 2019 5:17 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21591
Location: NE Indiana, USA (NTSC)
If you have a BEQ #$FB at $8002, the user would expect it to disassemble as BEQ $7FFF, which clearly jumps into SRAM. If there's no SRAM on the cart, the actual executed instruction will come from data bus capacitance (and ultimately from $80FF if I remember correctly).

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Wed Apr 17, 2019 5:23 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
Tepples' answer is correct. If you think 6502 code "can only run within 'ROM space' (e.g. $8000-FFFF on NES)" then that is incorrect.

In all my years I've never seen someone use branch instructions and while using them "have to worry about if the ROM has enough room left for the branch". This doesn't really make any sense. Can you explain what exactly you're talking about here?

P.S. Branch instructions do not use immediate addressing (re: BEQ #$FB); this should be BEQ $FB. If the assembler is allowing you to write the former, then that's a bug and it should be reported.


Top
 Profile  
 
PostPosted: Wed Apr 17, 2019 5:37 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21591
Location: NE Indiana, USA (NTSC)
BEQ $FB is valid only in $007A (halfway into zero page) through $0179 (halfway into stack), not in ROM. I understood the "BEQ #rr" to mean "BEQ *+2+(signed char)rr".

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Wed Apr 17, 2019 5:59 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
It depends on the assembler. Ones like Merlin clearly cover this case as such -- Merlin 8/16 will decide the legality of the addressing mode for any given opcode. -- others make the (very good and highly justified) assumption that you'll use a label, which relieves all of this pain. Others permit things like beq $-5 and bne $+5 (really), which is based off of what the effective PC is at the time of assemble. It all varies per assembler.

The point I'm making: branch instructions do not use immediate addressing, they use what's called relative addressing.

If you want to write branch instructions with their literal operand value and don't want to use labels, for whatever reason, then it's best to resort to using .db (or equivalent -- again, see your assembler manual) statements, e.g. .db $f0,$fb for the aforementioned instruction example.

I will also point out that some disassemblers also disassemble branch instructions into a opcode $operand syntax, ex. BEQ $FB. This doesn't mean "branch if equal to address $00FB", but rather exactly the way CrowleyBluegrass denoted in their previous post: branch-if-equal backwards 5 bytes. See my previous post if confused about relative addressing.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group