On Disassembly

I have spent a lot of time disassembling other people's code. Much of it was hand-written 6502 or 65816 assembly, some of it was ARM assembly, some was Dalvik bytecode. I've done it by hand on printouts of code listings, I've used fancy tools, I've written tools to make my life easier. I figured I'd write down a few things I've learned along the way.

The Process

The basic idea is this: you grab a binary blob and feed it into a disassembler. The disassembler chews it up and spits out a list of assembly instructions. If the disassembler has a moderate level of sophistication, it does its level best to separate code from data. (Games tend to be about 50/50 code/data.) You then write meaningful names for functions and variables, add comments explaining what the code does and how the data is used, and in the end you have a reasonable substitute for the original source code.

Simple.

The Journey

It's not simple.

Here are some things to keep in mind.

You must understand the program

This is essential. You will have a difficult time figuring out how a program works if you don't understand what it does.

For a game this means playing it exhaustively, through every level and mode. Pay close attention to every detail. Know what actions trigger behaviors, when animations change, when sounds are played. Cheats that make you invincible can make it easier to look at parts of the screen away from the action. For some games, the ability to jump directly from one level to another can be very helpful.

It helps to understand the host system

Your life will be easier if you're working on a system you're familiar with. Space Eggs was much easier for me than Battlezone because I've spent a lot of time with the Apple II, but had zero familiarity with coin-op arcade machines. On the other hand, if you want to learn how a system works, this is a great way to do it. Just accept up front that you may be puzzled by certain things and will need to do some reading or ask for help.

It's a jigsaw puzzle, not a book

If you're expecting to start at the start and read to the end you will be disappointed. For example, suppose you see some code that loads a byte, adds 12, and stores it somewhere else. You can't understand what the purpose of doing that was until you know something about where the data is coming from and going to. Sometimes it's obvious, e.g. initialization code that zeroes out a region of memory is pretty easy to figure out, but it's rarely that simple.

When assembling a rectangular jigsaw puzzle, it's easiest to start by finding the pieces with straight sides and constructing the frame. Most programs have something similar: inputs and outputs. For example, Apple II games will read the keyboard and joystick, output text or graphics, and click the speaker. Once you find where the inputs are read, you can examine the code that reacts to those inputs. If joystick button 1 fires guns and nothing else, then you know that the code that cares about that button is responsible for firing the guns.

You don't have to move in a straight line

I've lost count of the number of times I've sat down with the intention of exploring feature A, then spent the rest of the day chasing features B and C because something caught my attention. If your goal is to fully disassemble something, then it doesn't matter which section you explore first. And if you get a bit of inspiration about something, it's best to chase after it while the trail is hot.

Alternatively, if something catches your eye but you don't want to chase it down right away, leave notes to yourself so you can find it easily later.

Work incrementally upward

Expect to take little nibbles. Figure out what a chunk of code does, and give it a label. Figure out who calls it, and see if the caller is doing something obvious with the inputs or outputs. If the function is accessing global variables, give those labels, and see what other code accesses them.

Don't beat your head against a chunk of code that just isn't making sense. Find bits you understand and explore around them. Learning more about the rest of the program will help you decode the obscure parts.

When exploring source code it's often easiest to work from the top down: get the big picture, then trace your way down into the interesting bits. When disassembling code it's easier to work in the other direction, seeking the "leaf nodes" in the call graph first, and figuring out how they fit together later.

Don't be afraid to guess...

Don Lancaster's classic Tearing Into Machine Language Code (originally part of Enhancing Your Apple II, Vol. 1) makes the point that "the method relies heavily on your subconscious putting together the big picture and sewing up the loose ends." Sometimes you'll look at something and think, "I bet that's doing X", but it's buried in so much other stuff that you can't be sure. Write your guess down while it's fresh in your head. You can update it later on if it's wrong.

...but distinguish fact from guesswork

Suppose you dig into some code and find that the thing you labeled "player_position" is being used in a strange way. Either the variable doesn't actually hold the player position, or they're using the value for a novel purpose. If you're 100% sure that the location holds the player's position then you should dive deeper into the current code, but if the label was just a wild guess then you need to go back and revisit your assumptions. Separating fact from speculation will save time.

SourceGen allows you to mark any label or symbol as uncertain by adding a '?' to it. This feature wasn't part of the original design; it didn't occur to me to add until, after disassembling a couple of large projects, I realized that it was very useful to be able to tell the difference between the things I was sure about and the things I was just guessing at. "high_score?" is more succinct than "high_score_maybe", and it's easier to spot in symbol lists.

I also use it to indicate partial information. For example, if a variable only holds $00 or $ff, but I haven't figured out what it means, I'll give it a label like "bool_1234?" (where $1234 is the address). The '?" is a reminder that I'm not done with it yet.

Become friends with the debugger

There will be times when you look at a piece of code and have no idea what it does, or you have some idea and want to test a theory. A quick way to sort things out is to change variables or modify the program, and the easiest way to do that is with a debugger. Emulators like MAME and AppleWin have built-in debuggers that can make your life much easier.

An important feature they usually provide is breakpoints and watchpoints, which halt the program or emulator when execution reaches a certain point, or when a specific memory location is accessed. If you think a bit of code is only run when a ship explodes or when the score goes from 999 to 1000, you can verify your guess quickly.

Another useful feature is the ability to alter the executing code. While exploring Battlezone I found some code that was called very early in the main update loop. Based on the sort of calculations it was doing, I guessed that it was responsible for updating the particles that fly out of the volcano. So I did a simple test: start the game, rotate to face the volcano, then use the debugger to disable the call to the function. Sure enough, the particles stopped moving.

Don't narrate the instruction stream

When adding comments, you want it to be possible to understand the essence of what the code does by just reading your commentary. Most lines should have a comment, and really complicated or interesting sections should have a block comment that explains the details. However, it's safe to assume that the reader has some understanding of the instruction set, so you don't need to explain how individual assembly language instructions work.

So this is okay:

loop:
        lda  (ptr),y        ;load value from pointer
        ora  #$80           ;OR the high bit
        sta  (ptr),y        ;store value to pointer
        cmp  #$aa           ;compare to $AA
        bne  loop           ;branch if not

But this is better:

set_hi_loop:
        lda  (ptr),y        ;get character
        ora  #$80           ;convert to high ASCII
        sta  (ptr),y        ;write it back
        cmp  #$aa           ;was it '*'?
        bne  set_hi_loop    ;no, loop

Strive to explain what the instruction accomplishes. While it's true that converting to high ASCII involves setting the high bit, it's not the case that setting the high bit always converts something to high ASCII. Use the more-specific description, as it helps the reader understand why something is being done, not just what.

An Example

While disassembling Stellar 7, I came across tables that looked like this:

L8591    .bulk   $15,$0f,$10,$12,$12,$13,$10,$14,$11,$14,$10,$12,$0f,$00,$00
L85A0    .bulk   $00,$01,$02,$01,$02,$03,$02,$05,$02,$04,$01,$01,$01,$01,$01
L85AF    .bulk   $2d,$32,$20,$30,$28,$1e,$28,$32,$00,$00,$00,$3c,$50,$00,$20

These were being read with one index variable and stored elsewhere with a different index variable, sometimes with additional computation. It was in a chunk of code that looked like it was doing some initialization for the enemy units. The numbers didn't mean anything at first glance, so I ignored it for a while.

Later on, I returned to the initialization function, and looked at the table again. There are some patterns in the data -- repeated values -- but nothing that makes much sense... unless you happen to watch the mission briefing, which lists the weapon type, armor thickness, maximum speed, and shots/round for each enemy unit. When I ran the briefing side-by-side with the tables, I realized that the values matched.

This revelation paid off in several ways:

There were several other tables nearby, all indexed the same way. Knowing that they were some sort of type-specific unit characteristic helped me to understand the code that was using them.
The value read from the "armor thickness" table is used as the hit points for the enemy unit, which meant the place it was written to held the hit points for that object. So the code that decremented the value was handling a collision with a player projectile, and code that checked it for zero was handling enemy death.
I'd created visualizations for the 3D mesh list, and noticed that there were multiple entries for projectiles, e.g. 3 kinds of laser that used the same shape data. The table at $8591 was composed entirely of numbers that matched projectile object indices. This told me that objects $0f/$10/$11, which all use the laser projectile mesh, corresponded to three different levels of laser fire. Checking the mission briefing, the projectile types matched up with low/medium/high lasers. This suggested that the object type for a projectile determined not only its appearance, but might also be used to determine things like impact damage. This helped me make sense of projectile initialization and collision handling.

This sort of thing, where a discovery opens up additional avenues of investigation, happens often. This particular one wouldn't have happened if I'd skipped the mission briefing, so figuring stuff out would have taken longer.

What I didn't do is read the object initialization function from top to bottom, over and over, until it all made sense or my head exploded. Instead, I figured out bits and pieces from various places. When I eventually returned to it, most of the addresses it was reading from and writing to had nice labels, and the function of the code was obvious.

Closing Notes

The most important ingredients are patience and persistence. Complex programs will not give up their secrets quickly or easily. You may get to a point where you feel like the code is impenetrable, or be overwhelmed by the sense that there's just so much of it. But so long as you are adding meaningful labels and comments, you are making forward progress, and will eventually succeed.

The toughest nut to crack in videogames is the "AI" code. The functions look at a bunch of game state and update some more game state, and there's often a lot of complicated heuristics. You need to figure out what all that state means before you can start to make sense of the code, and then you have to try to map what the code is doing to what you see on screen. Some of this goes back to needing to understand the program thoroughly. For example, I was really confused by the movement code for homing mines in Stellar 7, because it was checking the object's altitude. I didn't know that mines could hop over objects. (By that point I had enough of the program deciphered that I could use the debugger to set up a custom level with obstacles and mines in a straight line in front of me, and confirm that the game was actually doing that.)

Compiled code is a bit different from hand-written assembly. On the plus side, code will tend to be organized into self-contained functions with recognizable calling conventions. Sometimes you get the massive goto-enabled spaghetti code, but compilers tend to have a smaller bag of tricks than humans attempting to be clever. Learning what code generated by specific compilers looks like can help you visualize what the source code looked like.

Personal note: Don Lancaster's "Tearing Into Machine Language Code" article, mentioned earlier, is how I learned to disassemble a large body of 6502 code back in the 1980s. I still have a printout from one of my efforts back in 1987 (scan 1, scan 2). Thanks, Don.