Decompiler Design - Basic Blocks |
|
Prev: Code vs. Data
Now that we are starting to make sense of the unstructured byte sequence in the binary file, the next problem we need to solve before we can start decompiling the code is to be sure that a sequence of instructions is actually executed in the sequence we see it in the file.
Consider the following sequence:
L1: MOV EAX, $2 L2: MUL EAX, ECX L3: MOV DWORD [0x402000], EAXSince we know from the processor's user's manual that the MUL instruction requires one of the operands as well as the result to be in EAX, we might be tempted to translate the above sequence into the following statement:
var_402000 = ecx * 2;That would be wrong for two reasons:
L0: MOV EAX, $3 CMP EBX, $0 JNE L2 L1: MOV EAX, $2 L2: MUL EAX, ECX L3: MOV DWORD [0x402000], EAX MOV DWORD [0x402010], EAX ...
The correct way to translate the above sequence would be:
eax = (ebx == 0) ? 2 : 3; var_402010 = var_402000 = eax * ecx;
But how do we know that execution could skip L1 and go directly to L2? And more importantly, how do we know that there are no other ways that execution can go to L2, maybe from some other part of the code?
These problems can be solved by using a well known data structure called basic block. This data structure is one of the most important one in the entire decompiler, since a lot of code will rely on the information stored in it for many analyses.
A lot of information exists in the literature on what basic blocks are. We've already
given an informal definition. Here's a more precise one:
A consequence of this definition is that every jump destination starts a new basic block, and every jump instruction (including return instructions) ends a basic block.
For a decompiler it's not clear whether a call instruction should end a basic
block. Depending on the sophistication of the algorithms, one can consider a call
instruction to end a basic block, or one can just ignore call instructions
for the time being.
It's however clear that the
Given that we already have computed the list of labels and jumps when we scanned for code and data, and that we have the entry point of most functions in our procedure table, we can build the basic blocks for a procedure by using the following algorithm:
current_address = current_procedure->entry_address workList.push = current_address while workList not empty current_address = workList.pop next: block = new Block at current_address blockList += block label = find label at address higher than current_address branch = find branch at address higher than current_address if label->address < branch->address_after_branch block->length = label->address - block->address current_address = label->address goto next end if block->length = branch->address_after_branch - block->address if branch is return // returns have no known destination continue workList.push = branch->destination if branch is conditional current_address = branch->address_after_branch goto next end if // else will resume from branch destination end while
This algorithm will record in the blockList set which sequences of instructions are executed in sequence. But while we are at, why don't we also record where execution goes from block to block? After all, it may be useful to know that in our example after we've set EAX to 3 we're going to use it in the MUL instruction, instead than on some other instruction. To do this, we record 2 additional pieces of information for each block:
Block *get_block_at(Address start_address) { Block *block = find in BlockList a block for start_address if block == not found block = new Block at start_address blockList += block return block } CreateBasicBlocks(current_procedure) { current_address = current_procedure->entry_address block = get_block_at(current_address) current_procedure->entry_block = block workList.push = current_address while workList not empty current_address = workList.pop block = get_block_at(current_address) next: label = find label at address higher than current_address branch = find branch at address higher than current_address if label->address < branch->address_after_branch block->length = label->address - block->address current_address = label->address next_block = get_block_at(current_address) next_block->preds += block block->succs += next_block block = next_block goto next end if block->length = branch->address_after_branch - block->address if branch is return // returns have no known destination continue end if next_block = get_block_at(branch->destination) next_block->preds += block block->succs += next_block workList.push = branch->destination if branch is not conditional continue // will resume from branch destination end if next_block = get_block_at(branch->address_after_branch) next_block->preds += block block->succs += next_block block = next_block current_address = block->address goto next end while sort blockList by block's start address }
Notes:
Indirect jumps, on the other hand, can have many destinations. We'll extend the algorithm in the advanced section to add all the known destinations to the list of successors for a block that ends with an indirect jump.
While this algorithm is more complicated, we'll see in the next section that the successors and predecessors information will become very useful to find out more about our procedure.
Next: Control Flow Graph