RecStudio Decompiler Design

Now that we are starting to make sense of the unstructured byte sequence in the binary file, the next problem we need to solve before we can start decompiling the code is to be sure that a sequence of instructions is actually executed in the sequence we see it in the file.

Consider the following sequence:

L1:      MOV  EAX, $2
L2:      MUL  EAX, ECX
L3:      MOV  DWORD [0x402000], EAX

Since we know from the processor's user's manual that the MUL instruction requires one of the operands as well as the result to be in EAX, we might be tempted to translate the above sequence into the following statement:

     var_402000 = ecx * 2;

That would be wrong for two reasons:

there may be a jump from some other part of the program to L2. If that happens, we cannot be sure that EAX will have the value 2 in it!
The EAX register may be used again after we have saved the value of the MUL instruction into memory.

For example, a bigger sequence could be the following one:

L0:     MOV   EAX, $3
        CMP   EBX, $0
        JNE   L2
L1:     MOV   EAX, $2
L2:     MUL   EAX, ECX
L3:     MOV   DWORD [0x402000], EAX
        MOV   DWORD [0x402010], EAX
        ...

The correct way to translate the above sequence would be:

       eax = (ebx == 0) ? 2 : 3;
       var_402010 = var_402000 = eax * ecx;

But how do we know that execution could skip L1 and go directly to L2? And more importantly, how do we know that there are no other ways that execution can go to L2, maybe from some other part of the code?

These problems can be solved by using a well known data structure called basic block. This data structure is one of the most important one in the entire decompiler, since a lot of code will rely on the information stored in it for many analyses.

A lot of information exists in the literature on what basic blocks are. We've already given an informal definition. Here's a more precise one: "a basic block is a sequence of instructions; the first instruction is the only way for execution to enter a basic block; the last instruction is the only way for execution to exit the basic block."

A consequence of this definition is that every jump destination starts a new basic block, and every jump instruction (including return instructions) ends a basic block.

For a decompiler it's not clear whether a call instruction should end a basic block. Depending on the sophistication of the algorithms, one can consider a call instruction to end a basic block, or one can just ignore call instructions for the time being.
It's however clear that the destination of a call instruction always starts a basic block.

Given that we already have computed the list of labels and jumps when we scanned for code and data, and that we have the entry point of most functions in our procedure table, we can build the basic blocks for a procedure by using the following algorithm:

    current_address = current_procedure->entry_address
    workList.push = current_address
    while workList not empty
        current_address = workList.pop
next:
        block = new Block at current_address
        blockList += block
        label = find label at address higher than current_address
        branch = find branch at address higher than current_address
        if label->address < branch->address_after_branch
            block->length = label->address - block->address
            current_address = label->address
	    goto next
	end if
	block->length = branch->address_after_branch - block->address
        if branch is return	// returns have no known destination
	    continue
        workList.push = branch->destination
        if branch is conditional
            current_address = branch->address_after_branch
            goto next
        end if
        // else will resume from branch destination
    end while

This algorithm will record in the blockList set which sequences of instructions are executed in sequence. But while we are at, why don't we also record where execution goes from block to block? After all, it may be useful to know that in our example after we've set EAX to 3 we're going to use it in the MUL instruction, instead than on some other instruction. To do this, we record 2 additional pieces of information for each block:

The list of blocks that jump to the entry instruction of the current block: that is the list of predecessor blocks (pred for short);
and the list of blocks where we jump to when we're finished executing this block: that is the list of successor blocks (succ for short).

Here's a modified version of the basic block construction algorithm that records this additional information:

Block *get_block_at(Address start_address)
{
    Block *block = find in BlockList a block for start_address
    if block == not found
	block = new Block at start_address
        blockList += block
    return block
}

CreateBasicBlocks(current_procedure)
{
    current_address = current_procedure->entry_address
    block = get_block_at(current_address)
    current_procedure->entry_block = block
    workList.push = current_address
    while workList not empty
        current_address = workList.pop
        block = get_block_at(current_address)
next:
        label = find label at address higher than current_address
        branch = find branch at address higher than current_address

        if label->address < branch->address_after_branch
            block->length = label->address - block->address
            current_address = label->address
	    next_block = get_block_at(current_address)
            next_block->preds += block
            block->succs += next_block
            block = next_block
	    goto next
	end if

	block->length = branch->address_after_branch - block->address
        if branch is return	// returns have no known destination
	    continue
        end if

	next_block = get_block_at(branch->destination)
        next_block->preds += block
        block->succs += next_block
        workList.push = branch->destination
        if branch is not conditional
            continue           // will resume from branch destination
        end if
	next_block = get_block_at(branch->address_after_branch)
        next_block->preds += block
        block->succs += next_block
        block = next_block
        current_address = block->address
        goto next
    end while

    sort blockList by block's start address
}

Notes:

this algorithm is still incomplete: it does not take into account indirect jumps, as it only assumes that a jump can have 2 destinations: the true case and the false case.
Indirect jumps, on the other hand, can have many destinations. We'll extend the algorithm in the advanced section to add all the known destinations to the list of successors for a block that ends with an indirect jump.
the algorithm only uses jumps and returns. It should ignore function calls (that is, if call instructions were recorded in the list of branches, they have to be skipped when looking for the next branch).

While this algorithm is more complicated, we'll see in the next section that the successors and predecessors information will become very useful to find out more about our procedure.

Next: Control Flow Graph

Decompiler Design - Basic Blocks

Basic Blocks