Prev: Object File Formats
Identifying Code and Data
Since non-code can be present in an executable section, we need some way to identify
which sequences of bytes are actually code and which are not.
The decompiler knows where the program starts executing from the object file format
(the program's entry point).
A simple and safe algorithm for identifying code regions is to analyze which
instructions can be reached from the program's entry point. The following pseudo-code
would produce a list of function entry points (as well as some additional useful
information):
callList = entry_address
while callList not empty
address = pop from callList
functionList += new function starting at address
while true
if instr at address is jump
workList += jump destination
if instr at address is call
callList += call destination
if instr at address is ret
if workList empty
break
if instr is not conditional
address = pop from workList
continue
address += instr length
end while
end while
The same algorithm can be applied to any function entry point. If the binary
file has a symbol table (whether a low-level list of labels or high-level
debugging information), we can call the algorithm for all known function entry
points.
Unfortunately the previous algorithm has a few shortcomings:
- it does not recognize functions accessed via function pointers;
- it does not follow indirect jumps, such as those used in switch
statements;
- it does not identify functions that are present in the binary file
but never called.
Functions accessed via function pointers are found in the following cases:
- straight function pointers stored in memory. In this case the decompiler
should report that the call is through a user variable. It is not possible
for the decompiler to know which functions can be called from that
instructions.
- virtual functions accessed through a C++ virtual table.
- functions accessed through dynamic library trampoline code. Since the
trampoline code has to be exported to the dynamic linker for the dynamic
linker to be able to patch it with the address of the entry point to the
dynamic linbrary function, these can usually be safely converted into direct
calls to the trampoline code, and mark the trampoline code as the entry
point of the library function.
- functions that are accessed through a register which has been
initialized with the destination address of a global function. This is a
difficult case, which requires data flow analysis to compute which value is
contained in the register at the call point. But since we don't know which
bytes are code and which are data, yet, we cannot perform such analysis
until much later.
Since we have established that the previous algorithm will not find all
functions, we need to apply further analyses to identify the missing functions.
The following heuristics can be applied:
- use a library signature module to identify sequences of bytes that might
be the entry point of a function. Using a database of function byte
sequences also allows the identification of the function name, of any global
variables that may be used by the function, and the types of both the
function and its input parameters. If the function's behavior is known, we
can mark so that we don't decompile it, thus reducing the output only to
user functions and not to library functions.
- ask the processor module to provide a short sequence of bytes that is
typically used by a compiler to set up the stack frame at the beginning of a
function. Instruction sequences such as PUSH EBP + MOV EBP, ESP + SUB ESP,#
are typically found at the beginning of functions in x86 programs.
- assume that the first non-noop instructions after a RET instruction that
is not the destination of a jump is the entry point of a function. This is a
risky assumption to make, and it may consider as code bytes that are instead
data. One possibility is to give the user a list of such entry point and let
the user decide whether to use such addresses as function entry points.
The output of the algorithm is a list of function entry points, and a list of
jump and ret instructions, which will be used in constructing the basic blocks
of each function.
For each function, we also record the function that it calls.
Next : Basic Blocks