Decompiler Design - Identifying Code and Data

 

Prev: Object File Formats


Identifying Code and Data

Since non-code can be present in an executable section, we need some way to identify which sequences of bytes are actually code and which are not. The decompiler knows where the program starts executing from the object file format (the program's entry point). A simple and safe algorithm for identifying code regions is to analyze which instructions can be reached from the program's entry point. The following pseudo-code would produce a list of function entry points (as well as some additional useful information):

callList = entry_address
while callList not empty
   address = pop from callList
   functionList += new function starting at address
   while true
       if instr at address is jump
           workList += jump destination
       if instr at address is call
           callList += call destination
       if instr at address is ret
           if workList empty
               break
       if instr is not conditional
           address = pop from workList
           continue
       address += instr length
   end while
end while

The same algorithm can be applied to any function entry point. If the binary file has a symbol table (whether a low-level list of labels or high-level debugging information), we can call the algorithm for all known function entry points.

Unfortunately the previous algorithm has a few shortcomings:

  • it does not recognize functions accessed via function pointers;
  • it does not follow indirect jumps, such as those used in switch statements;
  • it does not identify functions that are present in the binary file but never called.

Functions accessed via function pointers are found in the following cases:

  • straight function pointers stored in memory. In this case the decompiler should report that the call is through a user variable. It is not possible for the decompiler to know which functions can be called from that instructions.
  • virtual functions accessed through a C++ virtual table.
  • functions accessed through dynamic library trampoline code. Since the trampoline code has to be exported to the dynamic linker for the dynamic linker to be able to patch it with the address of the entry point to the dynamic linbrary function, these can usually be safely converted into direct calls to the trampoline code, and mark the trampoline code as the entry point of the library function.
  • functions that are accessed through a register which has been initialized with the destination address of a global function. This is a difficult case, which requires data flow analysis to compute which value is contained in the register at the call point. But since we don't know which bytes are code and which are data, yet, we cannot perform such analysis until much later.

Since we have established that the previous algorithm will not find all functions, we need to apply further analyses to identify the missing functions. The following heuristics can be applied:

  • use a library signature module to identify sequences of bytes that might be the entry point of a function. Using a database of function byte sequences also allows the identification of the function name, of any global variables that may be used by the function, and the types of both the function and its input parameters. If the function's behavior is known, we can mark so that we don't decompile it, thus reducing the output only to user functions and not to library functions.
  • ask the processor module to provide a short sequence of bytes that is typically used by a compiler to set up the stack frame at the beginning of a function. Instruction sequences such as PUSH EBP + MOV EBP, ESP + SUB ESP,# are typically found at the beginning of functions in x86 programs.
  • assume that the first non-noop instructions after a RET instruction that is not the destination of a jump is the entry point of a function. This is a risky assumption to make, and it may consider as code bytes that are instead data. One possibility is to give the user a list of such entry point and let the user decide whether to use such addresses as function entry points.

The output of the algorithm is a list of function entry points, and a list of jump and ret instructions, which will be used in constructing the basic blocks of each function.
For each function, we also record the function that it calls.


Next : Basic Blocks