Decompiler Design - Code vs. Data 2 |
|
Prev: Switch Statements
Remember that by separating code from data we achieve the following goals:
As we have seen in the previous page, switch statements pose a serious problem to the task of separating the code regions from the data regions.
The switch problem is a particular case of the more generic problem of how to handle code that is referenced through pointers. While in the switch case the pointer was automatically generated by the compiler and had a fairly limited scope (that of the switch statement basic block, and possibly that of the block's predecessors), there are other cases involving pointers to code that have a broader scope.
In this page we show some of these cases.
Languages such as C and C++ allow the use of a pointer to a function. For example:
C code Asm code --------------- -------------------- int (*ptr)(); ptr = addr1; L1: R1 = addr1 [ptr] = R1 ... ... (*ptr)(); L2: R2 = [ptr] CALL [R2]
How does the decompiler know which function is called at L2?
Moreover, how do we know that ptr is a pointer to code?
From the instructions at L1, there is no indication that addr1
is a function. Sure, we could assume that if addr1 is in
the text section it is the entry point of a function, but there's
no guarantee of that. Only if in some other part of the code
there is an explicit call to addr1 then we are assured
that it is a function.
In case both L1 and L2 are in the same function, we could apply
type analysis to ptr and discover that is is used as a function
pointer by the CALL instruction, and thus infer that addr1
is a function. However, if L1 and L2 are in different functions,
then it becomes very difficult to know that ptr was assigned
an address and then it was used in a CALL instruction.
Moreover, remember that we are doing the code walking during the
early phases of decompilation, and at this stage we have not done
any data flow analysis, or even less, any type analysis.
We could apply some heuristics to addr1 to see if it starts with the typical instruction sequence that saves the frame pointer and allocates the local stack. Still, if the heuristic fails, we won't be able to identify addr1 as a function, which means that its code, along with the code of any other function it calls, will remain unreachable to the decompiler.
One special case of the above is a form of constant folding
applied to CALL instructions, typically performed by the Microsoft
compiler and the Green Hills compilers.
When this optimization is performed, for example to reduce the
size of the generated code, a register is assigned to the
address of the function, and later used in the CALL instruction
without saving the register to a memory location.
This case is a bit easier to deal with, since a simple
constant propagation algorithm should be able to relink the
CALL instruction with its destination address. For example:
C code Asm code Decompiled code --------------- ----------------- ---------------- func1(); L1: R1 = addr1 CALL [R1] addr1(); ... ... func1(); CALL [R1] addr1(); ... ... func2(); CALL func2 func2(); ... ... func1(); CALL [R1] addr1(); // ? ------------------------------------------------------------ SymbolTable.defineFunction(addr1)
In this case we can see that there are no other assignments to R1, and that all statements are reachable from the same function, so we can convert the CALL [R1] instructions into CALL addr1 (as long as we know that func2 does not change the value of R1!).
In case there are multiple possible values for the destination, we can still detect that each value is a function address, even though we will not always be able to replace the CALL instructions. In the following example var is a local variable and all statements are in the same function:
C code Asm code Decompiled code --------------- ----------------- ---------------- void *(var)(); var = addr1; L1: R1 = addr1 var = addr1; (*var)(); CALL [R1] addr1(); ... ... (*var)(); CALL [R1] addr1(); ... ... if (expr) if (expr) var = addr2; R1 = addr2 var = addr2; ... ... (*var)(); L2: CALL [R1] (*var)(); ------------------------------------------------------------ SymbolTable.defineFunction(addr1) SymbolTable.defineFunction(addr2)
In this case, we can use use-def chains to identify all possible values of R1 at L2, and define a function for each of the addresses (in this case addr1 and addr2). However, we won't be able to convert the CALL [R1] instruction into a direct call, and we won't be able to eliminate the local variable var, since its value is not constant at L2 (it depends on whether the if() statement was true).
Languages such as C++ and Java have the concept of a "virtual" method. Virtual methods are nothing more than function pointers associated with the class or struct that defines them. When a method is marked as "virtual", the compiler will always use an indirect call to access the method, taking the address from a special table (the "virtual table") associated with the instance of that class (the this pointer).
Different compilers use different methods to retrieve the method pointer from the virtual table for each class. These methods are not officially documented, since they are not part of the Application Binary Interface of an architecture, which implies that it's not genreally possible to mix libraries compiled with one C++ compiler with code that was compiled with another C++ compiler.
This is a major problem for a decompiler, since not only the decompiler must infer that the code was originally from a C++ source instead than from a C source, but it must also infer which compiler was used and use different access algorithms to identify where is the virtual table for each class and where are the virtual method pointers.
Next: Call Optimizations