Decompiler Design - Call Optimizations |
|
Prev: Code vs. Data 2
The theme that is arising from the previous two pages shows that analyses limited to a basic block may not be sufficient to retrieve the information about a statement in the block.
Here we show some other cases that could potentially hamper even more the ability of the decompiler to analyze the instruction stream.
These cases typically occur in optimized code, especially when the optimization spans one function and is extended to a whole source module (compilation unit).
This optimization is performed on the join point of one or more basic blocks when the code at the end of each predecessor is the same. In this case, a compiler may chose to merge the identical code, and move the join point to the first non-identical instruction.
Consider the following example:
C code Non-optimized Optimized ------------- ------------------- --------------- if(cond) CMP R1, cond CMP R1, cond JNE Lelse JNE Lelse x[i] = func1(1); PUSH 1 PUSH 1 CALL func1 CALL func1 SP += 4 JMP Ljoin R1 = i x[R1*4] = R0 JMP L1 else Lelse: Lelse: x[i] = func2(4); PUSH 2 PUSH 2 CALL func2 CALL func2 Ljoin: SP += 4 SP += 4 R1 = i R1 = i x[R1*4] = R0 x[R1*4] = R0 L1:
This optimization has clear advantages, such as smaller code and higher cache hit ratio. Obviously it's not always possible for the compiler to perform it, since there are several conditions that have to be met. Still, it's frequent to see this optimization in assembly code.
Unfortunately this affects the decompiler's algorithm that detects parameter passing, because now the decompiler has to cross a block boundary to see if the instruction after the CALL is one that deallocates parameters (the SP += 4 instruction). The following optimization is another one that affects the computation of parameter passing and stack frame allocation.
When there are several function calls in a basic block, the compiler may chose to leave the parameter on the stack between the call instructions, and only deallocate them at the end of the sequence.
Consider the following example:
C code Non-optimized Optimized ------------- ------------------- --------------- a = func1(1); PUSH 1 PUSH 1 CALL func1 CALL func1 SP += 4 a = R0 a = R0 b = func2(2,3); PUSH 3 PUSH 3 PUSH 2 PUSH 2 CALL func2 CALL func2 SP += 8 b = R0 b = R0 func3(func4(4)); PUSH 4 PUSH 4 CALL func4 CALL func4 SP += 4 PUSH R0 PUSH R0 CALL func3 CALL func3 SP += 4 SP += 20
As you can see the optimized sequence has less instructions, at the price of using more stack space. This is normally not a big problem on Unix or Windows systems. However it may be a problem on embedded systems, where the stack space may be limited.
A variation of the above scheme, typically used by GCC, is to pre-allocate the stack space and use only indirect move instructions to pass the parameters, as in the following example:
C code Asm code ------------- ------------------- func() { SP -= 8+locals_size a = func1(1); SP[0] = 1 CALL func1 a = R0 b = func2(2,3); SP[4] = 3 SP[0] = 2 CALL func2 b = R0 func3(func4(4)); SP[0] = 4 CALL func4 SP[0] = R0 CALL func3
Note that the code uses the smalles stack size needed (8 bytes) and uses the smalles code size, since there is no need to de-allocate the parameters after each call. The restriction is that none of the functions can use the (...) prototype, i.e. the compiler must know exactly how many parameters are expected by each function.
When code like this is generated it's very difficult for the decompiler to compute how many parameters are expected by each function, because there is no indication in the instruction stream at the call point of how many bytes are passed. In fact, each SP[x] = y instruction could just be assigning a value to a local variable allocated on the stack. Therefore the decompiler must rely on the prototype of the called function (if available), or on the data accesses gathered when analyzing each called function, to know whether a particular location SP[x] is a local variable or is a passed parameter.
Note that the same limitation is present when passing parameters in registers.
When a function call is the last statement in a function, a compiler may chose to generate a jump instead of a call followed by a return. For example:
C code Non-optimized Optimized ------------- ----------------- --------------- func1() { PUSH FP PUSH FP FP = SP FP = SP ... func2(); CALL func2 SP = FP return; POP FP } SP = FP JMP func2 POP FP RET
This optimization allows the compiler to generate more efficient code, since a CALL+RET is equivalent to a JMP instruction; the JMP instruction does not use the stack to save the return address to the called function and also the processor will execute only one RET instruction instead of 2, further speeding up execution.
This optimization may confuse a decompiler because now we effectively have a function with 2 entry points. One entry point is func1, and the other is func2; both entry points will lead to a single exit point, since they share the final RET instruction.
Luckily this case is easier to handle than the previous ones, since once the frame set up instruction are eliminated by the standard stack processing algorithm, the control flow graph will still be legal, at the price of duplicating the code of func2() in the body of func1(). Even the algorithm that discovers function entry points will still work, since func2() will probably be called by other portions of the program and thus be recognized as a function entry point, and entered as such in the symbol table.
A particularly smart decompiler can check whether a JMP instruction goes to a function entry point and handle the stack frame deallocation instructions preceding it as a specific case.
Next: Passing Parameters in Registers