Prev: Simple Decompiler
Problems with the Simplest Decompiler
Let's consider the problems that are introduced by the simple
decompiler described before, and try to find a possible solution.
We start with the initial instruction we decompiled:
0x004011E5: or byte [ebp-0x3], 0x8 -> L4011E5: ebp[-0x3] |= 0x8;
- EBP is a processor register, so it is not known to a generic compiler;
therefore, to have our program compiled, we need to declare EBP.
In this case, we can declare EBP to be a pointer to some byte, that is:
unsigned char *ebp;
ebp[-0x3] |= 0x8;
- EBP is mostly used to point to the frame of the local procedure,
so the above statement is probably trying to access a local variable
that was declared of byte size. However, there may be other local
variables that are of different sizes, and they will all be accessed
through EBP (at different offsets from EBP), so declaring EBP as an
unsigned char pointer won't work for those other variables.
We can solve this problem by forcing the size of each access through
the register, based on the size of the original instruction:
char *bp;
*(char *)(ebp-0x3) |= 0x8;
Accessing a different variable will use a different cast:
*(short *)(ebp-0x8) += 10;
-
Even though usually EBP contains an address, there are cases
where the compiler might use it to hold a numeric constant
(in frame-less functions EBP is used just like any other register).
If EBP is used in an arithmetic operation other than + or -,
the compiler will complain if we declare it char *.
So we need to declare it as an int. This is not a problem
when using the cast approach above, because the adding and
subtracting from an integer is essentially the same as
adding and subtracting from a char *.
-
One instruction type that is not easily translated using
the above method is for those instructions that set a register
with an address for later use. These instructions may be
indistinguishable from instructions that load a numeric
constant for some later calculation:
8D 3C 85 DC 68 40 00 lea edi, [eax*4+0x4068DC] // A
8D 3C 85 04 00 00 00 lea edi, [eax*4+4] // B
The translation for these two instructions is quite different:
edi = &var_4068DC[eax*4]; // A
edi = 4 + eax * 4; // B
Why the difference? Because in one case some memory area will be accessed,
thus we need to take the address of that memory area and index it
(in fact, we should probably remove the multiplication, too,
since that's how array of integers are typically accessed - so the
instruction should become: edi = &var_4068DC[eax];).
In the second case, some numeric computation is performed and the
compiler chose to use one lea instruction instead of the
lengthier sequence mul and add. Solving this problem is not easy.
These are just some example. As you have seen, control flow instructions (jumps, calls) don't translate very well, and using the mechanical translation method described above quickly creates unreadable generated C code that may become bloated when recompiled to machine code after the translation.
We clearly need a better solution.
Next: Standard Decompiler Architecture