Decompiler Design - Simple Decompiler |
|
Prev: Reverse Engineering Tools
As we hinted when talking about disassemblers, the simplest way to decompile a program is to present as much high-level information as possible for each instruction. Let's say that the disassembler converts a sequence of bytes into the following instruction:
0x004011E5: 80 4D FD 08 or byte [ebp-0x3], 0x8
We can translate this instruction into C by simply considering the effect that the instruction has on the program. A possible translation of this instruction is:
ebp[-0x3] |= 0x8;A simple filter on the output of the disassembler can do the job:
Objdump -d file.exe | translate > file.cWhere translate would do something like this:
while(gets(instr)) { p = &instr[FIRST_OPCODE_COL]; // skip address, bytes p = isolate_word(p, opcode); // get operation p = isolate_argument(p, dest); // get destination operand if(*p == ',') p = isolate_argument(p + 1, source);// get source operand if(!strcmp(opcode, "or")) printf("%s |= %s\n", dest, source); . . . // other operators }
Most of the arithmetic instructions can be translated this way. Some effort can be dedicated to pretty-printing the arguments, so that [ebp-0x3] can become ebp[-3]. If a memory location is involved in the instruction, we can simply record its use in a table and emit the variable declaration in a .h file. Jump instructions are special, because you need to define the label that corresponds to the target of the jump, but this is easy to do: just emit the address of each instruction as a label as you translate them:
0x004011D6: EB 0D jmp L4011E5 . . . 0x004011E5: 80 4D FD 08 or byte [ebp-0x3], 0x8 L4011D6: goto L4011E5 . . . L4011E5: ebp[-0x3] |= 0x8;
You can experiment with other cases, and most of them can be translated this way. Here's some example:
0x00401383: 80 FB 49 cmp bl, 0x49 0x00401386: 74 2E jz L4013B6 L401383: flags = bl - 0x49; L401386: if ( flags == 0 ) goto L4013B6; 0x00401357: E8 B6 06 00 00 call L401A12 0x0040135C: . . . L401357: esp -= 4; // make space for return location *esp = L40135C; // save return location goto L401A12; // go to ("call") the function L40135C: . . .
Now we have a slight problem. C does not allow computed returns, so how are we going to translate RET and expect to go back to L40135C? One way to do it is by recording the addresses of all instructions that follow a call instruction, and use a switch() statement to go to that label:
// Translating RET _dest = *esp; esp += 4; switch(_dest) { case 0x40135C: goto L40135C; . . . }
This approach is very simple and shows the basics of decompilation: that you don't need to recreate the original program's source; you only need to recreate the original program's behavior.
However, this approach has several shortcomings, as you may imagine. We consider them in the next chapter.
Next: Problems with the Simplest Decompiler