REference Debugger Module Description

Independent Components

The following components can be considered independent of the host and target environment, and can be thought of as the core of the debugger. They won't need to be ported when re-targeting the debugger for a new processor or a new host.

Low-level Symbol Table

The symbol table is the foundation of any debugger. Whether the symbol table only handles low-level assembly labels or high-level C++ constructs, the rest of the debugger will rely on the information provided by it to correctly locate entities in memory.

Clearly, support for low-level assembly debugging only requires the following from a symbol table:

Address to name translation
Name to address translation
Address range to section translation
Memory map information

This is fairly simple to implement for linear address spaces. It becomes more complex for segmented or overlayed address spaces, since the information about which address space is desired must be associated with every request.

High-level Symbol Table

This type of symbol table is in addition to the low-level symbol table, and is considerably more difficult to implement. It has to support the following:

Address to function translation
Address to source file+line number translation
Source file+line number to address translation
Location (register, SP+offset, address) to variable translation
Name to symbol translation (name resolution)
Symbol to location translation
Stack frame information for each address range
Type system, to map symbols to their high-level definitions. For functions, return address and list of formal parameters; for variables, size and structure of the memory allocated to the variable.

Most of this information comes from the object file loader, and there is almost a one-to-one mapping between a full-featured file format such as Dwarf-3 and the in-memory representation of the symbolic information.

The precision by which the high-level symbol table describes the language entities will directly affect the end-user debugging experience.

Since high-level symbolic information can take up a lot of space, some mechanism to only partially load the symbolic information in memory should be implemented if large files are to be debugged.

An extension mechanism should be considered to allow the main symbol table to be extended to describe special types such as the __vector and __packed types found in DSP processors.

Object File Format Reader

The Object File Format Reader (or just Reader for brevity) is the code that reads the binary file produced by the compiler or the linker and fills in the symbol tables with the decoded information.
A low-level symbol table will still require the creation of the name-to-address table and also the creation of the section table to support memory mapping information.

Expression Evaluator

The expression evaluator is one of the main consumers of the symbolic information stored in the Symbol Table.
A low-level-only expression evaluator should still be able to perform arithmetic and logical operations on addresses and/or numeric values. This allows the user to compute hexadecimal or binary values while looking at memory.

Standard syntax directed parsing techniques can be used to construct the expression tree. The evaluation is a simple traversal of the tree, evaluating each node to return a final numeric value.

A high-level expression evaluator must cope with all the high-level language constructs. While the same operations used for low-level evaluation can be re-used, additional operations must be implemented to allow type-based memory access. Moreover, while low-level expression evaluation can be limited to compute constant values only, symbolic evaluation requires access to the target to fetch live values contained in registers or memory.
Each node will therefore have associated the type of the operation (e.g. a -> operator requires a pointer as its left-hand operator, and a name as its right-hand operator). Leaf nodes will also have associated the name + scope + location + type of the object.

An extension mechanism (similar to what implemented for the symbol table) should be implemented to take into account non-standard types, especially if arithmetic or logical operations have to be performed on such types. Most debuggers only allow printing and assigning constant values to such types.

High-level expression evaluators should also allow calling functions on the target. This requires some interaction with the execution engine. It also requires knowledge of the ABI so that actual parameters can be passed in the correct location to the called target function.

Advanced expression evaluators allow interpretation of some form of high-level language, typically as close as possible to the language the end-user is using. A C interpreter intergrated with the expression evaluator allows easy customization of both the debugger and of the execution; for example it allows patching the target code without the need to recompile and re-download of the target program.

Execution Engine

The execution engine is responsible for the controlled execution of the target program. It primarily deals with the target program counter, and makes the decisions on how to perform stepping of the target program. It also deals on how to react to breakpoints and exceptions.

Low-level assembly execution can still allow for complex execution patterns. Very useful execution operations are:

step one assembly instruction following function calls
step one assembly instruction over function calls
step any number of assembly instructions
step until a branch or call is encountered
step until a return instruction is encountered
setting, clearing and hitting of breakpoints
all of the above, conditioned to the value of some expression

The only complexity here is in the execution of conditional stepping. This involves calling the expression evaluator, which may ask for memory or registers from the target. Still, the code can be easily implemented.

High-level execution is much more complex. It uses information stored in the symbol table to map program counter values to line numbers, and vice-versa, it must be able to use line number information to know when it's appropriate to stop execution after a high-level step command.

A high-level execution engine is typically implemented as a state machine, which implements the following initial states (in addition to the states required for low-level assembly execution):

step one source statement following function calls
step one source statement over function calls
step until a specific statement is reached
step until the current function returns
setting, clearing and hitting of breakpoints
all of the above, conditioned to the value of some expression

Command processor

The command processor should be the only interface the debugger provides to the user.
It is very convenient to implement all features of the debugger as string commands. This is because a user can operate on the debugger from a restricted access terminal (text-only, remote terminal), and it makes it very easy to add scripting capabilities to the debugger. Scripting allows repetitive tasks such as resetting and initializing of the target to be isolated and automatically executed.

Fancy command processors (as in gdb) allow automatic context-sensitive completion of arguments, allowing for quick access to symbol names.
Also, different commands may operate on different entities. Each command should be smart enough to automatically select the appropriate entity. For example, a breakpoint should only operate on source files, line numbers and code symbols, but not on local variables and types.

Each command can be implemented incrementally as other parts of the debugger are implemented, allowing simultaneous testing of each feature, as well as the creation of regression tests.

User interface

The user interface is typically built on-top of the command processor. Even if the debugger only supports a text-based command-line interface, integration with other, more advanced tools can be accomplished by the clever addition of extra information provided to the other tool by the user interface.

gdb uses markers to indicate whether its output refers to an execution entity such as a source file and line number, or if it's the result of the evaluation of an expression. The "ddd" debugger front-end uses these markers to direct each piece of information to the appropriate window (code, memory, data, registers, stack etc.). Similarly, Emacs can use the same type of markers to show the full source in one window pane while sending commands to the debugger in another pane.

Dependent Components

The following components depend on the target processor architecture and/or on the ABI specified for the high-level language of choice:

Disassembler

Clearly low-level assembly debugging relies on showing the user which instruction the program is about to execute.
Address + opcode + mnemonic are typically displayed when disassembling code. More advanced disassembly provides labels for addresses, and possibly names for locations where global or local variables are stored.
If available, it should be possible to show mixed source code + disassembly. This can be implemented in a target-independent way.

Code analyzer

The code analyzer is a reduced disassembler that only provides some semantic of each instruction, and is used when unwinding the stack to show which function called the current function.
One code analyzer should recognize instructions that modify the stack pointer, and possibly instructions that save registers in memory, so that we can unwind the stack in (library or assembly) functions that don't have stack frame symbolic information from the compiler.

The analysis of the code analyzer output can be implemented in a target-independent way.

Register description

This is can be mostly data driven, and associates names and expressions to target locations. Most processor registers will just have a name and a size association. Some bit-oriented processor registers (e.g. status/flags registers) may have rules to extract/insert information from specific register bits, or to describe the data stored in the register in a special way (e.g. floating-point registers, or packed / vector registers)

Target access

Obviously all debuggers must be able to communicate with the target system, the system where the target program will be executed. This typically involves the use of some form of protocol, which is specified by the physical interface to the target system. OEM vendors specify the protocol their probes (JTAG, BDM, Monitor etc.) uses.
An instruction set simulator can also be provided to simulate a physical target.
Host systems may allow direct control of application through some form of API (ptrace( ) and /proc on Unix systems, the Debug API on Windows systems).
On systems that allow multiple process execution, there should be a way to specify which process the debugger wants to debug, allowing for multi-process and/or multi-core debugging.

ABI description

Support for the display of actual parameters, extended types and execution of target functions requires the description of the ABI used by the compiler when it created the target program.
Advanced debugging of regular applications can benefit from the knowledge of how the target system handles threads (so as to show the context of different threads).
Advanced debugging of C++ applications may require the description of how exceptions (try-catch blocks) are handled.