|
A debugger for a generic processor is made up of several
components, some of them completely independent of the target
processor and the target environment, others very specific.
It is useful to separate the description of the two types so that
the effort to develop or port each one can become clearer.
Independent Components
The following components can be considered independent of the
host and target environment, and can be thought of as the core of
the debugger. They won't need to be ported when re-targeting the
debugger for a new processor or a new host.
- Low-level Symbol Table
The symbol table is the foundation of any debugger. Whether the
symbol table only handles low-level assembly labels or
high-level C++ constructs, the rest of the debugger will rely on
the information provided by it to correctly locate entities in
memory.
Clearly, support for low-level assembly debugging only requires
the following from a symbol table:
- Address to name translation
- Name to address translation
- Address range to section translation
- Memory map information
This is fairly simple to implement for linear address
spaces. It becomes more complex for segmented or overlayed
address spaces, since the information about which address
space is desired must be associated with every request.
- High-level Symbol Table
This type of symbol table is in addition to the low-level symbol
table, and is considerably more difficult to implement. It has
to support the following:
- Address to function translation
- Address to source file+line number translation
- Source file+line number to address translation
- Location (register, SP+offset, address) to variable
translation
- Name to symbol translation (name resolution)
- Symbol to location translation
- Stack frame information for each address range
- Type system, to map symbols to their high-level
definitions. For functions, return address and list of
formal parameters; for variables, size and structure of
the memory allocated to the variable.
Most of this information comes from the object file loader, and
there is almost a one-to-one mapping between a full-featured
file format such as Dwarf-3 and the in-memory representation of
the symbolic information.
The precision by which the high-level symbol table describes the
language entities will directly affect the end-user debugging
experience.
Since high-level symbolic information can take up a lot of
space, some mechanism to only partially load the symbolic
information in memory should be implemented if large files are
to be debugged.
An extension mechanism should be considered to allow the main
symbol table to be extended to describe special types such as
the __vector and __packed types found in DSP processors.
- Object File Format Reader
The Object File Format Reader (or just Reader for brevity) is
the code that reads the binary file produced by the compiler or
the linker and fills in the symbol tables with the decoded
information.
A low-level symbol table will still require the creation of the
name-to-address table and also the creation of the section table
to support memory mapping information.
- Expression Evaluator
The expression evaluator is one of the main consumers of the
symbolic information stored in the Symbol Table.
A low-level-only expression evaluator should still be able to
perform arithmetic and logical operations on addresses and/or
numeric values. This allows the user to compute hexadecimal or
binary values while looking at memory.
Standard syntax directed parsing techniques can be used to
construct the expression tree. The evaluation is a simple
traversal of the tree, evaluating each node to return a final
numeric value.
A high-level expression evaluator must cope with all the
high-level language constructs. While the same operations used
for low-level evaluation can be re-used, additional operations
must be implemented to allow type-based memory access. Moreover,
while low-level expression evaluation can be limited to compute
constant values only, symbolic evaluation requires access to the
target to fetch live values contained in registers or memory.
Each node will therefore have associated the type of the
operation (e.g. a -> operator requires a pointer as its
left-hand operator, and a name as its right-hand operator). Leaf
nodes will also have associated the name + scope + location +
type of the object.
An extension mechanism (similar to what implemented for the
symbol table) should be implemented to take into account
non-standard types, especially if arithmetic or logical
operations have to be performed on such types. Most debuggers
only allow printing and assigning constant values to such types.
High-level expression evaluators should also allow calling
functions on the target. This requires some interaction with the
execution engine. It also requires knowledge of the ABI so that
actual parameters can be passed in the correct location to the
called target function.
Advanced expression evaluators allow interpretation of some form
of high-level language, typically as close as possible to the
language the end-user is using. A C interpreter intergrated with
the expression evaluator allows easy customization of both the
debugger and of the execution; for example it allows patching
the target code without the need to recompile and re-download of
the target program.
- Execution Engine
The execution engine is responsible for the controlled execution
of the target program. It primarily deals with the target
program counter, and makes the decisions on how to perform
stepping of the target program. It also deals on how to react to
breakpoints and exceptions.
Low-level assembly execution can still allow for complex
execution patterns. Very useful execution operations are:
- step one assembly instruction following function
calls
- step one assembly instruction over function calls
- step any number of assembly instructions
- step until a branch or call is encountered
- step until a return instruction is encountered
- setting, clearing and hitting of breakpoints
- all of the above, conditioned to the value of some
expression
The only complexity here is in the execution of conditional
stepping. This involves calling the expression evaluator,
which may ask for memory or registers from the target.
Still, the code can be easily implemented.
High-level execution is much more complex. It uses
information stored in the symbol table to map program
counter values to line numbers, and vice-versa, it must be
able to use line number information to know when it's
appropriate to stop execution after a high-level step
command.
A high-level execution engine is typically implemented as a
state machine, which implements the following initial states
(in addition to the states required for low-level assembly
execution):
- step one source statement following function
calls
- step one source statement over function calls
- step until a specific statement is reached
- step until the current function returns
- setting, clearing and hitting of breakpoints
- all of the above, conditioned to the value of
some expression
- Command processor
The command processor should be the only interface the debugger
provides to the user.
It is very convenient to implement all features of the debugger
as string commands. This is because a user can operate on the
debugger from a restricted access terminal (text-only, remote
terminal), and it makes it very easy to add scripting
capabilities to the debugger. Scripting allows repetitive tasks
such as resetting and initializing of the target to be isolated
and automatically executed.
Fancy command processors (as in gdb) allow automatic
context-sensitive completion of arguments, allowing for quick
access to symbol names.
Also, different commands may operate on different entities. Each
command should be smart enough to automatically select the
appropriate entity. For example, a breakpoint should only
operate on source files, line numbers and code symbols, but not
on local variables and types.
Each command can be implemented incrementally as other parts of
the debugger are implemented, allowing simultaneous testing of
each feature, as well as the creation of regression tests.
- User interface
The user interface is typically built on-top of the command
processor. Even if the debugger only supports a text-based
command-line interface, integration with other, more advanced
tools can be accomplished by the clever addition of extra
information provided to the other tool by the user interface.
gdb uses markers to indicate whether its output refers to an
execution entity such as a source file and line number, or if
it's the result of the evaluation of an expression. The "ddd"
debugger front-end uses these markers to direct each piece of
information to the appropriate window (code, memory, data,
registers, stack etc.). Similarly, Emacs can use the same type
of markers to show the full source in one window pane while
sending commands to the debugger in another pane.
Dependent Components
The following components depend on the target processor
architecture and/or on the ABI specified for the high-level language
of choice:
- Disassembler
Clearly low-level assembly debugging relies on showing the user
which instruction the program is about to execute.
Address + opcode + mnemonic are typically displayed when
disassembling code. More advanced disassembly provides labels
for addresses, and possibly names for locations where global or
local variables are stored.
If available, it should be possible to show mixed source code +
disassembly. This can be implemented in a target-independent
way.
- Code analyzer
The code analyzer is a reduced disassembler that only provides
some semantic of each instruction, and is used when unwinding
the stack to show which function called the current function.
One code analyzer should recognize instructions that modify the
stack pointer, and possibly instructions that save registers in
memory, so that we can unwind the stack in (library or assembly)
functions that don't have stack frame symbolic information from
the compiler.
The analysis of the code analyzer output can be implemented in a
target-independent way.
- Register description
This is can be mostly data driven, and associates names and
expressions to target locations. Most processor registers will
just have a name and a size association. Some bit-oriented
processor registers (e.g. status/flags registers) may have rules
to extract/insert information from specific register bits, or to
describe the data stored in the register in a special way (e.g.
floating-point registers, or packed / vector registers)
- Target access
Obviously all debuggers must be able to communicate with the
target system, the system where the target program will be
executed. This typically involves the use of some form of
protocol, which is specified by the physical interface to the
target system. OEM vendors specify the protocol their probes
(JTAG, BDM, Monitor etc.) uses.
An instruction set simulator can also be provided to simulate a
physical target.
Host systems may allow direct control of application through
some form of API (ptrace( ) and /proc on Unix systems, the Debug
API on Windows systems).
On systems that allow multiple process execution, there should
be a way to specify which process the debugger wants to debug,
allowing for multi-process and/or multi-core debugging.
- ABI description
Support for the display of actual parameters, extended types and
execution of target functions requires the description of the
ABI used by the compiler when it created the target program.
Advanced debugging of regular applications can benefit from the
knowledge of how the target system handles threads (so as to
show the context of different threads).
Advanced debugging of C++ applications may require the
description of how exceptions (try-catch blocks) are handled.
© 2007 Giampiero Caprino, Backer Street Software
|
|