ARM Symbolic Debug Table Format
===============================


Acknowledgement
---------------

This design is based on work originally done for Acorn Computers Ltd. by 
Topexpress Ltd.


Introduction
------------

This document specifies the format of symbolic debugging data generated by ARM 
compilers, which is used by the ARM Symbolic Debugger (<armsd>) to support high 
level language oriented, interactive debugging.

For each separate compilation unit (called a <section>) the compiler produces 
debugging data, and a special <area> in the object code (see "<ARM Object 
Format>" for an explanation of ARM Object Format, including 
areas and their attributes). Debugging data are position independent, 
containing only relative references to other debugging data within the same 
section, and relocatable references to other compiler-generated areas.

Debugging data areas are combined by the ARM linker into a single contiguous 
section of a program image. For details of the ARM linker's capabilities see "
<The ARM Linker (armlink)>" of the User Manual. For a 
description of the linker's principal output format see "<ARM Image Format>" 
starting on page10.

Since the debugging section is position-independent, the debugger can move it 
to a safe location before the image starts executing. If the image is not 
executed under debugger control, the debugging data are simply overwritten.

The format of debugging data allows for a variable amount of detail. This 
potentially allows the user to trade off among memory used, disc space used, 
execution time, and debugging detail.

Assembly-language level debugging is also supported, though in this case the 
debugging tables are generated by the linker. If required, the assembler can 
generate debugging table entries relating code addresses to source lines. 
Low-level debugging tables appear in an extra section item, as if generated by 
an independent compilation (see "<Debugging Data Items in Detail>" starting on 
page61). Low-level and high-level debugging are orthogonal facilities, though 
<armsd> allows the user to move smoothly between levels if both sets of 
debugging data are present in an image.


Terminology
-----------

A <byte> is 8 bits, usually considered unsigned. A <word> is 32 bits (4 bytes), 
often considered signed. A <half word>, also called a <short>, is 16 bits (2 
bytes). Half words are unused, except in the long form of <LineInfo> items.


Order of Debugging Data
-----------------------

A debug data area consists of a series of <items>. The arrangement of these 
items mimics the structure of the high-level language program itself.

For each debug area, the first item is a <section> item, giving global 
information about the compilation, including a code identifying the language, 
and flags indicating the amount of detail included in the debugging tables.

Each datum, function, procedure, etc., definition in the source program has a 
corresponding debug data item; these items appear in an order corresponding to 
the order of definitions in the source. This means that any nested structure in 
the source program is preserved in the debugging data, and the debugger can use 
this structure to make deductions about the scope of various source-level 
objects. Of course, for procedure definitions, two debug items are needed: a 
<procedure> item to mark the definition itself, and an <endproc> item to mark 
the end of the procedure's body and the end of any nested definitions. If 
procedure definitions are nested then the <procedure> <endproc> brackets are 
nested too. Variable and type definitions made at the outermost level, of 
course, appear outside of all procedure/endproc items.

Information about the relationship between the executable code and source files 
is collected together and appears as a <fileinfo> item, which is always the 
final item in a debugging area. Because of the C language's #include facility, 
the executable code produced from an outer-level source file may be separated 
into disjoint pieces interspersed with that produced from the included files. 
Therefore, source files are considered to be collections of <fragments>, each 
corresponding to a contiguous area of executable code, and the <fileinfo> item 
is a list with an entry for each file, each in turn containing a list with an 
entry for each fragment. The fileinfo field in the <section> item addresses the 
<fileinfo> item itself. In each <procedure> item there is a <fileentry> field, 
which refers to the file-list entry for the source file containing the 
procedure's start; there is a separate one in the <endproc> item because it may 
possibly not be in the same source file.


Endian-ness and the Encoding of Debugging Data
----------------------------------------------

The ARM can be configured to use either a little-endian memory system (the 
least significant byte of each 4-byte word has the lowest address), or a 
big-endian memory system (the most significant byte of each 4-byte word has the 
lowest address).

In general, the code to be generated varies according to the byte-sex (or 
endian-ness) of the target, and the linker has insufficient information to 
change the byte sex of an object file. Therefore, object files are encoded 
using the byte order of the intended target, independently of the byte order of 
the host system on which the compiler or assembler runs. The ARM linker accepts 
inputs having either byte order, but rejects mixed sex inputs, and generates 
its output using the same byte order.

This means that producers of debugging tables must be prepared to generate them 
in either byte order, as required. In turn, this requires definitions to be 
very clear about when a 4-byte word is being used (which will require reversal 
on output or input when cross-sex compiling or debugging), and when a sequence 
of bytes is being used (which requires no special treatment provided it is 
written and read as a sequence of bytes in address order).


Representation of Data Types
----------------------------

Several of the debugging data items (e.g. procedure and variable) have a <type> 
word field to identify their data type. This field contains, in the most 
significant 24 bits, a code to identify a base type, and in the least 
significant 8 bits, a pointer count: 0 to denote the type itself; 1 to denote a 
pointer to the type; 2 to denote a pointer to a pointer to...; etc.

For simple types the code is a positive integer as follows, (all codes are 
decimal):

            void              0

    signed integers
            single byte       10
            half-word         11
            word              12

    unsigned integers
            single byte       20
            half-word         21
            word              22

    floating point
            float             30
            double            31
            long double       32

    complex
            single complex    41
            double complex    42

    functions
            function          100

For compound types (arrays, structures, etc.) there is a special kind of debug 
data item (array, struct, etc.) to give details such as array bounds and field 
types. The type code for compound types is negative, the negation of the (byte) 
offset of the debug item from the start of the debugging area.

If a type has been given a name in a source program, it will give rise to a 
<type> debugging data item which contains the name and a type word as defined 
above. If necessary, there will also be a debugging data item, such as an <array> 
or <struct> item, to define the type itself. In that case, the type word will 
refer to this item.

Set types in Pascal are not treated in detail: the only information recorded 
for them is the total size occupied by the object in bytes. Neither are Pascal 
<file> variables supported by the debugger, since their behaviour under 
debugger control is unlikely to be helpful to the user.

Fortran character types are supported by special kinds of debugging data item, 
the format of which is specific to each Fortran compiler.


Representation of Source File Positions
---------------------------------------

Several of the debugging data items have a <sourcepos> field to identify a 
position in the source file. This field contains a line number and character 
position within the line packed into a single word. The most significant 10 
bits encode the character offset (0-based) from the start of the line and the 
least-significant 22 bits give the line number.


Debugging Data Items in Detail
------------------------------


The Code and Length Field
.........................

The first word of each debugging data item contains the byte length of the item 
(encoded in the most significant 16 bits), and a code identifying the kind of 
item (in the least significant 16 bits). The defined codes are:

    1       section

    2       procedure/function definition

    3       endproc

    4       variable

    5       type

    6       struct

    7       array

    8       subrange

    9       set

    10      fileinfo

    11      contiguous enumeration

    12      discontiguous enumeration

    13      procedure/function declaration

    14      begin naming scope

    15      end naming scope

The meaning of the second and subsequent words of each item is defined below.

If a debugger encounters a code it does not recognise, it should use the length 
field to skip the item entirely. This discipline allows the debugging tables to 
be extended without invalidating existing debuggers.


Text Names in Items
...................

Where items include a string field, the string is packed into successive bytes 
beginning with a length byte, and padded at the end to a word boundary with 0 
bytes. The length of a string is in the range [0..255] bytes.


Offsets in File and Addresses in Memory
.......................................

Where an item contains a field giving an offset in the debugging data area 
(usually to address another item), this means a byte offset from the start of 
the debugging data for the whole section (in other words, from the start of the 
<section> item).

When the same structure is used to map debugging data in memory, an offset 
field may be used to hold a pointer to another debug item in memory, rather 
than the offset of it in the debug area.


Section Items
.............

A section item is the first item of each section of the debugging data. After 
its code and length word it contains the fields listed below. First there are 4 
flag bytes:

    lang                a byte identifying the source language

    flags               a byte describing the level of detail

    unused

    asdversion          a byte version number of the debugging data

The following language byte codes are defined:

    LANG_NONE           0     Low-level debugging data only

    LANG_C              1     C source level debugging data

    LANG_PASCAL         2     Pascal source level debugging data

    LANG_FORTRAN        3     Fortran-77 source level debugging data

    LANG_ASM            4     ARM Assembler line number data

All other codes are reserved to ARM.

The <flags> byte uses the following mask values:

    1       debugging data contains line-number information

    2       debugging data contains information about top-level variables

    3       both of the above

The <asdversion> byte should be set to 2, the version of this definition.

The flag bytes are followed by the following word-sized fields:

    codestart           address of first instruction in this section

    datastart           address of start of static data for this section

    codesize            byte size of executable code in this section

    datasize            byte size of the static data in this section

    fileinfo            offset in the debugging area of the fileinfo item for
                        this section (0 if no fileinfo item present)

    debugsize           total byte length of debug data for this section

    name or nsyms       string or integer

<codestart> and <datastart> are addresses, relocated by the linker. The 
<fileinfo> field, nominally an offset, is also used as a pointer when this 
structure is mapped in memory. The <fileinfo> field is 0 if no source file 
information is present.

The <name> field contains the program name for Pascal and Fortran programs. For 
C programs it contains a name derived by the compiler from the root file name 
(notionally a module name). In each case, the name is similar to a variable 
name in the source language. For a low-level debugging section (language = 0), 
the field is treated as a 4 byte integer giving the number of symbols 
following.

For linker-generated debugging data, the fields have the following values:

    language            0

    codestart           Image$$RO$$Base

    datastart           Image$$RW$$Base

    codesize            Image$$RO$$Limit - Image$$RO$$Base

    datasize            Image$$RW$$Limit - Image$$RW$$Base

    fileinfo            0

    nsyms               number of symbols in the following debugging data

    debugsize           total size of the low-level debugging data including
                        the size of this section item

For linker-generated debugging data, the section item is followed by nsyms 
<symbol> items, each consisting of 2 words:

    sym                 flags + byte offset in string table of symbol name

    value               the value of the symbol

<sym> encodes an index into the string table in the 24 least significant bits, 
and the following flag values in the 8 most significant bits: 

    ASD_GLOBSYM         0                 if the symbol is absolute

    ASD_ABSSYM          0x01000000L       if the symbol is global

    ASD_TEXTSYM         0x02000000L       if the symbol names code

    ASD_DATASYM         0x04000000L       if the symbol names data

    ASD_ZINITSYM        0x06000000L       if the symbol names 0-initialised
                                          data

Note that the linker reduces all symbol values to absolute values, so that the 
flag values record the history, or origin, of the symbol in the image.

Immediately following the symbol table is the string table, in standard AOF 
format. It consists of:

 *  a length word

 *  the strings themselves, each terminated by a NUL (0)

The length word includes the size of the length word, so no offset into the 
string table is less than 4. The end of the string table is padded with NULs to 
the next word boundary, (so the length is a multiple of 4).


Procedure Items
...............

A procedure item appears once for each procedure or function definition in the 
source program. Any definitions within the procedure have their related 
debugging data items between the procedure item and its matching endproc item. 
After its code and length field, a procedure items contains the following 
word-sized fields:

    type          the return type if this is a function, else 0
                  (see "<Representation of Data Types>")

    args          the number of arguments

    sourcepos     the source position of the procedure's start  (see
                  "<Representation of Data Types>")

    startaddr     address of 1st instruction of procedure prologue

    entry         address of 1st instruction of the procedure body
                  (see note below)

    endproc       offset of the related endproc item (in file) or pointer
                  to related endproc item (in memory)

    fileentry     offset of the file list entry for the source file (in file)
                  or a pointer to it (in memory)

    name          string

The <entry> field addresses the first instruction following the procedure 
prologue. That is, the first address at which a high-level breakpoint could 
sensibly be set. The <startaddr> field addresses the start of the prologue. 
That is, the instruction at which control arrives when the procedure is called.


Label Items
...........

A label in a source program is represented by a special procedure item with no 
matching endproc, (the endproc field is 0 to denote this). Pascal and Fortran 
numerical labels are converted by their respective compilers into strings 
prefixed by "$n".

For Fortran77, multiple entry points to the same procedure each give rise to a 
separate procedure item, all of which have the same endproc offset referring to 
the unique, matching endproc item.


Endproc Items
.............

An endproc item marks the end of the debugging data items belonging to a 
particular procedure. It also contains information relating to the procedure's 
return. After its code and length field, an endproc item contains the following 
word-sized fields:

    sourcepos     position in the source file of the procedure's end (see
                  "<Representation of Source File Positions>" starting on page
    60)

    endpoint      address of the code byte AFTER the compiled code for the
                  procedure

    fileentry     offset of the file-list entry for the procedure's end (in
                  file) or a pointer to it (in memory)

    nreturns      number of procedure return points (may be 0)

    retaddrs      array of addresses of procedure return code

If the procedure body is an infinite loop, there will be no return point, so 
nreturns will be 0. Otherwise each member of retaddrs should point to a 
suitable location at which a breakpoint may be set "at the exit of the 
procedure". When execution reaches this point, the current stack frame should 
still be for this procedure.


Variable Items
..............

A variable item contains debugging data relating to a source program variable, 
or a formal argument to a procedure (the first variable items in a procedure 
always describe its arguments). After its code and length field, a variable 
item contains the following word-sized fields:

    type                type of this variable
                        (see "<Representation of Data Types>" starting on page
    59)

    sourcepos           the source position of the variable (see
                        "<Representation of Source File Positions>" starting on 
    page60)

    storageclass        a word encoding the variable's storage class

    location            see explanation below

    name                string

The following codes define the storage classes of variables: 

    1             external variables (or Fortran common)

    2             static variables private to one section

    3             automatic variables

    4             register variables

    5             Pascal 'var' arguments

    6             Fortran arguments

    7             Fortran character arguments

The meaning of the location field of a variable item depends on the storage 
class: it contains an absolute address for static and external variables 
(relocated by the linker); a stack offset (an offset from the frame pointer) 
for automatic and var-type arguments; an offset into the argument list for 
Fortran arguments; and a register number for register variables, (the 8 
floating point registers are numbered 16..23).

No account is taken of variables which ought to be addressed by +ve offsets 
from the stack-pointer rather than -ve offsets from the frame-pointer.

The sourcepos field is used by the debugger to distinguish between different 
definitions having the same name (e.g. identically named variables in disjoint 
source-level naming scopes such as nested blocks in C).


Type Items 
...........

A type item is used to describe a named type in the source language (e.g. a 
typedef in C). After its code and length field, a type item contains two 
word-sized fields:

    type          a type word (described in
                  "<Representation of Data Types>")

    name          string


Struct Items
............

A struct item is used to describe a structured data type (e.g. a struct in C or 
a record in Pascal). After its code and length field, a struct item contains 
the following word-sized fields: 

    fields              the number of fields in the structure

    size                total byte size of the structure

    fieldtable...       an array of <fields> struct field items

Each struct field item has the following word-sized fields: 

    offset              byte offset of this field within the structure

    type                a type word (described in
                        "<Representation of Data Types>")

    name                string

Union types are described by struct items in which all fields have 0 offsets.

C bit fields are not treated in full detail: a bit field is simply represented 
by an integer starting on the appropriate word boundary (so that the word 
contains the whole field).


Array Items
...........

An array item is used to describe a one-dimensional array. Multi-dimensional 
arrays are described as "arrays of arrays". Which dimension comes first is 
dependent on the source language (which is different for C and Fortran). After 
its code and length field, an array item contains the following word-sized 
fields:

    size                total byte size of the array

    flags               (see below) 

    basetype            a type word (described in
                        "<Representation of Data Types>") 

    lowerbound          constant value or location of variable     

    upperbound          constant value or location of variable

If the size field is zero, debugger operations affecting the whole array, 
rather than individual elements of it, are forbidden.

The following mask values are defined for the flags field:

    ARRAY_UNDEF_LBOUND              1           lower bound is undefined

    ARRAY_CONST_LBOUND              2           lower bound is a constant

    ARRAY_UNDEF_UBOUND              4           upper bound is undefined

    ARRAY_CONST_UBOUND              8           upper bound is a constant

    ARRAY_VAR_LBOUND                16          lower bound is a variable

    ARRAY_VAR_UBOUND                32          upper bound is a variable

A bound is described as undefined when no information about it is available.

A bound is described as constant when its value is known at compile time. In 
this case, the corresponding bound field gives its value.

If a bound is described as variable, the offset field identifies a variable 
debug item describing the location containing the bound. In a debug area in an 
object file, the offset field contains the offset from the start of the debug 
area to the variable item; in memory it contains a pointer to the corresponding 
variable item. Note that a variable item may be used to describe a location 
known to the compiler, which need not correspond to a source language variable.


Subrange Items
..............

A subrange item is used to describe a subrange typed in Pascal. It also serves 
to describe enumerated types in C, and scalars in Pascal (in which case the 
base type is understood to be an unsigned integer of appropriate size). After 
its code and length field, a subrange item contains the following word-sized 
fields:

    sizeandtype         see below

    lb                  low bound of subrange

    hb                  high bound of subrange

The <sizeandtype> field encodes the byte size of container for the subrange (1, 
2 or 4) in its least significant 16 bits, and a simple type code (see"
<Representation of Data Types>") in its most significant 16 
bits. The type code refers to the base type of the subrange.

(For example, a subrange 256..511 of unsigned short might be held in 1 byte).


Set Items
.........

A set item is used to describe a Pascal set type. Currently, the description is 
only partial. After its code and length field, a set item consists of a single 
word: 

    size                byte size of the object


Enumeration Items
.................

An enumeration item describes a Pascal or C enumerated type. After its code and 
length word, the description of a <contiguous enumeration> contains the 
following word-sized fields:

    type                a type word describing the type
                        of the container for the enumeration
                        (see "<Representation of Data Types>" starting on page
    59)

    count               the cardinality of the enumeration

    base                the first (lowest) value (may be -ve)

    nametable           a character array containing <count> names (see
                        "<Text Names in Items>")

The description of a discontiguous enumeration (such as the C enumeration enum 
bits {bit0=1, bit1=2, bit2=4, bit3=8, bit4=16}) contains the following fields 
after its code and length word: 

    type        as above

    count       as above

    nametable   a table of <count> (value, name) pairs

Each nametable entry has the following format (which is variable in length):

    val         the enumerated value (1/2/4/8/16 in the example)

    name        the name of the enumerated element (may be several words long)


Function Declaration Items
..........................

After its code and length word, a function declaration item contains the 
following fields:

    type        a type word (described in "<Representation of Data Types>" 
               ) describing the return type of the function 
                or procedure

    argcount    the number of arguments to the function

    args        a sequence of <argcount> argument description items

Each argument description item contains the following:

    type        a type word (described in "<Representation of Data Types>" 
               ) describing the type of the argument

    name        the name of the argument (may be several words)

An argument descriptor need not be named; in this case the length of the name 
is zero, and the name field is a single zero word.


Begin and End Naming Scope Items 
.................................

These debug items are used to mark the beginning and end of a naming scope. 
They must be properly nested in the debug area.

In each case, after the code and length word, there is one word-sized field:

    codeaddress         address of the start/end of scope (which is
                        determined by the code word)


Fileinfo Items
..............

A fileinfo item appears once per section, after all other debugging data items.  
If the fileinfo item is too large for its length to be encoded in 16 bits, its 
length field must be written as 0 (since this is the last item in a section and 
the section header contains the length of the whole section, the length field 
is strictly redundant.

Each source file is described by a sequence of <fragments>. Each <fragment> 
describes a contiguous region of the file, within which the addresses of 
compiled code increase monotonically with source file position. The order in 
which fragments appear in the sequence is not necessarily related to the source 
file positions to which they refer.

Note that for compilations which make no use of the #include facility, the list 
of fragments may have only one entry, and all line-number information can be 
contiguous.

After its code and length word, the fileinfo item is a sequence of file entry 
items with the following format:

    len                 length of this entry in bytes (including the length
                        of the following fragments)

    date                date and time when the file was last modified
                        (may be 0, indicating not available, or unused)

    filename            string (or "" if the name is not known)

    fragment data       see below

If present, the date field contains the number of seconds since the beginning 
of 1970 (the Unix date origin).

Following the final file entry item, is a single 0 word marking the end of the 
sequence.

The fragment data is a word giving the number of following fragments followed 
by a sequence of fragment items:

    n                   number of fragments following

    fragments...        n fragment items

Each fragment item consists of 5 words, followed by a sequence of byte pairs 
and half word pairs, formatted as follows:

    size                length of this fragment in bytes (including
                        length of following lineinfo items)

    firstline           linenumber

    lastline            linenumber

    codestart           pointer to the start of the fragment's executable
                        code

    codesize            byte size of the code in the fragment

    lineinfo...         a variable number of bytes matching line numbes
                        to code addresses

Each lineinfo item describes a source statement and consists of a pair of 
(unsigned) bytes, possibly followed by a two or three (unsigned) half words, 
(each half word has the byte ordering appropriate to the target memory system's 
endian-ness or byte sex).

The short form (pair of bytes) lineinfo item is as follows:

    codeinc             # bytes of code generated by this statement

    lineinc             # source space occupied by this statement

<lineinc> describes how to calculate the source position (line, column) of the 
next statement from the source position of this one.  If <lineinc> is in the 
range 0 <= <lineinc> < 64, the new position is (line+<lineinc>,1).  If <lineinc> 
>= 64, the new position is (line,column+<lineinc>-64).

The number of bytes of code generated for a statement may be zero, provided the 
line increment is non-zero (such an item may describe a block end or block 
start, for example).

It is not possible to describe a statement which generates no code and no line 
number increment, as that encoding is used as an escape to the long form 
lineinfo items described below. 

If <codeinc> is greater than 255, or <lineinc> is required to describe a line 
number change greater than 63 or a column change greater than 191, then both 
bytes are written to describe 0 increments, and the real values are given in 
the following two or three (unsigned) half words.  (Note that there are two 
ways to describe 0 increments: 0 lines and 0 columns, which serves to 
descriminate between the two half word and three half word forms).  If the 
starting column for the next statement is 1, the two half word form is used, 
which in effect is a triple of half words as follows:

    zero                2 zero bytes

    lineinc             # source lines occupied by this statement

    codeinc             # bytes of code generated by this statement

Note that the order of the <lineinc> and <codeinc> half words is the reverse of 
the corresponding bytes.

If the starting column for the next statement is not 1, the three half word 
form is used, which in effect is a quadruple of half words, as follows:

                        codeinc = 0, lineinc = 64

    lineinc             # source lines occupied by this statement

    codeinc             # bytes of code generated by this statement

    newcol              starting column for the next statement

Note as above that the order of the <lineinc> and <codeinc> half words is the 
reverse of the corresponding bytes.  Note also that the column item here is the 
absolute column number for the next statement, and not an increment as in the 
two byte form.

(This encoding of lineinfo items is an incompatible change from the previous 
format (version 2): in that format, <lineinc> in a two byte lineinfo item 
always describes a line increment, and accordingly, there is no four half word 
form.  Programs interpreting asd tables should interpret lineinfo items 
differently according to the table format in the section item.)

