Programming in C
================


A Very Simple C Program
-----------------------


About this Recipe
.................

This recipe gives you a simple exercise in using the ARM Software Development 
Toolkit (the toolkit) to write a program in C. By following it, you will learn 
how to:

 *  use the ARM C compiler <armcc> to create a runnable program;

 *  use the ARM source level debugger <armsd> to run your program on a 
    (simulated) ARM system;

 *  use <armcc> to compile a C program to an object file;

 *  use the ARM linker <armlink> to create a runnable program from an object 
    file and the ARM C library.


Prerequisites
.............

Before you can try this recipe, the toolkit must be properly installed on your 
computer. Instructions for installation are given in the installation notes 
distributed with every toolkit. If you experience any difficulties, please 
refer to these notes.


Making a Simple Runnable Program
................................

The "Hello World" program shown below, is included in the on-line examples as 
file <hellow.c> in the directory <examples>:

    #include <stdio.h>
    
    int main( int argc, char **argv )
    { printf("Hello World\n");
      return 0;
    }

If you set your working directory to be the <examples> directory you can 
compile this program to runnable form in a single step using:

    armcc hellow.c -li -apcs 3/32bit


Explanation

The argument -li says that the target is little endian and -apcs 3/32bit says 
that the 32 bit ARM procedure call standard should be used. If the compiler has 
been <configured> to use these options by default then these arguments need not 
be given (see "<The ARM Tool Reconfiguration Utility (reconfig)>" starting on 
page45 of the User Manual for details). The executable program is left in a 
file called <hellow>.


Running the Program
...................

You can run the program (technically an AIF Image) using <armsd>. You should 
follow the sample dialog below:

    host-prompt> armsd -li hellow
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file hellow
    armsd: go
    Hello world
    Program terminated normally at PC = 0x000082a0
          0x000082a0: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    host-prompt>


Explanation

The -li argument to <armsd> tells it to emulate a little endian arm. If armsd 
has been configured to be little endian by default then -li can be omitted (see 
"<The ARM Tool Reconfiguration Utility (reconfig)>" of the 
User Manual for how to configure the ARM development tools).

When armsd comes up with its "armsd:" prompt and waits for your command, you 
should type "go<CR>". At the next prompt type "quit<CR>" to exit <armsd>.


Separate Compiling
..................

You can invoke the compiler and the linker separately. You can use:

    armcc -c hellow.c -li -apcs 3/32bit

to make an object file (in this example called <hellow.o>, by default).


Explanation

The -c flag tells the compiler to make an object file but not to link it with 
the C library.


Separate Linking
................

When you have finished compiling, you can link your object file with the C 
library to make a runnable program using:

    armlink -o hellow hellow.o <somewhere>/armlib.32l

Where we have written <somewhere>, above, you must type the name of the 
directory containing the ARM C libraries.


Notes

You now have to be very explicit; you must specify:

 *  the name of the file which will contain the runnable program (here, <hellow>
    );

 *  the name of the object file (here, <hellow.o>);

 *  the location and name of the C library you wish to use.

In simple cases, <armcc> can reduce the need to be so explicit.


Related Topics
..............

Please refer to the index to find topics of particular interest.


Writing Efficient C for the ARM
-------------------------------


About This Recipe
.................

The ARM C compiler can generate very good machine code for if you present it 
with the right sort of input. From this note, you will learn:

 *  what the C compiler compiles well and why;

 *  how to help the C compiler to generate excellent machine code.

Some of the rules of thumb presented are quite general; some are quite specific 
to the ARM or the ARM C compiler. It should be quite clear from context which 
rules are portable.

The first subsection below is concerned with how to design collections of C 
functions to maximise low-level efficiency. The following subsection is 
concerned with the efficiency of larger and more complicated functions.


Function Design Considerations
..............................

Unlike on many earlier CISC processor architectures, function call overhead on 
the ARM is small and often in proportion to the work done by the called 
function. Several feaures contribute to this:

 *  the minimal ARM call-return sequence is BL... MOV pc, lr, which is 
    extremely economical;

 *  STM and LDM reduce the cost of entry to and exit from functions which must 
    create a stack frame and/or save registers;

 *  the ARM Procedure Call Standard has been carefully designed to allow two 
    very important types of function call to be optimised so that the entry and 
    exit overheads are minimal.

Good general advice is to keep functions small, because function calling 
overheads are low. In the remainder of this subsection you will learn precisely 
when function call overhead is very low. In following subsections you will 
learn how small functions help the ARM C compiler; you will also learn how to 
assist the C compiler when functions cannot be kept small.


Leaf Functions

In 'typical' programs, about half of all function calls made are to leaf 
functions (a leaf function is one which makes no calls from within its body).

Often, a leaf function is rather simple. On the ARM, if it is simple enough to 
compile using just 5 registers (a1-a4 and ip), it will carry no function entry 
or exit overhead. A surprising proportion of useful leaf functions can be 
compiled within this constraint.

Once registers have to be saved, it is efficient to save them using STM. In 
fact the more you can save at one go, the better. In a leaf function, all and 
only the registers which need to be saved will be saved by a single STMFD 
sp!,{regs,lr} on entry and a matching LDMFD sp!,{regs,pc} on exit.

In general, the cost of pushing some registers on entry and popping them on 
exit is very small compared to the cost of the useful work done by a leaf 
function which is complicated enough to need more than 5 registers.

Overall, you should expect a leaf function to carry virtually no function entry 
and exit overhead; and at worst, a small overhead, most likely in proportion to 
the useful work done by it.


Veneer Functions (Simple Tail Continued Functions)

Historically, abstraction veneers have been relatively expensive. The kind of 
veneer function which merely changes the types of its arguments, or which calls 
a low-level implementation with an extra argument (say), has often cost much 
more in entry and exit overhead than it was worth in useful work.

On the ARM, if a function ends with a call to another function, that call can 
be converted to a <tail continuation>. In functions which need to save no 
registers, the effect can be dramatic. Consider, for example:

    extern void *__sys_alloc(unsigned type, unsigned n_words);
    #define  NOTGCable   0x80000000
    #define  NOTMovable  0x40000000
    
    void *malloc(unsigned n_bytes)
    {   return __sys_alloc(NOTGCable+NOTMovable, n_bytes/4);
    }

Here, <armcc> generates:

    malloc
        MOV     a2,a1,LSR #2
        MOV     a1,#&c0000000
        B       |__sys_alloc|

There is no function entry or exit overhead - just useful work massaging 
arguments - and the function return has disappeared entirely - return is direct 
from __sys_alloc to malloc's caller. In this case, the basic call-return cost 
for the function pair has been reduced from:

     BL + BL + MOV pc,lr + MOV pc,lr

to:

     BL + B  +             MOV pc,lr

a saving of 25%.

More complicated functions in which the only function calls are immediately 
before a return, collapse equally well. An artificial example is:

    extern int f1(int), int f2(int, int);
    
    int f(int a, int b)
    {   if (b == 0)
            return a;
        else if (b < 0)
            return f2(a, -b);
        else
            return f2(b, a);  /* argument order swapped */
    }

<armcc> generates the following, wonderfully efficient code:

    f   CMP     a2,#0
        MOVEQS  pc,lr
        RSBLT   a2,a2,#0
        BLT     f2
        MOV     a3,a1
        MOV     a1,a2
        MOV     a2,a3
        B       f2


Fast Paths and Slow Paths - A Useful Transformation

Inevitably, not all functions can be leaves or small abstraction functions. 
And, inevitably, non-leaf functions must carry the cost of establishing a call 
frame on entry and removing it on exit, perhaps also the cost of saving and 
restoring some registers. How does this hurt performance? Consider the 
following example:

    int f(Buffer *b)
    {    if (b->n > 0)
         {   /* The usual path through the function... */
             /*     95% of all calls.*/
             /* Simple calculation involving b->buf, b->n, etc.*/
             return ...;
         }
         /* Exceptional path through the function... */
         /*     5% of all calls.  */
         /* Complicated calculation involving calls
         /*     to other functions.*/
         return ...;
    }

In this case, the entry and register-save overhead caused by the infrequent 
heavyweight path through the function applies to the much more frequent 
lightweight path through it. To fix this, turn the heavyweight path into a tail 
call. Yes, introducing another layer of function call yields much more 
efficient code!

    int f2(Buffer *b)
    {    /* Exceptional path through the function... */
         /*     5% of all calls.  */
         /* Complicated calculation involving calls */
         /*     to other functions.*/
         return ...;
    }
    
    int f(Buffer *b)
    {    if (b->n > 0)
         {   /* The usual path through the function... */
             /*     95% of all calls.*/
             /* Simple calculation involving b->buf, b->n, etc.*/
             return ...;]
         }
         return f2(b);
    }

If you are lucky, f() will now compile using only a1-a4 and ip and so incur no 
entry overhead whatsoever. 95% of the time, the overhead on the original f() 
will be reduced to zero.

This is quite a general source transformation technique and you should look for 
opportunities to use it and analogous transformations. It works for any 
processor to some extent; it works particulary well for the ARM because of the 
careful optimisation of tail continuation in lightweight functions.

Repeated application of this technique to the chain of six or so functions 
called for every character processed by the preprocessing phase of the ARM C 
compiler, improved the performance of the preprocessor (running on the ARM) by 
about 30%.


Function Arguments and Argument Passing

The final aspect of function design which influences low-level efficiency is 
argument passing.

Under the ARM Procedure Call Standard, up to four argument words can be passed 
to a function in registers. Functions of up to four integral (not floating 
point) arguments are particularly efficient and incur very little overhead 
beyond that required to compute the argument expressions themselves (there may 
be a little register juggling in the called function, depending on its 
complexity).

If more arguments are needed, then the 5th, 6th, etc., words will be passed on 
the stack. This incurs the cost of an STR in the calling function and an LDR in 
the called function for each argument word beyond four.

How can argument passing overhead be minimised?

 *  Try to ensure that small functions take four or fewer arguments. These will 
    compile particualrly well.

 *  If a function needs many arguments, try to ensure that it does a 
    significant amount of work on every call, so that the cost of passing 
    arguments is amortised.

 *  Factor out read-mostly global control state and make this static. If it has 
    to be passed as an argument (e.g. to support multiple clients) then wrap it 
    up in a struct and pass a pointer to it. The characteristics of such 
    control state are:

     *  it's logically global to the compilation unit or program

     *  it's read-mostly, often read-only except in response to user input, and 
        for almost all functions cannot be changed by them or any function 
        called from them;

     *  references to it are ubiquitous, but in any function, references are 
        relatively rare (frequent references should be replaced by references 
        to a local, non-static copy).

    Don't confuse such control state with compuational arguments, the values of 
    which differ on every call.

 *  Collect related data into structs. Decide whether to pass pointers or 
    struct values based on the use of each struct in the called function:

     *  If few fields are read or written then passing a pointer is best.

     *  The cost of passing a struct via the stack is typically a share in an 
        LDM-STM pair for each word of the struct. This can be better than 
        passing a pointer if (i) on average, each field is used at least once 
        and (ii) the register pressure in the function is high enough to force 
        a pointer to be repeatedly re-loaded. 

    As a rule of thumb, you can't lose much efficiency if you pass pointers to 
    structs rather than struct values. To gain efficiency by passing struct 
    values rather than pointers usually requires careful study of a function's 
    machine code.


Register Allocation and How To Help It
......................................

It is well known that register allocation is critical to the efficiency of code 
compiled for RISC processors. It is particularly critical for the ARM, which 
has only 16 registers rather than the 'traditional' 32.

The ARM C compiler has a highly efficient register allocator which operates on 
complete functions and which tries to allocate the most frequently used 
variables to registers (taking loop nesting into account). It produces very 
good results unless the demand for registers seriously outstrips supply. And it 
has one shortcoming, namely that it allocates whole variables to registers, not 
separate live ranges.

As code generation proceeds for a function, new variables are created for 
expression temporaries. These are never reused in later expressions and cannot 
be spilled to memory. Usually, this causes no problems. However, a particularly 
pathological expression could, in principle, occupy most of the allocatable 
registers, forcing almost all program variables to be spilled to memory. 
Because the number of registers required to evaluate an expression is a 
logarithmic function of the number terms in it, it takes an expression of more 
than 32 terms to threaten the use of any variable register.

As a rule of thumb, avoid very large expressions (more than 30 terms).

The more serious problem is with long scope program variables. Our allocator 
either allocates a variable to a chosen register everywhere the variable is 
live, or it leaves the variable in memory. To help visualise the problem - and 
to see how to help the allocator - consider the following two program schemata:

    int f()                            int f()
    {   int i, j, ...;                 {   int j, ...;
                                         { int i;
        for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
        {                                  {
            ...                               ...
        }                                  }
                                         }
                                         { int i;
        for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
        {  /* register pressure in this    {
           loop forces 'i' to memory */
        }                                  }
                                         }
                                         { int i;
        for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
        {                                  {
            ...                                ...
        }                                  }
                                         }
    }                                  }

In the left hand case, because the scope of 'i' is the whole function, if 'i' 
cannot be allocated to a register everywhere then all three loops will suffer 
their loop index being in memory. On the other hand, in the right hand case 
there are three separate variables called 'i', each of which will be allocated 
separately by the register allocator.

As a rule of thumb, keep variable declarations local, especially in large 
functions. Use additional block structure as illustrated here (right hand 
example), if necessary.

On the other hand, if this transformation is carried to excess, there may be 
bad results. When a local variable is spilled to memory, there is a stack 
adjustment on each entry to and exit from its containing scope. The ARM C 
compiler does this to minimise the space used by local variables. Suppose, for 
example, that in the right hand case above, each block declared a 1KB buffer as 
well as 'i'. Then adjusting the stack at every scope leads to stack usage of 
just over 1KB whereas adjusting it only at function entry leads to usage of 
more than 3KB.

In principle, the compiler could be more intelligent about adjusting the stack 
locally for large variables and only at function entry for small variables. For 
the moment, the programmer must be aware of these issues.

So, a modified rule of thumb is to cluster variable declarations into 
reasonable sub-scopes within large functions and to avoid doing so within the 
most deeply nested loops. This will most likely help the allocator without 
introducing unwanted costs associated with local stack adjustment.


Static and Extern Variables - Minimising Access Costs
.....................................................

A variable in a register costs nothing to access: it is just there, waiting to 
be used. A local (auto) variable is addressed via the sp register, which is 
always available for the purpose.

A static variable, on the other hand, can only be accessed after the static 
base for the compilation unit has been loaded. So, the first such use in a 
function always costs 2 LDRs or an LDR and an STR. However, if there are many 
uses of static variables within a function then there is a good chance that the 
static base will become a global common subexpression (CSE) and that, overall, 
access to static variables will be no more expensive than to auto variables on 
the stack.

Extern variables are fundamentally more expensive: each has its own base 
pointer. Thus each access to an extern is likely to cost 2 LDRs or an LDR and 
an STR. It is much less likely that a pointer to an extern will become a global 
CSE - and almost certain that there cannot be several such CSEs - so if a 
function accesses lots of extern variables, it is bound to incur significant 
access costs.

A further cost occurs when a function is called: the compiler has to assume - 
in the absence of inter-procedural data flow analysis - that <any> non- const 
static or extern variable <could> be side-effected by the call. This severly 
limits the scope across which the value of a static or extern variable can be 
held in a register.

Sometimes a programmer can do better than a compiler could do, even a compiler 
that did interprocedural data flow analysis. An example in C is given by the 
standard streams: stdin, stdout and stderr. These are not pointers to const 
objects (the underlying FILE structs are modified by I/O operations), nor are 
they necessarily const pointers (they may be assignable in some 
implementations). Nonetheless, a function can almost always safely slave a 
reference to a stream in a local FILE * variable.

It is a common programming paradigm to mimic the standard streams in 
applications. Consider, for example, the shape of a typical non-leaf printing 
function:

    extern FILE *out;                  extern FILE *out;
        /* the output stream */            /* the output stream */
    
    void print_it(Thing *t)            void print_it(Thing *t)
    {                                  {   FILE *f = out;
        fprintf(out, ...);                 fprintf(f, ...);
        print_1(t->first);                 print_1(t->first);
        fprintf(out, ...);                 fprintf(f, ...);
        print_2(t->second);                print_2(t->second);
        fprintf(out, ...);                 fprintf(f, ...);
        ...                                ...
    }                                  }

In the left hand case, out has be be re-computed or re-loaded after each call 
to print_... (and after each fprintf...). In the right hand case, 'f' can be 
held in a register throughout the function (and probably will be).

Uniform application of this transformation to the disassembly module of the ARM 
C compiler saved more than 5% of its code space.

In general, it is difficult and potentially dangerous to assert that no 
function you call (or any functions they in turn call) can affect the value of 
any static or extern variables of which you currently have local copies. 
However, the rewards can be considerable so it is usually worthwhile to work 
out at the program design stage which global variables are slavable locally and 
which are not. Trying to retrofit this improvement to exisiting code is usually 
hazardous, except in very simple cases like the above.


The switch() Statement
......................

The switch() statement can be used to transfer control to one of several 
destinations - conceptually an indexed transfer of control - or to generate a 
value related to the controlling expression (in effect computing an in-line 
function of the controlling expression).

In the first role, switch() is hard to improve upon: the ARM C compiler does a 
good job of deciding when to compile jump tables and when to compile trees of 
if-then-elses. It is rare for a programmer to be able to improve upon this by 
writing if-then-else trees explicitly in the source.

In the second role, however, use of switch() is often mistaken. You can 
probably do better by being more aware of what is being computed and how.

In the example below, which is abstracted from an early version of the 
disassembly module of the ARM C Compiler, you will learn:

 *  the cost of implementing an in-line function using switch();

 *  how to implement the same function more economically.

The function below used for illustrative purposes maps a 4-bit field of an ARM 
instruction to a 2-character condition code mnemonic. The real case was more 
complicated, decoding two 4-bit fields to a 3-char mnemonic, but for 
illustration the simple example serves just as well. The real case was also 
embedded in a larger function, but this is irrelevant to the discussion.

    char *cond_of_instr(unsigned instr)
    {   char *s;'
        switch (instr & 0xf0000000)
        {
    case 0x00000000:  s = "EQ";  break;
    case 0x10000000:  s = "NE";  break;
         ...          ...        ...
    case 0xF0000000:  s = "NV";  break;
        }
        return s;
    }

The compiler handles this code fragment well, generating 276 bytes of code and 
string literals. But we could do better. If performance were not critical (as 
it never is in disassembly) then we could look up the code in a table of codes, 
in something like:

    char *cond_of_instr(unsigned instr)
    {
        static struct {char name[3];  unsigned code;}
            conds[] = {
                "EQ", 0x00000000,
                "NE", 0x10000000,
                ....
                "NV", 0xf0000000,
            };
        int j;
        for (j = 0;  j < sizeof(conds)/sizeof(conds[0]);  ++j)
            if ((instr & 0xf0000000) == conds[j].code)
                return conds[j].name;
        return "";
    }

This fragment compiles to 68 bytes of code and 128 bytes of table data. Already 
this is a 30% improvement on the switch() case, but this schema has other 
advantages: it copes well with a random code to string mapping and if the 
mapping is not random admits further optimisation. For example, if the code is 
stored in a byte (char) instead of an unsigned and the comparison is with 
(instr >> 28) rather than (instr & 0xF0000000) then only 60 bytes of code and 
64 bytes of data are generated for a total of 124 bytes.

Another advantage we have heard of for table lookup is that is is possible to 
share the same table between a disassembler and an assembler - the assembler 
looks up the mnemonic to obtain the code value, rather than the code value to 
obtain the mnemonic. Where performance is not critical, the symmetric property 
of lookup tables can sometimes be exploited to yield significant space savings.

Finally, by exploiting the denseness of the indexing and the uniformity of the 
returned value it is possible to do better again, both in size and performance, 
by direct indexing:

    char *cond_of_instr(unsigned instr)
    {
        return "\
    EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\
    HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4;
    }

This expression of the problem causes a miserly 16 bytes of code and 64 bytes 
of string literal to be generated and is probably close to what an experienced 
assembly language programmer would naturally write if asked to code this 
function. It is the solution finally adopted in the ARM C compiler's 
disassembler module.

The uniform application of this transformation to the disassembler module of 
the ARM C compiler saved between 5% and 10% of its code space.

The moral of this tale is to think before using switch() to compute an in-line 
function, especially if code size is an important consideration. Switch() 
compiles to high performance code but often table lookup will be smaller; where 
the function's domain is dense, or piecewise dense, direct indexing into a 
table will often be both faster and smaller.


Related Topics
..............

 *  "<ARM Assembly Programming Performance Issues>".

 *  "<Register Usage under the ARM Procedure Call Standard>"
    .

 *  "<Passing and Returning structs>".

s


C Programming for Deeply Embedded Applications
----------------------------------------------


About this Recipe
.................

In this recipe you will learn about the standalone runtime support system for C 
programming in deeply embedded applications.  In particular you will discover:

 *  what <rtstand.s> supports;

 *  how to make use of it by looking at example programs;

 *  how to extend it by adding extra fuctionality from the C library;

 *  the size of the standalone run time library;


Introduction
............

The semi hosted ANSI C library provides all the standard C library facilities 
(and thus is quite large).  This is acceptable when running  under emulation 
with plenty of memory available, or maybe even when running on development 
hardware with access to a real debugging channel and plenty of memory. However, 
in a deeply embedded application many of the facilities of the C library may no 
longer be relevent, eg. file access functions, time and date functions, and the 
size of the semi hosted ANSI C library may be prohibitive if the memory 
available is severely limited.

For deeply embedded applications a minimal C runtime system is needed which 
takes up as little memory as possible, is easily portable to the target 
hardware, and only supports those functions required for such an application.

The ARM Software Development Toolkit comes with a minimal runtime system in 
source form.  The 'behind the scenes' jobs which it performs are:

 *  setting up the initial stack and heap, and calling main;

 *  program termination - either automatic (returning from main() or forced - 
    explicitly calling __rt_exit);

 *  simple heap allocation (__rt_alloc);

 *  stack limit checking;

 *  setjmp and longjmp support;

 *  divide and remainder functions (calls to which can be generated by <armcc>
    );

 *  high level error handler support (__err_handler);

 *  optional floating point support, and a means to detect whether floating 
    point support is available or not (__rt_fpavailable); 

The source code <rtstand.s> documents the options which you may want to change 
for your target.  These are not covered in this recipe.  The header file 
<rtstand.h> documents the functions which <rtstand.s> provides to the C 
programmer.

Note that no support is provided for outputting data down the debugging 
channel.  This can be done, but is specific to the target application.  The 
example C programs described below use the ARM Debug Monitor available under 
<armsd> to output messages using in-line SWIs.  See "<ARM Debug Monitor>" 
starting on page104 of the Technical Specifications for full details of the 
facilities which the ARM Debug Monitor provides, and see "<In-Line SWIs>" 
starting on page72 for more information about in-line swis.


Using the Standalone Runtime System
...................................

In this section the main features of the standalone runtime system are 
demonstrated by example programs.

Before attempting any of the demonstrations below create a working directory, 
and set this up as your current directory.  Copy the contents of the <clstand> 
directory into your working directory, and also copy the files <fpe*.o> from 
the <fpe340> directory of the <cl> directory into your working directory.  You 
are now ready to experiment with the C standalone runtime system.

In the examples below, the following options are passed to <armcc>, <armasm>, 
and in the first case <armsd>:

    -li                 This specifies that the the target is a little endian 
                        ARM.

    -apcs 3/32bit       This specifies that the 32 bit variant of APCS 3 should 
                        be used.  For <armasm> this is used to set the built in 
                        variable {CONFIG} to 32.

These arguments can be changed if the target hardware differs from this 
configuration.  If the ARM Software Tools have been configured as desired then 
these options may be omitted, as the tools will default to the configuration 
time values.  See "<The ARM Tool Reconfiguration Utility (reconfig)>" starting 
on page45 of the User Manual for how to configure the ARM Software Tools.

These demonstrations are likely to be most useful if the sources <rtstand.s>, 
<errtest.c> and <memtest.c> are studied in conjunction with this recipe.


A Simple Program
................

Let us compile the example program <errtest.c>, and assemble the standalone 
runtime system.  These can then be linked together to provide an executable 
image, <errtest>:

    armcc -c errtest.c -li -apcs 3/32bit
    armasm rtstand.s -o rtstand.o -li -apcs 3/32bit
    armlink -o errtest errtest.o rtstand.o

We can then execute this image under the <armsd> as follows:

    > armsd -li errtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file errtest
    armsd: go
    (the floating point instruction-set is not available)
    Using integer arithmetic ...
    10000 / 0X0000000A = 0X000003E8
    10000 / 0X00000009 = 0X00000457
    10000 / 0X00000008 = 0X000004E2
    10000 / 0X00000007 = 0X00000594
    10000 / 0X00000006 = 0X00000682
    10000 / 0X00000005 = 0X000007D0
    10000 / 0X00000004 = 0X000009C4
    10000 / 0X00000003 = 0X00000D05
    10000 / 0X00000002 = 0X00001388
    10000 / 0X00000001 = 0X00002710
    Program terminated normally at PC = 0x00008550
          0x00008550: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

The '>' prompt is the Operating System prompt, and the 'armsd:' prompt is 
output by <armsd> to indicate that user input is required.

Already several of the standalone runtime system's facilities have been 
demonstrated:

 *  the C stack and heap have been set up;

 *  <main> has clearly been called;

 *  the fact that floating point support is not available has been detected;

 *  the integer division functions  have been used by the compiler.

 *  program termination was successful.


Error Handling
..............

The same program, <errtest>, can also be used to demonstrate error handling, by 
recompiling <errtest.c> and predefining the DIVIDE_ERROR macro:

    armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR
    armlink -o errtest errtest.o rtstand.o

Again, we can now execute this image under the <armsd> as follows:

    > armsd -li errtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file errtest
    armsd: go
    (the floating point instruction-set is not available)
    Using integer arithmetic ...
    10000 / 0X0000000A = 0X000003E8
    10000 / 0X00000009 = 0X00000457
    10000 / 0X00000008 = 0X000004E2
    10000 / 0X00000007 = 0X00000594
    10000 / 0X00000006 = 0X00000682
    10000 / 0X00000005 = 0X000007D0
    10000 / 0X00000004 = 0X000009C4
    10000 / 0X00000003 = 0X00000D05
    10000 / 0X00000002 = 0X00001388
    10000 / 0X00000001 = 0X00002710
    10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0
    caller's pc = 0X00008304
    returning...
    
    run time error: divide by 0
    program terminated
    
    Program terminated normally at PC = 0x0000854c
          0x0000854c: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

This time an integer division by zero has been detected by the standalone 
runtime system, which called <__err_handler>.  <__err_hander> output the first 
set of error messages in the above output.  Control was then returned to the 
runtime system which output the second set of error messages and terminated 
execution.


longjmp and setjmp
..................

A further demonstration can be made using <errtest> by predefining the macro 
LONGJMP to perform a <longjmp> out of <__err_handler> back into the user 
program, thus catching and dealing with the error.  First recompile and link 
<errtest>:

    armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR -DLONGJMP
    armlink -o errtest errtest.o rtstand.o

Then rerun <errtest> under <armsd>.  We expect the integer divide by zero to 
occur once again:

    > armsd -li errtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file errtest
    armsd: go
    (the floating point instruction-set is not available)
    Using integer arithmetic ...
    10000 / 0X0000000A = 0X000003E8
    10000 / 0X00000009 = 0X00000457
    10000 / 0X00000008 = 0X000004E2
    10000 / 0X00000007 = 0X00000594
    10000 / 0X00000006 = 0X00000682
    10000 / 0X00000005 = 0X000007D0
    10000 / 0X00000004 = 0X000009C4
    10000 / 0X00000003 = 0X00000D05
    10000 / 0X00000002 = 0X00001388
    10000 / 0X00000001 = 0X00002710
    10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0
    caller's pc = 0X00008310
    returning...
    
    Returning from __err_handler() with errnum = 0X00000001
    
    Program terminated normally at PC = 0x00008558
          0x00008558: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

The runtime system detected the integer divide by zero, and as before 
__err_handler was called, which produced the error messages.  However, this 
time __err_handler used longjmp to return control to the program, rather than 
the runtime system.


Floating Point Support
......................

Using <errtest> we can also demonstrate floating point support.  You should 
already have copied the appropriate floating point emulator object code into 
your working directory.  For the configuration used in this example <fpe_32l.o> 
is the correct object file.

However, in addition to this it is also necessary to link with an fpe <stub>, 
which we must compile from the source given (<fpestub.s>).

    armasm fpestub.s -o fpestub.o -li -apcs 3/32bit
    armlink -o errtest errtest.o rtstand.o fpestub.o fpe_32l.o -d

The resulting executable, <errtest>, can be run under <armsd> as before:

    > armsd -li errtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file errtest
    armsd: go
    (the floating point instruction-set is available)
    Using Floating point, but casting to int ...
    10000 / 0X0000000A = 0X000003E8
    10000 / 0X00000009 = 0X00000457
    10000 / 0X00000008 = 0X000004E2
    10000 / 0X00000007 = 0X00000594
    10000 / 0X00000006 = 0X00000682
    10000 / 0X00000005 = 0X000007D0
    10000 / 0X00000004 = 0X000009C4
    10000 / 0X00000003 = 0X00000D05
    10000 / 0X00000002 = 0X00001388
    10000 / 0X00000001 = 0X00002710
    10000 / 0X00000000 = errhandler called: code = 0X80000202: Floating Point
    Exception : Divide By Zero
    
    caller's pc = 0XE92DE000
    returning...
    
    Returning from __err_handler() with errnum = 0X80000202
    
    Program terminated normally at PC = 0x00008558 (__rt_exit + 0x10)
    +0010 0x00008558: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

This time the floating point instruction set is found to be available, and when 
a floating point division by zero is attempted, <__err_handler> is called with 
the details of the floating point divide by zero exception.

Note that if you have compiled <errtest.c> other than as in "<longjmp and 
setjmp>", you will not see precisely this dialogue with 
<armsd>.


Running Out of Heap
...................

A second example program, <memtest.c> demonstrates how the standalone runtime 
system copes with allocating stack space, and also demonstrates the simple 
memory allocation function <__rt_alloc>.  Let us first compile this program so 
that it should repeatedly request more memory, until there is none left:

    armcc -li -apcs 3/32bit memtest.c -c
    armlink -o memtest memtest.o rtstand.o

This can be run under <armsd> in the usual way:

    > armsd -li memtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file memtest
    armsd: go
    kernel memory management test
    force stack to 4KB
    request 0 words of heap - allocate 256 words at 0X000085A0
    force stack to 8KB
    ..
    force stack to 60KB
    request 33211 words of heap - allocate 33211 words at 0X00049388
    force stack to 64KB
    request 49816 words of heap - allocate 5739 words at 0X00069A74
    memory exhausted, 105376 words of heap, 64KB of stack
    Program terminated normally at PC = 0x0000847c
          0x0000847c: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

This demonstrates that allocating space on the stack is working correctly, and 
also that the <__rt_alloc> routine is working as expected.  The program 
terminated because in the end <__rt_alloc> could not allocate the requested 
amount of memory.


Stack Overflow Checking
.......................

<memtest> can also be used to demonstrate stack overflow checking by 
recompiling with the macro STACK_OVERFLOW defined.  In this case the amount of 
stack required is increased until there is not enough stack available, and 
stack overflow detection causes the program to be aborted.

To recompile and link <memtest.c> issue the following commands:

    armcc -li -apcs 3/32bit memtest.c -c -DSTACK_OVERFLOW
    armlink -o memtest memtest.o rtstand.o

Running this program under <armsd> produces the following output:

    > armsd -li memtest
    A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
    ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
    Object program file memtest
    armsd: go
    kernel memory management test
    force stack to 4KB
    ...
    force stack to 256KB
    request 1296 words of heap - allocate 1296 words at 0X0000AE20
    force stack to 512KB
    
    run time error: stack overflow
    program terminated
    
    Program terminated normally at PC = 0x0000847c
          0x0000847c: 0xef000011 .... : >  swi     0x11
    armsd: quit
    Quitting
    > 

Clearly stack overlfow checking did indeed catch the case where too much stack 
was required, and caused the runtime system to terminate the program after 
giving an appropriate diagnostic.


Extending the Standalone Runtime System
.......................................

For a many applications it may be desirable to have access to more of the 
standard C library than just the minimal runtime system provides.  This section 
demonstrates how to take out a part of the standard C library and plug it into 
the standalone runtime system.

The function which we will add to <rtstand> is <memmove>.  Although this is 
small, and easily extracted from the C library source, the same methodology can 
be applied to larger sections of the C library, eg. the dynamic memory 
allocation system (malloc, free, etc).

The source of the C library can be found in the <cl> directory.  The source for 
the <memmove> function is in <string.c>.  The extracted source for <memmove> 
has been put into <memmove.c>, and the compile time option <_copywords> has 
been removed.  The function declaration for <memmove> and a typedef for <size_t> 
(extracted from <include/stddef.h>) have been put into <memmove.h>.

Our memmove module can be compiled as follows.

    armcc -c memmove.c -li -apcs 3/32bit

The output, <memmove.o> can be linkedwith the user's other object modules 
together with rtstand.o in the normal way (see previous examples in this 
section).


The Size of the Standalone Runtime Library
..........................................

<rtstand.s> has been separated into several code Areas.  The advantage of this 
is that <armlink> can detect if any Areas are unreferenced, and then eliminate 
them from the output image.

The table below shows the typical size of the Areas in <rtstand.o>:

    Area                      Size (bytes)Functions

    C$$data                         4

    C$$code$$__main                 96    __main, __rt_exit

    C$$code$$__rt_fpavailable       8     __rt_fpavailable

    C$$code$$__rt_trap              128   __rt_trap

    C$$code$$__rt_alloc             68    __rt_alloc

    C$$code$$__rt_stkovf            76    __rt_stkovf_split_*

    C$$code$$__jmp                  100   longjmp, setjmp

    C$$code$$__divide               256   __rt_sdiv, __rt_udiv, __rt_udiv10,
                                          __rt_sdiv10, __rt_divtest

    All Areas                       736

If floating point support is definitely not required, then the 
EnsureNoFPSupport variable can be set to {TRUE}, and some extra space will be 
saved.  After making any modifications to <rtstand.s>, the size of the various 
areas can be found by using the command:

    decaof -b rtstand.o

From the above table it is clear that for many applications the standalone 
runtime library will be roughly 0.5Kb.


Related Topics
..............

 *  "<Register Usage under the ARM Procedure Call Standard>"
    ;

 *  "<In-Line SWIs>".


ARM Shared Libraries
--------------------


About This Recipe
.................

In this recipe you will learn:

 *  what an ARM shared library is;

 *  how the shared library mechanism works;

 *  how to instruct the ARM linker to make a shared library;

 *  how to make a toy shared library from the string section of the ANSI C 
    library.


About ARM Shared Libraries
..........................

ARM <shared libraries> support the sharing of utility, service or library 
functions between several concurrently executing <client> applications in a 
single address space. Such shared code is necessarily <reentrant>.

If a function is reentrant, each of its concurrently active clients must have a 
separate copy of the data it manipulates for them. The data cannot be 
associated with the code itself unless the data is read-only. In the ARM shared 
library architecture, a dedicated register (called <sb>) is used to address 
(indirectly) the static data associated with a client.

An ARM shared library is read only, reentrant and usually position independent. 
A shared library made exclusively from object code compiled by the ARM C 
compiler will have all three of these attributes. Library components 
implemented in ARM Assembly Language need not be reentrant and position 
independent, but in practice, only position independence is inessential.

A library with all three of these attributes in an ideal candidate for packing 
into a system ROM.

Some shared library mechanisms associate a shared library's data with the 
library itself and put only a place holder in the stub. At run time, a copy of 
the library's initialised static data is copied into the client's place holder 
by the dynamic linker or by library initialisation code.

The ARM shared library mechanism supports these ways of working provided the 
data is free of values which require link-time (or run time) relocation. In 
other words, it can be supported provided the input data areas are free of 
relocation directives.


How ARM shared Libraries Work
.............................


Stubs and Proxy Functions

When a client application is linked with a shared library, it is linked not 
with the library itself  but with a <stub object> containing:

 *  an <entry vector>;

 *  a copy of the library's static data or a place holder for it.

Each member of the entry vector is a <proxy> for a function in the matching 
shared library.

When a client <first> calls a <proxy> function, the call is directed to a 
<dynamic linker>. This is a small function (typically about 50-60 ARM 
instructions) which:

 *  locates the matching shared library;

 *  if required, copies an initial image of the library's static data from the 
    library to the place holding area in the stub;

 *  patches the entry vector so each proxy function points at the corresponding 
    library function;

 *  resumes the call.

Once an entry vector has been patched, all future proxy calls proceed directly 
to the target library function with only minimal indirection delay and no 
intervention by the dynamic linker.

Of course, making an <inter-link-unit> call like this <is> more expensive than 
making a straightforward local procedure call, but not a lot so. It is also the 
only supported way to call a function more than 32MBytes away.


Locating a Library Which Matches the Stub
.........................................

Locating a matching shared library is specific to a target system and you must 
provide code to do the location, but the remainder of the dynamic linking 
process is generic to all target systems. Consequently, in order to use ARM 
shared libraries, you have to design and implement a library location mechanism 
and adapt the dynamic linker to it. In practice, this is quite straightforward:

 *  the ARM Linker provides support for parameterising a location mechanism;

 *  a basic dynamic linker with neither location nor failure reporting 
    mechanisms is a mere 42 ARM instructions.

Please refer to "<ARM Shared Library Format>" of the 
Reference Manual for a full explanation of parameter blocks.


How the Dynamic Linker Works

The dynamic linker is entered via a proxy call with r0 pointing at the dynamic 
linker's 16-byte entry stub. Following this stub code is a copy of the 
parameter block for the shared library.

Stored in the parameter block is the identity of the library - perhaps a  
32-bit unique identifier or perhaps a string name. Either way, it can be passed 
to the library location mechanism. You have to decide how to identify your 
shared libraries and, hence, what to put in their parameter blocks.

The library location function is required to return the address of the start of 
the library's offset table.

A primitive location mechanism might be to search a ROM for a matching string. 
This would identify the start of the parameter block of the matching shared 
library. Immediately preceding it will be negative offsets to library entry 
points and a non-negative count word containing the number of entry points. By 
working backwards through memory and counting, you can be sure you have found 
the entry vector and can return the address of its count word to the dynamic 
linker.

More sophisticated location schemes are possible, for example:

 *  You might include in your library a header containing code to execute when 
    the library is first loaded (into RAM) or initialised (in ROM) which 
    registers the library's name with a library manager. Obviously, the library 
    manager has to be locatable without using the library manager, so either 
    it's address has to be known or its function has to be supported by an 
    underlying system call.

 *  Acorn's RISC OS operating system supports a <module> mechanism which is 
    sometimes used to implement shared libraries. A RISC OS module may, by 
    declaring so in its module header, be called when software interrupts 
    (SWIs) in a declared range occur. When such a module is loaded, it extends 
    the range of SWIs interpreted by RISC OS. We can use this mechanism to 
    locate a shared library by storing the identity of a library location SWI 
    in the library's parameter block and by implementing this SWI in the 
    library module's header.


Instructing the Linker to Make a Shared Library
...............................................


Prerequisites

A shared library can be made from any number of object files, including 
<reentrant stubs> of other shared libraries, but two simple rules must be 
followed:

 *  each object file must conform to a reentrant version of the ARM Procedure 
    Call Standard and each code area must have the REENTRANT attribute;

 *  there may be no unresolved references resulting from linking together the 
    component objects.

An immediate consequence of the second rule is that it is impossible to make 
two shared libraries which refer to one another: to make the second library and 
its stub would require the stub of the first, but to make the first and its 
stub would require the stub of the second.

The first rule is not 100% necessary and is difficult to enforce. The ARM 
Linker warns you if it finds a non-reentrant code area in the list of objects 
to be linked into a shared library but it will build the library and its 
matching stub anyway. You have to decide whether the warning is real, or merely 
a formality.


Linker Outputs

The ARM linker generates a shared library as two files:

 *  a plain binary file containing the read-only, reentrant, usually position 
    independent, shared code;

 *  an AOF format <stub> file with which client applications can be linked.

The linker can also generate a reentrant stub suitable for inclusion in another 
shared library.

The library image file contains, in order:

 *  read only code sections from your input objects;

 *  if so requested, a read only copy of the initialised static data from the 
    input objects;

 *  a table of (negative) offsets from the end of the library to its entry 
    points;

 *  if so requested, the size and offset of the static data image;

 *  a copy of the library's <parameter block>.

You request a copy of the initialised static data to be included in a library 
when you describe to the linker how to make a shared library. If you request 
this, the linker writes the length and offset of the data image immediately 
after the entry vector. During linking, <armlink> defines symbols 
SHL$$data$$Size and SHL$$data$$Base to have these values; components of your 
library may refer to these symbols. Instead of including the static data in the 
stub <armlink> includes a zero initialised place holding area of the same size. 
It also writes the length and (relocatable) address of this place holding, zero 
initialised stub data area immediately after the dynamic linker's entry veneer, 
giving the dynamic linker sufficient information to initialise the place holder 
at run time. During linking, the linker symbols SHL$$data$$Size and $$0$$Base 
describe this length and relocatable address.

Obviously, any data included in your shared library must be free of relocation 
directives. Please refer to "<ARM Shared Library Format>" of 
the Reference Manual for a full explanation of what kind of data can be 
included in a shared library.

You specify a parameter block when you describe to the linker how to make a 
shared library. You might, for example, include the name of the library in its 
parameter block, to aid its location. An identical copy of the parameter block 
is included in the library's entry vector in the stub file.


Describing a Shared Library to the Linker

To describe a shared library to the linker you have to prepare a file which 
describes:

 *  the name of the library;

 *  the library parameter block;

 *  what data areas to include;

 *  what entry points to export.

For precise details of how to do this, please refer to "<ARM Shared Library 
Format>" of the Reference Manual. Below is an intuitive 
example you can work with and adapt:

    ; First, give the name of the file to contain the library -
    ; strlib - and its parameter block - the single word 0x40000...
    > strlib \
      0x40000
    ; ...then include all suitable data areas...
    + ()
    ; ... finally export all the entry points...
    ; ... mostly omitted here for brevity of exposition.
    memcpy
    ...
    strtok

The name of this file is passed to <armlink> as the argument to the -SHL 
command line option (please refer to "<The ARM Linker (armlink)>" starting on 
page19 of the User Manual for further details).


Making a Toy String Library
...........................

This section refers to the files collected in the <strlib> subdirectory of the 
<examples> directory of the release.

The header files <config.h> and <interns.h> let you compile cl/string.c 
locally. Little-endian code is assumed. If you want to make a big-endian string 
library you should edit config.h. Similarly, if you want to alter which 
functions are included or whether static data is initialised by copying from 
the library, then you should edit config.h. You do not need to edit interns.h. 
If you use config.h unchanged you will build a little-endian library which 
includes a data image and which exports all of its functions.


Compiling the String Library

To compile string.c, use the following command:

    armcc -li -apcs /reent -zps1 -c -I. ../../cl/string.c

The <-li> flag tells <armcc> to compile for a little-endian ARM.

The <-apcs /reent> flag tells <armcc> to compile reentrant code.

The <-zps1> flag turns off software stack limit checking and allows the string 
library to be independent of all other objects and libraries. With software 
stack limit checking turned on, the library would depend on the stack limit 
checking functions which, in turn, depend on other sections of the C run time 
library. While such dependencies do not much obstruct the construction of full 
scale, production quality shared libraries, they are major impediments to a 
simple demonstration of the underlying mechanisms.

The <-I.> flag tells <armcc> to look for needed header files in the current 
directory.


Linking the String Library

To make a shared library and matching stub from string.o, use the following 
linker command:

    armlink -o strstub.o -shl strshl -s syms string.o

<strlib>'s stub will be put in <strstub.o> as directed by the -o option.

The file <strshl> contains instructions for making a shared library called 
<strlib>. A shortened version of it was shown in the earlier section "
<Describing a Shared Library to the Linker>".

The option <-s syms> asks for a listing of symbol values in a file called <syms>
. You may later need to look up the value of EFT$$Offset (it will be 0xA38 if 
you have changed nothing). As supplied, the dynamic linker expects a library's 
extenal function table (EFT) to be at the address 0x40000. So, unless you 
extend the dynamic linker with a library location mechanism (please refer to 
the discussion in the earlier section "<How the Dynamic Linker Works>" starting 
on page96), you will have to load <strlib> at the address 0x40000-EFT$$Offset.


Making the Test Program and Dynamic Linker

Now you should assemble the dynamic linker and compile the test code:

    armasm -li dynlink.s dynlink.o
    armcc -li -c strtest.c

You can extend the test code to probe lots of string functions, but this is 
left as an exercise to help you understand what is going on.

To make the test program you must link together the test code, the dynamic 
linker, the string library stub and the appropriate ARM C library (so that 
references to library members other than the string functions can be resolved):

    armlink -d -o strtest strtest.o dynlink.o strstub.o ../../lib/armlib.32l


Running the Test Program with the Shared String Library

Now you are ready to try everything under the control of command-line armsd:

    <host-prompt> armsd strtest
    A.R.M. Source-level Debugger version ...
    ARMulator V1.30, 4 Gb memory, MMU present, Demon 1.1,...
    Object program file strtest
    armsd: getfile strlib 0x40000-0xa38
    armsd: go
    
    strerror(42) returns unknown shared string-library error 0x0000002A
    
    Program terminated normally at PC = 0x00008354 (__rt_exit + 0x24)
    +0024 0x00008354: 0xef000011 .... :    swi      0x11
    armsd: q
    Quitting
    <host-prompt>

Before starting <strtest> you must load the shared string library by using:

    getfile strlib 0x40000-0xa38

<strlib> is the name of the file containing the library; 0x40000 is the hard 
wired address at which the dynamic linker expects to find the external function 
table; and 0xa38 is the value of EFT$$Offset, the offset of the external 
function table from the start of the library.

When <strtest> runs, it calls <strerror(42)> which causes the dynamic linker to 
be entered, the static data to be copied, the stub vector to be patched and the 
call to be resumed. You can watch this is more detail by setting a breakpoint 
on __rt_dynlink and single stepping.


Suggested Further Exercises
...........................


Library Location Mechanisms

Locating a library's EFT at 0x40000 is not very satisfactory, so an obvious 
exercise is to extend the dynamic linker to locate a library by looking for it. 
Try, for example, adding a header to the start of the library which contains:

 *  offset to the next loaded library or 0

 *  the total length of the library

 *  the offset to the external function table

 *  the string name of the library

Hint: when you link this area with the other library contents you have to 
ensure that it wil precede all other areas in the library. Please refer to "
<Area Placement and Sorting Rules>" of the Reference Manual 
for further details.

Your dynamic linker could now search a list of libraries loaded at 0x40000 
onwards.


Self-Loading Libraries

You could extend the header mechanism described in the previous subsection so 
that a library could copy itself to the next free location above 0x40000. This 
would allow libraries to be loaded at 0x8000 and 'executed' there. Of course, 
you would want your header to begin with a branch to the code which will copy 
the library from 0x8000 to its destination above 0x40000.


Multiple Shared Libraries

Once you have built location and loading mechanisms, you can build more than 
one shared library. Try making one of your own and linking a test program with 
the stubs of two or more libraries.


Inter-Library Calls

Once you have multiple libraries working, you can try making one library call 
functions in another (but remember that if library A refers to library B then 
library B <may not> refer to library A). To do this you will have to make a 
reentrant stub for the library you wish to refer to and link this into the 
library making the reference.


Related Topics
..............

 *  "<Register Usage under the ARM Procedure Call Standard>"

 

 

