
                           CAROBASE format reference


    For further information contact the author: Anthony Naggs
        Email;          amn@ubik.demon.co.uk
        Paper email;    PO Box 1080
                        PEACEHAVEN
                        East Sussex
                        BN10 8PZ
                        Great Britain
        Telephone, UK;  0273 589701
            overseas;   +44 273 589701


    Document history:

        Prior to July 1993, created and maintained by Vesselin Bontchev,
        VTC, University of Hamburg.


                       ** PROVISIONAL ** revised edition,
                           produced by Anthony Naggs
                               on 15 August 1993.


The CaroBase FILE will include some special records at the beginning and
end:

CAROBASE_BANNER:
        Normally at the start of the file.  This will be a multi-line
        text field, including copyright claims and distribution details.
CAROBASE_LINGUA:
        Keyword indicating the language of text entries:
        English     - British/International English
        Others, if/when produced;
        Deutsch     - German
        Francais    - French
        ...
CAROBASE_START:
        Marks the end of the banner, and immediately precedes the first
        Carobase entry.

CAROBASE_END:
        Immediately follows the last Carobase entry.
        Authenticity information will be placed here, (eg a PGP signature).


CaroBase entry specification:

NAME:
        Full standard CARO name
ALIASES:
        Comma delimited list of all known aliases, (blank if none).

        Names may be placed in double quotes ("s), as other people's  PROPOSAL
        names might include punctuation.
NAME_HISTORY:
        Comma seperated list of previous 'CARO name's, most recent
        first, blank if no changes.
LAST_NAME_CHANGE:
        Date of the most recent name change.
TARGETS:
        Comma delimited list of keywords. Possible keywords:
        COM   - infects COM files. The file type is determined from the
                first two bytes of the file.
        EXE   - infects EXE files. The file type is determined from the
                first two bytes of the file.
        .COM  - infects COM files. The file type is determined from the
                file extension.
        .EXE  - infects EXE files. The file type is determined from the
                file extension.
        ZM    - when determining the file type, a check for ZM is also
                made.
        SYS   - infects device drivers.
        OVR   - infects overlays (by mistake).
        NE_W  - supports the NewEXE file format and can infect Windows
                applications correctly.
        NE_OS2- supports the NewEXE file format and can infect OS/2
                applications correctly.
        NewEXE- supports the NewEXE file format, but does not check
                the host OS type.
        DIR   - infects file systems (e.g., Dir_II).
        MBR   - infects the MBR of the hard disk.
        FBR   - infects the DBR of the floppy disks.
        HBR   - infects the DBR of the hard disk.
        BAT   - infects BAT files.
        OBJ   - infects OBJ files.
        LIB   - infects libraries.
        COMP  - companion virus.
        OTHER - infects other kinds of objects, not listed above.

Viruses like Frodo (that infect files for which the extension satisfies a
broader condition) are described as "COM, EXE, OTHER".

For the companion viruses we could have a list of the spoofed extensions
in paranthesis. For instance, the currently existing ones could be
described as "COMP (EXE, BAT)".

RESIDENT:
        Whether the viruses is resident and where in memory it
        resides. Possible keywords:
        NONE      - The virus is not memory resident.
        PAYLOAD   - The virus is not memory resident, but installs a
                    resident payload.
        IVT       - The virus resides in the interrupt vector table.
        FIRSTMCB  - The virus resides in the MCB of DOS (e.g., Dir_II).
        BUFFER    - The virus resides in the DOS buffers (e.g., Int13).
        LOW       - The virus resides as a TSR (e.g., Jerusalem).
 * OBSOLETE:
 *      TWIXT     - The virus resides above the last MCB (e.g.
 *                  Dark_Avenger).
 * Replaced by:
        TwixtAny  - Shrinks the current MCB, and becomes resident in
                    the stolen space.  (Can leave a mess in memory).
        TwixtZ    - only shrinks the current MCB if it is a 'Z' block.
                    (Only leaves a mess if there is another chain of
                    MCBs, eg for UMBs).
        NewMCB    - shrinks the current MCB, and creates a new MCB in
                    the released space - copying the original MCB
                    marker.
        NewEndMCB - shrinks the current MCB setting it to 'M', and
                    makes a new 'Z' MCB in the released space.
 *
        UPPER     - The virus looks for free memory above the 640 Kb
                    limit and installs itself there (e.g., EDV).
        HIGH      - The virus resides in the high memory (i.e., between
                    1 Mb and (1 Mb + 64 Kb - 16)).
        TOP       - The virus resides at the top of memory, reducing the
                    BIOS memory size at 0000:0413 (like most boot
                    sector infectors).
        UNMARKED  - The virus installs itself at the end of the available
                    memory (regardless where exactly this end is), without
                    marking the occupied memory as allocated.
        VIDEO     - The virus resides in the video RAM (e.g., VComm).
        EXT       - The virus resides in the Extended memory.
        EXP       - The virus resides in the Expanded memory.
        AT addr   - The virus resides at a particular memory address
                    (either without marking it as allocated, like
                    Stupid, or in a memory hole in DOS, like The_Rat).
        OTHER     - The virus uses some other technique to install itself
                    in memory, not described above.

        Below     - adjusts MCBs so that the resident virus is        PROPOSAL
                    included in the memory block below where it
                    loaded in memory.
MEMORY_SIZE:
        How much memory the virus occupies. The number is in bytes by
        default, but can be followed by K (kilobytes) or P
        (paragraphs).
STORAGE_SIZE:
        Size of the virus on the disk. The default is in bytes, but
        you can append K (kilobytes), C (clusters), or S (sectors).
        In the cases when the virus pads the infected file to some
        multiple of N bytes, the size is expressed as virus_size+N.
WHERE:
        Where the virus resides. Comma delimited list of keywords.
        Possible keywords for file infectors:
        OVERWRITES - The virus overwrites part of the file, destroying
                     it (e.g., Burger).
        PREPENDING - The virus prepends itself to the files (e.g.,
                     Jerusalem with COM files).
        MOVE       - The virus overwrites the beginning of the file,
                     appending the overwritten part after the end of
                     the file (e.g., Anti-Pascal).
 * OBSOLETED:
 *      APPENDING  - The virus appends itself to the files (e.g.,
 *                   Jerusalem with EXE files).
 * Replaced by:
        EOIMAGE    - Uses length from EXE header to position virus.
        EOFILE     - Uses DOS file length to position virus.
 *
        HEADER     - The virus installs itself in the EXE header
                     (e.g., The_Rat or Phoenix.2000).
        SPLIT      - The virus inserts itself between the EXE header
                     and the rest of the file (e.g., Suriv 2.01).
        DATA       - The virus overwrites a constant data area of the
                     file (e.g., Lehigh, Squisher).
        RANDOM     - The virus inserts itself at a random place in the
                     file (e.g., Bomber).
        SLACK      - The virus preserves the orginal file length by
                     using the slack cluster space after the end of
                     the file (e.g., Int13).  Liable to make a mess on
                     network drives or with SuperStor, Stacker ...
        COMPANION  - The virus is of companion type.
        OTHER      - The virus uses a new technique, not listed above.
        When a virus uses more than one technique for different files,
        this is indicated in paranthesis.
        Example: PREPENDING (COM), APPENDING (EXE), SPLIT (NewEXE),
                 DATA (COMMAND.COM), OVERWRITES (SYS).
        For boot sector viruses, this field describes where the rest
        of the virus (and possibly the original boot secotr) resides.
        Possible keywords are:
        AT ttt/hh/ss - at track ttt, head hh, sector ss.
        AT_LSN nn    - at logical sector number nn.
        AT_CN nn     - at cluster number nn.
        TRACK nn     - on an additional track nn.
        BAD          - in a bad cluster.
        When a virus uses more than one technique for different disks,
        the kind of disk is indicated in paranthesis.
        When the number depends on the size of the disk, this size
        could be used in an arithmetical expression. There as some
        reserved words, like LAST_C, LAST_S, LAST_T, and LAST_R, to
        indicate the last cluster of a partition, the last sector of
        a track, the last track of a disk, or the last sector of the
        root directory of a disk(ette).
        Example: AT 0/0/7 (HARD), AT 1/0/3 (360), AT 1/0/14 (1.44).
STEALTH:
        List of the interrupts and functions, subverted by the virus.
        Special keywords are NONE for no stealth and DRIVER for device
        driver requests.
POLYMORPHIC:
        How polymorphic the virus is.  Comma delimited list of
        keywords. Possible keywords:
        NONE    - the virus is even not encrypted.
        CONST   - the virus uses encryption with a constant key, in
                  order to garble some fields, possibly containing
                  text messages or the original parts of the infected
                  file.
        VAR     - the virus uses variable encryption with a constant
                  decryptor.
        WILDCARD- the virus uses variable encryption with a variable
                  decryptor, but the decryptor can be detected with
                  a wildcard string (e.g., Phoenix).
        POLY-nn - the virus uses variable encryption with a variable
                  decryptor. The number nn specifies the number of
                  constant bytes (at any place) in the decryptor.
                  Therefore, nn==0 is the strongest polymorphism.
        ENTRY   - The virus hides the entry point, like Bomber.
        SWAP    - The virus permutates parts of its body, like BadBoy.
        OTHER   - The virus uses some other polymorphic technique, not
                  listed above.
ARMOURING:
        Armouring tricks used by the virus. A list of keywords.
        Possible keywords are:
        NONE      - No armouring tricks are used.
        CODE      - The virus uses a special coding style to fool most
                    disassemblers (particularly Sourcer).
        CRYPT     - The virus uses multiple level encryption.
        TRACE     - The virus disables INT 1 and INT 3.
        KBD       - The virus disables the keyboard.
        SELFTRACE - The virus uses INT 1 and INT 3 to decrypt itself.
        INT1      - The virus uses INT 1 for some of its fuctions
                    (e.g., instead of INT 21h).
        INT3      - The virus uses INT3 for some of its functions
                    (e.g., instead of INT 21h).
        PREFETCH  - The virus uses the prefetch queue of the CPU to
                    determine if it is being traced.
        OTHER     - The virus uses some other armouring technique,
                    not listed above.
TUNNELLING:
        Level at which the virus tunnels. Possible keywords:
        NONE     - No tunneling is used.
        NEXT     - The virus can bypass the last loaded TSR program.
        HAND21   - The virus can find the original INT 21h handler and
                   call it directly.
        DRIVER   - The virus uses device driver requests.
        SECTOR   - The virus uses INT 13h to access the disk.
        HAND13   - The virus can find the original handler (in DOS) of
                   INT 13h and calls it directly.
        BIOS     - The virus uses direct calls to the ROM BIOS, in
                   order to access the disk.
        HARDWARE - The virus controls the hardware directly.
        OTHER    - The virus uses another tunneling techniques, not
                   listed above.
        In the case of HAND21, HAND13, and BIOS, a keyword can be
        supplied in parentesis, indicating the method uses to obtain
        the address of the particular interrupt handler. These keywords
        can be:
        TRACE - The virus uses interrupt tracing.
        2F    - The virus uses INT 2Fh/AH=13h.
        TABLE - The virus has a lookup table of known addresses. The
                table could consist of only one entry, e.g. the virus
                could obtain the address of the DOS segment and assume
                that the handler is at a particular offset in this
                segment.
        SCAN  - The virus scans the memory for a string of bytes and
                if it is found, uses the address at a particular offset
                from the address at which the string has been found.
        OTHER -
INFECTIVITY:
        We decided to keep this field anyway, due to popular demand.
        It contains two keywords - WITHIN and BETWEEN, indicating how
        infective the virus is within the different infectable objects
        in a single machine (or on a LAN) and between unconnected
        machines (i.e., when it can be transfered only via floppies).
        After each keyword, in paranthesis, there is a number,
        indicating how infective the virus is for this particular
        category. The possible numbers and their meanings are:
        0 - Not a virus.
        1 - Needs "spoonfeeding".
        2 - As infective as an overwriting virus could be.
            Example: Burger.
        3 - As infective as a non-resident virus that infects only one
            file when an infected file is executed could be.
            Example: DataCrime.
        4 - As infective as a simple resident virus, which infects only
            when a file is executed. Example: Eddie-2.
        5 - As infective as a fast infector. Example: Dark Avenger.
        6 - As infective as an MBR infector. Example: Stoned.
        7 - A superfast infector. Example: Dir_II.
OBVIOUSNESS:
        This field replaces the two fields (VISIBILITY and AUDIBILITY)
        which were present in the initial draft. The opinion of the
        majority is that a field like this must be present.
        A keyword, showing how obvious for the user this virus is. Of
        course, the value is subjective. Possible keywords are:
        EXTREMELY - The virus is extremely obvious - it has a payload
                    or a side effect at just cannot remain  unnoticed
                    (e.g., Diamond).
        QUITE     - The virus is quite obvious - it is very probable
                    that its payload or side effects will be noticed
                    (e.g., Yankee Doodle).
        SLIGHTLY  - The virus is only slightly obvious - its payload
                    or side effects can only rarely be noticed (e.g.,
                    Stoned).
        NONE      - The virus is not obvious at all - it can be noticed
                    only by chance or by using an anti-virus program.
COMMONNESS:
        A number, indicating how widespread the virus is. Optionally,
        a geographical region can be indicated in paranthesis. The
        possible numbers are:
        0 - not in the wild and not likely to be (too obvious,
            too buggy, overwriting, etc.)
        1 - unreliable reports (a few unconfirmed reports about this
            virus being in the wild).
        2 - reliable reports (several but not many reports about this
            virus being in the wild).
        3 - confirmed reports (at least one report about this virus
            being in the wild, from a reliable anti-virus researcher).
        4 - common (the virus is very common in the wild, e.g. Cascade).
        5 - extremely common (the virus is extremely widespread, e.g.
            Stoned).
        6 - extinct (the virus has been in the wild in the past, maybe it
            has even been a rather common one, but now it is not reported
            any more).
        It is possible to include a list of these numbers, delimited
        with commas, to indicate different commonness of this virus in
        different parts of the world, indicated in parenthesis. For
        example, Dir_II can be described as 3, 5 (Bulgaria).
COMMONNESS_DATE:
        A date, indicating the last time the above field was modified.
        The date is entered in the following format: yyyy-mm-dd.
TRANSIENT_DAMAGE:
        Non-destructive effects of the virus. Don't see any way to
        tokenize this, so we decided to leave it a free-text field.
T_DAMAGE_TRIGGER:
        Condition, describing when the transient damage occurs. In
        general, it is a boolean expression, which can be expressed
        rather compactly, but I see no way to standartize it
        completely. That's why, we decided to leave it a free-text
        field. However, when describing it, try to use some kind of
        restricted syntax. We might tokenize this field in the future,
        if it turns out to be possible.
        Example: (DayOfWeek = Sunday) and (VirusInMemorySince = 30 min).
PERMANENT_DAMAGE:
        What permanent destruction the virus causes. Same problems as
        above, so we decided to leave it a free-text field.
P_DAMAGE_TRIGGER:
        Similar as above.
        Example: (DayOfWeek = Friday) and (DayOfMonth = 13) and Exec.

When there are more than one damage effect (transient or permanent),
they are listed in multiple TRANSIENT_DAMAGE and PERMANENT_DAMAGE
fields, with the appropriate TRIGGER field following each of them.

SIDE_EFFECTS:
        Known side effects, caused by the virus. Yet another thing
        that I don't know how to tokenize, so we decided to leave it
        in free-text format.
INFECTION_TRIGGER:
        When the virus decides to infect. It seems very difficult to
        tokenize this field, so we decided to leave it in free-text
        form. However, when describing it, try to use some kind of
        restricted syntax - we might try to tokenize it in the future,
        if it turns out to be possible.
        Example: (Exec or Copy) and (LengthCOM > 10) and (LengthEXE
        > 512) and (LengthCOM < 64000) and (Random mod 8 <> 0).
MSG_DISPLAYED:
        Strings that the virus displays, in quotes. If the string is
        encrypted, "; Encrypted" is appended after the closing quote.
        Text using graphics characters > 127 or less than < 32 should
        be qualified, eg stating the required Code Page to correctly
        view them:
            "FGHT" DAh EBh CAh "CAS" 0Dh 0Ah; Code Page 852
        (Needs extending, to handle (Slavic) alphabets supported by TSRs
        instead of Code Pages).
MSG_NOT_DISPLAYED:
        Other strings that the virus contains, in quotes. If the string
        is encrypted, "; Encrypted" is appended after the closing quote.
INTERRUPTS_HOOKED:
        Comma delimited list of the interrupts and functions
        intercepted by the virus. All numbers are in hex. The
        functions are described as INTERRUPT/FUNCTION.
        Example: 21/4B00, 24, 21/3D, 13/2, 13/3.
SELFREC_IN_MEMORY:
        How the virus detects itself in memory. If it is an "Are you
        there?" call, this entry contains boolean expressions like
        INT_21;AX=4BFF -> AX=1234. The check for a value in memory
        can be tokenized as "[address] = value" or
        "[address]-[address] = comma_dedlimited_list_of_values".
        When a virus compares the memory image with its whole body,
        the keyword COMPARES is used.
        The above are just suggestions. In general, this is a text
        field, but please try to use the above restricted syntax
        whenever possible - we might try to completely tokenize this
        field in the future.
SELFREC_ON_DISK:
        How the virus recognizes the infected objects. Some tokenizable
        things are expressions, involving FileTime.Seconds (or Minutes,
        Hours), FileDate.Day (or Month, or Year), File[Position],
        Disk[LogicalDiskAddress], PDisk[PhysicalDiskAddress], etc.
        Unfortunately, this does not work in all cases, so we decided
        to leave this field in free-text format. However, try to use
        the restricted syntax suggested above, it might be possible to
        tokenize this filed completely in the future.
LIMITATIONS:
        A boolean expression, describing the limitations of the virus,
        if it requires some special software or hardware to run.
        Example: (CPU >= 286) and (DOS == "PC-DOS 3.30") or (DOS =
        "Windows 3.1").
        Special keyword: NONE, if there are no limitations.
        Unfortunately, the above cannot cover all possible cases, so
        we decided to leave this as free-text field. Just try to use
        the restricted syntax suggested above in all cases whenever
        possible - we might try to tokenize completely this field in
        the future.
COMMENTS:
        Natural language comments; everything that you want to say
        about the virus, which does not fit in the above format.
ANALYSIS_BY:
        Who has analysed the virus.
DOCUMENTATION_BY:
        Who has written the database entry.

When more than one person are listed in the above fields, they should
be listed one per line. Each person should be listed with his affiliation.
When a person just introduces some corrections in the database entry,
s/he should be listed in the DOCUMENTATION_BY field.

ENTRY_DATE:
        Date when the database entry has been created.
        The date is entered in the following format: yyyy-mm-dd.
LAST_MODIFIED:
        Date when the database entry has been last updated.
        The date is entered in the following format: yyyy-mm-dd.
SEE_ALSO:
        List of the contents of the NAME fileds of other viruses, which
        are related to the virus described here.
END:
        Indicates the end of the entry for this virus description.


Notes:

1.  When a field in the description has the same value as the same
    field in another entry, then instead of repeating the entry, one
    can use the keyword LIKE, followed by the contents of the NAME
    field of the relevant description, in which this particular field
    has the same contents.

2.  Keyword fields may have additional comments.  They should be
    denoted using Pascal syntax, with matching '{' & '}', or '(*' &
    '*)'.

3.  A special style of comment is recommended to mark entries which
    you assume are correct but have not verified, with a '?'
    immediately after the open comment character(s), eg '{?' or '(*?'.

4.  Entries submitted by non-CARO members must be accompanied by a
    sample of the virus.  Where the virus infects different types of
    object, (eg Master Boot Sector, .EXE files), an example of each
    is preferred.

