
  Pentium- Optimization Cross-Reference by Instruction
                    

Pentium- Optimization Cross-Reference by Instruction 

The following is a list of optimizations that may come in handy.
Each one is listed alphabetically (more or less) in the first column.

The second column lists the CPU or CPU's that this optimization is
applicable to; alternatively it may be noted as applicable to 16-bit
code or 32-bit code.  
The third column contains one or more replacement sequences of code
that is either faster or smaller (sometimes both) than the first
column. For some obscure optimizations, the action of the first column
instruction is explained.  
The forth column contains a description and/or examples.   
                           replacement
instruction     CPU's       or action             description/notes
---------------------------------------------------------------------------

aad (imm8)      all        AL = AL+(AH*imm8)    If imm8 is blank
uses 10.
                          AH = 0               AAD is almost always
slower,
  
                                               but only 2 bytes
long.

aam (imm8)      all        AH = AL/imm8         Same as AAD.
                          AL = AL MOD imm8

add             16-bit     lea reg, [reg+reg+disp]

                                               Use LEA to add
                                               base + index + displacement
                                               Also preserves flags;
                                               for example:

                                                 add bx, 4

                                               can be replaced by:

                                                 lea  bx, [bx+4]

                                               when the flags must
not
                                               be changed.

add             32-bit     lea reg, [reg+reg*scale+disp]

                                               Use LEA to add
                                               base + scaled index
+ disp
                                               Also preserves flags.
                                               (See previous example).
                                               The 32-bit form of
LEA
                                               is much more powerful
                                               than the 16-bit version
                                               because of the scaling
                                               and the fact that
almost
                                               all of the 8 General
purpose
                                               registers can be
used
                                               as base and index
registers.

and reg, reg    Pent       test reg, reg        Use TEST instead
of AND
                                               on the Pentium because
                                               fewer register conflict
                                               will result in better
pairing

bswap           Pent       ror eax, 16          Pairs in U pipe,
BSWAP
                                               doesn't pair.
                                               disadvantage: modifies
flags
                                               (Not a direct replacement)

call dest1      286+       push offset dest2    When CALL is followed
by
jmp  dest2                 jmp  dest1           a JMP, change the
return
                                               address to the JMP
destination.

call dest1      all        jmp  dest1           When a CALL is followed
by a
ret                                             RET, the CALL can
be replaced
                                               by a JMP.

cbw             386+       mov ah, 0            When you know AL
< 128
                                               use MOV AH, 0 for
speed.
                                               But use CBW for smaller
                                               code size.

cdq             486+       xor edx, edx         When you know EAX
is positive
                                               Faster, better pairing.

                                               disadvantage: modifies
flags

               Pent       mov edx, eax         When EAX value could
be
                          sar edx, 31          positive or negative
                                               because of better
pairing

cmp mem, reg    286        cmp reg, mem         reg, mem is 1 cycle
faster

cmp reg, mem    386        cmp mem, reg         mem, reg is 1 cycle
faster

dec reg16                  lea reg16, [reg16 - 1]   Use to preserve
flags
                                                   for BX, BP, DI,
SI

dec reg32                  lea reg32, [reg32 - 1]   Use to preserve
flags
                                                   for EAX, EBX,
ECX, EDX
                                                       EDI, ESI,
EBP

div <op>         8088       shr accum, 1         When <op>  resolves
to 2, use
                                               shift for division.
                                               (use CL for 4, 8,
etc.)

div <op>         186+       shr accum, n         When <op>  resolves
to a power
                                               of 2 use shifts for
division.

enter imm16, 0  286+       push bp              ENTER is always slower
                          mov  bp, sp          and 4 bytes in length
                          sub  sp, imm16       if imm16 = 0 then
push/mov
                                               is smaller

               386+       push ebp
               32-bit     mov  ebp, esp
                          sub  esp, imm16

inc reg16                  lea reg16, [reg16 + 1]   Use to preserve
flags
                                                   for BX, BP, DI,
SI

inc reg32                  lea reg32, [reg32 + 1]   Use to preserve
flags
                                                   for EAX, EBX,
ECX, EDX
                                                       EDI, ESI,
EBP
jcxz <dest>:    486+        test cx, cx          JCXZ is faster and
                          je   <dest>:          smaller on 8088-286.
                                               On the 386 it is
the
                                               about the same speed

              486+        test ecx, ecx        Never use JCXZ on
486
                          je   <dest>:          or Pentium except
for
                                               compactness

lea reg, mem   8088-286    mov reg, OFFSET mem  MOV reg, imm is faster
                                               on 8088 - 286. 386+
                                               they are the same.

       Note: There are many uses for LEA, see: add, inc, dec, mov,
mul

leave           486+       mov sp, bp           LEAVE is only 1 byte
                          pop bp               long and is faster
                                               on the 186-386. The
                          mov esp, ebp         MOV/POP is much faster
                          pop ebp              on 486 and Pentium

lodsb           486+       mov al, [si]         LODS is only 1 byte
long
                          inc si               and is faster on
8088-386,
                                               much slower on the
486.
                                               On the Pentium the
MOV/INC
                                               or MOV/ADD instructions
                                               pair, taking only
1 cycle.

lodsw           486+       mov ax, [si]         see lodsb
                          add si, 2

lodsd           486+       mov eax, [esi]       see lodsb
                          add esi, 4

loop <dest>:     386+       dec cx               LOOP is faster and
                          jnz <dest>:           smaller on 8088-286.
                                               on 386+ DEC/JNZ is
loopd <dest>:               dec ecx              much faster. On
the Pentium
                          jnz <dest>:           the DEC/JNZ instructions
                                               pair taking only
1 cycle.

loopXX <dest>:   486+       je  $+5              The 3 replacement
instructions
( XX = e,ne,z or nz)       dec cx               are much faster on
the 486+.
                          jnz <dest>:           LOOPxx is smaller
and
                                               faster on 8088-286
loopdXX <dest>:  486+       je  $+5              The speed is about
the
                          dec ecx              same on the 386.
                          jnz <dest>:

mov reg2, reg1  286+       lea reg2, [reg1+n]   LEA is faster, smaller
and
followed by:                                   preserves flags.
This is a
inc/dec/add/sub reg2                           way to do a MOV and
ADD/SUB
                                               of a constant, n.

mov acc, reg    all        xchg acc, reg        Use XCHG for smaller
code
                                               when one of the registers
                                               final value can be
ignored.
                                               Note that acc = AL,
AX or EAX.

mov mem, 1      Pent       lea bx, mem          Displacement/immediate
does
                          mov [bx], 1          not pair. LEA/MOV
can be used
                                               if other code can
be placed
                                               inbetween to prevent
AGI's.
                          mov ax, 1            MOV/MOV may be easier
to pair.
                          mov mem, ax

mov [bx+2], 1   Pent       mov ax, 1            Better pairing because
                          mov [bx+2], ax       displacement/immediate
                                               instructions do not
pair.

                          lea bx, [bx+2]
                          mov [bx], 1

movsb           486+       mov al, [si]         MOVS is faster and
                          inc si               smaller to move a
single
                          mov [di], al         byte, word or dword
                          inc di               on the 8088-386.
                                               On the 486+ the MOV/INC
                                               method is faster.

                                               NOTE: REP MOVS is
always
                                               faster to move a
large block.

movsw           486+       mov ax, [si]         see MOVSB
                          add si, 2
                          mov [di], ax
                          add di, 2

movsd           486+       mov eax, [esi]       see MOVSB
                          add esi, 4
                          mov [edi], eax
                          add edi, 4

movzx r16, rm8  486+       xor bx, bx           MOVZX is faster and
                          mov bl, al           smaller on the 386.
                                               On the 486+ XOR/MOV
movzx r32, rm8  486+       xor ebx, ebx         is faster. Possible
                          mov bl, al           pairing on the Pentium.
                                               (source can be reg
or mem)
movzx r32, rm16 486+       xor ebx, ebx         disadvantage: modifies
flags
                          mov bx, ax

mul n           8088+      shl ax, cl           Use shifts or ADDs
instead of
                                               multiply when n is
a power of 2

mul n           Pent       add ax, ax           ADD is better than
single
                                               shift because it
pairs better.

mul             32-bit     lea                  Use LEA to multiply
by
                                               2, 3, 4, 5, 7, 8,
9

                          lea eax, [eax+eax*4] (ex: multiply EAX
* 5)

                                               LEA is better than
SHL on the
                                               Pentium because it
pairs in
                                               both pipes, SHL pairs
only in
                                               the U pipe.

or reg, reg     Pent       test reg, reg        Better pairing because
                                               OR writes to register.
                                               (This is for src
= dest.)

pop mem         486+       pop reg              Faster on 486+
                          mov mem, reg         Better pairing on
Pentium

push mem        486+       mov  reg, mem        Faster on 486
                          push reg             Better pairing on
Pentium

pushf           486+       rcr reg, 1           To save only the
carry flag
                                               use a rotate (RCR
or RCL)
                             or                into a register.
RCR and RCL
                                               are pairiable (U
pipe only)
                          rcl reg, 1           and take 1 cycle.
PUSHF is
                                               slow and not pairable.

popf            486+       rcl reg, 1           To restore only the
carry flag.
                                               See PUSHF.
                             or

                          rcr reg, 1

rep scasb       Pent       loop1:               REP SCAS is faster
and
                            mov al, [di]       smaller on 8088-486.
                            inc di             Expanded code is
faster
                            cmp al, reg2       on Pentium due to
pairing.
                            je  exit
                            dec cx
                            jnz loop1
                          exit:

shl reg, 1      Pent       add reg, reg         ADD pairs better.
SHL
                                               only pairs in the
U pipe.

stosb           486+       mov [di], al         STOS is faster and
smaller
                          inc di               on the 8088-286,
and the same
                                               speed on the 386.
On the 486+
stosw           486+       mov [di], ax         the MOV/INC is slightly
                          add di, 2            faster.

stosd           486+       mov [edi], eax       REP STOS is faster
on 8088-386.
                          add edi, 4           MOV/INC or MOV/ADD
is faster
                                               on the 486+

                                               Note: use LEA SI,
[SI+n]
                                               to advance LEA without
                                               changing the flags.

xchg            all                             Use xchg acc, reg
to do a
                                               1 byte MOV when one
register
                                               can be ignored.

xchg reg1, reg2 Pent       push reg1            pushes and pops are
1 cycle
                          push reg2            faster on Pentium
due to
                          pop  reg1            pairing.
                          pop  reg2

                                               disadvantage: uses
stack

               Pent       mov  reg3, reg1      Faster and better
pairing
                          mov  reg1, reg2      if reg3 is available.
                          mov  reg2, reg3

xlatb           486+       mov bh, 0            XLAT is faster and
smaller
                          mov bl, al           on 8088-386. MOV's
are faster
                          mov al, [bx]         on 486+. Best to
rearrange
                                               instructions to prevent
AGI's
xlatb           486+       xor ebx, ebx         and get pairing on
Pentium.
                          mov bl, al           Force high part of
BX/EBX
                          mov al, [ebx]        to zero outside of
loop.

                                               disadvantage: modifies
flags


  
Home Page--- e-mail to Quantasm ----Order form -- Site Map   

