MDOS 100% BUG PROBLEM DESCRIPTION        Rev. 2.0 / March 9th, 1997

CONTACT PERSON:

    Tobias Ernst             e-mail:  tobi@bland.fido.de
    Werderstr. 70            fidonet: 2:2476/418
    D-76137 Karlsruhe        os2net:  81:449/7835
    Germany                  phone:   +49 721 9374497


CONTENTS OF THIS DOCUMENT

    This document contains

    - A typical phenomenological problem description
    - A technical problem description
    - A kernel patch to solve the problem (!)
    - Annotations


ONE-LINE DESCRIPTION OF THE PROBLEM:

    APAR JR-10024:

    MDOS APPS CAUSE 100% LOAD DUE TO BROKEN TIME SLICE API IN WARP4


A TYPICAL PHENOMENOLOGICAL PROBLEM DESCRIPTION

    A lot of different DOS software, esp.  DOS DFUE software, which
    ran smoothly under Warp  3,  causes  100% system workload since
    upgrade from Warp 3 to Warp 4.  Most of these programs claim to
    be OS/2 aware, i.E.  to  release  time  slices  to  OS/2.   The
    difference  between Warp 3 and Warp 4 can be visualized using a
    system process monitor.  It reveals the following:

                       WARP 3->              WARP 4->
                       Priority  State       Priority  State
    Program active     0x0201    Ready       0x0201    Ready
    Program idle       0x0200    Blocked     0x0200    Ready

    Warp 4 simply "forgets" to block the program when it  is  idle.
    The   following  diagraph  shows  an  interesting  effect  that
    illustrates that  there  must  be  a  serious  bug  in the time
    slicing API.  It was taken with the DOS fidonet mailer software
    "McMail", a typical OS/2-aware DOS program:

                                     Resulting system load
                                     maesured using PULSE.EXE

    Program is told to ..            Warp 3            Warp 4

       Release OS/2 time slices      3%                100%
       Not release any time slices   50%               50%

    Actually, giving  a  time  slice  back  to  OS/2  V4  makes the
    situation WORSE than not giving a time slice at all!

    There seem to be further  problems  with the DOS emulation idle
    time detection.  For example, on  my  system  the  TSR  program
    DOSKEY, which is included in OS/2 MDOS, causes 100% system load
    *SOMETIMES*.  This is not reproducable  at  will,  but  happens
    casually and is very worrying.


A TECHNICAL PROBLEM DESCRIPTION

    There are four methods of releasing time slices under OS/2:

       Method                Works with Warp 3    Works with Warp 4

    a) "Generic DOS pause"   yes                  yes (?)
       INT 28h
    b) "DPMI time slice"     yes                  no
       MOV AX,1680H
       INT 2FH
    c) "BIOS delay"          yes                  yes
       CX:DX=my_secs
       MOV AH,86H
       INT 15h
    d) "CPU halt"
       DX:AX=msecs           yes                  no
       STI
       HLT
       DB 035h, 0CAh

    From  these  four  methods,  number  b)  and  d)  are  the most
    frequently used methods, so  that  about  80% of all OS/2-aware
    DOS-software is nearly inoperable under Warp 4.

    We have done some debugging of the OS/2 Warp 4 kernel and  have
    found the reason why INT 2F does not work any more:

    OS/2 Versions prior to OS/2 Warp  4  were  handling  INT  2F  /
    AX=1680h via the 16 bit doskrnl.  The 16 bit doskrnl contains a
    routine to trap the INT 2F / AX=1680h and release the DOS tasks
    processor  time  slice.   In  Warp  4, the 16 bit doskrnl still
    conatins this code and it is still  operable - but the INT 2F /
    AX=1680h never reaches the doskrnl at all.

    In Warp 4, INT 2F /  AX=1680h  is trapped by the MVDM before it
    ever reaches doskrnl, and MVDM seams to be inable to handle  it
    correctly.   Now  there  are  two  possible  solutions  to  the
    problem,  namedly  either make MVDM process the time slice call
    correctly - or disable MVDM  from  handling the time slice call
    at all, so that it can reach the routines in doskrnl which  are
    still operable.

    As a first workaround, we have chosen the latter method:


A KERNEL PATCH TO SOLVE (WORKROUND)THE PROBLEM

    The following kernel patch  is  valid  for OS/2 Warp XR4000 and
    XRG4000 service levels. We have not  debugged  the  Fixpack  #1
    kernel yet, so I cannot estimate if it is valid for this kernel
    as  well  (I only know that the Fixpack #1 kernel has still the
    problem ...).

    Service  level  XR4000 or XRG4000, File OS2KRNL (located in the
    root dir of the installation  drive):  For Revision 9.023 (Warp
    4 w/o fixes) at offset 67C2Eh for Revision  9.025  (Warp  4  w/
    fixpack #1) at offset 67D73, change the following six bytes:

       66 25 80 00 74 45   (old)

    as follows:

       66 3D 80 00 7E 45   (new)

    This  will  stop MVDM from processing INT 2F/AX=1680 and voila,
    time slices  work  again.   -  What  this  patch  does  is also
    illustrated by the following disassembly of the OS2KNRL:

    F8C6B4      push    ebp
    F8C6B5      mov     ebp, esp
    F8C6B7      push    ebx
    F8C6B8      push    esi
    F8C6B9      mov     ebx, [ebp+8]
    F8C6BC      cmp     byte ptr [ebx+1Dh], 16h
    F8C6C0      jnz     loc_FFF8C6EA
    F8C6C2      movzx   esi, byte ptr [ebx+1Ch]
    F8C6C6      mov     eax, esi
    F8C6C8      mov     ecx, eax
    F8C6CA      and     ax, 80h
    -->CHANGED: cmp     ax, 80h

    F8C6CE      jz      loc_FFF8C715
    -->CHANGED: jle     loc_FFF8C715

    F8C6D0      cmp     ecx, 8Ah ;
    F8C6D6      ja      loc_FFF8C715
    F8C6D8      and     ecx, 0FFFFFF7Fh
    F8C6DE      mov     esi, ecx
    F8C6E0      push    ebx
    F8C6E1      call    dword_FFF14C50[ecx*4]
    F8C6E8      jmp short loc_FFF8C717
    F8C6EA
    F8C6EA
    F8C6EA loc_FFF8C6EA:
    F8C6EA      cmp     word ptr [ebx+1Ch], 4010h
    F8C6F0      jnz     loc_FFF8C705
    F8C6F2      mov     word ptr [ebx+1Ch], 0
    F8C6F8      mov     word ptr [ebx+10h], 1428h
    F8C6FE      mov     eax, 1
    F8C703      jmp short loc_FFF8C717
    F8C705
    F8C705 loc_FFF8C705:
    F8C705      cmp     word ptr [ebx+1Ch], 4011h
    F8C70B      jnz     loc_FFF8C715
    F8C70D      push    ebx
    F8C70E      call    loc_FFF94691
    F8C713      jmp short loc_FFF8C717
    F8C715
    F8C715 loc_FFF8C715:
    F8C715      sub     eax, eax
    F8C717
    F8C717 loc_FFF8C717:
    F8C717      pop     esi
    F8C718      pop     ebx
    F8C719      leave
    F8C71A      retn    4


HOW TO REPRODUCE THE PROBLEM

    The  problem  can  be  and  has been reproduced on all existing
    versions of Warp 4  (Beta,  Gamma,  German  GA, US American GA)
    virtually  independent   of   installed   hard-   or   software
    components.    Using  the  following  method,  I  was  able  to
    reproduce it on *any*  OS/2  Warp  4  system I have worked with
    since.

    Enter the following lines on  the  MDOS  prompt and do not omit
    the empty lines:

    ==begin==
    debug
    a100
    mov ax,1680
    int 2f
    jmp 100

    rcx
    7
    nloop.com
    w
    q
    ==end==

    This  creates  a little program LOOP.COM.  This program is just
    an endless loop  which  does  nothing  more than to continuosly
    release all of its processor time back to OS/2 via the  INT  2F
    call.   Consequently, this program should not cause any visible
    system  load.

    Start PULSE.EXE.   Then  run  LOOP.COM  in  a  windowed or full
    screen DOS session with standard settings (IDLE_SENSITIVITY=70,
    IDLE_SECONDS=0)  and  watch  the  Pulse.   You  will  see  that
    LOOP.COM does not produce any system load under Warp  3,  while
    it  produces,  depending  on  the  system, between 40% and 100%
    under Warp  4.   (Note  that  you  will  not  see anything when
    running the program except for the pulse change.  You  have  to
    terminate it by closing the DOS window).


ANNOTATIONS

    Note:   It  is  evident that, while LOOP.COM drives system load
    display to  100%,  you  will  not  perceive  relevant impact on
    overall system performance.  LOOP.COM is just  a  DEMO  program
    for the problem which does nothing more than producing the bug.
    But imagine a full-grown program with a main loop like that:

    - Poll I/O-Port
    - Poll Keyboard
    - Check Harddisk for file semaphores
    - If OS/2 then  give up timeslice
              else  do nothing for 1/10 sec
    - Redo from start

    With the Warp 4 problem, this program will poll the hardware 10
    times  as often as with Warp 3.  This causes highly perceptible
    effects on overall system load.

    Note 2:  Of course you can pull "LOOP"'s system load down using
    very low values  of  IDLE_SENSITIVITY.   But  you can't do this
    with a full grown BBS software.  Meaning, of  course  you  can,
    but  then  the  BBS  will  not  respond  to user input within a
    reasonable time any more.

[EOF]
