IDE HDD S.M.A.R.T. monitor for OS/2


1. Introduction 

   Many  of recent   IDE hard disk    drives come with  an integrated  health
   monitoring subsystem that is capable of reporting several disk parameters,
   such as the error rate, number  of reallocated sectors, drive temperature,
   etc. The parameters in general are vendor-specific, but the most common of
   them, as well as  the  interface for  retrieving them, are  set forth in a
   specification known as  S.M.A.R.T.  (self-monitoring, analysis  and report
   technology).  The S.M.A.R.T.  values  are crucial for  estimating possible
   life time of  the hard  disk and  detecting any  instabilities before they
   lead to data losses or drive malfunction.

   Since version  4.00, OS/2 comes with  an integrated S.M.A.R.T.  monitoring
   functionality, with IBM1S506.ADD  serving   as a back-end,   and HDMON.EXE
   (part of  DMI/Problem Determination  package)  as the front-end.   In this
   vision, SMARTMON.EXE is just another front-end that displays more detailed
   information   rather   than    just   reporting    S.M.A.R.T.  status   as
   "normal/error", and somehow resembles the well-known SMARTMON for DOS from
   HDDSPEED package, originally developed by Mikhail Radchenko.

   This program requires a post-1996  IBM1S506.ADD, or any compatible driver,
   such as DANIS506.ADD. A S.M.A.R.T.-capable IDE drive is also a must.


2. Operation

   HDD reports its S.M.A.R.T. status in a series of attribute structures:

      * Attribute ID = a number that identifies the kind of parameter being
        returned. What real meanings correspond to IDs is a de-facto standard
        (with some exceptions though).
      * Value = current value of the parameter. It comes in two
        representations: user-readable (a normalized value seen as a decimal
        number: the lower, the worse) and a "raw" representation which
        describes the actual state of parameter in a series of hex digits.
                user ->   97        000000005ccc  <- raw
      * Threshold = the lowest possible value. Shall the user-readable value
        drop below threshold, the drive is no longer guaranteed to operate
        properly. The HDDSPEED authors have named this condition "T.E.C."
        (threshold exceed condition), and so did we.
      * Worst value = the least acceptable value encountered during the
        operation of HDD. For example, if the current temperature is deemed
        adequate, but some time ago the drive suffered overheating - the
        worst temperature is still kept in the "worst value" field.

   Now, it  can be seen that S.M.A.R.T.   has no facilities  for tracking the
   values over  the time. As the drive  reliability decreases, several values
   will decrease too, but S.M.A.R.T.  does not provide any information on the
   exact rate  of their change.  That is,  one may not  predict the remaining
   service life of a drive while looking at its current parameters.

   To  work around  this,  SMARTMON.EXE  keeps a copy   of  the drive's first
   S.M.A.R.T.  status - captured when you ran the program for the first time.
   Now, if any parameter begins to decline, its current value may be compared
   against its former value, and, based on the time passed, the decrease rate
   may be easily computed.  Consequently, it is  possible to roughly estimate
   the date when the threshold may be crossed.

   The file retaining the S.M.A.R.T. history  is called SMART.DAT and resides
   in  the same  directory  as SMARTMON.EXE.   The   format of SMART.DAT   is
   compatible with  HDDSPEED v  2.1. It keeps   the data  for any  number  of
   drives, based on their model and serial numbers.   To reset the S.M.A.R.T.
   history, just delete this file.

   Keeping the history introduces two "artifical" parameters:

      * 1/Month = estimation of how fast does the value decrease. As an
        example, losing 5 points (compared to original value) in 2 months
        shall yield 2.5 points/month.
      * T.E.C. = approximate timeline for crossing the threshold value. If
        there is no degradation yet, the field will read "Unknown". If the
        value has already dropped below threshold (the drive will not
        necessary give up at that point), the field will read "Yes".


3. Running SMARTMON.EXE

   SMARTMON may run in two modes:
      * User mode: display the drive status and exit.
      * Daemon mode: run continuously, monitor a single drive and report its
                     status to some other software as the time passes.
   Invoke "SMARTMON /?" to display a help screen with a list of options.

   By default,  SMARTMON proceeds with the primary  master drive, although it
   may be  invoked  to run with  any  other IDE  units  with the command-line
   parameter /1, /2, /3, etc.  (the default, primary master, is /0).

   With the /F parameter, SMARTMON will switch to the Fahrenheit scale in its
   temperature readings.  This does not  compromise the T.E.C. predictions in
   any  way,  even if a previously   recorded  SMART.DAT contained centigrade
   values.

   In user mode, which is the default, SMARTMON  displays either a simplified
   graphic representation  of the health status (if   no other parameters are
   given),  or  a more detailed  report  comprising "raw" and  "worst" values
   (/RAW).

   The  daemon  mode  is  triggered   with at  least   one  of the  following
   suboptions, and an optional /DELAY parameter to  adjust the delays between
   S.M.A.R.T.  inquiries (this  procedure is time-consuming  for some drives,
   so the default value is 900 seconds):

   1. /TM = Temperature  monitoring into  a named pipe. The temperature value
   is  dumped into  pipe  called  \PIPE\IBMHDD.  This mode   of operation  is
   intended for use with SysBar/2 Pipe Monitor v 0.xx or any similar utility.
   To use it with SysBar/2, right-click at the SysBar/2 panel, add a new cell
   of the "custom  pipe listener" class  and type  in "\PIPE\IBMHDD" (without
   quotes) in the "Pipe" field.  Fill  in a prefix and  a description of your
   choice. Be sure to enable the pipe, then choose "Close" to save.

   2. /DETAIL = Similar with /TM but includes  identification data along with
   temperature  readings  so the  pipe listener  can tell between data coming
   from multiple monitors.

   3. /SYSLOG = Report temperature  and number of  reallocated sectors to the
   SYSLOG daemon. With this   parameter, SMARTMON makes requests to   SYSLOGD
   (port 514) and leaves marks such as the following:

   [asc] Dec 19 23:12:08 SMARTMON: IBM-DJNA-352030 #GQ0GQ0F0116G: 35C, 0 RB
   [asc] Dec 19 23:27:08 SMARTMON: IBM-DJNA-352030 #GQ0GQ0F0116G: 35C, 0 RB
   [asc] Dec 19 23:42:08 SMARTMON: IBM-DJNA-352030 #GQ0GQ0F0116G: 35C, 0 RB
   [asc] Dec 19 23:57:08 SMARTMON: IBM-DJNA-352030 #GQ0GQ0F0116G: 35C, 0 RB

   In this example, the text  immediately following "SMARTMON:" comprises the
   drive  model and serial  number (to tell between  twin drives  in a single
   system), current temperature (C) and number of Reallocated Blocks (RB).

   Provided that the  SYSLOG spool is  large  enough, some statistics may  be
   accumulated to estimate the  daily/yearly temperature fluctuations, or  to
   keep  track  of progressing reallocations.     More than  one instance  of
   SMARTMON may access   the  SYSLOG (detach   multiple  instances specifying
   different unit numbers).

   The /ONCE parameter, when used in conjunction with the daemon mode, limits
   the monitoring loop to  a single check and  then exits. It is particularly
   useful  to leave  a mark  in SYSLOG suggesting  that the  drive is healthy
   as  a  part  of   some  system-wide  event , rather  than  poll  the  unit
   continuously.


4. Interpreting the T.E.C. predictions

   Although  there   are no  exact rules  for  interpreting   the information
   returned by SMARTMON.EXE, some of  the following recommendations will help
   one  more precisely estimate the stability  of HDD and avoid false alarms.
   It should  be stressed that the  reliability of T.E.C. prediction is quite
   low, and depends  on   specific  drive,  its manufacturer  and   operating
   conditions.

   1.  Factory  presets:  during the  first  couple of weeks  in operation, a
   degradation of some   HDD S.M.A.R.T. values may  be  noticed. Most oftenly
   this applies to the newly installed drives:

     Attribute                 ID Threshold Value Indicator  1/Month   T.E.C.
   
   * Spin Up Time              3       21     96      2.5    02/2005
   * Seek Error Rate           7       51    200     0.0    Unknown
   * Spin Retry Count          10      51    100     0.0    Unknown
   [...]
   Attribute               Value  Threshold  Worst  Raw           Flags
   
   Spin Up Time               96         21     91  000000000a06  OC PR
   Seek Error Rate           200         51    200  000000000000  OC ER
   Spin Retry Count          100         51    253  000000000000  OC EC

   The  complexity  of the problem  is  that degradation  of spin-up  time is
   normal  and   common, whereas for some   other  attributes it may   mean a
   forthcoming malfunction.  The example  below corresponds to a  now-defunct
   unit which  differs from the "normal" example  above only by non-zero seek
   error rate and spin retries. Moreover,  the seek error  rate at this stage
   can only be discovered in the raw display mode.

     Attribute                 ID Threshold Value Indicator  1/Month   T.E.C.
   
   * Spin Up Time              3       21     93     10.0    09/2002
   * Seek Error Rate           7       51    200     0.0    Unknown
   * Spin Retry Count          10      51     99     0.0    10/2002
   [...]
   Attribute               Value  Threshold  Worst  Raw           Flags
   
   Spin Up Time               91         21     89  000000000bb9  OC PR
   Seek Error Rate           200         51    200  00000000000b  OC ER
   Spin Retry Count           99         51     99  000000000118  OC EC

   Also illustrated is  how distracting the  "spin-up time" parameter is: its
   T.E.C. will be  actually deferred for a long  time, and the real  cause of
   drive crash is the spin retry count - initially holding the 2nd place.

   2.  Non-conforming values:   the predictions are based  on "user-readable"
   values, with  the assumption that any change  of  value reflects an actual
   change of drive fitness/performance characteristics.  However, some values
   are likely to change without any  representation of the actual state (such
   as temperature, and, to  some extent, the spin-up  time), and  they should
   not be counted upon   if SMARTMON yields completely different  predictions
   for them at different time.

   3. Long-term operation: in the  long-term, the life time estimations given
   by SMARTMON may not reflect  the actual rate  of drive degradation.  It is
   the large time  span between SMART.DAT creation  time and  query time that
   prevents SMARTMON from stating its assumptions correctly. The following is
   an imaginary   example of  change  history  for a specific   attribute and
   SMARTMON assumptions about life-cycle (threshold=20):

        Value reported by HDD in Jul-1999:        100
        Value written to SMART.DAT (Jul-1999):    100
        Value reported by HDD in Jan-2000:        100
        Value reported by HDD in Jan-2001:        100
        Value reported by HDD in Mar-2001:         99  T.E.C. = 10/2132
        Value reported by HDD in Jun-2001:         98  T.E.C. = 02/2076
        Value reported by HDD in Jul-2001:         97  T.E.C. = 10/2052
        Value reported by HDD in Aug-2001:         95  T.E.C. = 11/2032
        Value reported by HDD in Sep-2001:         89  T.E.C. = 09/2015
        Value reported by HDD in Oct-2001:         63  T.E.C. = 10/2004
        Value reported by HDD in Nov-2001:         39  T.E.C. = 07/2002
        Value reported by HDD in Dec-2001:         24  T.E.C. = 01/2002

   The point of interest is how fast the value  decreases (and it is so that,
   in  many  cases, the value  does  not   decrease  linearly), and  how  the
   T.E.C. estimations are corrected by the  program. Should T.E.C. monitoring
   restart in  Aug-2001, it would yield  a more realistic prediction from the
   very beginning:

        Value reported by HDD in Aug-2001:         95
        Value written to SMART.DAT (Aug-2001):     95
        Value reported by HDD in Sep-2001:         89  T.E.C. = 08/2002
        Value reported by HDD in Oct-2001:         63  T.E.C. = 12/2001
        Value reported by HDD in Nov-2001:         39  T.E.C. = 12/2001
        Value reported by HDD in Dec-2001:         24  T.E.C. = 12/2001

   Conclusion: after several months  in operation,  it  may be wise  to reset
   SMART.DAT to be aware of possible failures. Keep in mind that HDD does not
   memorize dates,  so the job   for  calculating T.E.C.  is  entirely up  to
   SMARTMON.

   4. Worst values. As mentioned earlier, the  drive keeps track of the worst
   value  of  each parameter  detected during operation.   The  values may be
   recalled from the raw mode:

   T.E.C. prediction monitoring started on 31/01/2000, 19:05:12
     Attribute                 ID Threshold Value Indicator  1/Month   T.E.C.
   
   * Raw Read Error Rate       1       60    100     0.0    Unknown
   [...]
     Power On Time Count       9        0     99      0.0    03/2265

   Attribute               Value  Threshold  Worst  Raw           Flags
   
   Raw Read Error Rate       100         60     97  000000010000  OC ER
   [...]
   Power On Time Count        99          0     99  0000000032f1  OC EC

   Without the raw mode,  SMARTMON only reported the  nearest T.E.C. to occur
   in year 2265 (see  above for side-effects of  an outdated  SMART.DAT). The
   raw mode highlights a  problem  with "raw read error   rate": there was  a
   point in the drive's history when its value dropped to 97.  Also, for this
   particular model it can be deduced from low 8 digits of raw value that the
   last read error occured  not  that long ago  (in IBM  DJNA and some  other
   drives, the  history is kept in two  frames: the "present"  and "previous"
   state  which are shifted after  several million  commands  - "0001" is the
   previous state, and low 4 digits, "0000", are the current state).

   5. Sensitivity.  Different HDD manufacturers  apply different factors when
   it comes to  calculating a particular S.M.A.R.T.   value. For some HDDs, a
   Raw  Read Error Rate of  99  is an alarming  degradation,  for others this
   value can become  as low as  75  in a perfectly  healthy  system.  In some
   cases, a manufacturer's tool  (such as Data  LifeGuard Diagnostics from WD
   or Drive Fitness test from IBM) may be used to carry  out a more extensive
   test and elaborate on any uncertainties of the usual S.M.A.R.T. report.


5. FAQ

   Q: How critical those "*"-marked attributes are?
   A: The S.M.A.R.T. specification introduces a notion of "life-critical"
      attributes, leaving it up to individual HDD manufacturers to decide
      which attributes are to be marked as critical in their S.M.A.R.T.
      records. A common convention is to judge by the T.E.C. behavior for a
      particular attribute:
      - if crossing the threshold signals imminent failure (e.g. abnormally
        high read error rate), the attribute is classified "life-critical".
      - if the threshold represents some design limit (e.g. power-on time
        count) but no abrupt changes are expected after T.E.C., the attribute
        is not critical (a.k.a. "advisory").

   Q: Is there any relation between S.M.A.R.T. and manufacturer's warranty
      obligations?
   A: Some manufacturers, e.g. Western Digital as of early 2005, specifically
      stipulate that a T.E.C. for a life-critical attribute (see above)
      constitutes a valid ground for warranty return.
      On the other hand, there was a case of "warranty void" attributes at
      least several years ago with IBM TravelStar 4GN, when units subjected
      to over 3000 power-off retract cycles were declared to be excluded from
      warranty.

   Q: What do "ATTv16" and "THRv16" stand for?
   A: These numbers correspond to revisions of attribute and threshold table.
      Not much information can be extracted from the table revisions.

   Q: Is there a feature to modify S.M.A.R.T. values returned by HDD?
   A: No, S.M.A.R.T. data is read-only data. There are various methods to
      affect S.M.A.R.T. tables of a particular HDD model, but there is no
      unified method.

   Q: Does this program support SCSI HDDs?
   A: No. There is a S.M.A.R.T. interface for SCSI but SMARTMON presently
      covers IDE only.

   Q: My Kalok/XEBEC drive reports wrong S.M.A.R.T. data!
   A: There are some older drives that do not support S.M.A.R.T. but respond
      to S.M.A.R.T. inquiry with arbitrary data. If the S.M.A.R.T. values
      reported by drive are admittedly invalid (e.g. duplicate entries,
      totally unknown attributes, etc.) then the drive incidentally responds
      to S.M.A.R.T. request with wrong data and is not meant to not actually
      support S.M.A.R.T.

   Q: What tools do I need to compile SMARTMON from its source code?
   A: OS/2 Toolkit and any of the supported compilers (OpenWatcom, IBM or
      MetaWare). To obtain helper libraries enabling the system LIBC for
      non-IBM compilers, contact the author.


6. Credits

   This product incorporates information from various sources:

       - HDDSPEED v 2.1 by Mikhail Radchenko (SMART.DAT structures)
       - DANIS506.ADD by Daniela Engert (updated IDEDATA.H)
       - SIGuardian by PalickSoft (S.M.A.R.T. attribute list)

   This product includes software developed by the University of California,
   Berkeley and its contributors.

   SMARTMON.EXE is maintained by Andrew Belov <andrew_belov@newmail.ru>.
   See the ChangeLog for a complete list of contributors.

   The Cast (in alphabetical order):

       - IBM Deskstar 25GP DJNA-352030
       - IBM Deskstar 34GXP DPTA-372050
       - IBM Travelstar 14GS DCYA-214000
       - IBM Travelstar 20GN DJSA-210
       - IBM Travelstar 4GT DTCA-23240
       - IBM Travelstar 80GN IC25N040ATMR04
       - Hitachi Travelstar 5K80 HTS548020M9AT00
       - Quantum Fireball EX6.4A
       - Western Digital Protege 100EB
       - Western Digital Protege 400EB
       - Western Digital Scorpio 800VE

   No hard disk drive was tortured during the development of this software.

   $Id: readme.txt,v 1.12 2005/04/10 09:03:46 root Exp $
