Thursday, February 6, 2014

Reading smartctl for Dummies

This post is partially meant to serve as a reminder of things to myself. Hopefully it is helpful to others too. Any corrections or contributions of information are welcome.

Smartmontools can be helpful, but it can also be kind of cryptic. To begin with, simply use the "-a" option. I think it will output everything about the device. Reading my USB connected drive, I also need to use the option "-d sat".

smartctl -a -d sat /dev/sdd

On one of my old harddrives, I see this:

Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000f   049   046   006    Pre-fail  Always       -       176467267
  3 Spin_Up_Time            0x0003   099   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1495
  5 Reallocated_Sector_Ct   0x0033   095   095   036    Pre-fail  Always       -       223
  7 Seek_Error_Rate         0x000f   063   060   030    Pre-fail  Always       -       2487909
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       20485
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1523
194 Temperature_Celsius     0x0022   046   059   000    Old_age   Always       -       46
195 Hardware_ECC_Recovered  0x001a   049   046   000    Old_age   Always       -       176467267
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   198   000    Old_age   Always       -       5
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

There are 2 types of readings: Prefail and Oldage.

On the rows with Pre-fail, you want the VALUE and WORST numbers to be higher than the THRESH (threshold) number. If they are higher, then you should be good. If not, then your device may be dieing.

(The WORST is generally a record of the lowest that VALUE has ever reached.)

I think the same rule applies to the rows with Old_age, if the THRESH is not 000. The Old_age data can also give other information in the RAW_VALUES column. For example, on the row "9 Power_On_Hours" the RAW_VALUE is 20485. That indicates that the device has run approximately 20,485 hours since it was purchased.

The column WHEN_FAILED will tell you the last time a VALUE went below THRESH.

And that's all I know about that table.

My device also reports these errors:

SMART Error Log Version: 1
ATA Error Count: 4
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 occurred at disk power-on lifetime: 20185 hours (841 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 01 e7 7b 8b e1  Error: ICRC, ABRT 1 sectors at LBA = 0x018b7be7 = 25918439

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b8 30 7b 8b e1 00      23:41:31.936  READ DMA
  c8 00 08 58 e7 09 e9 00      23:41:31.915  READ DMA
  c8 00 08 f8 f7 08 e9 00      23:41:31.906  READ DMA
  c8 00 10 e8 c5 07 e9 00      23:41:31.880  READ DMA
  c8 00 08 28 53 c2 e0 00      23:41:31.857  READ DMA

Error 3 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 01 00 00 00 f0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 01 00 00 00 f0 00      00:00:17.249  READ DMA
  c6 00 10 00 00 00 f0 00      00:00:17.249  SET MULTIPLE MODE
  91 00 3f 00 00 00 ff 00      00:00:17.248  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 00 00 00 00 00 f0 00      00:00:17.248  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 06      00:00:21.676  NOP [Abort queued commands]

Error 2 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 01 00 00 00 f0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 01 00 00 00 f0 00      00:00:21.435  READ DMA EXT
  91 00 3f 00 00 00 ff 00      00:00:21.435  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 00 00 00 00 00 f0 00      00:00:21.319  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 06      00:00:20.554  NOP [Abort queued commands]
  20 00 01 00 00 00 f0 00      00:00:20.540  READ SECTOR(S)

Error 1 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 01 8f 08 80 f0  Error: ICRC, ABRT 1 sectors at LBA = 0x0080088f = 8390799

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 08 80 f0 00      00:01:21.352  READ DMA
  e7 00 00 0f 08 80 b0 00      00:01:21.352  FLUSH CACHE
  c8 00 08 08 08 80 f0 00      00:01:21.352  READ DMA
  ca 00 08 00 e0 dc f1 00      00:01:21.425  WRITE DMA
  c8 00 08 00 f0 dc f1 00      00:01:21.375  READ DMA

The errors #4 and #3 report they were at 20,185 hours, and errors #2 and #1 report they were at 16,357 hours.

Since the errors didn't happen recently, and they seem to have large time gaps between them, I think they may have been related to forced computer shutdown or power failure.