This post is partially meant to serve as a reminder of things to myself. Hopefully it is helpful to others too. Any corrections or contributions of information are welcome.
Smartmontools can be helpful, but it can also be kind of cryptic. To begin with, simply use the "-a" option. I think it will output everything about the device. Reading my USB connected drive, I also need to use the option "-d sat".
smartctl -a -d sat /dev/sdd
On one of my old harddrives, I see this:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 049 046 006 Pre-fail Always - 176467267
3 Spin_Up_Time 0x0003 099 098 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1495
5 Reallocated_Sector_Ct 0x0033 095 095 036 Pre-fail Always - 223
7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 2487909
9 Power_On_Hours 0x0032 077 077 000 Old_age Always - 20485
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1523
194 Temperature_Celsius 0x0022 046 059 000 Old_age Always - 46
195 Hardware_ECC_Recovered 0x001a 049 046 000 Old_age Always - 176467267
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 198 000 Old_age Always - 5
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
There are 2 types of readings: Prefail and Oldage.
On the rows with Pre-fail, you want the VALUE and WORST numbers to be higher than the THRESH (threshold) number. If they are higher, then you should be good. If not, then your device may be dieing.
(The WORST is generally a record of the lowest that VALUE has ever reached.)
I think the same rule applies to the rows with Old_age, if the THRESH is not 000. The Old_age data can also give other information in the RAW_VALUES column. For example, on the row "9 Power_On_Hours" the RAW_VALUE is 20485. That indicates that the device has run approximately 20,485 hours since it was purchased.
The column WHEN_FAILED will tell you the last time a VALUE went below THRESH.
And that's all I know about that table.
My device also reports these errors:
SMART Error Log Version: 1
ATA Error Count: 4
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 4 occurred at disk power-on lifetime: 20185 hours (841 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 e7 7b 8b e1 Error: ICRC, ABRT 1 sectors at LBA = 0x018b7be7 = 25918439
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b8 30 7b 8b e1 00 23:41:31.936 READ DMA
c8 00 08 58 e7 09 e9 00 23:41:31.915 READ DMA
c8 00 08 f8 f7 08 e9 00 23:41:31.906 READ DMA
c8 00 10 e8 c5 07 e9 00 23:41:31.880 READ DMA
c8 00 08 28 53 c2 e0 00 23:41:31.857 READ DMA
Error 3 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 00 00 00 f0 Error: ICRC, ABRT 1 sectors at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 00 00 00 f0 00 00:00:17.249 READ DMA
c6 00 10 00 00 00 f0 00 00:00:17.249 SET MULTIPLE MODE
91 00 3f 00 00 00 ff 00 00:00:17.248 INITIALIZE DEVICE PARAMETERS [OBS-6]
10 00 00 00 00 00 f0 00 00:00:17.248 RECALIBRATE [OBS-4]
00 00 01 00 00 00 00 06 00:00:21.676 NOP [Abort queued commands]
Error 2 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 00 00 00 f0 Error: ICRC, ABRT 1 sectors at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 01 00 00 00 f0 00 00:00:21.435 READ DMA EXT
91 00 3f 00 00 00 ff 00 00:00:21.435 INITIALIZE DEVICE PARAMETERS [OBS-6]
10 00 00 00 00 00 f0 00 00:00:21.319 RECALIBRATE [OBS-4]
00 00 01 00 00 00 00 06 00:00:20.554 NOP [Abort queued commands]
20 00 01 00 00 00 f0 00 00:00:20.540 READ SECTOR(S)
Error 1 occurred at disk power-on lifetime: 16357 hours (681 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 8f 08 80 f0 Error: ICRC, ABRT 1 sectors at LBA = 0x0080088f = 8390799
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 08 80 f0 00 00:01:21.352 READ DMA
e7 00 00 0f 08 80 b0 00 00:01:21.352 FLUSH CACHE
c8 00 08 08 08 80 f0 00 00:01:21.352 READ DMA
ca 00 08 00 e0 dc f1 00 00:01:21.425 WRITE DMA
c8 00 08 00 f0 dc f1 00 00:01:21.375 READ DMA
The errors #4 and #3 report they were at 20,185 hours, and errors #2 and #1 report they were at 16,357 hours.
Since the errors didn't happen recently, and they seem to have large time gaps between them, I think they may have been related to forced computer shutdown or power failure.