...yes, really...although it wasn't directly caused by the Linux kernel itself. It has more due to the fact the Linux distribution is responsible for configuring ACPI power management settings and many Linux distributions don't change the defaults. [...and the software packages that can adjust the default settings were not installed by default on my machine.]
If you run Linux on a laptop or otherwise use Linux with a laptop hard drive (maybe in an external enclosure?) you really should read this post instead of just skimming it.
[This is a follow up to my
earlier post about my laptop's hard drive troubles.]
I'm never really satisfied until I know the cause of a hardware failure so I spent quite a bit of time Sunday night and Monday afternoon trying to figure out exactly what caused the drive to suddenly start having problems...
I've had Linux on this laptop since I put it into service a few years back but I found out Sunday that the smartmontools package wasn't installed by default. After I finished copying all my files over to the old P166 I set about installing it.
Here is the output from 'smartctl -A /dev/hda':
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is
http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 063 063 033 Pre-fail Always - 1
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 879
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 053 053 000 Old_age Always - 20827
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 171
191 G-Sense_Error_Rate 0x000a 099 099 000 Old_age Always - 131077
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 17
193 Load_Cycle_Count 0x0012 001 001 000 Old_age Always - 2131639
194 Temperature_Celsius 0x0002 114 114 000 Old_age Always - 47 (Lifetime Min/Max 17/62)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 61
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
Most of these numbers are well within what would be expected for this drive. This particular laptop is used as a surrogate desktop system so it stays powered up pretty much all the time so the high Power_On_Hours value is to be expected.
I disabled the power management for the drive (or so I thought) via the laptop's BIOS ages ago to keep the drive from constantly spinning down. It turns out that didn't completely disable all the power management functions and the hard drive's power management has been operating in "Low Power Idle" mode and not "Active Idle" or "Disabled".
This line from smartctl's output tells the full story:
193 Load_Cycle_Count 0x0012 001 001 000 Old_age Always - 2131639
Yes, that really is over 2.1 million head load/unload cycles...
Google turned up something about this problem that was submitted to Slashdot last October:
Ubuntu May Be Killing Your Laptop's Hard Drive Note that I'm using Debian on this laptop, not Ubuntu, but this seems to be a common problem across many if not all Linux distributions.
Google also turned up these two lengthy discussions about the problem:
High frequency of load/unload cycles on some hard disks may shorten lifetimelaptop harddrive Load_Cycle_Count issue Digging into the internals of the system, I find via 'hdparm -I /dev/hda' that even though I've disabled power management in the BIOS, the drive is still running in "Low Power Idle" mode:
Capabilities:
LBA, IORDY(can be disabled)
Standby timer values: spec'd by Vendor, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 0
Advanced power management level: 128
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=240ns IORDY flow control=120ns
I managed to find a copy of the
Travelstar 80GN OEM Specification v2.0 (986k PDF) datasheet after following a link from
this page.
Section 6.3.6 Load/unload:
"The product supports a minimum of 300,000 normal load/unloads."
Section 11.7.2 Active Idle Mode:
"In this mode, power consumption is 45-55% less than that of Performance Idle mode. Additional electronics are powered off and the head is parked near the mid-diameter of the disk without servoing. Recovery time to Active mode is about 20 ms."
Section 11.7.3 Low Power Idle Mode:
"Power consumption is 60-65% less than that of Performance Idle mode. The heads are unloaded on the ramp but the spindle is still rotated at the full speed. Recovery time to Active mode is about 300ms."
Section 13.33 Set Features (EFh), Note 3 (numbered page 154):
"Note 3. When the Feature register is 85h (=Disable Advanced Power Management), the deepest Power Saving mode becomes Active Idle."
Section 11.7.3 certainly explains a few things. Section 6.3.6 still doesn't tell me what the maximum rated number of load/unload cycles is for my 80GN series drive but later models of Hitachi drives tend to be rated at 600,000 cycles. It's probably safe to assume my drive is similarly designed and rated.
The drive really should have been running in Active Idle mode (0xFE / '-B 254') while on AC power but nothing in the system changed the default from Low Power Idle mode (0x80 / '-B 128'). Unlike with Windows, Linux is completely modular and the utilities that can change the default setting are separate from the kernel and base system installation.
Section 13.33 also explains why some of the people involved in the Ubuntu discussions linked above had luck using '-B 255' instead of '-B 254'. The '-B 255' hdparm option causes hdparm to issue the 0x85 command to the drive, which in the case of the 80GN series drives, is the same as issuing the 0xFE ('-B 254') command to put it into Active Idle mode.
A quick-fix command is: 'hdparm -B 254 /dev/hda' [for 0xFE (Active Idle)]
...of course that's a temporary fix and it's a little late now, but at least I know why the drive is starting to have problems.
Further research Tuesday morning shows that neither the acpi-support or the laptop-mode-tools software packages were installed on my laptop by default.
Debian's acpi-support package as of version 0.103-5 has support for changing the default power management mode for a laptop hard drive via 'hdparm -B'. The 90-hdparm.sh script found in /etc/acpi under the ac.d, battery.d, resume.d, and start.d directories contains this comment:
# This script adjusts hard drive APM settings using hdparm. The hardware
# defaults (usually hdparm -B 128) cause excessive head load/unload cycles
# on many modern hard drives. We therefore set hdparm -B 254 while on AC
# power. On battery we set hdparm -B 128, because the head parking is
# very useful for shock protection.
The laptop-mode-tools package can also control the hard drive's power management if CONTROL_HD_POWERMGMT is enabled in the /etc/laptop-mode/laptop-mode.conf config file. It has 3 other settings that allow fine tuning of the power management levels. These are the defaults:
BATT_HD_POWERMGMT=1
LM_AC_HD_POWERMGMT=254
NOLM_AC_HD_POWERMGMT=254
So...if the acpi-support package is installed and/or if the laptop-mode-tools package is installed and CONTROL_HD_POWERMGMT enabled, the power management level for the hard drive will be adjusted.
So who exactly is at fault for this problem?
The answer seems to be both everybody (Dell, Hitachi, and Debian) and at the same time, nobody.
It makes sense that the drive manufacturers default the drive to very aggressive power management settings that unload the heads and spin down the platters. This helps the drive survive minor shocks and bumps if used in a mobile system. It will help prevent head impacts on the platters and the resulting platter/head damage, data loss, and warranty returns. Of course, this is at the expense of a shorter drive life since spinning the drive down and unloading the heads both contribute to extra wear and tear, especially if the drive is going to be accessed regularly or spun right back up again.
From what I've read Dell has changed the BIOS screens in newer laptops so that this setting can be better adjusted. Supposedly enabling "Performance Mode" in their newer BIOS will take care of it. That still doesn't help those of us who are still using older laptops or people who don't know about that BIOS setting. It would be nice if Dell were to issue a BIOS update for these older laptops and make this problem better known but the way they probably see it is for them to do that would be admitting at least some sort of liability.
Debian and other Linux distributions really should be better at detecting the power management settings by default and changing the defaults as necessary. Windows does this to varying degrees but Linux offers much more control over these settings if the required utilities are installed. I haven't checked but hopefully Debian has since changed the priority level of the acpi-support package so it will be installed by default on ALL systems. The acpi-support package's default settings (at least in the version in the debian-unstable branch) would have saved a lot of wear and tear on my laptop's hard drive.
I guess it's a little late for my laptop's drive but maybe this information will at least save someone else's hard drive...