Log in

No account? Create an account
Nobody expects a double disk failure - Alex Belits
Nobody expects a double disk failure
Yesterday I was doing some remote maintenance on the company's servers while sitting at a coffee shop. A combination of OpenVPN and ssh allows me to access everything from outside, serial consoles don't let me lose the connection to servers in the case of some severe networking misconfiguration, or installing a bad kernel, so I didn't have to be in the lab just to run updates and configuration. I still had an updated kernel waiting for reboot, so I was supposed to reboot the server from the lab today just to be sure, but everything else was perfectly usable remotely.

Until I have looked at the log, and got the dreaded hard drive error:

hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=5096644, high=0, low=5096644, sector=5096642
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 5096642

What meant that my remote maintenance just became a local one, with a trip to the store for a new hard drive along the way.

Everything on that drive except for recent email was copied to a backup a night before that -- I have rdiff-backup running every night for all servers -- however this server doesn't have RAID, so I can't just replace a drive and tell the array to resync -- I have to copy everything to the new drive, bring the server down and replace the drive.

Since the failing drive had some new email compared to backup, and was likely to run for days, I decided to keep the box running, copy everything read-only from it, shut down all services, copy everything that was supposed to be read-write, then replace the drive. If at any point I encountered more errors, I always have a backup, but if I won't, I will lose no email, and only have 1-2 hours of downtime at night while vast amounts of email will be copied. If it was something more critical, I would be able to copy from backup, then rsync all changes, however if it was more critical, I would have RAID, so the whole copying procedure would be unnecessary in the first place.

I have bought the new 300G Seagate drive, brought it to work, shut down my desktop box and connected the new drive there. Not willing to reinstall the bootloader (GRUB), I have copied the first sectors from the old drive using dd over ssh and edited the partition table -- the original drive was 160G, and partitions had to be re-read after this kind of rewriting.

Other than the procedure being slow (450MHz PIII is perfectly good as an X terminal with OpenGL but isn't that great for tar over ssh), everything went as planned. At midnight I have shut down all external services and continued copying /var. After almost two hours tens of gigabytes of email, databases and some small amount of logs copied to the new drive, I shut down servers and desktop, replaced the drive, turned everything on, monitored the boot-up, and everything was up and running. Except for the second server.

The company's "IT infrastructure" has tree servers. First one is what a server is really supposed to be -- email, web, DNS, DHCP, TFTP (for SunRay terminals and installation of new computers over PXE), phone system and Bittorrent (as a matter of fact, it is for my custom Linux ISO files) and internal IRC. This is where the drive failed. The second server is a "remote desktop". It's configured just like a regular Linux desktop box, except without an assumption that there is only one user, and he is likely to be on the local console -- users have normal permissions and no crazy administrative stuff is allowed through sudo. To use this "desktop" the user has to have either remote X session (the aforementioned 450MHz PIII is more than sufficient to run 3D CADs remotely) or SunRay terminal (everything except 3D applications). Obviously, everything on this server is also backed up, what is a whole lot better than trying to backup all kinds of company's critical data from a bunch of desktops. The third server is backup/storage that exports a rather small RAID5 array over NFS to other servers, and every night takes backups from them onto a USB2 drive (three drives in rotation).

First two servers are in a small improvised rack, installed without proper rails, so to take one out it's safer to turn them both off. What is not a problem because no one cares about "remote desktop" being down for a few minutes around 2am. After turning the power back on I have heard the clicking sound from the rack, what could mean only one thing -- another disk failure, and a much more umm... immediate one. I thought that the replacement drive was defective, but a look at the consoles from desktop shown me that I was wrong, "desktop server" was producing the errors. Drives in both servers were pretty old, so it wasn't unusual for them to fail at that point. However it happened that spin down/spin up cycle was sufficient to trigger the error earlier than it would happen if the platters were continuously spinning.

I have turned the "remote desktop" server and desktop off, connected the drive to desktop, and booted it again. This time the drive worked, and testing shown no errors, so just like with the first drive, it was an intermittent failure, however just like with the first drive, I had to replace it. I had one spare drive previously used for testing of prototypes with only 300 hours on it, so I have repeated the procedure with both drives on my desktop, and changing the partitions to a better layout along the way. Again, I didn't have to use backups, the drive remained readable through the whole procedure. I have installed the new drive into "desktop server", and this time both servers shown no signs of problems.

The whole procedure wasn't what I would call difficult, intellectually challenging or involved the possibility of any significant data loss, the only real problem was that the time spent on the copying alone turned it into an unexpected night shift. In a way it demonstrated something that I already knew, that without RAID or redundant servers you WILL GET DOWNTIME, and that running servers on single drives instead of RAID1 or RAID5 isn't worth the saved money. However one unusual thing about it was that of two drives two failed almost simultaneously. Granted, second drive probably would survive longer if it was a part of hot-swap RAID array, so it wouldn't spin down, but nevertheless I had two drives failing within the same day. If I had RAID in both servers, and two drives failed in the same server, commonly used RAID configurations wouldn't save them. This is obviously not a good excuse to keep the servers without RAID, and I am still supposed to fix that, but it demonstrates that worst-case or nearly-worst-case scenarios do happen, and RAID in both servers would not guarantee that there won't be a "night shift" like this.

There are other reasons why RAID is not a replacement for reliable backups -- rolling back user errors, intrusions, filesystem corruption and broken controllers are more likely scenarios than simultaneous drive failures. RAID does nothing to prevent data loss in those situations, and this is why I have chosen to spend resources on "backup everythng" rather than on "RAID everywhere" in the first place. But again, in the end resourced ended up being spent anyway.

Storage configuration always balances reliability/availability against cost and speed, and it's possible to build an array that survives multiple failures. For a small company, where very low probability of few hours of downtime, and small amount of possible data loss (changes since last backup) is acceptable, massive multiple-redundancy arrays may be pointless, and RAID1/RAID5 will be sufficient. But there is always a possibility that the only way to fix the problem is to restore from backups.

Tags: ,

4 comments or Leave a comment
mackys From: mackys Date: March 21st, 2007 04:07 am (UTC) (Link)
But there is always a possibility that the only way to fix the problem is to restore from backups.

I think this is one of the lessons that Evi hammered into me, along with "always check your return values from system calls." There's always some way for the system to fail such that backups are the only way to get your data back. The moral? Always have backups.

Uh... what was my point? Oh yeah: <AOL>ME TOO!</AOL> ;]
j_b From: j_b Date: March 21st, 2007 04:46 am (UTC) (Link)

backups ftw!

Glad you were able to copy data off R/O instead of a drive completely smoking.
dagbrown From: dagbrown Date: March 21st, 2007 08:09 am (UTC) (Link)
Did you buy the drives as a batch? Nice and close serial numbers, manufactured at the same plant on the same day, like?

When building a RAID, always mix and match brands. At least then the failures won't cluster together as much.
abelits From: abelits Date: March 21st, 2007 10:09 am (UTC) (Link)
I have invented a physical object replication technology, made drives using it and placed those drives into a RAID1 array for closest conditions possible, specifically to see how many of them will end up failing simultaneously.

(Of course, I know. Even identical firmware bugs can defeat redundancy.)
4 comments or Leave a comment