Hard drive failure, recovery, and wasted Sunday
This is the email that I have sent to all users on my server yesterday. Yes, it sucks, yes, it could be worse, and I am lucky that nothing was lost, and yes, I am guilty. But no, an overworked engineer who also does system administration, is not likely have "what if this drive fails less than four years after being produced?" high on his priorities list. Even if it's an IBM SCSI drive.

Ok, I will put everything on a nice RAID array when the new server will be ready.

Date: Mon, 29 Nov 2004 04:37:24 -0700 (MST)
Subject: Server outage -- hard drive crash and upgrade
  Yesterday morning, November 28 2004, around 6am, a hard drive (old 18G
IBM Ultrastar DDYS-T18350N) on mail, web, database and shell server
mars.illtel.denver.co.us produced access errors, and the system was hung
until 9:42 when I have arrived at the lab, and restarted the server.</tt>

  Since the drive was failing, I have bought a replacement 250G Maxtor
6Y250P0, and installed it in the server at 13:20, after another old disk's
failure that happened at 12:33, when I was at the store, buying the drive.
I have discovered that old IDE controller does not support full 250M size,
and reduced the usable disk space to 128G -- the rest of space will become
available when I will move this drive into a new 1u server.

  The box was disconnected from the network, and all data from the
old drive was copied to the new one, what, along with installation,
rearrangement and verification took the whole day. Old drive was removed,
computer reconnected to the network and booted from the new drive on
November 29, 03:47.

  There was no data loss, no configuration changes, and the layout of the
filesystems is approximately the same as on the old hard drive. Incoming
mail was delayed, and should arrive within a few hours. Both root and
/home filesystems were increased in size (to 59G and 65G), and despite the
switch to a slower IDE interface, the data transfer rate remained
approximately the same as with the old drive. When the new server will be
completed, new drive will be moved to it, with additional half of
available capacity and full ATA133 speed enabled, increasing both
available space and performance.

  Please accept my apologies for this outage, delays and inconvenience
that it caused.

Alex Belits
