Log in

No account? Create an account
Data recovery from Hell - Alex Belits
Data recovery from Hell
It happened some time ago, and I haven't posted it here, but I think, it's interesting enough to be placed here now.

Usually people don't ask me to fix their computers unless/until something really bad or incomprehensible is going on, and this time was not an exception -- a Toshiba laptop running Windows XP became unstable for no understandable reason, and occasionally it produced strange beeping/whining noise that seemed to come from the right speaker.

Virus/spyware check found nothing but inactive remnants of the old malware that was removed manually some time ago, cooling seemed to be sufficient, and running "CPU Burn-In" did not cause anything unusual other than fans constantly running at full speed. "memtest86+" shown some bad RAM, however after removing the offending sodimm, things only got worse -- sound seemed to appear randomly, laptop occasionally was hanging (seemingly with no relationship to the sound), and event logs did not contain anything unusual, certainly nothing about IDE/ATA. At one point when the laptop seemed hung, I have put my ear against it, and finally recognized the sound usually masked by the fan noise -- hard drive's head was repeatedly seeking and failing to find the track. The drive was dying, and the sound that I thought was produced by the speaker, actually came from the drive. The drive had a shitload of valuable data, backups were incomplete and half a year out of date.

I had two possible choices -- copy the hard drive's directory tree starting from "important" directories, or copy the drive as an image. I still don't know, what is really the right choice for this kind of situation -- in either case I have to transfer gigabytes of data, copying the subtree obviously takes less total amount of data, but going through the tree requires re-reading some directories, and dealing with physical errors in the middle of this procedure is a huge pain in the ass. Copying the drive image can restore everything the easiest way possible, and I will always know what to re-start in the case of error, however the total amount of data is almost twice larger. In addition, if the drive dies in the middle of the copying, I will lose an unpredictable part of the data, and many "copied" files will contain uninitialized garbage inside -- the disk was not defragmented in quite a while. Last time when I had to deal with something this massive, I have chosen "copy the subtree" method, and ended up with little more than the contents of my home directory when the drive locked up despite relatively error-free copying procedure. So this time I have decided to tempt the fate in a different way -- copy the drive image. A new drive was larger, 80G instead of 60G, so after copying I would have to extend the partition, however if everything copied correctly, extending ntfs partition+filesystem would be easy. I have placed the new drive into the USB enclosure, disabled booting from everything but a CD, booted Knoppix from a CD, and ran "dd".

About half of the drive copied without any problems, then kernel complained about errors, and "dd" exited. I have synced the drive and continued, this time giving "dd" "skip" and "seek" arguments, so it started from the point slightly before the error. Copying continued for a while, only to result in more errors down the road. The process continued, and at some point only rebooting could restart it. Then rebooting resulted in hard drive being not recognized by BIOS and Linux kernel, so there was nothing to copy. I have opened the laptop and switched the drives -- new drive taken place of the old one inside the laptop, old one was moved into the USB enclosure. I have booted Knoppix, cooled down the drive in a fridge, connected the power, atached the USB cable, and continued copying.

Having the drive out of the laptop, I could easily hear its noises, and repeatedly-seeking sound quickly alerted me that the copying process is stuck again. Thinking that the heads' bearings are going bad, and they might move better in a tilted position, I have tried to slowly tilt the drive, while running "strace" on the "dd" process, to see when it will be able to read something. One 1M block passed, and then nothing at all. I have moved it again, and another block was copied, however once I stopped moving the drive, it went back into repeatedly seeking mode. I have started rocking the drive in various directions, and it continued copying as long as it moved. I suspected that the data is corrupt, so I have looked at the copied blocks with "dd" piped "od -t x1z", and everything looked reasonable. Re-reading blocks also returned the same data (I could use ramdisk to store files for comparison), so I have just continued this until the process got stuck again.

There were no errors in the kernel log yet, however since I knew that it's stuck, I simply disconnected USB -- kernel immediately returned an error, and "dd" process exited. I have found that if "dd" was being "strace'd", "strace" exited but "dd" got suspended, and had to be killed, however I didn't really needed "strace" at that point -- if the copying went well, USB enclosure's LED was blinking as the blocks were copied, and I could just use "dd if=/dev/hda bs=1024k skip=<whatever> count=1 | od -t x1z" to determine if some megabyte is copied. The drive was new, so initially it was all filled with 0xff, so "od" just shown a line of "ff" and "*" after it. Also "dd" always returned the number of blocks copied before exiting. This continued for a while, I had to copy the drive while continuing rocking it, and when it got stuck I disconnected-powercycled-reconnected-continued. Plus few times I had to make a trip to the fridge between stretches of copying to cool the drive.

I am not entirely sure about how rocking the hard drive helped -- I have thought up two possible explanations, but I am not sure if they have anything to do with reality because I didn't have any means to test them. One explanation is that the dying head movement mechanism could get stuck at some position, and then when its coil continued applying force, movement started again too fast and the head passed over the track it was seeking. Another one, that I think is more plausible, is that the motor's bearings were so worn out that platters vibrated, and head positioning couldn't keep up with a constantly moving track. When I was rocking the drive, precession caused sufficient side forces to be applied to the bearings to keep the whole thing from vibrating for long enough to let the head position and read the track. Placing the drive on its side, that was also supposed to produce side forces in the bearings, had no effect, but it's quite possible that the weight of the rotor and platters was insufficient to stabilize the bearings, and it took precession to stop the vibration.

Then the motor got stuck. When I turned the enclosure on, it made some whining noise, and stopped. Turning it on and off, shaking or spinning the drive had no effect, so I have switched into the "counting losses" mode.

I have mounted the partition readonly and ran "find", trying to determine if directory tree is usable. It was -- mostly. Kernel logged a bunch of directory errors, but seemed to be content with handling everything readable as directory entries. The files didn't fare too well -- I had no idea that NTFS keeps things so fragmented -- a lot of files had pieces filled by "0xff" and pieces with normal data, so either, files were very fragmented, or it seemed so because allocation pointers, whatever they are in NTFS, were screwed up. I wrote a script that counted 16-byte-aligned blocks of 16 "0xff"'es within the files and started making a list of broken stuff, simultaneously trying to make a backup to another USB hard drive. My hands were tired, and mood was shitty -- apparently it was a bad choice to copy a drive image instead of directory tree. Thinking that there is nothing urgent left, I have started copying the disk image over the network to a server using "dd" and "ssh", and went to sleep.

After waking up I have continued messing with the incomplete copy, but then decided to make the last attempt to revive the drive. I have attached it to an old AT (yes, AT, not ATX) power supply, thinking that if it's stuck and consumes too high current for USB enclosure to provide, a more powerful power supply will either fry its coils, or provide enough current to restart it. I am not sure if there really was a problem with current, but after some shaking the drive spun up again. The rest of copying went pretty much the same as the part before the motor was stuck, and finally I have seen "dd" exiting without an error -- the whole image was now on the new drive.

Being afraid that something is still wrong, and after booting the whole thing will self-destruct, I have decided to wait with booting, and copy the rest of the image to the server -- what "dd" and "ssh" did without a problem. At the same time I have mounted it readonly again, and made a complete copy of everything "important" to the USB drive that I used for backups.

After both "tree" and "image" backups were ready and verified, I have finally shut down Knoppix, re-enabled booting from the hard drive, and restarted everything. It booted, and seemed to behave exactly like the original drive, sans lockups and mysterious sounds. I have booted Knoppix again, and used "qtparted" to resize the partition and filesystem to fill the rest of the drive. Having an image backup, I felt that I don't have much to fear, and after rebooting I was greeted with "chkdsk" re-checking the resized filesystem, returning no errors. Two days and nights long data recovery procedure from hell was over.

I still had to do some fixing of minor stuff in the system configuration that I have not done when I have found the hard drive problem, and after that I ran Windows Update because the laptop wasn't recently on a fast/permanent enough connection to do that automatically, however those are minor things that have nothing to do with the rest of this horribly exhausting experience. In the end, I have to repeat the obvious:



5 comments or Leave a comment
raider3 From: raider3 Date: September 29th, 2005 07:47 am (UTC) (Link)
Amen to that. I've been lazy and let gigs of stuff accumulate on my external hard drive. Thank goodness it was my 3-year old CD burner that's developed read/write problems, and not the DVD burner I installed last summer, so I can still back up to CD/DVD.

Now, it's either replace the dying CD burner with another CD burner, or put in the DVD reader that was replaced by the DVD burner. I've been a firm believer in backing up periodically since I bought my last PC back in late 2002.

(Verified CD burner was having problems when CD burns stopped around 40-45%, and noted disc access problems in a Genesis/Megadrive/CD emulator when running a game directly off CD. Also checked the drive with the drive tools on my CD burning software of choice. I could run a lens-cleaner CD on it and hope, or just spend the $$ to replace the drive.)
From: nickhalfasleep Date: September 29th, 2005 04:00 pm (UTC) (Link)
"inactive remnants of the old malware", you mean windows, right?

Wow. Thanks for sharing this Alex, I had no idea about what to do when a drive gets this bad. Really neat about the shaking part. I wonder if you stuck the drive in the USB enclosure in the freezer, with the wires protruding, would that help?
abelits From: abelits Date: September 30th, 2005 01:01 am (UTC) (Link)
"inactive remnants of the old malware", you mean windows, right?

Some incarnation of coolwebsearch, and similar crap.

Though I consider Windows to be a kind of malware, it was not really inactive, just had trouble reading corrupt NTFS.

I wonder if you stuck the drive in the USB enclosure in the freezer, with the wires protruding, would that help?

I tried to do that, however the drive quickly produced an error, and I had to open the fridge to start moving it again. Maybe if I had some mechanism, like a platform on a servo that continued working inside the freezer, it would work better.
From: (Anonymous) Date: March 24th, 2007 01:56 pm (UTC) (Link)


Why aren't you using GNU ddrescue? Are you a masochist?!
abelits From: abelits Date: March 24th, 2007 02:43 pm (UTC) (Link)

Re: dd

Because usually after an error the drive became unresponsive, and I had to power-cycle it to be able to continue. ddrescue would work well if particular sectors were producing errors without affecting the rest of the drive, so it would copy the drive in a few passes, filling the gaps as reading formerly unreadable blocks succeeded. With that drive I would never get "gaps" that ddrescue is supposed to fill because data always came in contiguous blocks that ended with an error fixable only by power cycling. So I wanted the copying to visibly fail when this happened, so I can power-cycle the drive while it's not being accessed, get it re-recognized and restarted, and only then continue copying.

I had to manually check the number of blocks copied or use hex dump to see where the copied data ended, but the alternative would be to use some intermediate drive (that I didn't have locally) to store disk image, and then ddrescue log or file length would be of the same use.
5 comments or Leave a comment