Recovering data from a RAID 5 array with two disks dead

Has anyone ever needed to try this? We have a satellite branch in Brownsville that has a research database on a RAID 5 array. Earlier today the tech there saw the database was down and discovered two of the five drivers were 'dead'. The kicker? The tech doesn't have a recent good backup. The most recent full backup of that server is from March!!! Anyway, I don't know if he means physically dead or what, but the drives are getting shipped to me tomorrow so I can try to recover the bad disks using File Scavenger. One of the support guys from QueTek (makers of File Scavenger) said to pull the drives and put them in another machine and try to pull the disk image bit by bit.

Anyone have any better/more expedient suggestions?

If I can't get anything rebuilt, we'll have to send it to a cleanroom facility and pay minimum $8K, and the researchers really don't want to spend that money if they don't have to.

First: shoot the guy who didn't take backups. RAID is NOT a backup. RAID just prevents downtime from drive failure. That should be a job-loss error for most admins.

Second: mathematically, with two dead drives, you can't recover the data. But, it's possible that the RAID controller marked the drives bad when most of the data is still intact. Most RAID controllers will let you force a drive into active state even though it's supposedly failed. But you will probably have to do it with the exact same card or software stack.

To minimize wear on failing drives, try first installing them in a non-RAID system, and taking a full sector-by-sector drive image. Iimage the drive into a file on another storage system, and then push that file back to another, exactly identical drive. The idea is to absolutely minimize the amount of run time on the disks. Then force the replacement drives to active mode on the controller, presuming that your controller has such a function.

For maximum safety, take the time to image ALL the drives, not just the failing ones, because that gives you a safety net if you mess something up when forcing the drives back to active. You don't have to re-image the good drives to new hardware, but you want the backup just in case. Note that this is a very long process, and imaging drives on several separate systems at once will let you do it in parallel.

If you keep the failing drives as cold as you possibly can while extracting the data, that can keep a dying bearing or motor working well enough to get the data off. Just be careful of condensation. You want it as cold as you can get without causing condensation. Doing the cooling in a heavily air-conditioned room will keep the ambient moisture to a minimum, and will let you chill the drives down further.

If those things don't work, the data recovery facility is the next step.

This is, by the way, a lot of work without a high probability of success, but it's better than no chance at all.

Thanks, Malor. This is pretty much what we were planning to try. I was hoping maybe there was another solution with a better chance of success, but I guess not. Still waiting on the courier to show up with the drives, so we'll see what happens.

As for the backups, I got some more info this morning. Apparently the guy 'inadvertently' overwrote his most recent incremental backup on Friday, and his last full backup was the one from March. Regardless, if he wasn't 400 miles away I certainly would be strangling him.

Thanks again.

bighoppa wrote:

Regardless, if he wasn't 400 miles away I certainly would be strangling him.

i think that calls for a road trip.

You should mail him a rope and an an instruction sheet.

Oh, if you're using Linux software RAID, you might be able to figure out how to bring the volumes up through loopback. Just remember to operate on copies of the files you create, not the originals.

This is all Microsoft Server 2003/2008 stuff. RAID is via HP raid controllers in the server.

Since they're Serial Attached SCSI drives, we had to go hit the shops and find a controller board. These are large form-factor (3.5") 15k SAS drives. Got the controller board back here and installed only to realize the cables from the controller to the drive need molex power connectors. All my connectors are 15 pin SATA connectors. Back to the store.

I'm starting to see why data recovery companies make so much money...:P

Rope and instruction sheet. You'll feel better.

Malor wrote:

First: shoot the guy who didn't take backups. RAID is NOT a backup. RAID just prevents downtime from drive failure. That should be a job-loss error for most admins.

Agreed.

bighoppa wrote:

This is all Microsoft Server 2003/2008 stuff. RAID is via HP raid controllers in the server.

Did both drives fail at the same time? That's pretty unlikely. I'm betting one failed a while back, and wasn't noticed. Also shoot the guy that didn't install the HP system management utilities that would have warned you when that first drive died.

Actually, we had a similar situation, where 2 out of 3 of our drives in our RAID array failed at the same time. If you look it up on wikipedia you'll see some articles on why this is a lot more common than people usually think.

Generally the reasoning goes - people buy a server with the intent of setting up a RAID, so they order several identical drives in one order. Supplier picks the next couple drives off the stack, and they'll most likely be ones that were manufactured right beside each other. Which massively increases the chance that if one drive fails due to mechanical failure, the 2nd drive will have the same problem. The chances are somehwere in the 70% range I believe.

Yeah, that happens a lot more than you'd think. Big shops will often require that all RAIDs be constructed from separate manufacturing runs of drives, sometimes from separate vendors, to help avoid the domino theory of drive failure.

I was a PFE at IBM for the DS4000 series of SAN controllers for several years. We had to deal bad manufacturing runs of drives. Even then, drives don't fail at exactly the same time. Usually what brings an array down is during the rebuild process to a hotspare, the system can't read a sector on one of the good disks, and since it can no longer calculate the checksum, it fails the array to prevent any further corruption. 99% of the time that unreadable sector is on an unused part of the disk (which is why SAN controllers all have a data scrubbing feature these days), and you can just zero out the stripe, bring the array back online, and finish the rebuild.

This lets me bring the thread back on track with the following question to bighoppa:

Was there a hotspare in the system, and was it trying to rebuild onto it? Server RAID systems tend to lack the data scrubbing features found in SAN controllers.

Don't rule out the possibility that the RAID controller is the problem. Had a similar situation and once we replaced the drives and restored the image the same error occurred, then we realized that it was the RAID controller.

Deftly:
There was no hot spare (Lord knows why they set it up that way).

Just as an update, once I received the drives I plugged them in to a separate controller and two of the drives failed to register at the BIOS level. We sent the drives out to AI Networks of Irvine, CA where they confirmed the drives were not functioning and they are going into the clean room for rebuild on Monday.

Rope and instruction sheet.

Malor wrote:

Rope and instruction sheet.

Not to put to fine a point on it, but once all this is cleared up, those may not be necessary.