Recovering RAID 5 Arrarys With Multiple Failed Devices
Failed Linux MD RAID devices. That’s what I got to deal with yesterday.
The ext3 file system produced scary errors and remounted the file system
as read-only. A quick look at /proc/mdstat
showed
# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdi1[8] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sda1[0]
645136128 blocks level 5, 256k chunk, algorithm 2 [10/8] [U_UUUUUUU_]
That’s bad. The second hard drive had failed in a RAID 5 array with no spares. Our mission, get the data back as best we can.
This was a fileserver that was in use. I pulled emergency maintenance
and rebooted the server into single user mode for safety. After the
reboot /dev/md1
was listed as inactive. There were not enough working
devices to bring up the array. Obviously, it wasn’t mounted either.
This was exactly where I wanted to be. (You could also unmount the
broken file system and use mdadm --stop /dev/md1
to stop the array.)
Next, use mdadm to force assemble the array.
# mdadm --assemble --force /dev/md1
mdadm: forcing event count in /dev/sdb1 from X to Y
Now that should bring the array online. The event counter for
/dev/sdb1
was the least out of date and mdadm just fudged it. This
means we have introduced corruption. Once the second device fails, its
not long before the array fails. So, provided that you bring back the
second failed disk (not the first which failed 3 years ago, right?) you
should introduce minimal corruption. However, you now have a working
array.
Next, we back up the array. Mount /dev/md1
as a read-only file
system. Use rsync or another tool to copy off the data. Don’t try to
add more disks and rebuild the array before you back it up.
Our mission is accomplished. We have data.