Failed Linux MD RAID devices. That’s what I got to deal with yesterday. The ext3 file system produced scary errors and remounted the file system as read-only. A quick look at /proc/mdstat showed

# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid5 sdi1[8] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sda1[0]
645136128 blocks level 5, 256k chunk, algorithm 2 [10/8] [U_UUUUUUU_]


That’s bad.  The second hard drive had failed in a RAID 5 array with no spares.  Our mission, get the data back as best we can.

This was a fileserver that was in use.  I pulled emergency maintenance and rebooted the server into single user mode for safety.  After the reboot /dev/md1 was listed as inactive.  There were not enough working devices to bring up the array.  Obviously, it wasn’t mounted either.  This was exactly where I wanted to be.  (You could also unmount the broken file system and use mdadm --stop /dev/md1 to stop the array.)

Next, use mdadm to force assemble the array.

# mdadm --assemble --force /dev/md1
mdadm: forcing event count in /dev/sdb1 from X to Y


Now that should bring the array online.  The event counter for /dev/sdb1 was the least out of date and mdadm just fudged it.  This means we have introduced corruption.  Once the second device fails, its not long before the array fails.  So, provided that you bring back the second failed disk (not the first which failed 3 years ago, right?) you should introduce minimal corruption.  However, you now have a working array.

Next, we back up the array.  Mount /dev/md1 as a read-only file system.  Use rsync or another tool to copy off the data.  Don’t try to add more disks and rebuild the array before you back it up.

Our mission is accomplished.  We have data.