Recovering from a crashed (soft) RAID

My previous post is about that one of the disks in my mirrored (RAID1) set up has died. This post is about how it was fixed. And fixed it was, without any real pain, just a bit of confusion.

Right, it all started at dusk in the castle, the wolves were howling in the distance, and you see, but not hear the owls fly over the fields, hunting their prey.... No, scratch that.
It started with a message like this:
This is an automatically generated mail message from mdadm
running on xyz

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdd1[1]
312568576 blocks [2/1] [_U]
That means that one of the disks has gone to HDD heaven (you know where the beer volcano is). Notice the [2/1] and the [_U]. It means one disk of two is gone.
Take this message seriously!
Now the next job for you is to locate (read, freak out and stress whilst buying a new disk) a replacement disk. This disk should be of equal or larger size. I chose equal. I'm cool like that.
Then you need to figure out which of the disks it is that's gone. This caused me some headscratching. By issuing mdadm --detail /dev/md0 (where md0 is your multi disk, a.k.a RAID, that's failed). You get an output that finishes in something like this:
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 49 1 active sync /dev/sdd1

That tells you which disk is active. And as I knew that both of my disks were of identical sizes I could check the partitions, like this; cat /proc/partitions and I got an output with this in it (not full output included);
8 32 312571224 sdc
8 33 312568641 sdc1
8 48 312571224 sdd
8 49 312568641 sdd1
9 0 312568576 md0

As you can see /dev/sdc and /dev/sdd are exactly the same. Some high-end-maths applied means that it was /dev/sdc that had taken leave of absence. Now all I had to do was to figure out which disk that was on the motherboard. I concluded that it would be the one with the lower number, I had SATA something 1 and SATA something 0... I unplugged number 0 and rebooted the machine and checked if the data was still present. It was, which meant I had unplugged the correct disk. Shutdown again and replace the broken disk with a brand spanking new one. Boot the machine back up and everything is tickety-boo. Just like before. Data still present, but now we have a "partitionless" disk in the mix. If you issue fdisk -l (for list) you'll get a list of what you've got. One should say something like:
Disk /dev/sdc doesn't contain a valid partition table

And that's the one you now need to partition. I had never used fdisk before, and I absolutely hate toiling with disks and partitions. It's so easy to make a mistake. Start fdisk with; fdisk /dev/sdc (obviously where the device is the one you're working on). Then you want to create a new primary partition. Press n for new, then select primary and then 1. My disk used the full space so I only selected the defaults when it came to size. Then you need to change the file system to "Linux Raid autodetect". I did it this way; press t for type, then enter fd (which is the hex code for "Linux Raid autodetect") or you can press L for a list of options. Select the code from the list. If you now press p for print (out the partition table) you should see something like this:
/dev/sdc1 1 38913 312568641 fd Linux raid autodetect

This means you're ready for the sweetest of the sweet spots when it comes to all your previous efforts with this RAID unit. All you have to do is to add the newly partitioned disk to the RAID array.
Issue this simple command: mdadm /dev/md0 --add /dev/sdc1 (naturally the devices should match what you're working with) and, as Gordon Ramsay says; "Done!".
Now you can issue mdadm --detail /dev/md0 to check the status of your disk. You should see something like this:

State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Rebuild Status : 0% complete

UUID : e6583ff6:48d02cb1:4e3ee66a:d08da4cd
Events : 0.17658

Number Major Minor RaidDevice State
2 8 33 0 spare rebuilding /dev/sdc1
1 8 49 1 active sync /dev/sdd1
Keywords here being "recovering" "Rebuild Status" and "rebuilding". If you issue cat /proc/mdstat you get an output something like this and there you can see the nice progress going on.
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdd1[1]
312568576 blocks [2/1] [_U]
[>....................] recovery = 1.6% (5299456/312568576) finish=108.9min speed=46994K/sec
I can tell you that it's a very very sweet feeling seeing the recovery percentage creep upwards. Disaster averted! YAY!

A few notes.
This page http://linux-raid.osdl.org/index.php/Reconstruction says that you should "Use raidhotadd /dev/mdX /dev/sdX to re-insert the disk in the array " - that's deprecated and you should now just use the --add flag to mdadm, nothing more.
Then I thoroughly recommend you to write on the physical disk which device it is corresponding to. I.e write /dev/sdXX on the disk you stick into the machine. That way you don't have to waste time figuring out exactly which /dev/sdXX is which disk.
My disk setup in the machine is a single IDE disk that runs the system, and then the soft RAID1 mounted at /home/ but I also have all my music and photos on this device. Mainly linked with soft links and so forth. The system disk is only backed up, not mirored. I also have a "scratch disk" there that's normally not mounted, but that's another story.
A few resources:
Big thanks to the guys whom I've bothered with this. Dempa (being calm, tips, links, and generally being a "Good Guy"(tm)), Aaaandrew (again, being calm, having insight, and good support, and also being a "Good Guy"(tm)), and Tooony (for the loan of the panic-backup-disk and also being a "Good Guy"(tm)). Thanks!!

I can't express how convenient this has been. From now on RAIDed disks will always be part of my machines. I'm currently scheming on how I can make the server even more "indestructable". Muahahaha!

No comments: