Replacing A Failed Hard Drive In A Software RAID1 Array

Do you like HowtoForge? Please consider supporting us by becoming a subscriber.
Submitted by falko (Contact Author) (Forums) on Sun, 2007-01-28 19:21. :: Linux | Storage | Other

Replacing A Failed Hard Drive In A Software RAID1 Array

Version 1.0
Author: Falko Timme <ft [at] falkotimme [dot] com>
Last edited 01/21/2007

This guide shows how to remove a failed hard drive from a Linux RAID1 array (software RAID), and how to add a new hard disk to the RAID1 array without losing data.

I do not issue any guarantee that this will work for you!

 

1 Preliminary Note

In this example I have two hard drives, /dev/sda and /dev/sdb, with the partitions /dev/sda1 and /dev/sda2 as well as /dev/sdb1 and /dev/sdb2.

/dev/sda1 and /dev/sdb1 make up the RAID1 array /dev/md0.

/dev/sda2 and /dev/sdb2 make up the RAID1 array /dev/md1.

/dev/sda1 + /dev/sdb1 = /dev/md0

/dev/sda2 + /dev/sdb2 = /dev/md1

/dev/sdb has failed, and we want to replace it.

 

2 How Do I Tell If A Hard Disk Has Failed?

If a disk has failed, you will probably find a lot of error messages in the log files, e.g. /var/log/messages or /var/log/syslog.

You can also run

cat /proc/mdstat

and instead of the string [UU] you will see [U_] if you have a degraded RAID1 array.

 

3 Removing The Failed Disk

To remove /dev/sdb, we will mark /dev/sdb1 and /dev/sdb2 as failed and remove them from their respective RAID arrays (/dev/md0 and /dev/md1).

First we mark /dev/sdb1 as failed:

mdadm --manage /dev/md0 --fail /dev/sdb1

The output of

cat /proc/mdstat

should look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[2](F)
      24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1]
      24418688 blocks [2/2] [UU]

unused devices: <none>

Then we remove /dev/sdb1 from /dev/md0:

mdadm --manage /dev/md0 --remove /dev/sdb1

The output should be like this:

server1:~# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1

And

cat /proc/mdstat

should show this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
      24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1]
      24418688 blocks [2/2] [UU]

unused devices: <none>

Now we do the same steps again for /dev/sdb2 (which is part of /dev/md1):

mdadm --manage /dev/md1 --fail /dev/sdb2

cat /proc/mdstat

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
      24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[2](F)
      24418688 blocks [2/1] [U_]

unused devices: <none>

mdadm --manage /dev/md1 --remove /dev/sdb2

server1:~# mdadm --manage /dev/md1 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2

cat /proc/mdstat

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
      24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0]
      24418688 blocks [2/1] [U_]

unused devices: <none>

Then power down the system:

shutdown -h now

and replace the old /dev/sdb hard drive with a new one (it must have at least the same size as the old one - if it's only a few MB smaller than the old one then rebuilding the arrays will fail).

 

4 Adding The New Hard Disk

After you have changed the hard disk /dev/sdb, boot the system.

The first thing we must do now is to create the exact same partitioning as on /dev/sda. We can do this with one simple command:

sfdisk -d /dev/sda | sfdisk /dev/sdb

You can run

fdisk -l

to check if both hard drives have the same partitioning now.

Next we add /dev/sdb1 to /dev/md0 and /dev/sdb2 to /dev/md1:

mdadm --manage /dev/md0 --add /dev/sdb1

server1:~# mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: re-added /dev/sdb1

mdadm --manage /dev/md1 --add /dev/sdb2

server1:~# mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: re-added /dev/sdb2

Now both arays (/dev/md0 and /dev/md1) will be synchronized. Run

cat /proc/mdstat

to see when it's finished.

During the synchronization the output will look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
      24418688 blocks [2/1] [U_]
      [=>...................]  recovery =  9.9% (2423168/24418688) finish=2.8min speed=127535K/sec

md1 : active raid1 sda2[0] sdb2[1]
      24418688 blocks [2/1] [U_]
      [=>...................]  recovery =  6.4% (1572096/24418688) finish=1.9min speed=196512K/sec

unused devices: <none>

When the synchronization is finished, the output will look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
      24418688 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      24418688 blocks [2/2] [UU]

unused devices: <none>

That's it, you have successfully replaced /dev/sdb!


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by ObiDrunk (not registered) on Thu, 2010-07-08 10:58.

first, ty, this its a very complete tutorial, cost a lot find info like this on the web.

 i have a question, i have a Raid 1 by software, same cfg that you, the md0 its the swap partition and the md1 its the /

when i first start, after the instalation i run on a shell

watch -n1 cat /proc/mdstat
 

and the md1 appears to be on sync status, this its normal? can i reboot while the sync its on?, ty

Submitted by scsi hot swap (not registered) on Wed, 2010-07-07 13:16.

Wow.

 This was just the article I needed after one of my disks failed and I had to get the array back up and running.  Linux is an amazing OS, but when you start to run mission critical services on there and don't employ or train people to support it properly, it is pages like this that are a big BIG help.

 Thanks again.

Submitted by Anonymous (not registered) on Tue, 2010-04-20 17:57.

Great tutorial.

I was wondering if reboot step is necessary? If my motherboard supports hotswapping, would the reboot still be necessary?

Submitted by Anonymous (not registered) on Wed, 2010-04-14 11:44.

The computer hard drives have become a short-board, then the hard drive performance is really not be able to enhance through other means? The answer is no, in fact, short-board hard disk RAID technology can be compensated for before the RAID technology has been used in high-performance servers, etc. However, as the popularity of an integrated RAID controller board, this technology can be used in our daily life .here is my blog about What is the difference between RAID 0 and RAID 5E 

How to achieve drives raid  How to set up RAID drives and enhance hard disk performance

Submitted by dino (not registered) on Sat, 2010-03-13 05:41.

Very helpful, thanks.

Any advice on a /dev/sda master mirror disk failure?  I'm having some difficulty tracking anything down about this on the Internet.  All information seems to refer to a slave disk failure /dev/sdb.

Cheers and thanks.

Submitted by pupu (not registered) on Mon, 2010-03-29 19:59.

I can add the procedure I've just used to replace failed /dev/sda on my Fedora system. I'm assuming you have your bootloader in MBR; if not, adjust arguments at point 7 and 8 1. After you have finished the procedure described in the article, boot from rescue cd/dvd/usb stick/whatever 2. Let the rescue procedure process to the point you are offered shell 3. Check for the location of your '/boot' directory on physical disks. Mine was on /dev/sda3 or /dev/sdb3; it means (hd0,2) or (hd1,2) in grub syntax (check grub docs if you are not sure) 4. run 'chroot /mnt/sysimage' 5. run 'grub' 6. At grub prompt, type 'root (hd0,2)' when the argument is the path you've found at the point 3 7. type 'install (hd0)' 8. type 'install (hd1)' 9. leave grub shell, leave chroot, leave rescue shell and reboot

Submitted by mpy (not registered) on Fri, 2010-01-15 11:25.

Thank you very much for this tutorial... especially the sfdisk trick is really clever!

I only have one comment: Perhaps it'll be smarter to wait with the re-addition of /dev/sdb2 until sdb1 is sync'd completely. Then the load of the HDD (writing to two partitions simultaneously) will be reduced.

Submitted by ttr (not registered) on Tue, 2010-01-19 15:24.

Nope, if there are multiple arrays on one drive to be sync, they will be queued and syncing will be done one-by one, so there is no need to wait with adding other partitions.

 


Submitted by Anonymous (not registered) on Thu, 2010-03-11 10:01.
Interesting... thanks for clarifying this. It was just a thought, as in the example above it looks like the sync'ing is done simultaniously (md0 at 9.9% and md1 at 6.4%).
Submitted by Paul Bruner (not registered) on Fri, 2009-12-11 23:07.

I think the auther needs to put in how to find the physical drive though.  Evey time my server reboots it seems to put the drives in diffrent dev nods.  (ex, sdb1 is now sda1, and so on)

 Not everyone can dig though the commands for that:P

Submitted by bobbyjimmy (not registered) on Sat, 2009-11-21 17:26.
Thanks - This worked perfectly against my raid5 as well.
Submitted by Kris (not registered) on Fri, 2009-07-17 12:07.

Thanks for the step-by-step guide to replacing a failed disk, this went much smoother than I was expecting - Now I just have to sit and wait 2.5 hours for the array to rebuild itself...

 

Thanks again!

Submitted by bbt5001 (registered user) on Thu, 2009-04-16 13:11.

This type of tutorial is invaluable. The man page for 'mdadm' is over 1200 lines long and it can be easy for the uninitiated to get lost. My only question when working through the tutorial was is it necessary to --fail all of the remaining partitions on a disk in order to remove them from the array (in preparation to replace the disk)?  The answer is 'yes', easily found in the man page once I knew the option existed.

One of the follow-up comments included a link to a post from the Linux-PowerEdge mailing list entitled 'Sofware Raid and Grub HOW-TO' (yes, 'software' is misspelled in the post's title).  Althouth this paper is dated 2003 and the author refers to 'raidtools' instead of 'mdadm', there are two very useful sections. The most useful is on using grub to install the master boot record to the second drive in the array. The other useful section is on saving the partition table, and using this to build a new drive. (In my own notes this I add saving the drive's serial number so I have a unambiguous confirmation of what device maps to what physical drive.)

Merging these tips to Falco's instructions gave me a system bootable from either drive, and easily rebuilt when I replaced a 'failed' drive with a brand-new unpartitioned hard drive.

Thanks to Falko and the other helpful posters.

Submitted by Stephen Jones (not registered) on Tue, 2009-02-03 16:08.
Class tutorial - just repaired a failed drive remotely (with a colleagues assistance at the location) flawlessly - hope its as easy if sda falls over . . . . . 
Submitted by som-a (registered user) on Thu, 2007-03-08 14:43.
Hello there,

i m missing the part for the bootloader (lilo/grub).
maybe you can add it?

a part for replacing the first disk (as said by the previous poster) would be good, also a part if the bootloader was not added to the bootloader (rescue-disc, chrooting, ...)

regards,
som-a
Submitted by burke3gd (registered user) on Tue, 2008-09-02 21:26.

This is something that should be added to the howto. On debian it is simply a matter of running "grub-install /dev/sdb".

I'm sure this was just an oversight on part of the author as otherwise Falko Timmes RAID howtos have been very correct and god send.
Keep up the good work!

Submitted by Joe (not registered) on Wed, 2010-07-21 14:34.
Thank you for noting the need to run grub-install! I wasted a lot of time following another incomplete guide, only to find out my new array was unbootable. It's frustrating few authors seem aware of this minor, but critical detail, since without it their guide is useless.
Submitted by riiiik (registered user) on Sat, 2007-05-19 20:44.

Hi,

This link worked well for me: http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html

Regards

Rikard 

Submitted by c600g (registered user) on Wed, 2007-01-31 17:30.

Thanks for the great article. This seems to be the best case scenario for a drive failure in a mirrored RAID array (i.e. drive 2 failing in a 2 drive mirror).

Perhaps a useful addition to the article would be to detail how to recover when the first drive (e.g. /dev/sda in this article) fails. Physically removing /dev/sda would allow the system to run from /dev/sdb (so long as the boot loader was installed on /dev/sdb!), but if you put a new HD in /dev/sda, I don't think you would be able to reboot...

You would probably need to remove /dev/sda, then move /dev/sdb to /dev/sda, and then install a new /dev/sdb.

Sponsored Links: Turn your desk phone and mobile phone into one with Sprint Mobile Integration.
www.seamlessenterprise.com

One number. One voicemail. Seize the lead. Sprint Mobile Integration.
www.seamlessenterprise.com

One Number. One Voicemail.
Make it easier for clients to reach you. Turn your desk phone and mobile phone into one with Sprint Mobile Integration.
www.seamlessenterprise.com

One number. One voicemail. Sprint Mobile Integration.
www.seamlessenterprise.com

One number. one voicemail. Seize the lead with Sprint. Learn more

AT&T Synaptic Compute as a Service. Boost your power on demand.

Trial: IBM Cognos Express Reporting, Analysis & Planning

Learn benefits of Simpana software.
View the Gartner Video

Sprint 4G - The Ultimate Mobile Broadband
Click here

SAP-Business Objects Crystal Reports Server
Complete reporting without hidden costs. Free Trial