LVM Recovery with a failing disk

Question

I have an LVM setup on Debian 7.8 with kernel 3.2.65-1+deb7u1 running OpenMediaVault

The LV is made up from 4 disks

Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
Disk /dev/sde: 1500.3 GB, 1500301910016 bytes

Starting last night, access to the shares located on the LV started getting slow, when finally shares became totally unresponsive.

Syslog is showing the following message repeatedly

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata3.00: BMDMA stat 0x45
ata3.00: failed command: READ DMA
ata3.00: cmd c8/00:80:80:01:00/00:00:00:00:00/e0 tag 0 dma 65536 in
         res 51/40:6f:85:01:00/00:00:4b:00:00/e0 Emask 0x9 (media error)
ata3.00: status: { DRDY ERR }
ata3.00: error: { UNC }
ata3.00: configured for UDMA/133
ata3.01: configured for UDMA/133
ata3: EH complete

Smartd is also reporting

Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 193 Load_Cycle_Count changed from 23 to 22
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 193 Load_Cycle_Count changed from 22 to 21
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 193 Load_Cycle_Count changed from 21 to 20
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1 Currently unreadable (pending) sectors
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 689 Currently unreadable (pending) sectors (changed +688)
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 197 Current_Pending_Sector changed from 200 to 198
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1416 Currently unreadable (pending) sectors (changed +727)
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], SMART Usage Attribute: 197 Current_Pending_Sector changed from 198 to 195
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1465 Currently unreadable (pending) sectors (changed +49)
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1465 Currently unreadable (pending) sectors
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1465 Currently unreadable (pending) sectors
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], 1465 Currently unreadable (pending) sectors
Device: /dev/disk/by-id/wwn-0x50014ee2af284bdd [SAT], ATA error count increased from 0 to 84

I have tracked down that /dev/sde is the problem disk, and I am no longer able to get LVM running as it just hangs.

I should have enough free space on sdb, sdc and sdd to remove sde but any commands like pvmove just hang when attempting to read sde.

Any suggestions or is my volume toast?

Thanks!

# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb   storage lvm2 a--  3.64t    0
  /dev/sdc   storage lvm2 a--  1.82t    0
  /dev/sdd   storage lvm2 a--  1.82t    0
  /dev/sde   storage lvm2 a--  1.36t    0

# vgs
  VG      #PV #LV #SN Attr   VSize VFree
  storage   4   1   0 wz--n- 8.64t    0

# lvs
  LV      VG      Attr     LSize Pool Origin Data%  Move Log Copy%  Convert
  storage storage -wi----- 8.64t

It's dead, Jim. Go to your backups...and put together a more sane array of disks. — Michael Hampton, Feb 21 '15 at 00:45
Haha, yeah I was guessing its dead. The data is not critical, just movies etc I have collected, I was concentrating more on space than redundancy.
pvs, vgs and lvs added to question — , Feb 21 '15 at 01:06
You will possibly loose data anyway, but "just because" I might try to get another disk, then run ddrescue on /dev/sde - then, after however long that takes, replace the old with the new disk. Its likely the volume is unresponsive because reads are extremely slow due to retrys - of-course, depending on how badly your disk is dying this might or might not give you back most of your data. — davidgo, Feb 21 '15 at 01:15
I do have another disk handy. I'll see how ddresuce goes. Thanks — CJSewell, Feb 21 '15 at 01:17

score 6 · Accepted Answer · answered Mar 02 '15 at 10:11

So after a week of ddrescue and a day or so of e2fsck I have it all some what restored. Seems like most of the data is there and uncorrupt, although a large section of it remains in lost+found it is readable.

Here a break down of how I done it.
An important note: My system disks were not part of the LVM. To do this if your if your system disks are failing may require you to boot from a live cd/usb drive

Get the system booted
My system would not boot and hung while trying to bring LVM up. To get around this, I unplugged the problem disk sde, then started the machine and waited until I was able to login. I then plugged sde back in and ran
echo '0 0 0' > /sys/class/scsi_host/host3/scan After which sde was picked up. ( host3 was the port which sde was on, and may not be the same for your disk )

Install ddrescude ( for debian )

apt-get install gddrescue

Clone the dieing disk with ddrescue ( First pass, skip errors to quickly read as much good data as possible. Takes a long time depending on errors and size of disk )

ddrescue -f -n /dev/sde /dev/sdf /root/sde.rescue.log


GNU ddrescue 1.16
Press Ctrl-C to interrupt
rescued:   644394 MB,  errsize:    372 kB,  current rate:    4390 kB/s
rescued:     1500 GB,  errsize:  22036 kB,  current rate:       66 B/s
   ipos:    200704 B,   errors:      77,    average rate:    4942 kB/s
   opos:    200704 B,     time since last successful read:       0 s
Finished

Attempt another pass ( skipping data we already copied, retry 3 times before giving up. For me this took even longer than the first pass )

ddrescue -d -f -r3 /dev/sde /dev/sdf /root/sde.rescue.log


GNU ddrescue 1.16
Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:     1500 GB,  errsize:  22036 kB,  errors:      77
Current status
rescued:     1500 GB,  errsize:  12014 kB,  current rate:      512 B/s
   ipos:    199680 B,   errors:     972,    average rate:      768 B/s
   opos:    199680 B,     time since last successful read:       0 s
Splitting failed blocks...

I then shut the machine down and removed sde, and plugged what was sdf in to the same sata port which sde was on and booted backup.
Upon boot LVM came up, but there were al ot of errors while trying to look through the files.

Fix the filesystem ( Answer yep to all questsion, be Verbose and Force file system check )

e2fsck -y -v -f /dev/mapper/storage-storage

I was then able to mount the filesystem and start looking at the damage. As stated, a large chunck of data ended up in lost+found. So far, its only folder names that have been lost. Checking the content of the folders I am able to piece together where it all belongs

References:

score 1 · Answer 2 · answered Feb 24 '15 at 15:54

1

It's a long shot, but you could attempt a data migration using LVM's mirror feature. The downside is you will need at least the same amount of storage in your new volume as your old volume. There is also no guarantee that you will recover all your data due to persistent disk errors, but any data that is readable still might survive the trip. It's worth a try; the worst that happens is you lose the data you're about to lose anyways.

answered Feb 24 '15 at 15:54

Avery Payne

2,501

Thanks for that. ddrescue has finished one pass on the dieing disk and has managed to recover all but 30MB. I'm running it again to see if it can grab any more. Once its finished I will replace the dieing disk with the clone. If that fails I will give this method a shot. – CJSewell Feb 24 '15 at 23:49

LVM Recovery with a failing disk

2 Answers2