Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
Sunday - Disks
Josh 201604 KWP
joshdavis
So, replication from the split copy worked, and then I rebuilt the array from 1 other known working disk, plus one suspect disk.

Checked this morning and all was well, and added the new, 5th disk and restripe is running.

I checked the SMART logs on one of the good disks, and it shows that out of the sectors I've written, an amazing THREE PERCENT have required ECC recovery. Or if you count it only by read sectors, it's a whole one percent.

For the drive that went offline in its first day of use, Offline_Uncorrectable and Pending_Reallocation are both 202.

That seems VERY excessive to me. I know that current disks have such high densities that bit rot is rapid. I think I heard it's 40% signal loss by 400 seconds after write, and then it sort of levels off around there.

Based on the study done at Google, "the critical threshold for scan errors is one. After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors. ... After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical threshold for this parameter also one. ... After the first offline reallocation, drives have over 21 times higher chances of failure within 60 days than drives without offline reallocations; an effect that is again more drastic than total reallocations. ... The critical threshold for [pending] counts is also one: after the first event, drives are 16 times more likely to fail within 60 days than drives with zero probational counts."

Note that pending, offline, and seek errors are primarily seen/reported with one brand. Seek and CRC erros don't trend with failures rates. The above only account for stats on 44% of failed drives. The other 56% don't report the stats above before requiring replacement.

OK, fine, but still. The error recovery used on modern disks is effectively RAID-2 anyway. Each head is a "disc" and they use something like Hamming code for error recovery. That's 4 bits of data and 3 bits of parity. It can correct single bit errors but not double bit errors. I know, "extremely rare", but when the BER is so high, it makes me think that "extremely rare" isn't as unlikely as anyone would expect.

Looking into the drive specs, "Nonrecoverable read errors 1 per 10^14 bits read"
one per 100,000,000,000,000 bits. Due to Hamming code, it's 14 bits per byte. That's one error per 7,142,857,142,857 bytes read. Divide that out and it's one unrecoverable read error per 6.5TB read. The disk is 3,907,029,168 sectors guaranteed at 512 bytes per sector which is 1.82TB, on average among these 5 spindles, I'm assumed to have an unrecoverable read error every 3.57 times I read through the entire disk?

The usage expected for these drives by the MFR is based on desktop which assumes 2400 hours per day, one spin-up per minute, 25C ambient temperature. We all know everyone leaves their computer off 16 hours per day. With this, in a large pile of drives, it's assumed that 0.32% will fail per year, or a failure of one for every 0.75mil hours. It doesn't state "power on hours" so I think they're counting linear time. Max ambient is 60C and max internal is 69C. Any increases above test are grounds for increased error rates.

There are 8764 hours in a year. I have 5 drives. My ambient temp in the case is 32C (I'm not a datacenter, I'm a house), but see ref 4 which indicates that temp doesn't notably increase drive failure rates. Based on this, I have a 5.84% chance per year.

Consider also that Leenooks' RAID code isn't guaranteed to pick a "good" copy on error detection. If it gets a read error, then it rebuilds the "failing" block.

If you lose 2 disks due to controller failure, re-adding them is no longer possible because they are stale and the array continues on. If there is an error in the remaining data, you no longer have a way to recover it.

If you get back bad data without an error, then it assumes the parity is wrong without doing any smart testing as to whether that's where the error lies. Considering there are more data disks than parity disks, this means you have a higher chance of mismatch.

How could that ever happen!?!?!? Well, if you have a failing disk and you for whatever reason need to use OS tools to copy the remaining good data off to a new disk, then you substitute the new disk for the bad disk. Maybe not the best choice, but it could happen.

It's all very disconcerting, though it's going through a complete rewrite today which should fix any transient issues and remap anything truly failed. Hopefully. But still, media errors increase the AFR from .32% to 30%.

Anyway, I just am paranoid that having roughly 66% overhead doesn't seem to be enough guarantee against data loss. :/

It looks like the best bet is:
1 One disk per controller, per array for hardware reliability
2 RAID-6 for reliability during recovery
3 Modern disks for ECC
4 Continue backups.

Unfortunately, I'm not able to do #1 right now. I'd need minimally 4 controllers and I have three. Based on current arrays, I'd need 5 controllers to keep from having to have 2 arrays and double the parity cost.

I guess I could get 2 more controllers. I'll have to weigh my concerns with that cost.


Ref1: http://neil.brown.name/blog/20100211050355
Ref2: http://en.wikipedia.org/wiki/Standard_RAID_levels
Ref3: http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%20LP/100564361b.pdf
Ref4: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf


#dmesg | grep ata
ata1.00: UDMA/33  TSScorpCD/DVD
sda ata3.00: UDMA/133 ST32000542AS      5XW1CB20
sdb ata3.01: UDMA/133 ST32000542AS      
sdc ata4.00: UDMA/133 WD1002FBYS-02A6B0 
sdd ata4.01: UDMA/133 ST3500320AS       
ata5.15: Port Multiplier 1.1  UDMA/100  
sde ata5.00: ST3500320AS                
sdf ata5.01: ST3500320AS                
sdg ata5.02: ST3500320AS                
sdh ata5.03: ST3500320AS                
sdi ata5.04: ST3500320AS                
sdj ata6.00: UDMA/100 ST32000542AS      

ST32000542AS sustains no more than 95MB/sec so UDMA/100 is fine.
#cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid0] [raid1] [raid10]
md3 : active raid6 sde[0] sdi[4] sdh[3] sdg[2] sdf[1]
      1465159488 blocks level 6, 64k chunk, algorithm 2 [5/5] [UUUUU]

md2 : active raid6 sdc3[1] sdd3[3]
      905696256 blocks level 6, 512k chunk, algorithm 2 [4/2] [_U_U]

md0 : active raid1 sdd1[2] sdc1[3]
      264960 blocks [4/2] [__UU]

md1 : active raid6 sdd2[0] sdc2[1]
      70540288 blocks level 6, 512k chunk, algorithm 2 [4/2] [UU__]

unused devices: <none>

sda and sdb are our missing disks
sdc is source for fdisk
sdj is a new disk
mdadm --add /dev/md0 /dev/sda1 /dev/sdb1
mdadm --add /dev/md1 /dev/sda2 /dev/sdb2
mdadm --add /dev/md2 /dev/sda3 /dev/sdb3

mdadm --fail /dev/md0 /dev/sdd1
mdadm --add /dev/md0 /dev/sdj1
mdadm --fail /dev/md1 /dev/sdd2
mdadm --add /dev/md1 /dev/sdj2
mdadm --fail /dev/md2 /dev/sdd3
mdadm --add /dev/md2 /dev/sdj3
######################  Wait here for complete

mdadm --fail /dev/md0 /dev/sdc1
mdadm --fail /dev/md1 /dev/sdc2
mdadm --fail /dev/md2 /dev/sdc3
shutdown -fh now

######################
mdadm --add /dev/md0 /dev/sdc1 /dev/sdd1
mdadm --add /dev/md1 /dev/sdc2 /dev/sdd2
mdadm --add /dev/md2 /dev/sdc3 /dev/sdd3
mdadm --grow /dev/md0 -n 5
mdadm --grow /dev/md1 -n5 --backup-file=/boot/md1temp
mdadm --grow /dev/md1 -n 5
mdadm --grow /dev/md2 -n 5

Overnight, lost A and B, not at the same time.
md0 and md1 are ok
md2 is offline with 3 missing and 1 spare.
#rescan
for i in /sys/bus/scsi/drivers/sd*/block/device/rescan ; do echo 1 > $i ; done
#find new
for i in /sys/class/scsi_host/host*/scan ; do echo "- - -" > $i ; done

mdadm --add /dev/md0 /dev/sdk1 /dev/sdl1
mdadm --add /dev/md1 /dev/sdk2 /dev/sdl2
mdadm -C --verbose --assume-clean -l6 -n4 -c512 --metadata=0.90 --uuid=55e53748:5d573e23:bb6ac296:576dbda4 /dev/md2 /dev/sda3 /dev/sdb3 missing missing
mdadm --readwrite /dev/md2
mdadm --readwrite /dev/md3
vgscan


Moar
mdadm --assemble /dev/md2 /dev/sdk3 /dev/sdl3 missing missing
pvscan # it's there
mdadm --add /dev/md2 /dev/sdj3
mdadm --add /dev/md2 /dev/sdd3
dd if=/dev/sda of=/dev/null bs=256k
dd if=/dev/sdb of=/dev/null bs=256k
mdadm --stop /dev/md2
remove disks


mdadm --assemble -f -R /dev/md2 /dev/sdd3 /dev/sdj3
mdadm --add /dev/md2 /dev/sdc3
mdadm --add /dev/md2 /dev/sdb3
while true ; do unset TMP ; cat /proc/mdstat | while read TMP ; do echo `date` $TMP | tee -a /var/log/syslog ; done ; sleep 120 ; done
mdadm --add /dev/md2 /dev/sda3
pvresize /dev/md1
mdadm --grow -n5 -l6 /dev/md2

/bin/bash# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid0] [raid1] [raid10]
md2 : active raid6 sda3[4] sdb3[2] sdc3[0] sdj3[1] sdd3[3]
      905696256 blocks super 0.91 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (106600/452848128) finish=566.2min speed=13325K/sec

md3 : active raid6 sde[0] sdi[4] sdh[3] sdg[2] sdf[1]
      1465159488 blocks level 6, 64k chunk, algorithm 2 [5/5] [UUUUU]

md0 : active raid1 sda1[0] sdd1[4] sdc1[3] sdj1[2] sdb1[1]
      264960 blocks [5/5] [UUUUU]

md1 : active raid6 sdj2[0] sdb2[4] sda2[3] sdd2[2] sdc2[1]
      105810432 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>


Here's a simple script I wrote and it makes some assumptions that are not universally true, but are fine for my system.
#!/bin/sh
###########
#
# lsdisk - by Josh-Daniel S. Davis
# depends on smartmontools being installed
# lists basic health in a compact format
#
#################

echo disk,model,serial,family,temperature,reallocated,uncorrectable
for i in /dev/sd[.*a-z] ; do
        disk=`basename $i`
        tmp=/dev/shm/jdsd.lsdisk.$disk.$$
        smartctl -x $i > $tmp &
done
wait

for i in /dev/sd[.*a-z] ; do
        disk=`basename $i`
        tmp=/dev/shm/jdsd.lsdisk.$disk.$$
        reallocated=`grep Reallocated_Sector_Ct $tmp | rev | cut -sf 1 -d \ | rev`
        uncorrectable=`grep Offline_Uncorrectable $tmp | rev | cut -sf 1 -d \ | rev`
        temperature=`grep Temperature_Celsius $tmp| tr -s \  | cut -sf 4 -d \ `
        model=`grep 'Device Model' $tmp | cut -sf 2 -d : | tr -s \  | cut -f 2-99 -d \ `
        family=`grep 'Model Family' $tmp | cut -sf 2 -d : | tr -s \  | cut -f 2-99 -d \ `
        serial=`grep 'Serial Number' $tmp | cut -sf 2 -d : | tr -s \  | cut -f 2-99 -d \ `

        echo $i,$model,$serial,\"$family\",$temperature,$reallocated,$uncorrectable
done

rm /dev/shm/jdsd.lsdisk.*

Here's the output
/bin/bash# lsdisk
disk,model,serial,family,temperature,reallocated,uncorrectable
/dev/sda,ST32000542AS,5XW1CB20,"Seagate Barracuda LP",027,0,202
/dev/sdb,ST32000542AS,5XW18B0R,"Seagate Barracuda LP",031,0,0
/dev/sdc,ST32000542AS,5XW1MLVP,"Seagate Barracuda LP",024,0,0
/dev/sdd,ST32000542AS,5XW1MJKS,"Seagate Barracuda LP",027,0,0
/dev/sde,ST3500320AS,9QMATG2Z,"Seagate Barracuda 7200.11 family",029,0,0
/dev/sdf,ST3500320AS,9QMAQENK,"Seagate Barracuda 7200.11 family",029,1,0
/dev/sdg,ST3500320AS,9QMABT7E,"Seagate Barracuda 7200.11 family",030,0,0
/dev/sdh,ST3500320AS,9QM6Y1TE,"Seagate Barracuda 7200.11 family",029,0,0
/dev/sdi,ST3500320AS,9QM9PL2Y,"Seagate Barracuda 7200.11 family",028,0,0
/dev/sdj,ST32000542AS,5XW1MN2K,"Seagate Barracuda LP",035,0,0



Pending
pvresize /dev/md2 # 905696256
Migrate in filesystems from external array
put 1TB into DT for TSM
after migrate, wipe all 500GB disks

craigslist the 500GB drives. Cheapest retail for any brand is $35, so price accordingly.
Tags: ,

  • 1
(Deleted comment)
I'm using the PRE tag here, and I have a manual stylesheet buried in my LJ configs.. *looks*

http://www.livejournal.com/customize/
My basic style is "Aquatic Moon" which is a blue theme for the Minimalism style as part of SS2. I gave up on SS1 last year I think.

I did "Customize Theme" and then "Custom CSS". The code in the box is:
PRE {
   background-color: rgb(230,250,250); 
   color: black; 
   border-style: solid; 
   border-color:black; 
   border-width: 1px;
}


PRE won't line wrap, so I've thought about doing the same thing for BLOCKQUOTE and using that plus TT. Ok, and I added my font-size: 85%; into both of them.

Testing...
This is PREformatted text

This is a block quote.
This is TeleType mode.


Edited at 2010-12-06 01:29 am (UTC)

(Deleted comment)
(Deleted comment)
Eye Talikz are ur friends.

There's a "heart" button at the top of this entry, and if you click that, it will save the post in your "memories" section. You can add your own tags there. :)

And really, 2 clicks, and a cut/paste, and it would be there. :)

Continuation of info for me

Sync completed OK. Started the grow commands here:
mdadm --grow /dev/md1 --size=max
pvresize /dev/md1   # Was 70540288
mdadm --grow /dev/md2 --size=max
pvresize /dev/md2   # Was 905696256


Updated my loop to show more info:
DELAY=1800
while true ; do 
echo ------------------
unset TMP ; lsdisk | while read TMP ; do echo `date` $TMP | tee -a /var/log/syslog ; done
echo ------------------
unset TMP ; cat /proc/mdstat | grep \\\[ | while read TMP ; do echo `date` $TMP | tee -a /var/log/syslog ; done
echo ------------------
unset TMP ; pvscan | while read TMP ; do echo `date` $TMP | tee -a /var/log/syslog ; done
sleep $DELAY ; done

Sun Dec 5 19:15:23 CST 2010 disk,model,serial,family,temperature,reallocated,uncorrectable
Sun Dec 5 19:15:25 CST 2010 /dev/sda,ST32000542AS,5XW1CB20,"Seagate Barracuda LP",028,0,202
Sun Dec 5 19:15:25 CST 2010 /dev/sdb,ST32000542AS,5XW18B0R,"Seagate Barracuda LP",031,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdc,ST32000542AS,5XW1MLVP,"Seagate Barracuda LP",025,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdd,ST32000542AS,5XW1MJKS,"Seagate Barracuda LP",027,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sde,ST3500320AS,9QMATG2Z,"Seagate Barracuda 7200.11 family",031,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdf,ST3500320AS,9QMAQENK,"Seagate Barracuda 7200.11 family",031,1,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdg,ST3500320AS,9QMABT7E,"Seagate Barracuda 7200.11 family",031,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdh,ST3500320AS,9QM6Y1TE,"Seagate Barracuda 7200.11 family",031,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdi,ST3500320AS,9QM9PL2Y,"Seagate Barracuda 7200.11 family",030,0,0
Sun Dec 5 19:15:25 CST 2010 /dev/sdj,ST32000542AS,5XW1MN2K,"Seagate Barracuda LP",034,0,0
------------------
Sun Dec 5 19:15:25 CST 2010 Personalities : [raid6] [raid5] [raid4] [raid0] [raid1] [raid10]
Sun Dec 5 19:15:25 CST 2010 md2 : active raid6 sda3[4] sdb3[2] sdc3[0] sdj3[1] sdd3[3]
Sun Dec 5 19:15:25 CST 2010 5753928192 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
Sun Dec 5 19:15:25 CST 2010 [=====>...............] resync = 26.0% (499208768/1917976064) finish=285.6min speed=82778K/sec
Sun Dec 5 19:15:25 CST 2010 md3 : active raid6 sde[0] sdi[4] sdh[3] sdg[2] sdf[1]
Sun Dec 5 19:15:25 CST 2010 1465159488 blocks level 6, 64k chunk, algorithm 2 [5/5] [UUUUU]
Sun Dec 5 19:15:25 CST 2010 md0 : active raid1 sda1[0] sdd1[4] sdc1[3] sdj1[2] sdb1[1]
Sun Dec 5 19:15:25 CST 2010 264960 blocks [5/5] [UUUUU]
Sun Dec 5 19:15:25 CST 2010 md1 : active raid6 sdj2[0] sdb2[4] sda2[3] sdd2[2] sdc2[1]
Sun Dec 5 19:15:25 CST 2010 105810432 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
------------------
Sun Dec 5 19:15:30 CST 2010 PV /dev/md3 VG mediavg lvm2 [1.36 TiB / 119.00 GiB free]
Sun Dec 5 19:15:30 CST 2010 PV /dev/md2 VG datavg lvm2 [5.36 TiB / 4.59 TiB free]
Sun Dec 5 19:15:30 CST 2010 PV /dev/md1 VG rootvg lvm2 [100.88 GiB / 81.38 GiB free]
Sun Dec 5 19:15:30 CST 2010 Total: 3 [6.82 TiB] / in use: 3 [6.82 TiB] / in no VG: 0 [0 ]


SDA had no more errors, but SDF had one pending.

Pending
Migrate in filesystems from external array
put 1TB into DT for TSM
after migrate, wipe all 500GB disks
craigslist the 500GB drives. Cheapest retail for any brand is $35, so price accordingly.

lvextend -rL32G /dev/rootvg/hd3
lvextend -rL700G /dev/datavg/storagelv
lvextend -rL256G /dev/datavg/bklv
mkfs -t ext4 -E stride=1024,stripe-width=3072,lazy_itable_init=1 -m0 /dev/datavg/medialv
mkfs -t ext4 -E stride=1024,stripe-width=3072,lazy_itable_init=1 -m0 /dev/datavg/uploadlv

edited /etc/fstab and remounted the old stuff ro
rsync -aHvxyS --stats /storage/uploads/uploads/* /storage/uploads
rsync -aHvxyS --stats /storage/media/media/* /storage/media


After migrade and verify tsm worked fine, a bootable backup

##### Install Bare Metal Recovery software
These lines were added to /etc/apt/sources.list
   deb ftp://ftp.mondorescue.org/debian 5.0 contrib
   deb-src ftp://ftp.mondorescue.org/debian 5.0 contrib
apt-get update ; apt-get install mondo mindi

##### Make a bootable CD/DVD backup   
# outfile is  /var/log/mondoarchive.log.
   mondoarchive -Oi9N -l LILO -s 4330m -f /dev/md0 -S /usr/tmp/ISO -T /usr/tmp \
    -E "/mnt /var/tmp /tmp /var/cache/apt/archives /storage" -p `hostname`.`date +%Y-%m-%d` \
    -d /storage/backup/


  • 1
?

Log in

No account? Create an account