[nmglug] SMART error message

Sun Apr 17 09:46:07 PDT 2011

On Apr 17, 2011, at 9:17 AM, Chris Brotherton wrote:

> I am starting to see the following error in my logs:
> 
> Apr 16 08:25:30 darkstar smartd[2159]: Device: /dev/sdd, 1 Currently
> unreadable (pending) sectors
> Apr 16 08:55:30 darkstar smartd[2159]: Device: /dev/sdd, 1 Currently
> unreadable (pending) sectors
> 
> Based on these error message, I was going to follow the advice on this webpage:
> 
> http://smartmontools.sourceforge.net/badblockhowto.html
> 
> However, the SMART selftests show the following:
> 
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%      4381         -
> # 2  Extended offline    Completed without error       00%      4374         -
> 
> So, do I need to concern myself with error or not?

First thing I'd do if you are not already doing so is make a full backup of data on the drive (dd or ddrescue to an image or another drive).  Or, if you don't care about all the data you can cp -r or rsync off the portions of the filesystem and directories that you do care about.

What's the output of;

smartctl -H /dev/sdd

?

in particular the "SMART overall-health self-assessment test result"?  I would further recommend running another offline test;

smartctl -t offline /dev/sdd

and see what the result is.  The badblockhowto.html that you've cited should allow you to identify which files are populated by bad sectors and as described in the backblockhowto.html you can use 'dd' to force the disk to reallocate the bad block by writing zeros to to it (them). You can restore the affected files from backup if desired or take other measures after identifying the file location of bad sectors.

A friend of mine and I recently had a similar situation with an Enterprise class drive that started sprouting bad sectors.  The procedure he followed was pretty much *exactly* as you've mentioned and as described in backblockhowto.html.  We eventually replaced the failing 3.5" Samsung disk in a 1U chassis with two commodity WD 2.5" drives (SUPERMICRO MCP-220-00044-0N Hard Drive Bracket) in a software RAID-1 configuration.

I'd say if the SMART test fails or if you see additional bad sectors after writing zeros to the existing bad sectors, replace the drive.  You could (if it's not a server) take the drive offline boot from a manufacturer's diagnostic CD (Seatools or Western Digital Data LifeGuard bootable CD) and do a full surface scan.  If it fails with bad sectors (likely) you could RMA the drive (if in warranty) based upon a result code other than 0x00.  If it was me and it was not a production system and I had backups, I'd probably skip straight to pulling the drive model and serial number;

hdparm -i /dev/sdd

and if the drive was in warranty (by checking the manufacturer's website which will usually tell you) I'd do the full surface scan (which I'm wagering will fail) and RMA the drive.

-Nick

---------------------------------------
Nicholas S. Frost
7 Avenida Vista Grande #325
Santa Fe, NM  87508
nickf at frostitute.com
----------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4372 bytes
Desc: not available
URL: <http://lists.nmglug.org/pipermail/nmglug-nmglug.org/attachments/20110417/bd9b8ec2/attachment.bin>