How to calculate LBA offsets on a 3ware 9650SE-16PML

Post date: Dec 31, 2011 2:00:38 PM

Info

GOAL: To have a document that outlines how to find a bad hard drive that is responsible for controller resets. When controller resets occur, it may not tell you which disk is bad, so you have to find out manually.

NOTES: Many of these steps were taken from LSI's (formerly 3ware) excellent technical support in a case that I opened with them.

REQUIREMENTS: A shell, tw_cli.

Specifications

Distribution: Debian Testing

Architecture: 64-bit

Outline

1. How to find the root cause of a controller reset, in this specific case.

Setup

1. 3ware 9650SE-16PML RAID controller

2. 16x1TB disks (15x1TB in a RAID-6, the 16th disk is a hot spare)

The Problem

I saw this in the logs:

Mar 21 05:40:03 p34 kernel: [521953.433965] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.

Investigation

After a preliminary investigation, I opened a case with 3ware for a root cause analysis. They stated that Drive02 was bad. However, tw_cli was showing all disks as OK, and the SMART data was all good too.

In Google's "Failure Trends in a Large Drive Population" [1], they state "Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives."

Check the RAID health: Looks good.

# tw_cli info c0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-6 OK - - 64K 12107.1 RiW ON

u1 SPARE OK - - - 931.505 - ON

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p0 OK u0 931.51 GB SATA 0 - WDC WD1002FBYS-01A6

p1 OK u0 931.51 GB SATA 1 - WDC WD1002FBYS-01A6

p2 OK u0 931.51 GB SATA 2 - WDC WD1002FBYS-01A6

p3 OK u0 931.51 GB SATA 3 - WDC WD1002FBYS-01A6

p4 OK u0 931.51 GB SATA 4 - WDC WD1002FBYS-01A6

p5 OK u0 931.51 GB SATA 5 - WDC WD1002FBYS-01A6

p6 OK u0 931.51 GB SATA 6 - WDC WD1002FBYS-01A6

p7 OK u0 931.51 GB SATA 7 - WDC WD1002FBYS-01A6

p8 OK u0 931.51 GB SATA 8 - WDC WD1002FBYS-01A6

p9 OK u0 931.51 GB SATA 9 - WDC WD1002FBYS-01A6

p10 OK u0 931.51 GB SATA 10 - WDC WD1002FBYS-01A6

p11 OK u0 931.51 GB SATA 11 - WDC WD1002FBYS-01A6

p12 OK u0 931.51 GB SATA 12 - WDC WD1002FBYS-01A6

p13 OK u0 931.51 GB SATA 13 - WDC WD1002FBYS-01A6

p14 OK u0 931.51 GB SATA 14 - WDC WD1002FBYS-01A6

p15 OK u1 931.51 GB SATA 15 - WDC WD1002FBYS-01A6

Check the disk (continue reading to find out how to identify the bad disk)

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1175

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 74

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 095 000 Old_age Always - 601

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 74

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 73

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 74

194 Temperature_Celsius 0x0022 115 112 000 Old_age Always - 35

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed without error 00% 598 -

# 2 Conveyance offline Completed without error 00% 595 -

# 3 Short offline Completed without error 00% 583 -

# 4 Extended offline Completed without error 00% 563 -

# 5 Short offline Completed without error 00% 537 -

# 6 Short offline Completed without error 00% 512 -

# 7 Short offline Completed without error 00% 488 -

# 8 Short offline Completed without error 00% 464 -

# 9 Short offline Completed without error 00% 440 -

#10 Short offline Completed without error 00% 416 -

1. Look for the following messages after the controller reset and find the bad LBAs

Update (03/24/2010)

The engineer had overlooked the error=, if the error=0x01, then it would have been a drive issue. Currently, we are looking at the BBU module on the controller itself. It has been removed and I am re-testing to see if I can re-create the problem. However, I am keeping the rest of this post online because it does show how to find a bad disk in a 3ware RAID array, just make sure it does not say error=0x0 however.

Update (03/25/2010)

In this specific instance, the BBU controller (which attaches to the 3ware controller failed). After disconnecting the controller and continuing to pound on the RAID for 24 hours, there have been no further issues. When I asked 3ware if it was common for the BBU to fail, they responded: "BBU is a consumable product that does wear out over time and eventual failure is inevitable."

The command you need to get these logs is shown below:

# tw_cli /c0 show diag

You can also use the lsigetlunix.sh script provided by LSI to gather all of the logs for you.

This is what we are looking for after the controller reset occurred: