[Untitled]‎ > ‎

How to calculate LBA offsets on a 3ware 9650SE-16PML

posted Dec 31, 2011, 6:00 AM by Unknown user
Info
GOAL: To have a document that outlines how to find a bad hard drive that is responsible for controller resets.  When controller resets occur, it may not tell you which disk is bad, so you have to find out manually.
NOTES: Many of these steps were taken from LSI's (formerly 3ware) excellent technical support in a case that I opened with them.
REQUIREMENTS: A shell, tw_cli.

Specifications
Distribution: Debian Testing
Architecture: 64-bit

Outline
1. How to find the root cause of a controller reset, in this specific case.

Setup
1. 3ware 9650SE-16PML RAID controller
2. 16x1TB disks (15x1TB in a RAID-6, the 16th disk is a hot spare)

The Problem
I saw this in the logs:  
Mar 21 05:40:03 p34 kernel: [521953.433965] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.  

Investigation
After a preliminary investigation, I opened a case with 3ware for a root cause analysis.  They stated that Drive02 was bad.   However, tw_cli was showing all disks as OK, and the SMART data was all good too.  

In Google's "Failure Trends in a Large Drive Population" [1], they state "
Out of all failed drives, over 56% of them have no 
count in any of the four strong SMART signals, namely 
scan errors, reallocation count, offline reallocation, and 
probational count. In other words, models based only 
on those signals can never predict more than half of the 
failed drives."

Check the RAID health: Looks good.

# tw_cli  info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       64K     12107.1   RiW    ON
u1    SPARE     OK             -       -       -       931.505   -      ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SATA  0   -            WDC WD1002FBYS-01A6
p1    OK             u0   931.51 GB SATA  1   -            WDC WD1002FBYS-01A6
p2    OK             u0   931.51 GB SATA  2   -            WDC WD1002FBYS-01A6
p3    OK             u0   931.51 GB SATA  3   -            WDC WD1002FBYS-01A6
p4    OK             u0   931.51 GB SATA  4   -            WDC WD1002FBYS-01A6
p5    OK             u0   931.51 GB SATA  5   -            WDC WD1002FBYS-01A6
p6    OK             u0   931.51 GB SATA  6   -            WDC WD1002FBYS-01A6
p7    OK             u0   931.51 GB SATA  7   -            WDC WD1002FBYS-01A6
p8    OK             u0   931.51 GB SATA  8   -            WDC WD1002FBYS-01A6
p9    OK             u0   931.51 GB SATA  9   -            WDC WD1002FBYS-01A6
p10   OK             u0   931.51 GB SATA  10  -            WDC WD1002FBYS-01A6
p11   OK             u0   931.51 GB SATA  11  -            WDC WD1002FBYS-01A6
p12   OK             u0   931.51 GB SATA  12  -            WDC WD1002FBYS-01A6
p13   OK             u0   931.51 GB SATA  13  -            WDC WD1002FBYS-01A6
p14   OK             u0   931.51 GB SATA  14  -            WDC WD1002FBYS-01A6
p15   OK             u1   931.51 GB SATA  15  -            WDC WD1002FBYS-01A6

Check the disk (continue reading to find out how to identify the bad disk)

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1175
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       74
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   095   000    Old_age   Always       -       601
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       74
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       73
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       74
194 Temperature_Celsius     0x0022   115   112   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       598         -
# 2  Conveyance offline  Completed without error       00%       595         -
# 3  Short offline       Completed without error       00%       583         -
# 4  Extended offline    Completed without error       00%       563         -
# 5  Short offline       Completed without error       00%       537         -
# 6  Short offline       Completed without error       00%       512         -
# 7  Short offline       Completed without error       00%       488         -
# 8  Short offline       Completed without error       00%       464         -
# 9  Short offline       Completed without error       00%       440         -
#10  Short offline       Completed without error       00%       416         -


1. 
Look for the following messages after the controller reset and find the bad LBAs
Update (03/24/2010)
The engineer had overlooked the error=, if the error=0x01, then it would have been a drive issue.  Currently, we are looking at the BBU module on the controller itself.  It has been removed and I am re-testing to see if I can re-create the problem.  However, I am keeping the rest of this post online because it does show how to find a bad disk in a 3ware RAID array, just make sure it does not say error=0x0 however.
Update (03/25/2010)
In this specific instance, the BBU controller (which attaches to the 3ware controller failed).  After disconnecting the controller and continuing to pound on the RAID for 24 hours, there have been no further issues.  When I asked 3ware if it was common for the BBU to fail, they responded: "BBU is a consumable product that does wear out over time and eventual failure is inevitable."


The command you need to get these logs is shown below:
# tw_cli /c0 show diag

You can also use the lsigetlunix.sh script provided by LSI to gather all of the logs for you.

This is what we are looking for after the controller reset occurred:
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)

# grep -A10000 'Soft Reset Handler Started ...' 3ware/Controller_C0.txt  |grep map=
DcbMgr::WriteSegment(map=0x4BCA04, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA04, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA04, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA04, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4C4F4C, segID=0x1, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x8, events=30, error=0x0)
DcbMgr::WriteSegment(map=0x4BCA2C, segID=0x1, events=30, error=0x0)

# grep 0x 3ware/Controller_C0.txt |grep map= | awk '{print $1}'|cut -f2- -d'('| cut -f1 -d','|sort|uniq
map=0x4BCA04 (events=30)
map=0x4BCA2C (events=30)
map=0x4C4F4C (events=30)
map=0x4CC32C (events=2)

The LSI engineer stated the drive is having problems with write requests and that th
ese are the bad offsets.  They are hexadecimal, they must be converted to decimal first.

2. Convert the offsets to decimal
On Apple's website [2], they show a quick way to convert from hexadecimal to decimal using printf
printf "%d\\n\\n" 0x4BCA2C
4966956

So let's get them all:
for lba in 0x4BCA04 0x4BCA2C 0x4C4F4C 0x4CC32C; do   printf "%d=$lba\\n\\n" $lba; done | grep .
4966916=0x4BCA04
4966956=0x4BCA2C
5001036=0x4C4F4C
5030700=0x4CC32C

3. Divide by the chunk size of the array
I use 64KiB as my stripe size, the LSI engineer stated that you need to divide by 64000 for each bad LBA.

This results in the following output:
77.60806250000000000000
77.60868750000000000000
78.14118750000000000000
78.60468750000000000000

4. Now you need to loop over the drives and find the affected drive
It took me a little while to understand what the engineer was saying until I wrote it down, then it made sense.  What he explained to me was, "we count from 1-15 and subtract the offset(s) later."

Details:
There are 16 drives total, but only 15 in the array, one is a hot spare.
Subtract 1 for the offset, the array starts at 0 and we are counting from 1.
Subtract 1 due to having a hot spare.

5. Loop the drives.

Details:
There are 16 drives total, but only 15 in the array, one is a hot spare.
Subtract 1 for the offset, the array starts at 0 and we are counting from 1.
Subtract 1 due to having a hot spare.

Now we will "loop through the drives"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15->1 LOOP 1
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30->2 LOOP 2
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45->3 LOOP 3
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60->4 LOOP 4
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75->5 LOOP 5
76
77 <- Here is the problem (offsets=
77.6/
77.60)
78 <- Here is the problem (offsets=78.14/78.60)
79
80
81
82
83
84
85
86
87
88
89
..
You would continue if the LBA offsets were higher.

6. Subtract the offsets from earlier
So from above we take "LOOP 5" subtract by the offset:
-1 since we count from 1 and not 0
-1 since we have a hot spare
So: 5-2 = 3

Counting from 0 we get Drive02:
Drive00 <- Drive1 in system
Drive01 <- Drive2 in system
Drive02 <- Drive3 in system (or p2) as shown below:

# tw_cli  info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       64K     12107.1   RiW    ON
u1    SPARE     OK             -       -       -       931.505   -      ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SATA  0   -            WDC WD1002FBYS-01A6
p1    OK             u0   931.51 GB SATA  1   -            WDC WD1002FBYS-01A6
p2    OK             u0   931.51 GB SATA  2   -            WDC WD1002FBYS-01A6
p3    OK             u0   931.51 GB SATA  3   -            WDC WD1002FBYS-01A6
p4    OK             u0   931.51 GB SATA  4   -            WDC WD1002FBYS-01A6
p5    OK             u0   931.51 GB SATA  5   -            WDC WD1002FBYS-01A6
p6    OK             u0   931.51 GB SATA  6   -            WDC WD1002FBYS-01A6
p7    OK             u0   931.51 GB SATA  7   -            WDC WD1002FBYS-01A6
p8    OK             u0   931.51 GB SATA  8   -            WDC WD1002FBYS-01A6
p9    OK             u0   931.51 GB SATA  9   -            WDC WD1002FBYS-01A6
p10   OK             u0   931.51 GB SATA  10  -            WDC WD1002FBYS-01A6
p11   OK             u0   931.51 GB SATA  11  -            WDC WD1002FBYS-01A6
p12   OK             u0   931.51 GB SATA  12  -            WDC WD1002FBYS-01A6
p13   OK             u0   931.51 GB SATA  13  -            WDC WD1002FBYS-01A6
p14   OK             u0   931.51 GB SATA  14  -            WDC WD1002FBYS-01A6
p15   OK             u1   931.51 GB SATA  15  -            WDC WD1002FBYS-01A6

Drive02 is the disk that needs to be replaced.

7. Export the disk and allow the array to rebuild to the spare
# tw_cli  /c0/p2 export
Removing /c0/p2 will take the disk offline.
Do you want to continue ? Y|N [N]: Y
Removing port /c0/p2 ... Done.

8. Allow the array to rebuild and then replace the bad disk
# tw_cli  info c0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    REBUILDING     45%(A)  -       64K     12107.1   RiW    ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SATA  0   -            WDC WD1002FBYS-01A6
p1    OK             u0   931.51 GB SATA  1   -            WDC WD1002FBYS-01A6
p3    OK             u0   931.51 GB SATA  3   -            WDC WD1002FBYS-01A6
p4    OK             u0   931.51 GB SATA  4   -            WDC WD1002FBYS-01A6
p5    OK             u0   931.51 GB SATA  5   -            WDC WD1002FBYS-01A6
p6    OK             u0   931.51 GB SATA  6   -            WDC WD1002FBYS-01A6
p7    OK             u0   931.51 GB SATA  7   -            WDC WD1002FBYS-01A6
p8    OK             u0   931.51 GB SATA  8   -            WDC WD1002FBYS-01A6
p9    OK             u0   931.51 GB SATA  9   -            WDC WD1002FBYS-01A6
p10   OK             u0   931.51 GB SATA  10  -            WDC WD1002FBYS-01A6
p11   OK             u0   931.51 GB SATA  11  -            WDC WD1002FBYS-01A6
p12   OK             u0   931.51 GB SATA  12  -            WDC WD1002FBYS-01A6
p13   OK             u0   931.51 GB SATA  13  -            WDC WD1002FBYS-01A6
p14   OK             u0   931.51 GB SATA  14  -            WDC WD1002FBYS-01A6
p15   DEGRADED       u0   931.51 GB SATA  15  -            WDC WD1002FBYS-01A6

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255    12-Dec-2009


URLs



Comments