finding faulted drive

Added by Caleb Baker 11 months ago

I have a fairly new nexenta box setup and we lost a drive. How do I find the bad drive in this 40 drive system? Here is what Nexenta is showing me:

volume : nexentasanvol
state : ONLINE
scan : none requested
config : NAME                       STATE     READ WRITE CKSUM
nexentasanvol              ONLINE       0     0     0
raidz2-0                 ONLINE       0     0     0
c0t5000C5004164DC63d0  ONLINE       0     0     0
c0t5000C500416503DFd0  ONLINE       0     0     0
c0t5000C50041653CFFd0  ONLINE       0     0     0
c0t5000C50041654C7Bd0  ONLINE       0     0     0
c0t5000C500416559B3d0  ONLINE       0     0     0
c0t5000C50041655A27d0  ONLINE       0     0     0
c0t5000C50041655FEFd0  ONLINE       0     0     0
raidz2-1                 ONLINE       0     0     0
c0t5000C50041656233d0  ONLINE       0     0     0
c0t5000C500416569EBd0  ONLINE       0     0     0
c0t5000C50041657443d0  ONLINE       0     0     0
c0t5000C50041657513d0  ONLINE       0     0     0
c0t5000C50041657A27d0  ONLINE       0     0     0
c0t5000C50041657E47d0  ONLINE       0     0     0
c0t5000C500416585EFd0  ONLINE       0     0     0
raidz2-2                 ONLINE       0     0     0
c0t5000C500416592B7d0  ONLINE       0     0     0
c0t5000C5004165A697d0  ONLINE       0     0     0
c0t5000C5004165A87Fd0  ONLINE       0     0     0
c0t5000C5004165A917d0  ONLINE       0     0     0
c0t5000C5004165AC6Bd0  ONLINE       0     0     0
c0t5000C5004165AC77d0  ONLINE       0     0     0
c0t5000C50041717E2Fd0  ONLINE       0     0     0
raidz2-3                 ONLINE       0     0     0
c0t5000C50041717E67d0  ONLINE       0     0     0
c0t5000C50041717F0Fd0  ONLINE       0     0     0
c0t5000C5004171D5CBd0  ONLINE       0     0     0
c0t5000C5004171DD4Bd0  ONLINE       0     0     0
c0t5000C5004171E90Bd0  ONLINE       0     0     0
c0t5000C50041723183d0  ONLINE       0     0     0
c0t5000C5004172338Fd0  ONLINE       0     0     0
logs
c0t5E83A97EC1F466E5d0    ONLINE       0     0     0
c0t5E83A97ED82275FFd0    ONLINE       0     0     0
cache
c0t5E83A97E31A954F9d0    ONLINE       0     0     0
c0t5E83A97E3364B9B9d0    ONLINE       0     0     0
c0t5E83A97E3ACE7AACd0    FAULTED      0     4     0  too many errors
c0t5E83A97E3AD14DB8d0    ONLINE       0     0     0
c0t5E83A97E6A76955Ed0    ONLINE       0     0     0
c0t5E83A97EB1270C0Ad0    ONLINE       0     0     0
spares
c0t5000C500417236D3d0    AVAIL
c0t5000C50041724AFBd0    AVAIL
errors : No known data errors

I can't seem to find anything that correlates with the serial numbers I have and the "blink" function doesn't seem to work either.


Replies

RE: finding faulted drive - Added by Linda Kateley 11 months ago

blink is the usual way to find the drive. we also have several commands that may help in identifing the drive. They are described on page 49 of the user guide..

but..

show lun slotmap

may help

or if you have a supported jbod

To view JBODs' states open Settings → Disks → JBODs:

RE: finding faulted drive - Added by Tommy Scherer 10 months ago

Before proceeding with replacing the drive, first lets talk about how OpenSolaris identifies the SAS disk. "c?": It is the virtual scsi_vhci controller. "t" : unique target device name. "d0": at the end means LUN 0, scsi_vhci always uses LUN0.

So the lengthy scsi_vhci device name we are going to replace in our example below is "c0t5000C5003C42B7E3d0".

First, log into the CLI of Nexenta (NMC) onto the node that has the volume pool that is degraded. Then run the command "zpool status vol0" (vol0 is the volume pool name) and look for the degraded vdev:

mirror-0 DEGRADED 0 0 0 c0t5000C5003C413C53d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 10982770408670368315 UNAVAIL 0 0 0 was /dev/dsk/c0t5000C5003C42B7E3d0s0 c0t5000C5003B677C47d0 ONLINE 0 0 0 So the bad disk is c0t5000C5003C42B7E3d0s0 which was replaced by the spare disk which is now online in this mirror group.

Next is to get the serial number of the bad disk:

Get the MPxIO name first: root@stor01-a-ha:/devices/scsi_vhci# ls | grep -i 5000C5003C42B7E3 | head disk@g5000c5003c42b7e3 <-- This is the one to use!! disk@g5000c5003c42b7e3:a disk@g5000c5003c42b7e3:a,raw disk@g5000c5003c42b7e3:b disk@g5000c5003c42b7e3:b,raw disk@g5000c5003c42b7e3:c disk@g5000c5003c42b7e3:c,raw disk@g5000c5003c42b7e3:d disk@g5000c5003c42b7e3:d,raw disk@g5000c5003c42b7e3:e Serial number is below under the value variable (first 8 characters):

root@stor01-a-ha:/devices/scsivhci# prtconf -v /devices/scsivhci/disk@g5000c5003c42b7e3 | egrep -A1 serial name='inquiry-serial-no' type=string items=1 dev=none value='6SK046F50000N202KZ4P' Now that we got the serial number, we will use the "sas2ircu" command to get the location of the bad disk and the type of disk:

Display Type of Drive and Location of Disk root@stor01-a-ha:/devices/scsi_vhci# /lsi/sas2ircu 1 display | grep -B7 -A2 6SK046F5 Enclosure # : 2 Slot # : 22 State : Ready (RDY) Size (in MB)/(in sectors) : 429247/879097967 Manufacturer : SEAGATE Model Number : ST3450857SS Firmware Revision : 0008 Serial No : 6SK046F5 Protocol : SAS Drive Type : SAS_HDD Note: "sas2ircu 0" displays drives on the internal LSI SAS adapter while "sas2ircu 1" display drives on the external LSI SAS adapter.

Locate Bad Disk root@stor01-a-ha:~# /lsi/sas2ircu 1 LOCATE 2:22 ON Now go to the enclosure and locate the bad disk that should have a red flashing light on it. Wait for the replacement drive and once you get the replacement drive turn off the flashing disk (and remember where it was located) by running the following command:

root@stor01-a-ha:~# /lsi/sas2ircu 1 LOCATE 2:22 OFF

Another option which you could use to blink the bad drive is using the command "show lun c0t5000C5003C42B7E3d0 blink" command. But this will blink the LED activity light while the above way will blink the red LED light instead.

nmc@stor01-a-ha:/$ show lun c0t5000C5003C42B7E3d0 blink The disk 'c0t5000C5003B677C47d0' is currently part of the volume: 'vol0' Warning! Please make sure there is no IO activity on this volumes before continuing ... Proceed to blink? Yes Enabled blinking LED activity for disk 'c0t5000C5003C42B7E3d0' (press Ctrl-C to interrupt)... So once you get the replacement disk, remove the bad disk and then insert the new disk. Once you have done that, log onto both HA nodes (ex. stor01-a-mgmt and stor01-b-mgmt) and run the following command:

nmc>$ !bash

cfgadm -al (again do this on both nodes)

nmc (this will take you out of OpenSolaris and back into Nexenta NMC)

Then run the following command to replace the bad disk in the volume pool called vol0 within the NMC:

nmc>$ setup volume vol0 replace-lun -r -y LUN to replace : c0t5000C5003C42B7E3d0 Please confirm using the same 'c0t5000C5001D5DD7CFd0' as a replacement? Yes If a drive is being difficult when trying to replace it, try the command below:

nmc>$ !bash (this will take you into OpenSolaris)

zpool replace vol0 c0t5000C5003C42B7E3d0 c0t5000C5001D5DD7CFd0 <-- first WWPN is bad disk followed by WWPN of replacement disk

zpool status vol0 (should see resilvering of disk)

If the commands above worked with no errors, use the command "show volume volume-pool-name status" command: Example: nmc>$ show volume data status

sas2ircu (143.1 KB)