Dyk 5 27

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 139

Did You Know LSI FC Disk Arrays, SYMplicity, and Fibre Channel Troubleshooting

Fibre Channel Troubleshooting DYK Agenda


LSI FC Array Overview Fibre Channel Arbitrated Loops Multi-Path Proxy Driver MPPD Commands and Interface to the Arrays SYMplicity 8.37 Storage Management Differences Between SYMplicity 8.37 and 7.15 SYMplicity Command Line SMcli Controller Shell Fibre Channel Fault Reporting Host Side Path Failures Path Failures in Switch Configurations Recovering From Downed Paths During Reboot Drive Side Path Failures Debugging Drive Side Problems Using RLS Data Debugging Host Side Problems Using RLS Data from lsiUtil Parity Checking and Predictive Failures Reference Links

wes5tools

1/30/2012

NCR Proprietary & Confidential

LSI Fibre Channel Arrays Hardware Overview

1/30/2012

NCR Proprietary & Confidential

NCR Enterprise Storage LSI FC Storage Release History


Marketing Name NCR Class Model LSI Controller Operating System Release Date Controller AP code Filename RAID Mgr SW Redundant Path SW Release Highlights WES 5.0 6840 1440 1456 4884 MPRAS April-02 4.02.01.08 Mojave 2 SYMplicity 7.15 MPPD 7.15 WES 5.4 6840 1440 1456 4884 MPRAS May-03 5.37.05.00 Sonoran 3.7 SYMplicity 8.37 MPPD 8.37 NS 5.5 6841 2456 4884 MPRAS W2K May-03 5.37.05.00 Sonoran 3.7 SYMplicity 8.37 MPRAS MPPD 8.37 W2K - RDAC 8.37 2Gbit FC backend NS 6.0 6841 6456 5884 MPRAS W2K May-03 5.37.05.00 Sonoran 3.7 SYMplicity 8.37 MPRAS MPPD 8.37 W2K - RDAC 8.37 New controller (5884) NS 6.1 6841 7456 5885 MPRAS 2H04 5.43.xx.xx Sonoran 4.3 SYMplicity 8.43 MPRAS MPPD 8.43 DAP controller (5885)

Initial Release

SYMplicity 8.37 Upgrade

1/30/2012

NCR Proprietary & Confidential

NCR Enterprise Storage Hardware


LSI Fibre Channel Arrays
6840-1440
Drive Module - 1 Gbit SYM 2200 (up to 40 drives)

6840-1456

6841-2456

6841-6456 6841-7456

Drive Module -1 Gbit FC 2500 (up to 56 drives)

Drive Module - 2 Gbit FC 2600 (up to 56 drives)

4884 4884

4884 4884

4884 4884

588X 588X

Controller Module 2 Gbit FC 1250

Controller Module FC 1275 5884 - 6841-6456 5885 (DAP) -6841-47456

1/30/2012

NCR Proprietary & Confidential

Storage Upgrades
6840 (WES 5.x) may be upgraded from SYM 7.15 to SYM 8.37 (WES 5.4)
A software and firmware upgrade

6840 (WES 5.4) is not field upgradeable to 6841 (NS 5.5 or 6.0)
Would require installing a new 2 Gbit drive tray chassis

6841-2456 (NS 5.5) is not field upgradeable to 6841-6456 (NS 6.0)


Would require installing a new 588x compatible controller chassis

6841-6456 (NS 6.0) is field upgradeable to 6841-7456 (NS6.1) (when released)


Software and firmware upgrade New controller board

1/30/2012

NCR Proprietary & Confidential

Storage Co-existence
The following arrays may coexist in a Teradata system, but not within a clique:
WES 3.x (SCSI Quad-array)
Require Raid Manager 5 on nodes.

WES 5.x (FC w/1Gbit backend) must upgrade to NS 5.4


must be upgraded to SYMplicity 8.37 (mandatory FRO).

NS 5.5 (FC w/2Gbit end-to-end)


Require SYMplicity 8.37.

NS 6.0 (High performance FC w/2Gbit end-to-end)


Require SYMplicity 8.37.

NS 6.1 (High performance DAP FC w/2Gbit end-to-end)


Require SYMplicity 8.43

Note: SYMplicity AWS SW must be at the same level as the latest version of SYMplicity SW in the system.

1/30/2012

NCR Proprietary & Confidential

Fibre Channel Arbitrated Loops


Configurations where FC Switches are NOT used

1/30/2012

NCR Proprietary & Confidential

Configuration - Drive Side FC Cabling


ESM-A
FC-AL X10 X1 Tray Number FC-AL FC-AL

ESM-B
X10 X1 Tray Number FC-AL Fan FRU

Drive Tray 4 (loops 4 and 3)

Fan FRU

Power Supply FRU

Power Supply FRU

4 Fibre Channel loops connect the 4 drive trays to the controllers. A drive tray is connected to 2 loops, each controls half of the drives.

FC-AL

X10 X1 Tray Number

FC-AL

FC-AL

X10 X1 Tray Number

FC-AL Fan FRU

Drive Tray 3 (loops 2 and 3)

Fan FRU

Power Supply FRU

Power Supply FRU

FC-AL

X10 X1 Tray Number

FC-AL

FC-AL

X10 X1 Tray Number

FC-AL Fan FRU

Drive Tray 2 (loops 3 and 1)

Fan FRU

Power Supply FRU

Power Supply FRU

FC-AL

X10 X1 Tray Number

FC-AL

FC-AL

X10 X1 Tray Number

FC-AL Fan FRU

Drive Tray 1 (loops 1 and 4)

Fan FRU

Mini-hub number matches the drive loop (channel) 8 Mini-hubs number. (4 host 4 drive side)

Drive Loop 4 3 2 1
OUT OUT OUT OUT OUT OUT OUT OUT

Controller Module
IN IN IN IN IN IN IN IN

1/30/2012

NCR Proprietary & Confidential

Configuration - Drive Side Loops .


Drive Tray 4
ESM A PBC ESM B PBC ESM A PBC ESM B PBC ESM A PBC

.
ESM B PBC ESM A PBC

.
Drive Tray 1
ESM B PBC

SFP

SFP

SFP

SFP

SFP

SFP

SFP

SFP

SFP

SFP

SFP

SFP

IN

OUT

IN

OUT

IN

OUT IN

OUT

IN

OUT IN

OUT

IN

OUT IN

OUT

SFPs or GBICS

IN

OUT
SFP

IN

OUT
SFP

IN

OUT
SFP

IN

OUT
SFP

Mini-Hub CH 4

Mini-Hub CH 3

Mini-Hub CH 2

Mini-Hub CH 1

Loop CH4
TachyonTL Chip 3

Loop CH3
TachyonTL Chip 2

Loop CH2
TachyonTL Chip 1

Loop CH1
TachyonTL Chip 0 Chip 5 TachyonTL

Loop CH4
TachyonTL Chip 3 Chip 4 TachyonTL

Loop CH3
TachyonTL Chip 2

Loop CH2
TachyonTL Chip 1

Loop CH1
TachyonTL Chip 0 Chip 5 TachyonTL

Chip 4 TachyonTL

Controller A

Controller B

1/30/2012

NCR Proprietary & Confidential

10

Configuration - Host Side FC Cabling


4 host side loops, 1 loop connects 2 nodes to a controller. A node has separate loop connections to controller A and B.
A1
OUT

Host Ports B1 A2 B2
OUT OUT OUT
OUT OUT OUT OUT

Controller Module
IN IN IN IN
IN IN IN IN

Node 1
HA1 HA2

Node 2
HA1 HA2

Node 3
HA1 HA2

Node 4
HA1 HA2

Arrays 2, 3, 4

1/30/2012

NCR Proprietary & Confidential

11

Configuration - Host Side Loops


TachyonTL TachyonTL TachyonTL TachyonTL TachyonTL TachyonTL TachyonTL TachyonTL

Controller A
TachyonTL TachyonTL Comm Board TachyonTL

Controller B
TachyonTL

RS232

Mini-Hub A1

ETH

Controller A Fibre Channel Host Loop 1

Controller A Fibre Channel Host Loop 2

Controller B Fibre Channel Host Loop 1

Controller B Fibre Channel Host Loop 2

Mini-Hub A2

Mini-Hub B1

Mini-Hub B2

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

FC HBA FC HBA

FC HBA FC HBA

FC HBA FC HBA

FC HBA FC HBA

Node 1

Node 2

Node 3

Node 4

1/30/2012

NCR Proprietary & Confidential

12

Fibre Channel Arbitrated Loop (FC-AL)


There can be up to 127 devices attached to and participating on the same Fibre Channel Arbitrated Loop (FC_AL). The FC_AL is not a token-passing scheme but a serial data channel that provides logical point-to-point service to two communicating devices or ports. You can have a maximum of one point-to-point circuit at any one time, where only two devices or ports are communicating. All of the devices that are on the loop, but that are not one of the two communicating devices see all data transferred across the loop and retransmit the data so it reaches its intended destination.

1/30/2012

NCR Proprietary & Confidential

13

Fibre Channel Serial Transmission


FC serial transmission delivers 10-bit characters which represent encoded data. Of the 1,024 characters possible with the 10-bit space, 256 8-bit byte data characters are mapped, along with 1 control character. This mapping process is called 8B/10B encoding. This encoding method involves selecting encoded 10-bit characters to maintain an equal number of 1s and 0s in a serial stream of bits. To prevent too many ones or zeros on the serial interface from causing a DC electrical shift of the serial media, the encoder monitors the number of ones in the encoded character and selects the option of the 10-bit encode character that will shift to balance the total number of zeros and ones. This balancing is called running disparity.
1/30/2012 NCR Proprietary & Confidential 14

Arbitrated Loop Physical Address (AL_PA)


Each device communicating on an arbitrated loop must have a unique Arbitrated Loop Physical Address (AL_PA). There are 127 valid AL_PA addresses that range from hex 0x00 - 0xEF. The lower a devices AL_PA the higher its priority on the loop. AL_PA 0x00 is reserved for a fabric switch port. The AL_PA is an 8-bit (1-byte) 8B/10B encoded value that must have an equal number of 1s and 0s in the address to maintain neutral running disparity. This leaves 127 valid addresses. The Loop_ID is a sequential decimal value assigned to each hex AL_PA value. The Loop_ID values range from 0 to 125.
(AL_PA 0x00 is not assigned a Loop_ID)

Some OSs use Loop ID to access devices. (Not NCR)


1/30/2012 NCR Proprietary & Confidential 15

Fibre Channel Arbitrated Loop (FC-AL)


Arbitrated Loop Physical address (AL_PA) AL_PA 0xef Lowest Priority 0xe8 0xe4 0xe2 0xe1 . . 0x08 0x04 0x02 0x01 Highest Priority Loop ID 0 1 2 3 4 . . 122 123 124 125

Drive Tray Controller B


0x02 0xd9
1 2 3 0xd6 0xd5 0xd4

Controller A
0x01

ESM Controller B
0xe4
14

.. .
11

0x1f 0x71

0x2c

Drive Side FC-AL Loop


(can be up to 32 devices)
AL_PA
13

0x66 0x4a

12

0x49 0x2d 0x65


12 14

13

Host Side FC-AL Loop


0xef 0xe8

0x23
11

0x6e 0x36
10

ESM
0xcd 0xca
4

Node 1

Node 3

...

0xcc 0xcb
2 3

Drive Tray
1/30/2012 NCR Proprietary & Confidential 16

Loop Initialization
Loop Initialization Primitive (LIP) is the process used to initialize the loop and assign an AL_PA to each device. Loop initialization occurs at power-up, or when any device detects a loop failure (loss of signal synchronization at its receiver), or when a device is inserted or removed. A LIP signal can be sent by one or many devices depending upon the cause. The LIP will propagate around the loop, triggering all other devices to transmit LIP as well. At this point the loop is not useable. (mili-seconds) A second series of signals selects a loop a master to control the AL_PA assignments. If a fabric port is on the loop it will become the master, otherwise the device with the numerically lowest port name (WWN) will be selected.

1/30/2012

NCR Proprietary & Confidential

17

Loop Initialization (cont.)


After a master is selected, each device selects an AL_PA based on the following sequence:
Assigned by Fabric Previously assigned AL_PA Hard assigned (preferred AL_PA) Soft assigned, first available unused AL_PA

The last step builds a list of all devices and assigned AL_PAs, the complete list is sent to all devices.

1/30/2012

NCR Proprietary & Confidential

18

Arbitrated Loop Physical Address (AL_PA)


Devices can have hard assigned or preferred AL_PA which will not be changed by the LIP process unless duplicate hard assigned physical addresses are found. Host Bus Adapters - Preferred AL_PA defined in HBA firmware. (can not be changed) Disk Array Controllers - NVSRAM setting. Default value can be changed by SYMplicity (do not change). Drives - obtained from the drive tray backplane based on drive slot and tray ID. (can not be changed)

1/30/2012

NCR Proprietary & Confidential

19

Loop Arbitration
When a device is ready to transmit data, it must first arbitrate and gain control of the Loop. It does this by transmitting the Arbitrate (ARBx) signal, where x = the AL_PA of the device. If the device receives its own ARBx signal it will gain control of the Loop. If however, a higher priority (lower AL-PA) node wishes to use the loop, it discards the lower priority ARB(x) and replaces it with its own. Since the original node does not see its own signal returning it cannot win arbitration instead it passes on the higher ARB(x) signal. After a device wins arbitration it transmits an Open (OPN) signal to a destination device, thus opening a point-to-point communication. Once a device relinquishes control of the Loop, the other devices will again have a chance to arbitrate. An Access Fairness Algorithm prohibits a device from arbitrating again until all other devices have a chance to arbitrate.
1/30/2012 NCR Proprietary & Confidential 20

Fibre Channel Addressing


Fibre Channel uses a 24 bit address identifier called the Port_ID, which is dynamically assigned during initialization. For Arbitrated Loop devices the upper two bytes will be '0000', and the lower byte will be the AL_PA. Example, 0x0000e8 If the devices are attached to a Fibre Channel switch or Fabric the devices will attempt a Fabric Login and the switch will assign the upper two bytes of the devices port address identifier and usually allow the low byte to be the devices AL_PA. The Low Level Fibre driver knows about devices based upon their Port_ID. It does not know about devices based on tray, slot identifier.

1/30/2012

NCR Proprietary & Confidential

21

World Wide Names and Fibre Channel Addressing


Even with a 24 bit Port_ID there still needs to be a way of uniquely identifying a device or port. This is accomplished using World Wide Names (WWN) which is a fixed 64-bit value. World Wide Names are assigned to: Nodes (WW Node_Name) a node is any device on the FC Ports (WW Port_Name) a device may have multiple ports Fabrics (WW Fabric_Name) WWNs are used in mapping to upper layer protocols. The name server table within the switch maps Port_ID to WWN, this enabling public devices outside the FC_AL to communicate with nodes outside the loop.
1/30/2012 NCR Proprietary & Confidential 22

World Wide Names


Drives and HBA WWN for drives are assigned by the manufacture and are hard coded on the device. Controllers - 12 bit change number + 48 bit Mac address
Note: If a controller is replaced the WWN does not change it maintains the previously assigned WWN for that controller slot. Example WWN for controller FC ports: 200100a0-b80f675c & 200200a0-b80f675c

1/30/2012

NCR Proprietary & Confidential

23

Multi-Path Proxy Driver MPPD


MP-RAS only
Windows systems use SMrdac software from LSI.
1/30/2012 NCR Proprietary & Confidential 24

Multi-Path Proxy Driver - MPPD


The MPPD software provides the classic rdac functions for the LSI Fibre Channel arrays. Mppd creates a virtual HBA and array thus limiting visibility of the physical data paths to just the MPPD driver, so that the OS is unaware of them.
DAMC 2
c700t0d0
Controller B Controller A

c700t0d1

The virtual RAID controller has a target ID which is the addressing mechanism for identifying the array.

c700t0d2

Virtual HBA c700tf


HBA

Virtual RAID controller (ID 0) c700t0


c700t0d5 c700t0d4 c700t1d0
Controller B Controller A

c700t0d3

c700t1d1

c700t1d2

Virtual RAID controller (ID 1) c700t1

DAMC 3

c700t0d5

c700t1d3 c700t1d4

1/30/2012

NCR Proprietary & Confidential

25

Volume (LUN) Ownership


A pair of active controllers is located in each disk array and each LUN (or volume under SYMplicity) in the array is owned by one controller. The controller that owns the volume controls the I/O between the logical drive and the application host along the I/O path.

Use the: Controller > View Associated Components selection to display current ownership.

1/30/2012

NCR Proprietary & Confidential

26

How MPPD Does I/Os


Mppd round-robins all I/Os to a controller through all available paths marked as "up" to that controller. A non FC switch system uses an FC_AL loop configuration, it will have only 1 path from a node to each controller. A system with FC Switches can have up to 4 paths from a node to each controller.
Controller A Disk Array 1 Controller B
Disk Array 1 A B Disk Array 2 A B Disk Array 3 A B Disk Array 4 A B

A1
GBIC

A2

B1

B2 Switch Switch

Node

Node

Node

Node Node 1 Node 2 Node 3 Node 4

1/30/2012

NCR Proprietary & Confidential

27

Failover
If mppd has a problem with a path then it will mark that path as "down". If all paths from a node to a controller are bad and marked as "down" then mppd will fail that controller and reassign volume ownership (move all the LUNs over) to the surviving controller. The mppd driver does not hold a failed controller in reset: SYMplicity 8 the controller will be Offline SYMplicity 7 the controller will be Online/Passive.

1/30/2012

NCR Proprietary & Confidential

28

Node Utilities for MPPD and HBA


There are four node based MP-RAS utilities/scripts for the MPPD driver: mppUtil - General purpose driver utility and information. mppUpdate - Updates driver configuration file (space.c) mppProbe - Scans for new arrays mppCheck - Path state change monitor

In addition there is a utility for the LSI Host Bus Adapter (HBA): lsiUtil - General purpose HBA utility and information. The node utilities do not have visibility to the drive side fibre paths in the array, use SYMplicity to debug drive side problems.

1/30/2012

NCR Proprietary & Confidential

29

mppUtil
Use mppUtil -g [target_ID] to view the drivers internal information about an array.
# /opt/mpp/bin/mppUtil -g1 (c700t1 = DAMC101-3) MPP Information: ---------------ModuleName: DAMC101-3 VirtualTargetID: 0x001 ObjectCount: 0x000 WWN: 600a0b80000f675c000000004095f6eb ModuleHandle: none Controller 'A' Status: ----------------------ControllerHandle: none UTMLunExists: Y (031) NumberOfPaths: 1 Path #1 --------DirectoryVertex: none Controller 'B' Status: ----------------------ControllerHandle: none UTMLunExists: Y (031) NumberOfPaths: 1 Path #1 --------DirectoryVertex: none Lun Information
1/30/2012 NCR Proprietary & Confidential 30

SingleController: ScanTriggered: AVTEnabled: RestoreCfg: Quiescent:

N N N N N

ControllerPresent: Failed: FailoverInProg: NotReady: Busy:

Y N N N N

Present: Y Failed: N

ControllerPresent: Failed: FailoverInProg: NotReady: Busy:

Y N N N N

Present: Y Failed: N

mppUpdate
This utility updates the MPP driver configuration file /etc/conf/pack.d/mppd/space.c
mppUpdate
Options
Set verbose output. -c Clear all array devices from the file. -d module_name Remove the specified array from the file.
-v

[-v] [-c] [-d module_name]

Virtual array target IDs (t# in c700t_) are arbitrarily selected by the MPP driver and made persistent by the mppUpdate utility. This utility is automatically run on the next reboot following the installation of the MPPD package. This utility must be manually run whenever arrays are added or removed. New arrays are added to the end of the list and assigned the next available t#.
1/30/2012 NCR Proprietary & Confidential 31

MPPD Configuration space.c file


If you want more control over target ID assignments you can edit the space.c file.
struct mpp_persist SA_Persist[MAX_ARRAY_MODULES] = { "DAMC101-2", ID 0 (c700t0) "DAMC101-3", ID 1 (c700t1) "DAMC102-2", ID 2 (c700t2) Array name "DAMC102-3", ID 3 (c700t3) as defined by "", ID 4 (c700t4) SYMplicity "", ID 5 (c700t5) "", . "", . "", . "", "", "", "", "", };

A kernel rebuild and reboot is required whenever changes (manual or by mppUpdate) are made to the space.c file. The list and order of array names in the space.c file must be identical on all nodes within a clique!
1/30/2012 NCR Proprietary & Confidential 32

Array Name
The Array Name is defined through the SYMplicity AMW on each array: Storage Array > Rename
(CLI - set storageArray userLabel=name)

The name is stored on the array and read by MPPD from the array. MPPD does not use the emwdata.bin file to determine array names. The name is cleared by a sysWipe.

1/30/2012

NCR Proprietary & Confidential

33

MPPD and Array Names


Never rename an array without careful planning because it will change the target ID of the array the next time the node(s) reboot. For example, using the space.c file from the previous slide:
Through SYMplicity you change DAMC102-2 to MDA102-2. The next time a node reboots mppd will re-discover the arrays: - Virtual array c700t2 will be missing since the MPP driver could not find an array with the name DAMC102-2. - A new virtual array will be added as c700t4 because it sees a new array called MDA102-2. All AMP vprocs that owned pdisks on c700t2 will be fatal on all nodes that rebooted. The space.c did not change since mppUpdate was not run. To correct the problem, rename the array to its original name and reboot all nodes that configured MDA102-2 as c700t4. If however, you wish to keep the new name you must correct the space.c file (mppUpdate) and rebuild the kernel.
1/30/2012 NCR Proprietary & Confidential 34

mppProbe
This utility runs during node bootup as part of the /etc/rc1.d/S14rdacProbe startup script. This utility probes the physical device addresses for array devices. When it finds an array or data volumes that are not attached to the MPP driver, it will attach them. This utility also creates UTM nodes.
mppUpdate Options -a -u -k Attach newly discovered arrays and volumes Create UTM nodes, existing UTMs are removed Keep existing UTM node entries. [-a] [-u] [-k]

1/30/2012

NCR Proprietary & Confidential

35

UTM LUNs
For each controller there is a UTM LUN (LUN 31). UTM LUNs are used by the SMAgent software to talk to the controller across the fibre path for array management purposes. The UTM LUN device name follows the physical SCSI HBA name, example - /dev/utm/c220t0d1fs0.
c100 c101 c220 c221 = = = = port port port port 0 1 0 1 on on on on the the the the first HBA first HBA second HBA second HBA

The physical name is also seen in the output of many of the utilities to identify a physical port connection to the array.

1/30/2012

NCR Proprietary & Confidential

36

mppCheck and Auto Failback


This utility checks whether any previously failed paths have been recovered and restores them back in to service. This function is referred to as Failback mppCheck runs as a cron job every 60 seconds. There are no options to the command. Run mppCheck manually after repairing a failed path, this will force the MPP driver to bring the path back into service and allow you to redistribute volumes back onto their preferred path. Or wait 60sec to allow mppCheck to run from cron.

1/30/2012

NCR Proprietary & Confidential

37

mppCheck and Auto-Failback


In a switched configuration with multiple paths, mppd will begin using the repaired paths as soon as they are marked back "up" by mppCheck without further interaction by the user. In a single path or loop configuration, the controller must be manually brought back online after the path is repaired and the volumes must be manually redistributed back to their preferred controller. This is also true in a switched configuration if ALL paths to a controller fail. The Automatic Volume Transfer (AVT) is Disabled for Teradata, therefore manual volume redistribution is always required.

1/30/2012

NCR Proprietary & Confidential

38

lsiUtil
The LSI Fibre Channel adapter has a utility that will display various status and configuration information about the adapter.
/opt/lsiUtil/lsiUtil options [device]

Some more common options: -a = Show All Info -f = Show FW Info -l = List Attached Devices -m = Show Manufacturing Info -r = Reset FC Link Stats -s = Show FC Link Stats -u = Show IO Unit Info -R = Issue A Hard Chip Reset -P = Show ALPA Loop Position Map (summary of l) -V File = Show Version Of Firmware File device = Device to query (c100tfd0s0) if omitted all LSI ports are queried

The lsiUtil is available only on MP-RAS


1/30/2012 NCR Proprietary & Confidential 39

lsiUtil Example
/opt/lsiUtil/lsiUtil -f -m -P -u -s c220 c220: Running Firmware Version: 2.00.06 FW Info -f Flashed Firmware Version: LSIFC929-2.00.06 (2003.06.02) BIOS Version: BLANK Chip Name: LSIFC929 Chip Revision: B.0 Board Name: 7004G2-LC Manufacturing Info -m Board Assembly: 03-00010-01A Board Tracer Number: 4337174302 Active ALPAs: 3 ALPA positions -P 0:0xef 1:0xe4 2:0xe8 Mapped paths -u Mapped Paths: ONE TimeSinceReset: 608807393314 Microseconds Tx Frame Count: 0x7704 Rx Frame Counts: 0xdb8a Tx Word Counts: 0xfd41a Rx Word Counts: 0x3b5e80 LIP Count: 0x1 NOS Count: 0x0 FC Link Status -s Error Frame Counts: 0x0 Dumped Frame Counts: 0x0 Link Failure Count: 0x0 Loss of Sync Count: 0x1a Loss of Signal Count: 0x1a Primative Seq Err Count: 0x0 Invalid Tx Word Count: 0x0 Invalid Crc Count: 0x0 Initiator IO Count: 0x5758
NCR Proprietary & Confidential

1/30/2012

40

Commands and Interface to the Arrays

1/30/2012

NCR Proprietary & Confidential

41

Interfacing the Arrays


The various methods to manage the LSI Fibre arrays: Node Based Utilities - Different for MP-RAS and Windows SYMplicity GUI - Same on MP-RAS and Windows SMcli command line - Same on MP-RAS and Windows Controller shell - Independent of the host OS wes5tools - MP-RAS only - Limited lifetime

1/30/2012

NCR Proprietary & Confidential

42

SYMplicity 8.37 Storage Management

1/30/2012

NCR Proprietary & Confidential

43

SYMplicity 8 Package Installation


SMP Node (MPRAS) MPPD SMruntime SMutil SMagent SMclient SMmonitor MPP Node (MPRAS) AWS (MPRAS) MPP Node AWS (Windows) (Windows)

X X X X X X

X X X X X X X X * X

rdac

X X X *

* SMclient for W2K includes SYMplicity event monitor.

1/30/2012

NCR Proprietary & Confidential

44

SYMplicity Storage Manager GUI


The SYMplicity GUI is run from the AWS, it has visibility to each arrays drive side configuration for the entire system. To run SYMplicity on a UNIX AWS: On the AWS, open an xterm window to the AWS, then enter SMclient # SMclient
Hint: use DISPLAY=loopback:0 it is faster than DISPLAY=aws1:0 (suggestion from in EMEA)

To run SYMplicity on a Windows AWS: Start > Programs > SYMplicity Storage Manager Client

1/30/2012

NCR Proprietary & Confidential

45

Enterprise Management Window

Selections you may need to use:


Edit: Add Device Remove Device Alert Destinations View: By Status Tools: Automatic Discovery Rescan Update Monitor Manage Device Help: Contents About

1/30/2012

NCR Proprietary & Confidential

46

SYMplicity Direct and Host Managed


AWS

SYMplicity GUI chooses the best path, (direct managed)


Nodes
SM Client SM Agent MPPD

SM Client

Host Agent Managed

Directly Managed

Private LAN

SM Client SM Agent MPPD SM Agent

SM Client SM Agent MPPD

SM Client

MPPD

Array Controllers

Ctrl. Firmware

Ctrl. Firmware

1/30/2012

NCR Proprietary & Confidential

47

SYMplicity Configuration Files


When arrays are added/removed an Auto Discovery or Add Device, Remove Device function must be performed to update the SYMplicity configuration. If the configuration is not current:
The Enterprise Management Window GUI will not display the correct configuration tree. SMcli commands will fail Could not retrieve profile from array or Unable to communicate with device

The configuration is contained in the two data files, emwdata.bin and emwback.bin.
For MP-RAS they are in the /var/opt/SM directory. Windows they are in the \Program Files\SM8\client\data folder.

The AWS and ALL nodes must have matching files.


1/30/2012 NCR Proprietary & Confidential 48

EMW Tips
Use Tools > Rescan on a Host that has unresponsive arrays.

The Tools > Update Monitor is available only when the Event Monitor is NOT synchronized with the management software. Disconnect customer LAN before running Auto Discovery, this prevents lengthy scan and CLAN node names from appearing in the tree. Or use Add Devices to manually specify.
1/30/2012 NCR Proprietary & Confidential 49

Array Management Window

Selections you may need to use (some more than others):


Storage Array: Locate - array or channels Configuration - save Download - FW or NVSRAM Change - host type, tray order media scan Set Controller Clocks Run Read Link Status Diagnostics View: Event Log Storage Array Profile Controller: Place - online, offline Properties
1/30/2012

Volume: Change - modification priority media scan settings Properties Drive: Locate Hot Spare - assign, unassign Fail Reconstruct Properties

Advanced: Help: Download - ESM or Drive FW Contents Capture State Information About Reset Controller
NCR Proprietary & Confidential 50

SMdevices
SMdevices (UNIX only, part of the SMutil pkg) displays the association between mppd physical device (c700) and SYMplicity volume name.
# SMdevices SYMplicity Storage Manager for NCR Devices, Version 08.37.53.01 Built Wed Mar 12 08:12:39 CST 2003 Copyright (C) LSI Logic Corp 2002. All rights reserved. /dev/rdsk/c700t0d0s0 /dev/rdsk/c700t0d1s0 /dev/rdsk/c700t0d2s0 /dev/rdsk/c700t0d3s0 /dev/rdsk/c700t0d4s0 /dev/rdsk/c700t0d5s0 /dev/rdsk/c700t0d6s0 . . /dev/rdsk/c700t1d0s0 /dev/rdsk/c700t1d1s0 /dev/rdsk/c700t1d2s0 /dev/rdsk/c700t1d3s0 /dev/rdsk/c700t1d4s0 /dev/rdsk/c700t1d5s0 /dev/rdsk/c700t1d6s0 /dev/rdsk/c700t1d7s0 . .
1/30/2012

[Storage [Storage [Storage [Storage [Storage [Storage [Storage

Array Array Array Array Array Array Array

DAMC101-2, DAMC101-2, DAMC101-2, DAMC101-2, DAMC101-2, DAMC101-2, DAMC101-2,

Volume Volume Volume Volume Volume Volume Volume

0, 1, 2, 3, 4, 5, 6,

LUN LUN LUN LUN LUN LUN LUN

0, 1, 2, 3, 4, 5, 6,

Volume Volume Volume Volume Volume Volume Volume

WWN WWN WWN WWN WWN WWN WWN

<600] <600] <600] <600] <600] <600] <600]

[Storage [Storage [Storage [Storage [Storage [Storage [Storage [Storage

Array Array Array Array Array Array Array Array

DAMC101-3, DAMC101-3, DAMC101-3, DAMC101-3, DAMC101-3, DAMC101-3, DAMC101-3, DAMC101-3,

Volume Volume Volume Volume Volume Volume Volume Volume

0, 1, 2, 3, 4, 5, 6, 7,

LUN LUN LUN LUN LUN LUN LUN LUN

0, 1, 2, 3, 4, 5, 6, 7,

Volume Volume Volume Volume Volume Volume Volume Volume

WWN WWN WWN WWN WWN WWN WWN WWN

<600] <600] <600] <600] <600] <600] <600] <600]

NCR Proprietary & Confidential

51

Differences Between SYMplicity 8.37 and 7.15

1/30/2012

NCR Proprietary & Confidential

52

Differences Between SYMplicity 8 and SYMplicity 7


Array Management Window Some minor changes were made to the GUI: Volume Mappings now a selectable tab. Menu changes, some selections added and some were moved. - View and Mappings are new - Advanced menu is the same as the Tools menu in SYM7. y Premium features status bar at the bottom of the window (These features not used for Teradata)

1/30/2012

NCR Proprietary & Confidential

53

Array Management Window


SYMplicity 7 SYMplicity 8

1/30/2012

NCR Proprietary & Confidential

54

SYM 8.37 New Features


The following is a list of the main functional changes made for SYMplicity 8.37: Passive state is no longer valid for a controller. The valid controller states are Online (active) and Offline. Volumes can not be redistributed to a controller if path to the controller is down. Run Read Link Error Status (RLS) diagnostics and retrieve RLS counts (used to troubleshoot fiber channel problems) Some additional functions added to the SMcli command line
1/30/2012

Retrieve Event Log (MEL) Redistribute volumes back to preferred paths Retrieve and set RLS error counts Add devices or perform automatic discovery of arrays
NCR Proprietary & Confidential 55

Controller States
There is NO Passive state with SYMplicity 8.xx The valid states for a controller are: Online (active) or Offline The selections for changing the state of the controller are under the Controller menu.

SMcli -n DAMC101-3 -c set controller [a|b] availability=online|offline;

1/30/2012

NCR Proprietary & Confidential

56

Status Tray at Bottom of Window


Located at the bottom of the Array Management Window is the status area for the Premium Features and Storage Partitioning. Teradata systems do NOT use any of these features so the status should be disabled or not used.

Remote Volume Mirroring - Disabled Snapshot Volume - Disabled Storage Partitioning - Disabled & 0/0 Allowed/Used
1/30/2012 NCR Proprietary & Confidential 57

Volume to LUN Mappings


Changing the default volume to LUN mappings can cause data unavailability and/or corruption. This feature is used only if the mappings are lost and need to be rebuilt.
Potential mapping corruption can be avoided if you power on the drive trays before the controller.

1/30/2012

NCR Proprietary & Confidential

58

SYMplicity Command Line SMcli

1/30/2012

NCR Proprietary & Confidential

59

SMcli
SMcli is a command line interface utility that provides access to the SYMplicity Script Engine commands. SMcli functions the same on MP-RAS and Windows. Commands are sent to the desired array by the specifying both controllers IP addresses or the name of the array.
SMcli <ctrl_A IP> <ctrl_B IP> -c <command>; SMcli -n <array_name> -c <command>;

The arrays, AWS and all nodes are connected to the PLAN. Therefore, SMcli can be executed from the AWS or from any node.

1/30/2012

NCR Proprietary & Confidential

60

SMcli Bugs
There are some bugs with the JRE 1.2.2 (within SMruntime package) that can cause the SMcli command to hang or or fail with error code 12.
Bug 1 - SMcli command may fail if only one IP address is specified to an array. Example: SMcli <ctrl_A IP> -c <command>; Bug 2 - SMcli command may fail if access is directed through a host agent. Example: SMcli <host_agent IP> -n <array_name> -c <command>;

To avoid these problems: Use both IP addresses.


SMcli <ctrl_A IP> <ctrl_B IP> -c <command>;

Or, specify just the name of the array (direct access)


SMcli -n <array_name> -c <command>;
1/30/2012 NCR Proprietary & Confidential 61

Display Array Configuration


Use SMcli -d to display the array configuration (emwdata.bin).
SMcli -d Warning: JIT compiler "sunwjit" not found. Will use interpreter. DAMC101-3 DAMC101-32 DAMC101-31 DAMC101-3 SMP001-4 DAMC101-3 SMP001-7 DAMC101-3 SMP001-6 DAMC101-3 SMP001-5 DAMC101-2 DAMC101-21 DAMC101-22 DAMC101-2 SMP001-4 DAMC101-2 SMP001-7 DAMC101-2 SMP001-6 DAMC101-2 SMP001-5 SMcli completed successfully.

The -d -i option will output the IP address instead of name.


SMcli -d -i Warning: JIT compiler "sunwjit" not found. Will use interpreter. DAMC101-3 10.5.101.32 10.5.101.31 DAMC101-3 10.5.1.14 DAMC101-3 10.5.1.17 DAMC101-3 10.5.1.16 . . .
1/30/2012 NCR Proprietary & Confidential 62

Add Array or Auto Discovery


Use SMcli -A to add an array to the configuration. To add a new array to the Direct Network Attached path (PLAN) specify the IP address of both controllers:
SMcli -A 10.5.104.21 10.5.104.22

To add a new array to a Host Agent Attached path (FC) specify the node name:
SMcli -A SMP004-7

Omit the a host or IP address to auto-discover all arrays.

1/30/2012

NCR Proprietary & Confidential

63

Display Array Profile


Use show storageArray profile to display the profile information, same as View > Storage Array Profile from GUI.
SMcli -n DAMC101-2 -c "show storageArray profile; Warning: JIT compiler "sunwjit" not found. Will use interpreter. Performing syntax check... Syntax check complete. Executing script... Storage array profile PROFILE FOR STORAGE ARRAY: DAMC101-2 SUMMARY-----------------------------Number of controllers: 2 Number of volume groups: 16 Number of standard volumes: 16 Number of drives: 32 Access volume: LUN 31 (see Mappings section for details) Default host type: NCRMPRAS (Host type index 0) Firmware version: 05.37.03.00 NVSRAM version: NV4884NCR833007 NVSRAM configured for batteries?: No Start cache flushing at (in percentage): 80 Stop cache flushing at (in percentage): 60 Cache block size (in KB): 4 Media scan duration (in days): 7 . . .
1/30/2012 NCR Proprietary & Confidential 64

Rename an Array
Use Caution Use set storageArray userLabel to change the name of the array, same as Storage Array > Rename from GUI.
SMcli -n DAMC102-2 -c set storageArray userLabel=\MDA102-2\; Note: If you change the array name using SMcli the emwdata.bin file will not be updated. You must also use the SMcli -A to add the array to the configuration. If you are using SMcli to repair a problem and rename the array back to an original name use the IP address to bypass the configuration file. SMcli 10.5.102.21 10.5.102.22 -c set storageArray userLabel=\DAMC102-2\; Note: the backslash and quotes are required - \array_name\

1/30/2012

NCR Proprietary & Confidential

65

Display Health Status


Use show storageArray healthStatus to display array status. Same as the summary information in the Recovery GURU.
SMcli -n DAMC101-2 -c "show storageArray healthStatus; Warning: JIT compiler "sunwjit" not found. Will use interpreter. Performing syntax check... Syntax check complete. Executing script... The following failures have been found on the Storage Array: Offline Controller Volume Not On Preferred Path Script execution complete. SMcli completed successfully.

Normal status for an array should be:


Storage array health status = optimal.
1/30/2012 NCR Proprietary & Confidential 66

Display Controller States


Syntax:
show controller [(a|b)] mode
SMcli -n DAMC101-2 -c "show controller [a] mode; Warning: JIT compiler "sunwjit" not found. Will use interpreter. Performing syntax check... Syntax check complete. Executing script... Mode for controller "a" = active. Script execution complete. SMcli completed successfully.

1/30/2012

NCR Proprietary & Confidential

67

Display Battery Age


Syntax:
show storageArray batteryAge
SMcli -n DAMC101-2 -c "show storageArray batteryAge; Warning: JIT compiler "sunwjit" not found. Will use interpreter. Performing syntax check... Syntax check complete. Executing script... Battery status: Optimal Age: 42 day(s) Days until replacement: 677 day(s) Script execution complete. SMcli completed successfully.

1/30/2012

NCR Proprietary & Confidential

68

Display Event Log (MEL)


Syntax:
upload storageArray file=filename content=(allEvents | criticalEvents Example: Retrieve and save all critical events to file criticalMel.txt. The file will be created in your current pwd SMcli -n DAMC101-2 -c upload storageArray file=\criticalMel.txt\ content=criticalEvents; The double quotes are required around the file name and the back slash is required to hide the double quotes from Unix.

Clear Event Log:


set storageArray clearEventLog = True

1/30/2012

NCR Proprietary & Confidential

69

Change Drive States


Syntax:
set drive[tray,slot] operationalState=(optimal | failed) start drive[tray,slot] reconstruct

Fail drive 3 in tray 4 of DAMC101-2:


SMcli -n DAMC101-2 -c set drive[4,3] operationalState=failed;

*Bring online and begin reconstruction:


SMcli -n DAMC101-2 -c start drive[4,3] reconstruct;

*Important: The start drive reconstruct command will be available with AP 5.43.xx.xx (Sonoran 4.3) NS 6.1 release in 2H04. Do NOT use set drive operationalState=optimal because it will REVIVE the drive. Current release (AP 5.37) requires you to use the GUI to properly replace a drive and begin reconstruction (normally it should automatically start after drive replacement).
1/30/2012 NCR Proprietary & Confidential 70

Change Controller States


Syntax:
set controller [(a|b)] availability=(online | offline) reset controller [(a|b)]

Offline - controller A in DAMC101-2:


SMcli -n DAMC101-2 -c set controller [a] availability=offline;

Online - controller A in DAMC101-2 :


SMcli -n DAMC101-2 -c set controller [a] availability=online;

Reset - controller A in DAMC101-2 :


SMcli -n DAMC101-2 -c reset controller [a];

1/30/2012

NCR Proprietary & Confidential

71

Redistribute Volumes
Syntax:
reset storageArray volumeDistribution

Example::
SMcli -n DAMC101-2 -c reset storageArray volumeDistribution;

1/30/2012

NCR Proprietary & Confidential

72

wes5tools
(MP-RAS only)

1/30/2012

NCR Proprietary & Confidential

73

wes5tools (limited lifetime)


Initially supplied as a stop-gap method for core support functionality missing in the first generation FC Arrays from LSI. Relies on key Mode-Page command support in the controller which is being removed in stages. Full support in
Full support in Partial support in No support in AP 4.02.xx.xx AP 5.37.xx.xx AP 5.43.xx.xx AP 6.11.xx.xx (Mojave 2) WES 5 (Sonoran 3.7) NS 6.0 (Sonoran 4.3) NS 6.1 - July 04 (Yuma) - 2005

GSC, Engineering and LSI has worked together to enhance the SMcli command set to encompass the wes5tools functionality. Thus, for the next release NS 6.1 some wes5tools scripts will still work and others will not.

1/30/2012

NCR Proprietary & Confidential

74

wes5tools Support For:

acf dcu ripb ripc ripm rshfa

The readme file for wes5tools is located under: /opt/wes5tools/wes5tools.README

1/30/2012

NCR Proprietary & Confidential

75

Preview - wes5tools Alternatives


Supported on NS6.1 with SYM8.43 - 2H04
wes5tools
acf i acf -N -v all acf -N -o v d acf -N -o v <val> acf -d c d acf -d c add acf -d c fail acf -d c replace acf -d g assign acf -d g delete dcu ia ripb b ripc ripm -c|-p ripm d ripm s ripm v ripm l

Sonoran4.3 (shorthand syntax)


SMcli -c "show allControllers summary;" SMcli -c "show controller[] nvsram;" SMcli -c "show controller[] nvsram[];" SMcli -c "set controller[] nvsram[]=<value>;" SMcli -c "show allDrives summary;" SMcli -c "revive drive[]; SMcli -c "set drive[] operationalState=failed;" SMcli -c "start drive[] reconstruct;" SMcli -c "set drive[] hotspare=true;" SMcli -c "set drive[] hotspare=false;" SMcli -c "show allDrives summary;" SMcli -c "reset storageArray volumeDistribution;" /opt/mpp/bin/mppUtil -S Controller and path state info. /opt/mpp/bin/mppUtil -S SMcli -c "set controller[] availability=<on|off>line;" SMcli -c "show storageArray lunMappings;" SMcli -c "show storageArray lunMappings; SMcli -c "set volume[] owner=<A|B>;"

1/30/2012

NCR Proprietary & Confidential

76

Preview - New Commands


Supported with next release - 2H04 New SMcli commands for debug:
SMcli SMcli SMcli SMcli SMcli SMcli c c c c c c revive drive[]; start drive[] reconstruct; show (allControllers|controller[]) [summary]; show (allDrives|drive[]|drives[]) [summary]; show (allVolumes|volume[]|volumes[]) [summary]; show storageArray lunMappings;

New SMcli commands for staging or system reconfigurations:


SMcli SMcli SMcli SMcli SMcli c c c c c set set set set set controller[] controller[] controller[] controller[] controller[] bootpEnabled=(true|false); rloginEnabled=(true|false); ipAddress=<ip_address>; subnetMask=<ip_address>; gatewayIPAddress=<ip_address>;

1/30/2012

NCR Proprietary & Confidential

77

Marginal PLAN Problems


A marginal PLAN may cause SYM7/SYM8, wes5tools, and other applications (PUT, ASF/REEL) to hang or fail. See Knowledge Article S11000DDD9A The following setup/configuration problems most frequently lead to marginal functioning PLANs in the field:
Mix of Half_duplex and Full_duplex operation on the nodes and AWS on the PLAN Different transmission speeds on nodes and AWS (may be correctly handled by the hubs however) Over-/Under-termination on coax networks

1/30/2012

NCR Proprietary & Confidential

78

Controller Shell

1/30/2012

NCR Proprietary & Confidential

79

Controller Shell Access


Via the SLAN > CMIC > RS-232 connection:

AWS GUI - use the Connect function. Telnet - From the AWS telnet to the CMIC using the port # for the controller. Example: telnet CMIC001-1 12021
Where: DAMC 21 = 12021 DAMC 22 = 12022 DAMC 31 = 12031 DAMC 32 = 12032
If Server Management is cabled properly. Use portShow command from CMIC to verify.

Via the PLAN from the AWS or node:


rlogin [IP address or name of controller] From an MP-RAS node using rshfa, example: # rshfa DAMC101-21 fcDevs 4

Note: rlogin or rshfa will cause any previously open sessions via SLAN/RS-232 to close. Always exit shell by typing ~
1/30/2012 NCR Proprietary & Confidential 80

rshfa Syntax Summary


rshfa [-v] [-p <passwd>] [-f <outFile>] <controller> <cmdFile> rshfa [-v] [-p <passwd>] [-f <outFile>] <controller> <cmd> [<arg>]... rshfa -h OPTIONS
Prints the logon and logoff handshake also. -h Prints this man page (refer to for additional syntax descriptions). -p <passwd> Password to use to log into each controller. If no password is specified the default password will be used. -f <outFile> The output is saved to the file name specified. <controller> DAMC* All controllers in the system DAMCxxx* All controllers in a cabinet DAMCxxx-y* Both controllers in an array DAMCxxx-yy A single controller DAMC*1 (or *2) All A (or B) controllers
-v

<cmd> The command can be a simple command with no arguments like fcAll" or it can have multiple arguments like fcDevs 4".
1/30/2012 NCR Proprietary & Confidential 81

rshfa Syntax (continued)


Output File -f <outFile> If you specify an output file and also multiple controllers, i.e. DAMC* or DAMC101-*, a file for each controller will be generated and the file name will have the controller's name appended to it. Restricted commands: sysWipe rdacMgrAltCtlReset Future Support Since rshfa does not require mode page support, it will continue to be packaged in wes5tools which will be part of adpxspt for future releases.
1/30/2012 NCR Proprietary & Confidential 82

Summary Command - arrayPrintSummary


The arrayPrintSummary command displays status of the controllers and volume ownership.

-> arrayPrintSummary 05/12/04-16:16:14 (GMT) (tShell): NOTE: Controllers synchronized. 05/12/04-16:16:14 (GMT) (tShell): NOTE: (host 60, drive 7). 05/12/04-16:16:14 (GMT) (tShell): NOTE: 8, 10, 12, 14} 05/12/04-16:16:14 (GMT) (tShell): NOTE: (Present/Not Failed). SCSI IDs (host 60, 05/12/04-16:16:14 (GMT) (tShell): NOTE: 9, 11, 13, 15} RDAC Mode is Dual-Active. Controller Mode Active. SCSI IDs 8 Volumes Owned = {0, 2, 4, 6, Alt. Ctrl. Mode Active drive 6). 8 Volumes Owned = {1, 3, 5, 7,

The example above executed from a node using rshfa: # rshfa DAMC101-21 arrayPrintSummary

1/30/2012

NCR Proprietary & Confidential

83

Firmware Versions - moduleList


The moduleList command displays the firmware revisions for the controller.
-> moduleList BootWare Package - Version 05.30.00.00 (Built 09/05/02 10:17:39) RAID Controller Build Package - Version 05.37.03.00 (Built 05/22/03 11:28:31) ## == 1 2 -> SNAME ===== .BW .AP COMPONENT NAME ======================== BootWare RAID Controller Build VERSION =========== 05.30.00.00 05.37.03.00 DATE ======== 09/05/02 05/22/03 TIME ======= 10:17:39 11:28:31

The fc 12 command will display the output of the moduleList and arrayPrintSummary together.

1/30/2012

NCR Proprietary & Confidential

84

Overall Status Command - fcALL


The fcAll command displays the overall status information for the six Fibre Channel interfaces (chips) on the controller, 2 host side (Src) and 4 (Dst) .
-> fcAll fcAll (Tick 0025561080) ==> 05/12/04-14:44:01 (GMT) 4884-A Chip LinkStat 0-Dst 1-Dst 2-Dst 3-Dst 4-Src 5-Src Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Our Num ::...Exchange Counts...:: Num ..Link Up.. Port Port :: :: Link Bad Bad ID Logi ::Open Total Errors:: Down Char Frame 1 20 :: 1 928630 8:: 2 0 0 1 20 :: 1 916945 8:: 2 0 0 1 20 :: 0 832819 8:: 2 0 0 1 20 :: 0 837783 8:: 2 0 0 1 3 :: 0 17402 7:: 5 0 4 e8 3 :: 0 17081 7:: 5 0 4

value = 2 = 0x2 ->

1/30/2012

NCR Proprietary & Confidential

85

fcALL Output Description


Controller A or B AL_PA # of other devices on this channel An exchange is typically, one SCSI command

fcAll (Tick 0025561080) ==> 05/12/04-14:44:01 (GMT) 4884-A Chip LinkStat 0-Dst 1-Dst 2-Dst 3-Dst 4-Src 5-Src Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Our Num ::...Exchange Counts...:: Num ..Link Up.. Port Port :: :: Link Bad Bad ID Logi ::Open Total Errors:: Down Char Frame 1 20 :: 1 928630 8:: 2 0 0 1 20 :: 1 916945 8:: 2 0 0 1 20 :: 0 832819 8:: 2 0 0 1 20 :: 0 837783 8:: 2 0 0 1 3 :: 0 17402 7:: 5 0 4 e8 3 :: 0 17081 7:: 5 0 4
# of abnormally terminated exchanges Loop = private loop Fab = FC Switch connect NoHub = No mini-hub installed Dst = channels to drive trays Src = channels to nodes # of times the link had to initialized (always min of 1)

Errors while link was up # if 100s or multiples more than the other ports there is a problem

fc 90 will clear all accumulated counts


NCR Proprietary & Confidential 86

1/30/2012

High Level Destination Driver Debug Menu - fcDevs


The fcDevs command provides debug information gathered by the high level destination driver (HDD).
-> fcDevs fcDevs 1 = 2 = 3 = 4 = 5 = 6 = 7 = 8 = 9 = 10 = 11 = 12 = 13 = 14 = ->
1/30/2012 NCR Proprietary & Confidential 87

<view>, <devNum (0=all)> All views by view type (active) Inquiry view Names view Path view Common names view Buf view Detail view (All luns) Detail view (Active luns only) All views by lun device All views by view type (detailed) Rls view Devices with any errors Devices with Rw errors returned to VDD Devices with major errors

fcDevs 4 - Path View


-> fcDevs 4 Path View DevNum Disk 00000001 Disk 00000002 Disk 00000003 Disk 00000004 ... Disk 0000000f Disk 00100000 Disk 00100001 Disk 00100002 . . . Disk 0010000d Disk 0010000e Disk 0010000f Disk 00200000 . . . Disk 0020000b Disk 0020000c Disk 0020000d Disk 0020000e . . . Encl 00000010 Encl 00100010 Encl 00200010 Encl 00300010 Lmir 00f00011 This 00e00011
1/30/2012

Tray Slot 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 3,0 4,0 1,0 2,0

Cur Alpa d9 d6 d5 d4 cd cc cb ca c3 bc ba b9 b2 b1 ae ad 1e 1d 23 1f 02 e8

Path Channels Cur:0 Alt:3 Cur:3 Alt:0 Cur:0 Alt:3 Cur:3 Alt:0 Cur:0 Cur:2 Cur:0 Cur:2 Cur:1 Cur:2 Cur:1 Cur:2 Cur:1 Cur:3 Cur:1 Cur:3 Cur:1 Cur:1 Cur:0 Cur:0 Cur:0 Cur:5 Alt:2 Alt:0 Alt:2 Alt:0 Alt:2 Alt:1 Alt:2 Alt:1 Alt:3 Alt:1 Alt:3 Alt:1 Alt:2 Alt:3 Alt:3 Alt:2 Alt:1 Alt:4

The Current and Alternate paths for the drives must alternate, if they do not then there is a path failure. Current = active port

Encl = ESM Lmir = local mirror, used for cache mirroring with alternate controller This = This controller

Alt:2 Alt:0

Alt:3 Alt:1

Alt:2

Alt:3
88

NCR Proprietary & Confidential

fcDevs 5 - Common Names View


-> fcDevs 5 Common Names View DevNum Disk 00000001 Disk 00000002 Disk 00000003 Disk 00000004 Disk 00000005 Disk 00000006 Disk 00000007 Disk 00000008 Disk 0000000f Disk 00100000 Disk 00100001 . . . Disk 00300000 Disk 00300001 Disk 00300002 Tray Slot 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 2,1 2,2 2,3 4,6 4,7 4,8 Cur Alpa d9 d6 d5 d4 d3 d2 d1 ce cd cc cb ab aa a9 Inquiry Name SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE SEAGATE ST336753FC ST336753FC ST336753FC ST336753FC ST336752FC ST336752FC ST336752FC ST336752FC ST336753FC ST336753FC ST336752FC ST336752FC ST336752FC ST336752FC Common Name Unknown Unknown Unknown Unknown Cheetah-X15-2 Cheetah-X15-2 Cheetah-X15-2 Cheetah-X15-2 Unknown Unknown Cheetah-X15-2

36GB 36GB 36GB 36GB 36GB

Cheetah-X15-2 36GB Cheetah-X15-2 36GB Cheetah-X15-2 36GB

Use fcDevs 2 to display the drive firmware version.

1/30/2012

NCR Proprietary & Confidential

89

Display Network Configuration netCfgShow


-> netCfgShow ==== NETWORK CONFIGURATION: ALL INTERFACES ==== Network Init Flags : 0x01 Network Mgmt Timeout : 30 Startup Script : Shell Password : ==== NETWORK CONFIGURATION: dse0 ==== Interface Name : dse0 My MAC Address : 00:a0:b8:0f:69:42 My Host Name : DAMC101-21 My IP Address : 10.5.101.21 Server Host Name : host Server IP Address : 10.5.1.1 Gateway IP Address : 0.0.0.0 Subnet Mask : 255.255.0.0 User Name : guest User Password : NFS Root Path : NFS Group ID Number : 0 NFS User ID Number : 0
1/30/2012 NCR Proprietary & Confidential 90

Change Network Configuration netCfgSet


-> netCfgSet '.' = clear field; '-' = to previous field; '+' = next interface; ^D = quit (keep changes) ==== NETWORK CONFIGURATION: ALL INTERFACES ==== Network Init Flags : 0x01 0 Network Mgmt Timeout : 30

Press enter to keep current value and continue to next field. Reboot the controller to apply changes. The names under the network settings are NOT the same as the array name under SYMplicity (not associated with
emwdata.bin).

Network settings are maintained after a sysWipe.


1/30/2012 NCR Proprietary & Confidential 91

Other Controller Shell Commands


Reboot the controller: sysReboot Wipe out array configuration, DANGER!! sysWipe Find commands using key words or string lkup string Others will also be covered in the Troubleshooting information and examples.

1/30/2012

NCR Proprietary & Confidential

92

Fibre Channel Fault Reporting

1/30/2012

NCR Proprietary & Confidential

93

General Debugging
Windows AWS Logs
Fault Viewer AWS Console Application ICONs

Physical Inspection
LSI HBA LEDs Mini-Hub LEDs ESM LEDs Controller LEDs Drive LEDs

UNIX AWS Logs


CSF Alerts XCON ICONs

UNIX Host Logs


/etc/.osm /var/adm/streams

Windows Host Logs


Event Viewer - Application Event Viewer - System

SYMplicity GUI
SMclient Recovery Guru and ICONs Storage Array Profile Capture State Information (perform only on a quiesced array) Major Event Log (MEL) (stored on the array in DACstore) Run Link Status (RLS)

Controller Internal Logs


1/30/2012

FA Log (Serial port command d 0xc1250,0x2b0,1)


NCR Proprietary & Confidential 94

Forwarding of Faults
The LSI Fibre Channel arrays report additional events that are not received by the UMB, and thus, not reported to the AWS through the Service Subsystem. For these type of events to be reported to the AWS / CSF special configuration is required depending upon the type of AWS: Windows NCR MEL package is installed on the AWS

UNIX AWS - Forwarding of SNMP traps are enabled and sent to the AWS.

1/30/2012

NCR Proprietary & Confidential

95

Forwarding of Faults - Windows AWS


NCR MEL is a Windows software package that installs a service named "NCR MEL" on the AWS. The service and monitors LSI storage arrays Major Event Logs (MEL) for critical events. The faults can be viewed via the AWS Fault Viewer (Source NCR MEL). After NCR MEL installation, the NCR MEL service automatically starts and begins monitoring the Major Event Logs of the LSI storage arrays. The storage arrays that are being monitored are listed in a file named "monitored_arrays.txt", located at "\Program Files\NCR\NCR MEL".

1/30/2012

NCR Proprietary & Confidential

96

Forwarding of Faults - UNIX AWS


Critical events from the SYMplicity Event (MEL) are forwarded to the UNIX AWS via the SNMP - Alert Destinations function.

When configured a green check mark will appear next to the AWS in the EMW

Click Validate to send a test fault to the AWS Fault Window

1/30/2012

NCR Proprietary & Confidential

97

Misleading Component Names


Drive event descriptions reported to the AWS through the Service Subsystem do NOT contain valid component IDs that can be used to easily locate the drive. Always use SYMplicity GUI or MEL to determine a [Tray, drive] FRU location. Example, Drive 6 on Drive Tray 2 failed:
AWS Tree View AWS Fault from Service Subsystem to AWS

Fault from NCR MEL

SYMplicity GUI

1/30/2012

NCR Proprietary & Confidential

98

Host Side Path Failures

1/30/2012

NCR Proprietary & Confidential

99

Clique Configurations
A FC_AL loop configuration a node will have only 1 path from a node to each controller. A system with FC Switches can have up to 4 paths from a node to each controller.
FC_AL Configuration Switched Configuration Disk Array 1 A B1 B Disk Array 2 A B Disk Array 3 A B Disk Array 4 A B

Controller A Disk Array 1

Controller B

A1
GBIC

A2

B2

Switch

Switch

Node

Node

Node

Node Node 1 Node 2 Node 3 Node 4

1/30/2012

NCR Proprietary & Confidential

100

Host-Side Path Failures


A path failure from a node to a controller in a loop (or ALL paths in a switch configuration) will cause mppd to fail that controller and reassign volume ownership to the surviving controller. Indications of a host side path failure are: SYMplicity - Volumes not on preferred path message. HBA and/or Mini-hub LED (Bypass / Loop fail LEDs) /etc/.osm and /var/adm/streams
mpp_FailPath and mpp_FailController messages on the node on the failed path. Use the mppUtil -gX, ripm -p and *mppUtil -S commands from each node in the clique to determine failed path. * NS 6.1 release (AP 5.43)
1/30/2012 NCR Proprietary & Confidential 101

Bring Controller Back Online to Troubleshoot


Because the controller is placed offline it will cause ALL minihub ports on the controller to indicate a fault.
In addition, the Controller fault LED on the front of the array will be on and the Ethernet port LED will be off.

Place the controller back Online which should cause the bypass LED of the failed port to come on. The only condition that should be reported by the Recovery GURU or a Health Status should be Volume not on Preferred Path. With SYMplicity 8 you will not be able to redistribute volumes back to the preferred controller until the path is fixed.

1/30/2012

NCR Proprietary & Confidential

102

Host Side FC_AL Path Failures (cont.)


Visually observe LEDs for problems with the FC loop.

2Gb Link Speed LED should be illuminated. Every Mini-hub with a cable connected should have its Bypass LED off. Every Mini-hub should have its Loop Good LED illuminated. The LSI HBA LEDs should be green of cable is connected.

1/30/2012

NCR Proprietary & Confidential

103

Host Side FC_AL Path Failures (cont.)


/etc/.osm messages from node on failed path
WARNING: MPP_Sysdep.c: [RAIDarray.mpp].mpp_AnalyseIoError: from DAMC101-2:0:0:0 Selection Retry count exhausted xcmn_err: Message Date 05/17 - Time 17:01(mm/dd hh:mm) WARNING: MPP_Sysdep.c: [RAIDarray.mpp].mpp_FailPath: DAMC101-2:0:0 Failed xcmn_err: Message Date 05/17 - Time 17:01(mm/dd hh:mm) WARNING: MPP_Sysdep.c: [RAIDarray.mpp].mpp_FailController: DAMC101-2:0 Failed xcmn_err: Message Date 05/17 - Time 17:01(mm/dd hh:mm) WARNING: MPP_Sysdep.c: [RAIDarray.mpp].mpp_AnalyseSyncError: Selection Retry count exhausted xcmn_err: Message Date 05/17 - Time 17:02(mm/dd hh:mm) WARNING: MPP_Sysdep.c: [RAIDarray.mpp].mpp_AnalyseSyncError: Selection Retry count exhausted

1/30/2012

NCR Proprietary & Confidential

104

Host Side Path Failures (cont.)


mppUtil from node on failed path (FC_AL loop example):
/opt/mpp/bin/mppUtil -g0 MPP Information: ---------------ModuleName: DAMC101-2 VirtualTargetID: 0x000 ObjectCount: 0x000 WWN: 600a0b80000f6942000000004095f24a ModuleHandle: none Controller 'A' Status: ----------------------ControllerHandle: none UTMLunExists: Y (031) NumberOfPaths: 1

SingleController: ScanTriggered: AVTEnabled: RestoreCfg: Quiescent:

N N N N N

With switches this will be 2 or 4 and each path will have display Present & Failed status

ControllerPresent: Failed: FailoverInProg: NotReady: Busy:

Y Y N N N

Path #1 --------DirectoryVertex: none

Present: Y Failed: Y

Controller 'B' Status: ----------------------ControllerHandle: none UTMLunExists: Y (031) NumberOfPaths: 1


1/30/2012 NCR Proprietary & Confidential

ControllerPresent: Failed: FailoverInProg: NotReady:

Y N N N
105

Busy: N

Host Side Path Failures (cont.)


ripm -p from node on failed path: (FC_AL loop example)
ripm -p c700t0 passive-OOS c100t0d0s0 active c101t0d0s0 active c220t0d0s0 active c221t0d0s0 LSI HBA in PCI Slot 1 up

down

c700t1

up

up

Controller A

Controller B

C100t0
PCI Bus # Controller # Port # on HBA
Green Yellow

Down
Yellow Yellow

1/30/2012

NCR Proprietary & Confidential

106

Other Tips for Host Path Troubleshooting


Use rallsh (or psh) with mppUtil and grep for Failed messages:
rallsh -sv /opt/mpp/bin/mppUtil -g0 | grep Failed (repeat -g1,2,3 for each array) Getting hosts for network: byn0 Using hosts: byn001-4 byn001-5 byn001-6 byn001-7 ==== /usr/bin/rsh byn001-4 /opt/mpp/bin/mppUtil -g0 |grep Failed UTMLunExists: Y (031) UTMLunExists: Y (031) ==== /usr/bin/rsh byn001-5 UTMLunExists: Y (031) UTMLunExists: Y (031) ==== /usr/bin/rsh byn001-6 UTMLunExists: Y (031) UTMLunExists: Y (031) ==== /usr/bin/rsh byn001-7 UTMLunExists: Y (031) UTMLunExists: Y (031) /opt/mpp/bin/mppUtil -g0 |grep Failed /opt/mpp/bin/mppUtil -g0 |grep Failed /opt/mpp/bin/mppUtil -g0 |grep Failed

==== Failed: Failed: Failed: Failed: ==== Failed: Failed: Failed: Failed: ==== Failed: Failed: Failed: Failed: ==== Failed: Failed: Failed: Failed:

N N N N N N N N N N N Y N N N N

Path failed to Controller B from SMP001-6

Controller A Controller B

1/30/2012

NCR Proprietary & Confidential

107

Additional Details for Host Side Path Failures in Switch Configurations


1/30/2012 NCR Proprietary & Confidential 108

Host-Side FC-Switch Path Failures


Failure on Array Side of Switch
Disk Array 1 A B Disk Array 2 A B Disk Array 3 A B Disk Array 4 A B

X
1

2 to 4 physical paths from every host to each controller. All single path and most multiple path failures will not result in a controller failover. mppd uses the 4 paths in a round-robin fashion. mppd will mark a bad path as down mppCheck will mark a fixed path as up ripm -p shows mppds perspective of path states # ripm -p (lsiUtil -S) c700t0 active c100t2d0s0 c101t2d0s0 c220t2d0s0 c221t2d0s0 down up down up active c100t0d0s0 c101t0d0s0 c220t0d0s0 c221t0d0s0 up up up up

Switch
0

1215

Switch

Node 1

Node 2 Port Number -----0 1 . . 8 9 . .

Node 3 Admin State ----Online Online Online Online

Node 4

(same down indication to array reported on each node)

FC Switch shell command show port. 2 paths to ctrl A are still up to all Nodes
Operational State ----------Online Offline Online Online Login Status -----LoggedIn NotLoggedIn LoggedIn LoggedIn Config Type -----G G G G Running Type ------F Unknown F F Link Link State Speed --------Active 2Gb/s InActive Auto Active Active 2Gb/s 2Gb/s
109

1/30/2012

NCR Proprietary & Confidential

Host-Side FC-Switch Path Failures


Failure on Host Side of Switch
Disk Array 1 A B Disk Array 2 A B Disk Array 3 A B Disk Array 4 A B

# ripm -p (lsiUtil -S) c700t0 active c100t2d0s0 c101t2d0s0 c220t2d0s0 c221t2d0s0 active c100t0d0s0 c101t0d0s0 c220t0d0s0 c221t0d0s0 down up up up active c100t0d0s0 c101t0d0s0 c220t0d0s0 c221t0d0s0 active c100t4d0s0 c101t4d0s0 c220t4d0s0 c221t4d0s0 down up up up

Switch
0

1215

Switch

c700t1

down up up up

down up up up

(same indication for c700t2 and t3 on this node only) Node 1 Node 2 Port Number -----0 1 . . 8 9 . . Node 3 Admin State ----Online Online Online Online Node 4

FC Switch shell command show port.


Login Status -----NotLoggedIn LoggedIn LoggedIn LoggedIn Config Type -----G G G G Running Type ------Unknown F F F Link Link State Speed --------InActive Auto Active 2Gb/s Active Active 2Gb/s 2Gb/s
110

3 paths still up on Node 1 to all arrays

Operational State ----------Offline Online Online Online

1/30/2012

NCR Proprietary & Confidential

Host-Side FC-Switch Path Failures


LEDs on Mini-Hubs, switches, and HBA can assist in tracking down FC loop problems. It is possible to isolate faulty components while Teradata is up and I/Os are active by moving a FC cable to a free port Use Caution:
Within the same switch or mini-hub, the host LowLevel-Driver will maintain the same physical address. Within the same or between HBAs, or between switches the physical address of the path(s) will change but mppProbe (manually executed) will correlate the path(s) to the same virtual address. (NOTE: Not recommended since this will add another path for mppd to manage and report on which can only be removed by rebooting the node.) Do not move a cable from one mini-hub to another -this will cause an arbitrated-loop configuration that the switches are not setup to handle. To maintain consistency and redundancy, after the faulty component is replaced, FC cables should be returned to their original connections.

Disk Array 1 A B

Disk Array 2 A B

Disk Array 3 A B

Disk Array 4 A B

Switch

Switch

Node 1

Node 2

Node 3

Node 4

1/30/2012

NCR Proprietary & Confidential

111

Recovering from Downed Paths During Node Reboot

1/30/2012

NCR Proprietary & Confidential

112

Configuring a Path into MPPD


If a path is down while a node reboots, the MPP driver will not configure that path.
ripm -p c700t0 c700t1 active missing active c101t0d0s0 N/A up active c220t0d0s0 active c221t0d0s0 up up

/opt/mpp/bin/mppUtil -g0 |pg MPP Information: ---------------ModuleName: DAMC101-2 VirtualTargetID: 0x000 ObjectCount: 0x000 WWN: 600a0b80000f69420000000040b1b251 ModuleHandle: none Controller 'B' Status: ----------------------ControllerHandle: none UTMLunExists: Y (031) NumberOfPaths: 1 Path #1 --------DirectoryVertex: none Lun Information
1/30/2012 NCR Proprietary & Confidential

SingleController: ScanTriggered: AVTEnabled: RestoreCfg: Quiescent:

Y N N N N

ControllerPresent: Failed: FailoverInProg: NotReady: Busy:

Y N N N N

Controller A information missing

Present: Y Failed: N

113

Configuring a Path into MPPD


The steps to reconfigure a recovered path and allow MPPD to begin using it are as follows:

1. Run mppProbe -auk on all nodes connected to the unconfigured path. UTM LUNs will be created. 2. Run ripm -p (mppUtil -S) and verify path is up on all nodes. 3. Run mppCheck to force mppd to use restored paths. 4. Redistribute volumes back to preferred controllers.

1/30/2012

NCR Proprietary & Confidential

114

Drive Side Path Failures

1/30/2012

NCR Proprietary & Confidential

115

Drive-side FC-AL Path Failures

Visually observe LEDs for problems with the FC loop.


Drive-Module 2Gb Link Speed LED should be illuminated. Every Mini-hub should have its Loop Good LED illuminated. Every Mini-hub and ESM SFP with a cable connected (items E & J below) should have its Bypass LED off.
ESM LEDs E Port Bypass F ESM Power G ESM Fault H ESM Overtemp J Port Bypass N 2Gb Link Speed

1/30/2012

NCR Proprietary & Confidential

116

Drive-side FC-AL Path Failures (cont.)


The indication of a drive side path failure is a Loss of Drive Tray Redundancy message.
SMcli c show allDrives summary;
DRIVE CHANNELS: TRAY, 1, 1, 1, 1, 1, 1, 1, 1, 1, SLOT 1 2 3 4 5 6 7 8 9 PREFERRED CHANNEL 1 4 1 4 1 4 1 4 1 REDUNDANT CHANNEL 4 1 4 1 4 1 4 1 4

Check for failed loops reported by each controller using either the SMcli command or the controllers shell command:
There should be only two loops listed for each drive. None of the Paths should be listed as failed. NOTE: Typically the Preferred loops should alternate between drives in the tray starting with the lowest loop, but this assignment is only setup on a drive basis when that controller accesses a particular drive.

Rshfa DAMCxxx-xx fcDevs 4


Path View DevNum Disk Disk Disk Disk Disk Disk Disk Disk 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008
1/30/2012

Tray Slot 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8

Cur Alpa cd cc cb ca c9 c7 c6 c5

Path Channels Pref:0 Cur:0 Pref:2 Cur:2 Pref:0 Cur:0 Pref:2 Cur:2 Pref:0 Cur:0 Pref:2 Cur:2 Pref:0 Cur:0 Pref:2 Cur:2

Alt:2 Alt:0 Alt:2 Alt:0 Alt:2 Alt:0 Alt:2 Alt:0

NCR Proprietary & Confidential

117

Debugging Drive Side Problems Using RLS Data


(Read Link Status)

1/30/2012

NCR Proprietary & Confidential

118

Debug Using RLS Counts


Gather RLS counts using the SYMplicity GUI.

RLS using SMcli: set storageArray RLSBaseline = currentTime upload storageArray file=rls.txt content=RLSCounts
1/30/2012 NCR Proprietary & Confidential 119

Debug Using RLS Counts (continued)


Set baseline and run RLS diagnostics. Click run to update display.
#3 #1 #2
Order of Severity

ITW Invalid Transmission Word LF Link Failure LOS Loss of Synchronization LOSG Loss of Signal PSP Primitive Sequence Protocol Error ICRC Invalid CRC

Order of data flow on the loop

1/30/2012

NCR Proprietary & Confidential

120

Array drive-side FC-AL debugging


(continued) Analyze RLS Counts
Look for a magnitude difference of a step or spike in error counts. Identify the first device (in Loop Map Order) that detects high number of Link Errors
Link Error Severity Order: LF > LOS > ITW

Get the location of the first device Get the location of its upstream device Use Possible Candidate Table to find out the possible candidates for bad component. Using the example on the previous page, the faulty component could be:
ESM in tray 2 Transmitter of drive 2,5 Receiver of drive 2,6
1/30/2012 NCR Proprietary & Confidential 121

Array drive-side FC-AL debugging


(continued)

Possible Candidate Table


Type Location of Device A (first device that detects link error) Tray X Location of Device B (first upstream device of A) Tray X Possible Candidates

ESM of Tray X, Device B (drive), Device A (drive)* Cable or SFPs between Tray X and Y, ESM of Tray Y, Device B (drive), ESM of Tray X*, Device A (drive)* Any cable or SFP in the channel, Minihub, ESM of Tray X, Device B (drive), Device A (controller)*, Controller Chassis* Cable or SFPs between Tray X and Controller Module, Minihub, Device B (controller), ESM of Tray X*, Device A (drive)*, Controller Chassis* MiniHub, Device B (controller), Device A (controller)*, Controller Chassis*

Tray X

Tray Y

Tray X

Controller Module

Controller Module

Tray X

Controller Module

Controller Module

* Component that is less likely to be candidate

1/30/2012

NCR Proprietary & Confidential

122

Array drive-side FC-AL debugging


(continued)

Drive Modules

.
A B

.
A B

.
A B

.
A

Drives

ESMs

IN

OUT

IN

OUT

IN

OUT IN

OUT

SFPs Cables

Loop 4
IN OUT IN OUT

Loop 3
IN OUT

Loop 2
IN OUT

Loop 1

Identify the suspect components on the faulty segment of the FC loop, then either replace one at a time or fanout to further isolate the faulty component.

SFPs
1

Mini-Hubs

Controller Module
CH0

CH3

CH2

CH1

CH3

CH2

CH1

CH0

A
CH4 CH5 CH4

B
CH5

Controllers

1/30/2012

NCR Proprietary & Confidential

123

Debugging Host Side Problems RLS Data Gathered From lsiUtil


1/30/2012 NCR Proprietary & Confidential 124

Host-Side FC-AL Debugging


Manually reset and gather the FC loop statistics from each device on the loop to identify which segment of the FC loop is faulty.
Node1 lsiUtil s Controller fcAll Node 3 lsiUtil s

FC Loop

- Use fcAll on the controller and lsiUtil a on the nodes. - The device that has the high error counts is the receiving device on the loop. - The faulty segment of the FC loop is from the transmitting device to the receiving device. - The order of the devices on the loop must be determined from the lsiUtil -a command.
1/30/2012 NCR Proprietary & Confidential 125

Host-Side FC-AL Debugging (cont.)


Gather RLS counts from the HBA
/opt/lsiUtil/lsiUtil r cXXX /opt/lsiUtil/lsiUtil a cXXX
TimeSinceReset: 150238850260 Microseconds Tx Frame Count: 0x25d52604 Rx Frame Counts: 0x1644e08ce Tx Word Counts: 0x199e6d2961 Rx Word Counts: 0x6b42b254f6 LIP Count: 0x9b NOS Count: 0x0 Error Frame Counts: 0x40 Dumped Frame Counts: 0x44 Link Failure Count: 0x1 Loss of Sync Count: 0xaba Loss of Signal Count: 0xaba Primative Seq Err Count: 0x0 Invalid Tx Word Count: 0x3cb9fef Invalid Crc Count: 0x1 Initiator IO Count: 0x13b0b19c Active ALPAs: 3 0:0xef Port Identifier: Bus/TargetId: WWNN / WWPN: Port Identifier: WWNN / WWPN: Port Identifier: WWNN / WWPN:

(resets the HBA FC Link statistics) (displays the HBA FC Link statistics)

Link Statistics

1:0xe8

2:0x01

0x1 TARGET 0/0 0x200200a0b80cf4cd / 0x200200a0b80cf4ce 0xe8 INITIATOR INITIATOR is the other 0x200000062b062884 / 0x100000062b062884 0xef blank is this Node 0x200000062b062154 / 0x100000062b062154

Order of the devices on the bus TARGET is the Disk Array Node

1/30/2012

NCR Proprietary & Confidential

126

Host-Side FC-AL Debugging (cont.)


/opt/wes5tools/rshfa DAMCxxx-yz fcAll 90
Resets the FC Link Statistics for the I/F chips on the Disk Array Controller.

/opt/wes5tools/rshfa DAMCxxx-yz fcAll


Shows the FC Link Statistics for the I/F chips on the Disk Array Controller.
5884-B Chip LinkStat 0-Dst 1-Dst 2-Dst 3-Dst 4-Src 5-Src Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Up-Loop Our Num ::...Exchange Counts...:: Port Port :: :: ID Logi ::Open Total Errors:: 2 31 :: 0 751530 26:: 2 31 :: 0 176940 14:: 2 31 :: 0 264905 16:: 2 31 :: 0 273344 15:: e4 3 :: 0 19282 10077:: e2 3 :: 0 1146391 24924:: Num ..Link Up.. Link Bad Bad Down Char Frame 5 0 0 3 0 0 3 0 0 3 0 0 1 0 1 284313 0 21873

Drive Side

Host Side

1/30/2012

NCR Proprietary & Confidential

127

Host-Side FC-AL Debugging (cont.)

TachyonTL

TachyonTL

TachyonTL

TachyonTL

TachyonTL

TachyonTL

TachyonTL

TachyonTL

Controller A
TachyonTL TachyonTL TachyonTL

Controller B
TachyonTL

Controller A Fibre Channel Host Loop 1

Controller A Fibre Channel Host Loop 2

Controller B Fibre Channel Host Loop 1

Controller B Fibre Channel Host Loop 2

Identify the suspect components on the faulty segment of the FC loop, then either replace one at a time or fan-out to further isolate the faulty component.
- If the problem is between the controller and the node the suspect components are: 1SFP, 1-Mini-hub, 1-cable, 1controller, and 1-HBA. - If the problem is between the two nodes on the loop the suspect components are: 2SFPs, 1-Mini-hub, 2-cables, or 2-HBAs.

Mini-Hub A1

Mini-Hub A2

Mini-Hub B1

Mini-Hub B2

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

GBIC IN

GBIC OUT

1 2 3 4 QLA2204 QLA2204

1 2 3 4 QLA2204 QLA2204

1 2 3 4 QLA2204 QLA2204

1 2 3 4 QLA2204 QLA2204

Node 1 (SMP)

Node 2 (SMP)

Node 3 (SMP)

Node 4 (SMP)

1/30/2012

NCR Proprietary & Confidential

128

RLS (new NS6.1)


With NS 6.1 Release RLS counts will be automatically

captured and stored on the AWS


Windows AWS
\Program Files\SM8\client\data\monitor\RLS.<arrayWWN>.csv

UNIX AWS

/var/opt/SM/monitor/RLS.<arrayWWN>.csv

1/30/2012

NCR Proprietary & Confidential

129

Parity Checking and Predictive Failures

1/30/2012

NCR Proprietary & Confidential

130

Array Parity Checking


The LSI Fibre Channel controllers run a background parity check that is integrated into the Media Scan. With FC arrays we do not run apcheckall as a cron job on the nodes. The apc and apr (array parity check / repair) utilities are still used. Utilities are located in /opt/SM8/util directory.

1/30/2012

NCR Proprietary & Confidential

131

DSDE vs. ADEPT


DSDE
Tallies based on MPRAS streams error logs Tallies are calculated daily when cron script runs on AWS AWS cron script runs job remsh on each node to get report Script on AWS aggregates errors based on node reports Runs only on MPP systems with MP-RAS AWS No drive serial number reference Drive replacement run utility Requires MPRAS AWS

ADEPT
Tallies based on MP-RAS streams error log & Windows System Log Tallies are calculated in real time as errors occur Adept always running on each node monitoring logs Adept runs on AWS and consolidates Adept events sent from the nodes Runs on SMP and MPP systems with MPRAS or Windows nodes. Requires Windows AWS Provides drive serial number Restart Adept : W2K = Adept Service
UNIX = /etc/init.d/adept [start | stop]

Requires Windows AWS

1/30/2012

NCR Proprietary & Confidential

132

ADEPT Reports from UNIX Nodes


Use the showData command to view the ADEPT error counts on a UNIX Node.
/opt/adept/bin/showData
tray/slot sort cur tot cur tot cur tot cur tot sk1 sk1 sk2 sk2 sk3 sk3 sk4 sk4 c t d FRU tray slot serial# === === === === === === === === === == == === ==== ==== ================ 0 0 0 0 0 0 0 0 700 0 0 11 1 1 3HX0X9LG00007342 0 0 0 0 0 0 0 0 700 1 0 2d 1 1 3HX06T3200007329 0 0 0 1 0 0 0 0 700 0 0 12 1 2 3HX0XKJT00007347 0 0 0 0 0 0 0 0 700 1 0 2e 1 2 3HX0788Q00007329 0 0 0 0 0 0 0 0 700 0 0 13 1 3 3HX0XKN400007347 4 15 0 26 9 0 0 0 700 1 0 2f 1 3 3ET0E71R00007240 0 0 0 0 0 0 0 0 700 0 0 14 1 4 3HX0XKVV00007347 0 0 0 0 0 0 0 0 700 1 0 30 1 4 3ET0E7P100007240 0 0 0 0 0 0 0 0 700 0 0 15 1 5 3ET0Q51H00007309 0 0 0 0 0 0 0 0 700 1 0 31 1 5 3ET0RRXH00007309 0 0 0 0 0 0 0 0 700 0 0 16 1 6 3ET0MBTC00007301 Suspect drive in 0 1 0 0 0 0 0 0 700 1 0 32 1 6 3ET0NBM400007302 tray1 slot 3 of 0 0 0 0 0 0 0 0 700 0 0 17 1 7 3ET0NWCR00007304 DAMC with t1 0 0 0 0 0 0 0 0 700 1 0 33 1 7 3ET0NCRG00007302 0 0 0 0 0 0 0 2 700 0 0 18 1 8 3ET0XTZZ00007317 this assignment in 0 0 0 0 0 0 0 0 700 1 0 34 1 8 3ET0JV6600007251 nodes clique.

ALWAYS use tray/slot to identify drive NEVER use FRU

fnd === yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes . .

1/30/2012

NCR Proprietary & Confidential

133

ADEPT Reports from Windows AWS


C:\Program Files\NCR\ADEPT>viewbin adept.bin Disk tallies on AWS from adept.bin file cur SK1 --0 0 0 tot SK1 --0 0 0 cur SK2 --0 5 0 tot SK2 --0 5 0 cur SK3 --0 0 13 This example is from a Windows System tot SK3 --0 0 13 cur SK4 --0 0 0 tot SK4 --5 0 0

Volume Name Serial Number ---------------- ---------------Scsi0:9:0:0:20 3ET0QRZG000073 Scsi0:8:0:0:13 3ET0E71R000072 Scsi0:9:0:0:2d 3HX0X9LG000073

Note: a similar report for a Windows node can be generated by executing ADEPTReport.bat from the node. A file ADEPTReport.txt is created in the ADEPT directory on that node.

1. Run C:\Program Files\NCR\ADEPT>adeptreport 2. Open AdeptSummary.txt file


============================================================================= SUMMARY ADEPT REPORT FOR WINDOWS 2000 NODES DATE:12/19/03 11:27:42 ============================================================================= NODE LATEST ERROR DEVICE FRU SK1 SK2 SK3 SK4 SMP_HODGES.1.1.4 12/17/03 18:45:08 Scsi0:9:0:0:20 20 0 0 0 5 SMP_HODGES.1.1.6 12/15/03 08:23:17 Scsi0:9:0:0:13 13 0 5 0 0 SMP_HODGES.1.1.4 12/14/03 11:42:12 Scsi0:9:0:0:2d 2d 0 0 13 0

1/30/2012

NCR Proprietary & Confidential

134

Fibre Channel FRU Numbers


The FRU information on Fibre Channel Arrays is a dynamically assigned channel and loop ID which does not convert to a Tray / Slot identifier needed to locate a FC drive. Therefore: Look at a fault message with the AWS Fault Viewer. The fault identifies:
- Node reporting drive error - Device ID of the drive: - Windows - SCSI 0:7:1:4:2d - UNIX - c700t1d4s0 - Tray and Slot location of the drive

Search the Array Profile to match the serial number to it's Tray/Slot.

1/30/2012

NCR Proprietary & Confidential

135

Working with Windows Device Names


Use Winobj to find which array a port # is connected.

Bad drive is Tray1, Slot1 in DAMC101-3

Scsi0:9:0:0:2d P9P0I0
DAMC101-3

1/30/2012

NCR Proprietary & Confidential

136

DSDE with Fibre Arrays


The tray and slot can be found at byte offset 26 and 27 in the supplemental status section. The section starts with |04|70(70 being byte 0). In the example streams data below this is 8209, or tray2, drive slot 9 Byte 26 identifies the drive tray type (10 or 14 sot) and the tray number. The "MSB" being on "80" identifies a 14 slot tray (as this example shows). The device name can be determined from the 8F81010 value. Use the nodes command to translate into a C700tx value. From the c700tx value you will be able to identify the array (space.c persistence data). The example below (8F81010) indicates c700t1d10s0.
232921 13:53:15 3f626530 ... -32758 0 100000001|SDISK |1|M|D|O|1|0|U|7|0|cs_get_error|0|0|0|0#01|03000202200000324C534920202020204 C53495F5669727475616C5F68626120312E303000|00|0F|02|000000000000000000000000| 8F81010|02|2A000205645A000070000000|03|00000332330040324C5349202020202056495 25455414C202020202020202020303533370000000000000000000000000000000000000000| 04|70000100000000980000000095011900000000000200D0060000820919820900000058000 0010000002A000205645A0000700007000031543233373638323738202020202020053701000 0100000004D53313453543333363735324643202020202020000001190000000000000000000 0000000000000000000000000000000000000000000000008B14DFE3033313630342F3134343 2343000000000000000|06|7CB1DD|53407AC9|20C|0|

1/30/2012

NCR Proprietary & Confidential

137

References

1/30/2012

NCR Proprietary & Confidential

138

LSI Fibre Array Reference Material


MPPE External Storage Website
http://www2.sandiegoca.ncr.com/sfpm/symbios/index.htm

Technical Publications
http://infocentral.daytonoh.ncr.com/tsd-library/

Technical Training - 6841 and Large Cliques


http://www.ncru.ncr.com/ncru/ Course #25224

1/30/2012

NCR Proprietary & Confidential

139

You might also like