Clustered Data ONTAP 8.2 HA
Clustered Data ONTAP 8.2 HA
Clustered Data ONTAP 8.2 HA
2
High-Availability Configuration Guide
NetApp, Inc.
495 East Java Drive
Sunnyvale, CA 94089
U.S.
Table of Contents | 3
Contents
Understanding HA pairs .............................................................................. 6
What an HA pair is ...................................................................................................... 6
How HA pairs support nondisruptive operations and fault tolerance ......................... 6
Where to find procedures for nondisruptive operations with HA pairs .......... 7
How the HA pair improves fault tolerance ..................................................... 8
Connections and components of an HA pair ............................................................. 11
How HA pairs relate to the cluster ............................................................................ 12
If you have a two-node switchless cluster ..................................................... 14
Table of Contents | 5
Understanding HA pairs
HA pairs provide hardware redundancy that is required for nondisruptive operations and fault
tolerance and give each node in the pair the software functionality to take over its partner's storage
and subsequently give back the storage.
What an HA pair is
An HA pair is two storage systems (nodes) whose controllers are connected to each other directly. In
this configuration, one node can take over its partner's storage to provide continued data service if the
partner goes down.
You can configure the HA pair so that each node in the pair shares access to a common set of
storage, subnets, and tape drives, or each node can own its own distinct set of storage.
The controllers are connected to each other through an HA interconnect. This allows one node to
serve data that resides on the disks of its failed partner node. Each node continually monitors its
partner, mirroring the data for each others nonvolatile memory (NVRAM or NVMEM). The
interconnect is internal and requires no external cabling if both controllers are in the same chassis.
Takeover is the process in which a node takes over the storage of its partner. Giveback is the process
in which that storage is returned to the partner. Both processes can be initiated manually or
configured for automatic initiation.
Fault tolerance
When one node fails or becomes impaired and a takeover occurs, the partner node continues to
serve the failed nodes data.
Nondisruptive software upgrades or hardware maintenance
During hardware maintenance or upgrades, when you halt one node and a takeover occurs
(automatically, unless you specify otherwise), the partner node continues to serve data for the
halted node while you upgrade or perform maintenance on the node you halted.
During nondisruptive upgrades of Data ONTAP, the user manually enters the storage
failover takeover command to take over the partner node to allow the software upgrade to
occur. The takeover node continues to serve data for both nodes during this operation.
For more information about nondisruptive software upgrades, see the Clustered Data ONTAP
Upgrade and Revert/Downgrade Guide.
Understanding HA pairs | 7
Nondisruptive aggregate ownership relocation can be performed without a takeover and giveback.
The HA pair supplies nondisruptive operation and fault tolerance due to the following aspects of its
configuration:
The controllers in the HA pair are connected to each other either through an HA interconnect
consisting of adapters and cables, or, in systems with two controllers in the same chassis, through
an internal interconnect.
The nodes use the interconnect to perform the following tasks:
Continually check whether the other node is functioning
Mirror log data for each others NVRAM or NVMEM
They use two or more disk shelf loops, or storage arrays, in which the following conditions apply:
For more information about disk ownership, see the Clustered Data ONTAP Physical Storage
Management Guide.
They own their spare disks, spare array LUNs, or both, and do not share them with the other
node.
They each have mailbox disks or array LUNs on the root volume that perform the following
tasks:
Related concepts
See the...
Controller
Yes
No
NVRAM
Yes
No
Stand-alone HA pair
Understanding HA pairs | 9
Hardware
components
CPU fan
Yes
Maybe, if all No
NICs fail
Stand-alone HA pair
No
No
No, if dual- No
path cabling
is used
No, if dual- No
path cabling
is used
Disk drive
No
No
Power supply
Maybe, if
both power
supplies fail
No
Maybe, if
both fans
fail
No
HA interconnect
adapter
Not
applicable
No
HA interconnect cable
Not
applicable
No
Stand-alone HA pair
Understanding HA pairs | 11
HA Interconnect
Node1
Node2
Node1
Storage
Node2
Storage
Primary connection
Redundant primary connection
Standby connection
Redundant standby connection
This diagram shows a standard HA pair with native disk shelves and multipath HA.
This diagram shows DS4243 disk shelves. For more information about cabling SAS disk
shelves, see the Universal SAS and ACP Cabling Guide on the NetApp Support Site.
Non-HA (or stand-alone) nodes are not supported in a cluster containing two or more nodes.
Although single node clusters are supported, joining two separate single node clusters to create one
cluster is not supported, unless you wipe clean one of the single node clusters and join it to the other
to create a two-node cluster that consists of an HA pair. For information on single node clusters, see
the Clustered Data ONTAP System Administration Guide for Cluster Administrators.
The following diagram shows two HA pairs. The multipath HA storage connections between the
nodes and their storage are shown for each HA pair. For simplicity, only the primary connections to
the data and cluster networks are shown.
Understanding HA pairs | 13
Node3
Storage
Node4
Storage
Node3
Node4
HA Interconnect
HA pair
Data
Network
Cluster
Network
HA Interconnect
Node1
Node1
Storage
HA pair
Node2
Node2
Storage
If Node1 and Node2 both fail, the storage owned by Node1 and Node2 becomes unavailable to the
data network. Although Node3 and Node4 are clustered with Node1 and Node2, they do not have
direct connections to Node1 and Node2's storage and cannot take over their storage.
15
possible.
HA pair
Single disk
failure
No
No
Yes
Yes
Double disk
failure (2
disks fail in
same RAID
group)
Maybe. If root
No, unless you are
No, unless you are
volume has
using RAID-DP, then using RAID-DP,
double disk
yes.
then yes.
failure, or if the
mailbox disks are
affected, no
failover is
possible.
Triple disk
failure (3
disks fail in
same RAID
group)
Yes
Maybe. If root
No
volume has triple
disk failure, no
failover is
possible.
No
Single HBA
(initiator)
failure, Loop
A
Maybe. If
multipath HA is
in use, then no;
otherwise, yes.
Maybe. If root
volume has
double disk
failure, no
failover is
possible.
Yes, if multipath HA
is being used.
Yes, if multipath HA
is being used, or if
failover succeeds.
Single HBA
(initiator)
failure, Loop
B
No
Yes, if multipath HA
is being used, or if
failover succeeds.
HA pair
Single HBA
initiator
failure (both
loops at the
same time)
Yes, unless
multipath HA is
in use, then no
takeover needed.
Maybe. If
No, unless multipath
multipath HA is HA is in use, then
being used and
yes.
the mailbox disks
are not affected,
then no;
otherwise, yes.
No failover needed if
multipath HA is in
use.
AT-FCX
failure (Loop
A)
Only if multidisk
volume failure or
open loop
condition occurs,
and multipath
HA is not in use.
Maybe. If root
volume has
double disk
failure, no
failover is
possible.
No
Yes, if failover
succeeds.
AT-FCX
failure (Loop
B)
No
Maybe. If
multipath HA is
in use, then no;
otherwise, yes.
Yes, if multipath HA
is in use.
Yes
IOM failure
(Loop A)
Only if multidisk
volume failure or
open loop
condition occurs,
and multipath
HA is not in use.
Maybe. If root
volume has
double disk
failure, no
failover is
possible.
No
Yes, if failover
succeeds.
IOM failure
(Loop B)
No
Maybe. If
multipath HA is
in use, then no;
otherwise, yes.
Yes, if multipath HA
is in use.
Yes
HA pair
Shelf
(backplane)
failure
Only if multidisk
volume failure or
open loop
condition occurs.
Maybe. If root
volume has
double disk
failure or if the
mailboxes are
affected, no
failover is
possible.
No
No
Shelf, single
power failure
No
No
Yes
Yes
Shelf, dual
power failure
Only if multidisk
volume failure or
open loop
condition occurs.
Maybe. If root
Maybe. If data is
volume has
mirrored, then yes;
double disk
otherwise, no.
failure, or if the
mailbox disks are
affected, no
failover is
possible.
No
Controller,
single power
failure
No
No
Yes
Yes
Controller,
dual power
failure
Yes
No
Yes, if failover
succeeds.
HA
interconnect
failure (1
port)
No
No
Not applicable
Yes
HA
interconnect
failure (both
ports)
No
Yes
Not applicable
Yes
HA pair
Tape
interface
failure
No
No
Yes
Yes
Heat exceeds
permissible
amount
Yes
No
No
No
Fan failures
(disk shelves
or controller)
No
No
Yes
Yes
Reboot
Yes
No
No
Yes, if failover
occurs.
Panic
Yes
No
No
Yes, if failover
occurs.
2. If the takeover is user-initiated, the target node gracefully shuts down, followed by takeover of
the target node's root aggregate and any aggregates which were not relocated in step 1.
3. Data LIFs migrate from the target node to the node doing the takeover, or any other node in the
cluster based on LIF failover rules, before the storage takeover begins.
The LIF migration can be avoided by using the skiplif-migration parameter with the
storage failover takeover command.
For details on LIF configuration and operation, see the Clustered Data ONTAP File Access and
5. SMB 3.0 sessions established to shares with the Continuous Availability property set can
reconnect to the disconnected shares after a takeover event.
If your site uses SMB 3.0 connections to Microsoft Hyper-V and the Continuous
Availability property is set on the associated shares, takeover will be nondisruptive for those
sessions.
Related information
If a background disk firmware update is occurring on a disk on either node, manually initiated
takeover operations are delayed until the disk firmware update completes on that disk. If the
background disk firmware update takes longer than 120 seconds, takeover operations are aborted
and must be restarted manually after the disk firmware update completes. If the takeover was
initiated with the bypassoptimization parameter of the storage failover takeover
command set to true, the background disk firmware update occurring on the destination node
does not affect the takeover.
If a background disk firmware update is occurring on a disk on the source (or takeover) node and
the takeover was initiated manually with the options parameter of the storage failover
takeover command set to immediate, takeover operations are delayed until the disk firmware
update completes on that disk.
If a background disk firmware update is occurring on a disk on a node and it panics, takeover of
the panicked node begins immediately.
If a background disk firmware update is occurring on a disk on either node, giveback of data
aggregates is delayed until the disk firmware update completes on that disk. If the background
disk firmware update takes longer than 120 seconds, giveback operations are aborted and must be
restarted manually after the disk firmware update completes.
If a background disk firmware update is occurring on a disk on either node, aggregate relocation
operations are delayed until the disk firmware update completes on that disk. If the background
disk firmware update takes longer than 120 seconds, aggregate relocation operations are aborted
and must be restarted manually after the disk firmware update completes. If aggregate relocation
was initiated with the -override-destination-checks of the storage aggregate
relocation command set to true, background disk firmware update occurring on the
destination node does not affect aggregate relocation.
Aggregate aggr_1
8 disks on shelf sas_1
(shaded grey)
node2
Owned by node1
before relocation
Owned by node2
after relocation
The aggregate relocation operation can relocate the ownership of one or more SFO aggregates if the
destination node can support the number of volumes in the aggregates. There is only a short
interruption of access to each aggregate. Ownership information is changed one by one for the
aggregates.
During takeover, aggregate relocation happens automatically when the takeover is initiated manually.
Before the target controller is taken over, ownership of the aggregates belonging to that controller are
moved one at a time to the partner controller. When giveback is initiated, the ownership is
automatically moved back to the original node. The bypassoptimization parameter can be used
with the storage failover takeover command to suppress aggregate relocation during the
takeover.
The aggregate relocation requires additional steps if the aggregate is currently used by an Infinite
Volume with SnapDiff enabled.
ha.takeoverImpVersion
ha.takeoverImpLowMem
ha.takeoverImpDegraded
ha.takeoverImpUnsync
ha.takeoverImpIC
ha.takeoverImpHotShelf
ha.takeoverImpNotDef
Avoid using the -only-cfo-aggregates parameter with the storage failover giveback
command.
Related tasks
Architecture compatibility
Both nodes must have the same system model and be running the same Data ONTAP software
and system firmware versions. See the Clustered Data ONTAP Release Notes for the list of
supported systems.
Nonvolatile memory (NVRAM or NVMEM) size and version compatibility
The size and version of the system's nonvolatile memory must be identical on both nodes in an
HA pair.
Storage capacity
The number of disks or array LUNs must not exceed the maximum configuration capacity. If
your system uses both native disks and array LUNs, the combined total of disks and array LUNs
cannot exceed the maximum configuration capacity. In addition, the total storage attached to each
node must not exceed the capacity for a single node.
To determine the maximum capacity for a system using disks, array LUNs, or both, see the
Hardware Universe (formerly the System Configuration Guide) at support.netapp.com/
knowledge/docs/hardware/NetApp/syscfg/index.shtml.
Note: After a failover, the takeover node temporarily serves data from all the storage in the HA
pair.
Multipath HA is required on all HA pairs except for some FAS22xx system configurations,
which use single-path HA and lack the redundant standby connections.
Mailbox disks or array LUNs on the root volume
Two disks are required if the root volume is on a disk shelf.
One array LUN is required if the root volume is on a storage array.
HA interconnect adapters and cables must be installed unless the system has two controllers in
the chassis and an internal interconnect.
Nodes must be attached to the same network and the Network Interface Cards (NICs) must be
configured correctly.
The same system software, such as Common Internet File System (CIFS) or Network File System
(NFS), must be licensed and enabled on both nodes.
Note: If a takeover occurs, the takeover node can provide only the functionality for the licenses
installed on it. If the takeover node does not have a license that was being used by the partner
node to serve data, your HA pair loses functionality after a takeover.
For an HA pair using array LUNs, both nodes in the pair must be able to detect the same array
LUNs.
However, only the node that is the configured owner of a LUN has read-and-write access to the
LUN. During takeover operations, the emulated storage system maintains read-and-write access
to the LUN.
Asymmetrical
configurations
In an asymmetrical standard HA pair, one node has more storage than the
other. This is supported as long as neither node exceeds the maximum
capacity limit for the node.
Active/passive
configurations
In this configuration, the passive node has only a root volume, and the
active node has all the remaining storage and services all data requests
during normal operation. The passive node responds to data requests only if
it has taken over the active node.
Shared loops or
stacks
You can share a loop or stack between the two nodes. This is particularly
useful for active/passive configurations, as described in the preceding
bullet.
In a single-chassis HA pair, both controllers are in the same chassis. The HA interconnect is provided
by the internal backplane. No external HA interconnect cabling is required.
The following example shows a dual-chassis HA pair and the HA interconnect cables:
In a dual-chassis HA pair, the controllers are in separate chassis. The HA interconnect is provided by
external cabling.
Required.
Interconnect type
(internal InfiniBand,
external InfiniBand,
or external 10-Gb
Ethernet)
6290
Single-chassis or dualchassis
6280
Single-chassis or dualchassis
Interconnect type
(internal InfiniBand,
external InfiniBand,
or external 10-Gb
Ethernet)
6250
Single-chassis or dualchassis
6240
Single-chassis or dualchassis
6220
Single-chassis
Internal InfiniBand
Yes
6210
Single-chassis
Internal InfiniBand
Yes
60xx
Dual-chassis
External InfiniBand
No
using NVRAM adapter
3270
Single-chassis or dualchassis
Interconnect type
(internal InfiniBand,
external InfiniBand,
or external 10-Gb
Ethernet)
3250
Dual-chassis
External 10-Gb
Yes
Ethernet using onboard
ports c0a and c0b
These ports are
dedicated HA
interconnect ports.
Regardless of the
system configuration,
these ports cannot be
used for data or other
purposes.
3240
Single-chassis or dualchassis
Interconnect type
(internal InfiniBand,
external InfiniBand,
or external 10-Gb
Ethernet)
3220
Single-chassis or dualchassis
3210
Single-chassis
Internal Infiniband
Yes
31xx
Single-chassis
Internal InfiniBand
No
FAS22xx
Single-chassis
Internal InfiniBand
Yes
35
Guide on the NetApp Support Site at support.netapp.com for information about cabling. For
cabling the HA interconnect between the nodes, use the procedures in this guide.
Multipath HA is required on all HA pairs except for some FAS22xx system configurations, which
use single-path HA and lack the redundant standby connections.
Required documentation
Installation of an HA pair requires the correct documentation.
The following table lists and briefly describes the documentation you might need to refer to when
preparing a new HA pair, or converting two stand-alone systems into an HA pair:
Manual name
Description
Hardware Universe
Diagnostics Guide
Note: If you are installing a V-Series HA pair with third-party storage, see the V-Series
Installation Requirements and Reference Guide for information about cabling V-Series systems to
storage arrays, and see the V-Series Implementation Guide for Third-Party Storage for information
about configuring storage arrays to work with V-Series systems.
Required tools
Installation of an HA pair requires the correct tools.
The following list specifies the tools you need to install the HA pair:
Required equipment
When you receive your HA pair, you should receive the equipment listed in the following table. See
the Hardware Universe (formerly the System Configuration Guide) at support.netapp.com/
knowledge/docs/hardware/NetApp/syscfg/index.shtml to confirm your storage-system type, storage
capacity, and so on.
Required equipment
Details
Storage system
Storage
support.netapp.com/knowledge/docs/hardware/
NetApp/syscfg/index.shtml
HA interconnect adapter (for controller modules InfiniBand (IB) HA adapter
that do not share a chassis)
(The NVRAM adapter functions as the HA
interconnect adapter on FAS900 series and later
Note: When 32xx systems are in a dualstorage systems, except the 32xx systems)
chassis HA pair, the c0a and c0b 10-GbE
ports are the HA interconnect ports. They do
not require an HA interconnect adapter.
Regardless of configuration, the 32xx
system's c0a and c0b ports cannot be used for
data. They are only for the HA interconnect.
For DS14mk2 disk shelves: FC-AL or FC HBA
(FC HBA for Disk) adapters
For SAS disk shelves: SAS HBAs, if applicable
N/A
Details
N/A
1. Install the nodes in the equipment rack, as described in the guide for your disk shelf, hardware
documentation, or Quick Start guide that came with your equipment.
2. Install the disk shelves in the equipment rack, as described in the appropriate disk shelf guide.
3. Label the interfaces, where appropriate.
4. Connect the nodes to the network, as described in the setup instructions for your system.
Result
The nodes are now in place and connected to the network and power is available.
1. Install the system cabinets, nodes, and disk shelves as described in the System Cabinet Guide.
If you have multiple system cabinets, remove the front and rear doors and any side panels that
need to be removed, and connect the system cabinets together.
2. Connect the nodes to the network, as described in the Installation and Setup Instructions for your
system.
3. Connect the system cabinets to an appropriate power source and apply power to the cabinets.
Result
The nodes are now in place and connected to the network, and power is available.
After you finish
Cabling an HA pair
To cable an HA pair, you identify the ports you need to use on each node, then you cable the ports,
and then you cable the HA interconnect.
About this task
This procedure explains how to cable a configuration using DS14mk2 or DS14mk4 disk shelves.
For cabling SAS disk shelves in an HA pair, see the Universal SAS and ACP Cabling Guide.
Note: If you are installing an HA pair between V-Series systems using array LUNs, see the VSeries Installation Requirements and Reference Guide for information about cabling V-Series
systems to storage arrays. See the V-Series Implementation Guide for Third-Party Storage for
information about configuring storage arrays to work with Data ONTAP.
The sections for cabling the HA interconnect apply to all systems regardless of disk shelf type.
1. Determining which Fibre Channel ports to use for Fibre Channel disk shelf connections on page
40
2. Cabling Node A to DS14mk2 or DS14mk4 disk shelves on page 41
3. Cabling Node B to DS14mk2 or DS14mk4 disk shelves on page 43
4. Cabling the HA interconnect (all systems except 32xx) on page 45
5. Cabling the HA interconnect (32xx systems in separate chassis) on page 46
Determining which Fibre Channel ports to use for Fibre Channel disk shelf
connections
Before cabling your HA pair, you need to identify which Fibre Channel ports to use to connect your
disk shelves to each storage system, and in what order to connect them.
Keep the following guidelines in mind when identifying ports to use:
Every disk shelf loop in the HA pair requires two ports on the node, one for the primary
connection and one for the redundant multipath HA connection.
A standard HA pair with one loop for each node uses four ports on each node.
Onboard Fibre Channel ports should be used before using ports on expansion adapters.
Always use the expansion slots in the order shown in the Hardware Universe (formerly the
System Configuration Guide) at support.netapp.com/knowledge/docs/hardware/NetApp/syscfg/
index.shtml for your platform for an HA pair.
If using Fibre Channel HBAs, insert the adapters in the same slots on both systems.
See the Hardware Universe (formerly the System Configuration Guide) at support.netapp.com/
knowledge/docs/hardware/NetApp/syscfg/index.shtml to obtain all slot assignment information for
the various adapters you use to cable your HA pair.
After identifying the ports, you should have a numbered list of Fibre Channel ports for both nodes,
starting with Port 1.
Cabling guidelines for a quad-port Fibre Channel HBA
If using ports on the quad-port, 4-Gb Fibre Channel HBAs, use the procedures in the following
sections, with the following additional guidelines:
Disk shelf loops using ESH4 modules must be cabled to the quad-port HBA first.
Disk shelf loops using AT-FCX modules must be cabled to dual-port HBA ports or onboard ports
before using ports on the quad-port HBA.
Port A of the HBA must be cabled to the In port of Channel A of the first disk shelf in the loop.
Port A of the partner node's HBA must be cabled to the In port of Channel B of the first disk shelf
in the loop. This ensures that disk names are the same for both nodes.
Additional disk shelf loops must be cabled sequentially with the HBAs ports.
Port A is used for the first loop, port B for the second loop, and so on.
If available, ports C or D must be used for the redundant multipath HA connection after cabling
all remaining disk shelf loops.
All other cabling rules described in the documentation for the HBA and the Hardware Universe
must be observed.
Steps
The circled numbers in the diagram correspond to the step numbers in the procedure.
The location of the Input and Output ports on the disk shelves vary depending on the disk
shelf models.
Make sure that you refer to the labeling on the disk shelf rather than to the location of the port
shown in the diagram.
The location of the Fibre Channel ports on the controllers is not representative of any
particular storage system model; determine the locations of the ports you are using in your
configuration by inspection or by using the Installation and Setup Instructions for your model.
The port numbers refer to the list of Fibre Channel ports you created.
The diagram only shows one loop per node and one disk shelf per loop.
Your installation might have more loops, more disk shelves, or different numbers of disk
shelves between nodes.
2. Cable Fibre Channel port A1 of Node A to the Channel A Input port of the first disk shelf of
Node A loop 1.
3. Cable the Node A disk shelf Channel A Output port to the Channel A Input port of the next disk
shelf in loop 1.
4. Repeat step 3 for any remaining disk shelves in loop 1.
5. Cable the Channel A Output port of the last disk shelf in the loop to Fibre Channel port B4 of
Node B.
This provides the redundant multipath HA connection for Channel A.
6. Cable Fibre Channel port A2 of Node A to the Channel B Input port of the first disk shelf of
Node B loop 1.
7. Cable the Node B disk shelf Channel B Output port to the Channel B Input port of the next disk
shelf in loop 1.
8. Repeat step 7 for any remaining disk shelves in loop 1.
9. Cable the Channel B Output port of the last disk shelf in the loop to Fibre Channel port B3 of
Node B.
This provides the redundant multipath HA connection for Channel B.
10. Repeat steps 2 to 9 for each pair of loops in the HA pair, using ports 3 and 4 for the next loop,
ports 5 and 6 for the next one, and so on.
Result
Steps
The circled numbers in the diagram correspond to the step numbers in the procedure.
The location of the Input and Output ports on the disk shelves vary depending on the disk
shelf models.
Make sure that you refer to the labeling on the disk shelf rather than to the location of the port
shown in the diagram.
The location of the Fibre Channel ports on the controllers is not representative of any
particular storage system model; determine the locations of the ports you are using in your
configuration by inspection or by using the Installation and Setup Instructions for your model.
The port numbers refer to the list of Fibre Channel ports you created.
The diagram only shows one loop per node and one disk shelf per loop.
Your installation might have more loops, more disk shelves, or different numbers of disk
shelves between nodes.
2. Cable Port B1 of Node B to the Channel B Input port of the first disk shelf of Node A loop 1.
Both channels of this disk shelf are connected to the same port on each node. This is not required,
but it makes your HA pair easier to administer because the disks have the same ID on each node.
This is true for Step 5 also.
3. Cable the disk shelf Channel B Output port to the Channel B Input port of the next disk shelf in
loop 1.
4. Repeat step 3 for any remaining disk shelves in loop 1.
5. Cable the Channel B Output port of the last disk shelf in the loop to Fibre Channel port A4 of
Node A.
This provides the redundant multipath HA connection for Channel B.
6. Cable Fibre Channel port B2 of Node B to the Channel A Input port of the first disk shelf of Node
B loop 1.
7. Cable the disk shelf Channel A Output port to the Channel A Input port of the next disk shelf in
loop 1.
8. Repeat step 7 for any remaining disk shelves in loop 1.
9. Cable the Channel A Output port of the last disk shelf in the loop to Fibre Channel port A3 of
Node A.
This provides the redundant multipath HA connection for Channel A.
10. Repeat steps 2 to 9 for each pair of loops in the HA pair, using ports 3 and 4 for the next loop,
ports 5 and 6 for the next one, and so on.
This procedure applies to all dual-chassis HA pairs (HA pairs in which the two controller modules
reside in separate chassis) except 32xx systems, regardless of disk shelf type.
Steps
1. See the Hardware Universe (formerly the System Configuration Guide) at support.netapp.com/
knowledge/docs/hardware/NetApp/syscfg/index.shtml to ensure that your interconnect adapter is
in the correct slot for your system in an HA pair.
For systems that use an NVRAM adapter, the NVRAM adapter functions as the HA interconnect
adapter.
2. Plug one end of the optical cable into one of the local node's HA adapter ports, then plug the
other end into the partner node's corresponding adapter port.
You must not cross-cable the HA interconnect adapter. Cable the local node ports only to the
identical ports on the partner node.
If the system detects a cross-cabled HA interconnect, the following message appears:
HA interconnect port <port> of this appliance seems to be connected to
port <port> on the partner appliance.
This procedure applies to 32xx systems regardless of the type of attached disk shelves.
Steps
1. Plug one end of the 10-GbE cable to the c0a port on one controller module.
2. Plug the other end of the 10-GbE cable to the c0a port on the partner controller module.
3. Repeat the preceding steps to connect the c0b ports.
Do not cross-cable the HA interconnect adapter; cable the local node ports only to the identical
ports on the partner node.
Result
47
Configuring an HA pair
Bringing up and configuring an HA pair for the first time can require enabling HA mode capability
and failover, setting options, configuring network connections, and testing the configuration.
These tasks apply to all HA pairs regardless of disk shelf type.
Steps
1.
2.
3.
4.
5.
6.
7.
In a two-node cluster, cluster HA ensures that the failure of one node does not disable the cluster. If
your cluster contains only two nodes:
Enabling cluster HA requires and automatically enables storage failover and auto-giveback.
Cluster HA is enabled automatically when you enable storage failover.
Note: If the cluster contains or grows to more than two nodes, cluster HA is not required and is
disabled automatically.
If you have a two-node switchless configuration that uses direct-cable connections between the nodes
instead of a cluster interconnect switch, you must ensure that the switchless-cluster-network option is
enabled. This ensures proper cluster communication between the nodes.
Steps
If storage failover is not already enabled, you will be prompted to confirm enabling of both
storage failover and auto-giveback.
2. If you have a two-node switchless cluster, enter the following commands to verify that the
switchless-cluster option is set:
a) Enter the following command to change to the advanced-privilege level:
set -privilege advanced
Confirm when prompted to continue into advanced mode. The advanced mode prompt
appears (*>).
b) Enter the following command:
network options switchless-cluster show
If the output shows that the value is false, you must issue the following command:
network options switchless-cluster modify true
Enable takeover
Disable takeover
See the man page for each command for more information.
Configuring an HA pair | 49
See the man page for each command for more information.
Related references
The HA mode state of the storage controller can vary. You can use the storage failover show
command to determine the current configuration.
About this task
When a storage controller is shipped from the factory or when Data ONTAP is reinstalled using
option four of the Data ONTAP boot menu (Clean configuration and initialize all
disks), HA mode is enabled by default, and the system's nonvolatile memory (NVRAM or
NVMEM) is split. If you plan to use the controller as a single node cluster, you must configure the
node as non-HA. Reconfiguring as non-HA mode enables full use of the system nonvolatile memory.
Note: Configuring the node as a single node cluster removes the availability benefits of the HA
If the storage failover show output displays Non-HA mode in the State Description
column, then the node is configured for non-HA mode and you are finished:
Example
cluster01::> storage failover show
Takeover
Node
Partner
Possible State Description
-------------- -------------- -------- ----------------node1
false
Non-HA mode
If the storage failover show output directs you to reboot, you must reboot the node to
enable full use of the system's nonvolatile memory:
Example
cluster01::> storage failover show
Takeover
Node
Partner
Possible State Description
-------------- -------------- -------- ----------------node1
false
Non-HA mode, reboot to use
full NVRAM
a) If you have a two-node cluster, disable cluster HA using the following command:
cluster ha modify -configured false
Configuring an HA pair | 51
d) Reboot the node using the following command:
cluster01::> reboot node nodename
Config Advisor is a configuration validation and health check tool for NetApp systems. It can be
deployed at both secure sites and non-secure sites for data collection and analysis.
Note: Support for Config Advisor is limited, and available only online.
Steps
1. Log in to the NetApp Support Site and go to Downloads > Utility ToolChest.
2. Click Config Advisor (WireGauge renamed).
3. Follow the directions on the web page for downloading and running the utility.
Specify the number of times the hardwareassisted takeover alerts are sent
See the man page for each command for more information. For a mapping of the cf options and
commands used in Data ONTAP operating in 7-Mode to the storage failover commands, refer
to the Data ONTAP 7-Mode to Clustered Data ONTAP Command Map. When in clustered Data
ONTAP, you should always use the storage failover commands rather than issuing an
equivalent 7-Mode command via the nodeshell (using the system node run command).
Panics
See the man page for each command for more information. For a mapping of the cf options and
commands used in Data ONTAP operating in 7-Mode to the storage failover commands, refer
to the Data ONTAP 7-Mode to Clustered Data ONTAP Command Map. When in clustered Data
ONTAP, you should always use the storage failover commands rather than issuing an
equivalent 7-Mode command via the nodeshell (using the system node run command).
The node cannot send heartbeat messages to its partner due to events such as loss of power or
watchdog reset.
You halt the node without using the -f or -inhibit-takeover parameter.
The node panics.
Configuring an HA pair | 53
Takeover initiated
upon receipt?
Description
power_loss
Yes
l2_watchdog_reset
Yes
power_off_via_rlm Yes
power_cycle_via_r Yes
lm
reset_via_rlm
Yes
abnormal_reboot
No
loss_of_heartbeat
No
periodic_message
No
test
No
The automatic giveback causes a second unscheduled interruption (after the automatic takeover).
Depending on your client configurations, you might want to initiate the giveback manually to
plan when this second interruption occurs.
The takeover might have been due to a hardware problem that can recur without additional
diagnosis, leading to additional takeovers and givebacks.
Note: Automatic giveback is enabled by default if the cluster contains only a single HA pair
Automatic giveback is disabled by default during nondisruptive Data ONTAP upgrades.
Before performing the automatic giveback (regardless of what triggered it), the partner node waits for
a fixed amount of time as controlled by the -delay-seconds parameter of the storage failover
modify command. The default delay is 600 seconds. By delaying the giveback, the process results in
two brief outages:
1. One outage during the takeover operation.
2. One outage during the giveback operation.
This process avoids a single prolonged outage that includes:
1. The time for the takeover operation.
2. The time it takes for the taken-over node to boot up to the point at which it is ready for the
giveback.
3. The time for the giveback operation.
If the automatic giveback fails for any of the non-root aggregates, the system automatically makes
two additional attempts to complete the giveback.
Configuring an HA pair | 55
See the man page for each command for more information.
For a mapping of the cf options and commands used in Data ONTAP operating in 7-Mode to the
storage failover commands, refer to the Data ONTAP 7-Mode to Clustered Data ONTAP
Command Map. When in clustered Data ONTAP, you should always use the storage failover
commands rather than issuing an equivalent 7-Mode command via the nodeshell (using the system
node run command).
1. Check the cabling on the HA interconnect cables to make sure that they are secure.
2. Verify that you can create and retrieve files on both nodes for each licensed protocol.
3. Enter the following command:
storage failover takeover -ofnode partner_node
5. Enter the following command to display all disks belonging to the partner node (node2) that the
takeover node (node1) can detect:
storage disk show -disk node1:* -home node2 -ownership
You can use the wildcard (*) character to display all the disks visible from a node. The following
command displays all disks belonging to node2 that node1 can detect:
cluster::> storage disk show -disk node1:* -home node2 -ownership
Disk
Aggregate
Reserver
-------- -----------------node1:0c.3
4078312452
node1:0d.3
4078312452
.
.
.
Home
Owner ID
DR Home ID
4078312453 4078312453 -
node2 node2 -
4078312453 4078312453 -
6. Enter the following command to confirm that the takeover node (node1) controls the partner
node's (node2) aggregates:
aggr show fields homeid,homename,ishome
cluster::> aggr show fields homename,ishome
aggregate home-name is-home
--------- --------- --------aggr0_1
node1
true
aggr0_2
node2
false
aggr1_1
node1
true
Configuring an HA pair | 57
aggr1_2
node2
false
4 entries were displayed.
During takeover, the is-home value of the partner node's aggregates is false.
7. Give back the partner node's data service after it displays the Waiting for giveback message
by entering the following command:
storage failover giveback -ofnode partner_node
8. Enter either of the following commands to observe the progress of the giveback operation:
storage failover show-giveback
storage failover show
9. Proceed depending on whether you saw the message that giveback was completed successfully:
If takeover and giveback...
Then...
Is completed successfully
Fails
Correct the takeover or giveback failure and then repeat this procedure.
Monitoring an HA pair
You can use a variety of commands to monitor the status of the HA pair. If a takeover occurs, you
can also determine what caused the takeover.
The history of hardware-assisted takeover events storage failover hwassist stats show
that have occurred
The progress of a takeover operation as the
storage failover showtakeover
partner's aggregates are moved to the node doing
the takeover
The progress of a giveback operation in
returning aggregates to the partner node
haconfig show
Note: This is a Maintenance mode command.
See the man page for each command for more information.
Monitoring an HA pair | 59
Meaning
Connected to partner_name.
Meaning
Pending shutdown.
In takeover.
Monitoring an HA pair | 61
State
Meaning
Takeover in progress.
firmware-status.
Node unreachable.
Meaning
Monitoring an HA pair | 63
State
Meaning
Meaning
Monitoring an HA pair | 65
State
Meaning
Meaning
Monitoring an HA pair | 67
State
Meaning
Boot failed
Booting
Dumping core
Dumping sparecore and ready to be taken-over
Halted
In power-on self test
In takeover
Initializing
Operator completed
Rebooting
Takeover disabled
Unknown
Up
Waiting
Waiting for cluster applications to come online on the local node
Waiting for giveback
Waiting for operator input
69
inhibittakeover
takeoveronreboot setting
of the partner node to prevent it
from initiating takeover.
Take over the partner node even if there is a disk storage failover takeover
mismatch
allowdiskinventorymismatch
Take over the partner node even if there is a
Data ONTAP version mismatch
If giveback is interrupted
If the takeover node experiences a failure or a power outage during the giveback process, that process
stops and the takeover node returns to takeover mode until the failure is repaired or the power is
restored.
However, this depends upon the stage of giveback in which the failure occurred. If the node
encountered failure or a power outage during partial giveback state (after it has given back the root
aggregate), it will not return to takeover mode. Instead, the node returns to partial-giveback mode. If
this occurs, complete the process by repeating the giveback operation.
If giveback is vetoed
If giveback is vetoed, you must check the EMS messages to determine the cause. Depending on the
reason or reasons, you can decide whether you can safely override the vetoes.
The storage failover show-giveback command displays the giveback progress and shows
which subsystem vetoed, if any. Soft vetoes can be overridden, whereas hard vetoes cannot be, even
if forced. The following tables summarize the soft vetoes that should not be overridden, along with
recommended workarounds.
Giveback of the root aggregate
These vetoes do not apply to aggregate relocation operations:
Vetoing subsystem
module
Workaround
vfiler_low_level
Terminate the CIFS sessions causing the veto, or shutdown the CIFS
application that established the open sessions.
Overriding this veto may cause the application using CIFS to
disconnect abruptly and lose data.
Workaround
Disk Check
Workaround
Lock Manager
RAID
If the veto is due to nvfile, bring the offline volumes and aggregates
online.
If disk add or disk ownership reassignment operations are in
progress, wait until they complete.
If the veto is due to an aggregate name or UUID conflict,
troubleshoot and resolve the issue.
If the veto is due to mirror resync, mirror verify, or offline disks,
the veto can be overridden and the operation will be restarted after
giveback.
Disk Inventory
SnapMirror
Give back storage even if the partner is not in the storage failover giveback ofnode
waiting for giveback mode
nodename requirepartnerwaiting
false
See the man page for each command for more information.
For a mapping of the cf options and commands used in Data ONTAP operating in 7-Mode to the
storage failover commands, refer to the Data ONTAP 7-Mode to Clustered Data ONTAP
Command Map. When in clustered Data ONTAP, you should always use the storage failover
commands rather than issuing an equivalent 7-Mode command via the nodeshell (using the system
node run command).
75
For SAS disk shelf management, see the Installation and Service Guide for your disk shelf model.
For cabling SAS disk shelves in an HA pair, see the Universal SAS and ACP Cabling Guide.
1. Confirm that there are two paths to every disk by entering the following command:
storage disk show -port
Note: If two paths are not listed for every disk, this procedure could result in a data service
outage. Before proceeding, address any issues so that all paths are redundant. If you do not
have redundant paths to every disk, you can use the nondisruptive upgrade method (failover) to
add your storage.
2. Install the new disk shelf in your cabinet or equipment rack, as described in the DiskShelf 14,
DiskShelf14mk2 FC, and DiskShelf14mk4 FC Hardware and Service Guide or DiskShelf14mk2
AT Hardware Service Guide.
3. Find the last disk shelf in the loop to which you want to add the new disk shelf.
Note: The Channel A Output port of the last disk shelf in the loop is connected back to one of
the controllers.
Note: In Step 4 you disconnect the cable from the disk shelf. When you do this, the system
displays messages about adapter resets and eventually indicates that the loop is down. These
4. Disconnect the SFP and cable coming from the Channel A Output port of the last disk shelf.
Note: Leave the other ends of the cable connected to the controller.
5. Using the correct cable for a shelf-to-shelf connection, connect the Channel A Output port of the
last disk shelf to the Channel A Input port of the new disk shelf.
6. Connect the cable and SFP you removed in Step 4 to the Channel A Output port of the new disk
shelf.
7. If you disabled the adapter in Step 3, reenable the adapter by entering the following command:
run node nodename fcadmin config -e adapter
9. Confirm that there are two paths to every disk by entering the following command:
storage disk show -port
For SAS disk shelf management, see the Installation and Service Guide for your disk shelf
model.
For cabling SAS disk shelves in an HA pair, see the Universal SAS and ACP Cabling Guide.
Whenever you remove a module from an HA pair, you need to know whether the path you will
disrupt is redundant.
If it is, you can remove the module without interfering with the storage systems ability to serve
data. However, if that module provides the only path to any disk in your HA pair, you must take
action to ensure that you do not incur system downtime.
When you replace a module, make sure that the replacement modules termination switch is in the
same position as the module it is replacing.
Note: ESH4 modules are self-terminating; this guideline does not apply to ESH4 modules.
If you replace a module with a different type of module, make sure that you also change the
cables, if necessary.
For more information about supported cable types, see the hardware documentation for your disk
shelf.
Always wait 30 seconds after inserting any module before reattaching any cables in that loop.
1. Verify that all disk shelves are functioning properly by entering the following command:
run -node nodename environ shelf
2. Verify that there are no missing disks by entering the following command:
run -node nodename aggr status -r
Local disks displayed on the local node should be displayed as partner disks on the partner node,
and vice-versa.
3. Verify that you can create and retrieve files on both nodes for each licensed protocol.
If the disks have redundant paths, you can remove the module without interfering with the storage
systems ability to serve data. However, if that module provides the only path to any of the disks in
your HA pair, you must take action to ensure that you do not incur system downtime.
Step
1. Use the storage disk show -port command at your system console.
This command displays the following information for every disk in the HA pair:
Primary port
Secondary port
Disk type
Disk shelf
Bay
Port
---A
A
A
A
A
A
A
A
A
A
A
A
A
A
Secondary
--------------Clustr-1:0b.16
Clustr-1:0b.17
Clustr-1:0b.18
Clustr-1:0b.19
Clustr-1:0b.20
Clustr-1:0b.21
Clustr-1:0b.22
Clustr-1:0b.23
Clustr-1:0b.24
Clustr-1:0b.25
Clustr-1:0b.26
Clustr-1:0b.27
Clustr-1:0b.28
Clustr-1:0b.29
Port
---B
B
B
B
B
B
B
B
B
B
B
B
B
B
Type
Shelf Bay
------ ----- --FCAL
1
0
FCAL
1
1
FCAL
1
2
FCAL
1
3
FCAL
1
4
FCAL
1
5
FCAL
1
6
FCAL
1
7
FCAL
1
8
FCAL
1
9
FCAL
1 10
FCAL
1 11
FCAL
1 12
FCAL
1 13
Notice that every disk (for example, 0a.16/0b.16) has two ports active: one for A and one for
B. The presence of the redundant path means that you do not need to fail over one system
before removing modules from the system.
Attention: Make sure that every disk has two paths. Even in an HA pair configured for
redundant paths, a hardware or configuration problem can cause one or more disks to have
only one path. If any disk in your HA pair has only one path, you must treat that loop as if it
were in a single-path HA pair when removing modules.
The following example shows what the storage disk show -port command output might
look like for an HA pair consisting of FAS systems that do not use redundant paths:
Clustr::> storage disk show -port
Primary
Port Secondary
--------------- ---- --------------Clustr-1:0a.16 A
Clustr-1:0a.17 A
Clustr-1:0a.18 A
Clustr-1:0a.19 A
Clustr-1:0a.20 A
Clustr-1:0a.21 A
Clustr-1:0a.22 A
Clustr-1:0a.23 A
Clustr-1:0a.24 A
Clustr-1:0a.25 A
Clustr-1:0a.26 A
Clustr-1:0a.27 A
Clustr-1:0a.28 A
Clustr-1:0a.29 A
-
Port
----
Type
Shelf Bay
------ ----- --FCAL
1
0
FCAL
1
1
FCAL
1
2
FCAL
1
3
FCAL
1
4
FCAL
1
5
FCAL
1
6
FCAL
1
7
FCAL
1
8
FCAL
1
9
FCAL
1 10
FCAL
1 11
FCAL
1 12
FCAL
1 13
Hot-swapping a module
You can hot-swap a faulty disk shelf module, removing the faulty module and replacing it without
disrupting data availability.
About this task
When you hot-swap a disk shelf module, you must ensure that you never disable the only path to a
disk, which results in a system outage.
Attention: If there is newer firmware in the /etc/shelf_fw directory than that on the
replacement module, the system automatically runs a firmware update. This firmware update
causes a service interruption on non-multipath HA ATFCX installations, multipath HA
configurations running versions of Data ONTAP prior to 7.3.1, and systems with non-RoHS
ATFCX modules.
Steps
1. Verify that your storage system meets the minimum software requirements to support the disk
shelf modules that you are hot-swapping.
See the DiskShelf14, DiskShelf14mk2 FC, or DiskShelf14mk2 AT Hardware Service Guide for
more information.
2. Determine which loop contains the module you are removing, and determine whether any disks
are single-pathed through that loop.
3. Complete the following steps if any disks use this loop as their only path to a controller:
a) Follow the cables from the module you want to replace back to one of the nodes, called
NodeA.
b) Enter the following command at the NodeB console:
storage failover takeover -ofnode NodeA
c) Wait for takeover to be complete and make sure that the partner node, or NodeA, reboots and
is waiting for giveback.
Any module in the loop that is attached to NodeA can now be replaced.
4. Put on the antistatic wrist strap and grounding leash.
5. Disconnect the module that you are removing from the Fibre Channel cabling.
6. Using the thumb and index finger of both hands, press the levers on the CAM mechanism on the
module to release it and pull it out of the disk shelf.
b) Wait for the giveback to be completed before proceeding to the next step.
11. Test the replacement module.
12. Test the configuration.
Related concepts
Aggregate aggr_1
8 disks on shelf sas_1
(shaded grey)
node2
Owned by node1
before relocation
Owned by node2
after relocation
The aggregate relocation operation can relocate the ownership of one or more SFO aggregates if the
destination node can support the number of volumes in the aggregates. There is only a short
interruption of access to each aggregate. Ownership information is changed one by one for the
aggregates.
Because volume count limits are validated programmatically during aggregate relocation
operations, it is not necessary to check for this manually. If the volume count exceeds the
supported limit, the aggregate relocation operation will fail with a relevant error message.
You should not initiate aggregate relocation when system-level operations are in progress on
either the source or the destination node; likewise, you should not start these operations during
the aggregate relocation. These operations can include:
Takeover
Giveback
Shutdown
Another aggregate relocation operation
Disk ownership changes
Aggregate or volume configuration operations
Storage controller replacement
Data ONTAP upgrade
Data ONTAP revert
You should not initiate aggregate relocation on aggregates that are corrupt or undergoing
maintenance.
If the source node is used by an Infinite Volume with SnapDiff enabled, you must perform
additional steps before initiating the aggregate relocation and then perform the relocation in a
specific manner. You must ensure that the destination node has a namespace mirror constituent
and make decisions about relocating aggregates that include namespace constituents.
For information about Infinite Volumes, see the Clustered Data ONTAP Physical Storage
Management Guide.
Before initiating the aggregate relocation, save any core dumps on the source and destination
nodes.
Steps
1. View the aggregates on the node to confirm which aggregates to move and ensure they are online
and in good condition:
storage aggregate show -node source-node
Example
The following command shows six aggregates on the four nodes in the cluster. All aggregates are
online. Node1 and node 3 form an HA pair and node2 and node4 form an HA pair.
node1::> storage aggregate show
Aggregate
Size Available Used% State
#Vols Nodes RAID Status
--------- -------- --------- ----- ------- ------ ------ ----------aggr_0
239.0GB
11.13GB
95% online
1 node1 raid_dp,
normal
aggr_1
239.0GB
11.13GB
95% online
1 node1 raid_dp,
normal
aggr_2
239.0GB
11.13GB
95% online
1 node2 raid_dp,
normal
aggr_3
239.0GB
11.13GB
95% online
1 node2 raid_dp,
normal
aggr_4
239.0GB
238.9GB
0% online
5 node3 raid_dp,
normal
aggr_5
239.0GB
239.0GB
0% online
4 node4 raid_dp,
normal
6 entries were displayed.
The following command moves the aggregates aggr_1 and aggr_2 from node1 to node3. Node3 is
node1's HA partner. The aggregates can only be moved within the HA pair.
node1::> storage aggregate relocation start -aggregate-list aggr_1,
aggr_2 -node node1 -destination node3
Run the storage aggregate relocation show command to check relocation
status.
node1::storage aggregate>
3. Monitor the progress of the aggregate relocation with the storage aggregate relocation
show command:
The following command shows the progress of the aggregates that are being moved to node3:
node1::> storage aggregate relocation show -node node1
Source Aggregate
Destination
Relocation Status
------ ----------- ------------- -----------------------node1
aggr_1
node3
In progress, module: wafl
aggr_2
node3
Not attempted yet
2 entries were displayed.
node1::storage aggregate>
When the relocation is complete, the output of this command shows each aggregate with a
relocation status of Done.
Related concepts
Background disk firmware update and takeover, giveback, and aggregate relocation on page 22
See the man page for each command for more information.
Meaning
-node nodename
Meaning
-destination nodename
-override-vetoes true|false
-relocate-to-higher-version true|false
-override-destination-checks true|false
Workaround
Vol Move
Backup
Lock manager
To resolve the issue, gracefully shut down the CIFS applications that
have open files, or move those volumes to a different aggregate.
Overriding this veto will result in loss of CIFS lock state, causing
disruption and data loss.
RAID
Workaround
Disk Inventory
WAFL
RAID
89
Troubleshooting HA issues
If you encounter issues in the operation of the HA pair, or errors involving the HA state, you can use
different commands to attempt to understand and resolve the issue.
Related concepts
1. Check communication between the local and partner nodes by entering the following command:
storage failover show -instance
Storage failover is
disabled
Both nodes should be able to detect the same disks. This message indicates
that there is a disk mismatch; for some reason, one node is not seeing all the
disks attached to the HA pair.
a. Remove the failed disk and issue the storage failover takeover
command again.
b. Issue the following command to force a takeover:
storage failover takeover -ofnode nodename -allowdisk-inventory-mismatch
Interconnect error, leading a. Check the HA interconnect cabling to ensure that the connection is secure.
to unsynchronized
b. If the issue is still not resolved, contact support.
NVRAM logs
NVRAM adapter being in Check the NVRAM slot number, moving it to the correct slot if needed.
the wrong slot number
See the Hardware Universe on the NetApp Support Site for slot assignments.
HA adapter error
Check the HA adapter cabling to ensure that it is correct and properly seated at
both ends of the cable.
Networking error
Automatic takeover is
disabled
3. If you have not done so already, run the Config Advisor tool, found on the NetApp Support Site
at support.netapp.com/NOW/download/tools/config_advisor/.
Support for Config Advisor is limited, and available only online.
4. Correct any errors or differences displayed in the output.
Troubleshooting HA issues | 91
5. If takeover is still not enabled, contact technical support.
Related references
1. For systems using disks, check for and remove any failed disks, using the process described in the
Clustered Data ONTAP Physical Storage Management Guide.
2. Enter the following command to check for a disk mismatch:
storage failover show -fields local-missing-disks,partner-missing-disks
Both nodes should be able to detect the same disks. If there is a disk mismatch, for some reason,
one node is not seeing all the disks attached to the HA pair.
3. Check the HA interconnect and verify that it is correctly connected and operating.
4. Check whether any of the following processes were taking place on the takeover node at the same
time you attempted the giveback:
If any of these processes are taking place, either cancel the processes or wait until they complete,
and then retry the giveback operation.
If the output shows that the node is in a partial giveback state, it means that the root aggregate has
been given back, but giveback of one or more SFO aggregates is pending or failed.
2. If the node is in a partial giveback state, run the storage failover show-giveback -node
nodename command.
Example
The output shows the progress of the giveback of the aggregates from the specified node back to
the partner node.
3. Review the output of the command and proceed as appropriate.
Troubleshooting HA issues | 93
If the output
shows...
Then...
Some
aggregates have
not been given
back yet
A subsystem
has vetoed the
giveback
process
None of the
SFO aggregates
were given
back
Wait about five minutes, and then attempt the giveback again. If the output of the
storage failover show command indicates that auto-giveback will be
initiated, then wait for node to retry the auto-giveback. It is not required to issue a
manual giveback in this case.
If the output of the storage failover show command indicates that autogiveback is disabled due to exceeding retry counts, then issue a manual giveback.
If, after retrying the giveback, the output of the storage failover show
command still reports partial giveback, check the event logs for the
sfo.giveback.failed EMS message. If the EMS log indicates that another
config operation is active on the aggregate, then wait for the conflicting operation to
complete, and retry the giveback operation.
The output of the storage failover show command will display which
subsystem caused the veto. If more than one subsystem vetoed, you will be directed
to check the EMS log to determine the reason.
These module-specific EMS messages will indicate the appropriate corrective action.
After the corrective action has been taken, retry the giveback operation, then use the
storage failover show-giveback command to verify that the operation
completed successfully.
Depending on the cause, you can either wait for the subsystem processes to finish
and attempt the giveback again, or you can decide whether you can safely override
the vetoes.
If the output of the storage failover show command displays that lock
synchronization is still in progress on the recovering node for more than 30 minutes,
check the event log to determine why lock synchronization is not complete.
If the output of the storage failover show command indicates that the
recovering node is waiting for cluster applications to come online, run the cluster
ring show command to determine which user space applications are not yet
online.
Then...
Destination did
not online the
aggregate on
time
In most cases, the aggregate comes online on the destination within five minutes.
Run the storage failover show-giveback command to check whether the
aggregate is online on the destination.
Also use the volume show command to verify that the volumes hosted on this
aggregate are online. This ensures no data outage from a client perspective.
The destination could be slow to place an aggregate and its volumes online because
of some other CPU-consuming activity. You should take the appropriate corrective
action (for example, reduce load/activities on the destination node) and then retry
giveback of the remaining aggregates.
If the aggregate is still not online, check the event logs for messages indicating why
the aggregate was not placed online more quickly.
Follow the corrective action specified by these EMS message(s), and verify that the
aggregate comes online. It might also be necessary to reduce load on the destination.
If needed, giveback can be forced by setting the require-partner-waiting
parameter to false.
Note: Use of this option may result in the giveback proceeding even if the node
detects outstanding issues that make the giveback dangerous or disruptive.
Destination
cannot receive
the aggregate
If the error message is accompanied by a module name, check the event log for EMS
messages indicating why the module in question generated a failure.
These module-specific EMS messages should indicate the corrective actions that
should be taken to fix the issue.
If the error message is not accompanied by a module name, and the output of the
storage failover show command indicates that communication with the
destination failed, wait for five minutes and then retry the giveback operation.
If the giveback operation persistently fails with the Communication with
destination failed error message, then this is indicative of a persistent
cluster network problem.
Check the event logs to determine the reason for the persistent cluster network
failures, and correct this. After the issue has been resolved, retry the giveback
operation.
4. Run the storage failover show -instance command and check the status of the
Interconnect Up field.
If the field's value is false, then the problem could be due to one of the following:
5. Resolve the connection and cabling issues and attempt the giveback again.
6. If the node panics during giveback of SFO aggregates (after the root aggregate has already been
given back), the recovering node will initiate takeover if:
Troubleshooting HA issues | 95
If takeover on panic is disabled, you can enable it with the storage failover modify
node nodename fields onpanic onpanic true command.
Changing either of these parameters on one node in an HA pair automatically makes the
same change on the partner node.
Related concepts
Aggregate
Destination
Relocation Status
c) If multiple modules veto the operation, the error message will be as follows:
Example
cluster01::> storage aggregate relocation show
Source
Aggregate Destination
-------------- ---------- ----------A
A_aggr3
B
vetoed by multiple modules.
Relocation Status
----------------Failed: Operation was
Check the event log.
B
2 entries were displayed.
d) To resolve these veto failures, refer to the event log for EMS messages associated with the
module that vetoed the operation. Module-specific EMS messages will indicate the
appropriate corrective action.
e) After the corrective action has been taken, retry the aggregate relocation operation, then use
the storage aggregate relocation show command to verify that the operation has
completed successfully.
f) It is possible to override these veto checks using the override-vetoes parameter. Note
that in some cases it may not be safe or possible to override vetoes, so confirm the reason the
vetoes occurred, their root causes, and the consequences of overriding vetoes.
Error: Destination cannot receive the aggregate
a) An aggregate relocation operation may fail due to a destination check failure. The storage
aggregate relocation show command will report the status: Destination cannot
receive the aggregate.
b) In the following example output, an aggregate relocation operation has failed due to a
disk_inventory check failure:
Example
cluster01::>storage aggregate relocation show
Source
Aggregate Destination
-------------- ---------- ----------A
aggr3
B
receive the aggregate.
Relocation Status
----------------Failed: Destination cannot
module: disk_inventory
Troubleshooting HA issues | 97
B
2 entries were displayed.
succeeded.
c) It is possible to override the destination checks using the overridedestinationchecks
parameter of the storage aggregate relocation start command. Note that in some
cases it may not be safe or possible to override destination checks, so confirm why they
occurred, their root causes, and the consequences of overriding these checks.
d) If after the aggregate relocation failure, the storage aggregate relocation show
command reports the status: Communication with destination failed, this is
probably due to a transient CSM error. In most cases, it is likely that when retried, the
aggregate relocation operation will succeed.
e) However, if the operation persistently fails with the Communication with destination
failed error, then this is indicative of a persistent cluster network problem. Check the event
logs to determine the reason for the persistent cluster network failures, and correct it.
f) After the issue has been resolved, retry the aggregate relocation operation, and use the
storage aggregate relocation show command to verify that the operation has
succeeded.
Error: Destination took too long to place the aggregate online
a) If the destination of an aggregate relocation takes longer than a specified time to place the
aggregate online after relocation, it is reported as an error. The default aggregate online
timeout is 120 seconds. If it takes longer than 120 seconds for the destination node to place an
aggregate online, the source node will report the error: Destination did not online
the aggregate on time. Note that the source has successfully relocated the aggregate but
it will report an error for the current aggregate and abort relocation of pending aggregates.
This is done so you can take appropriate corrective action (for example, reduce load/activities
on destination) and then retry relocation of remaining aggregates.
b) The destination could be taking a long time to place an aggregate and its volumes online
because of some other CPU-consuming activity. You should take the appropriate corrective
action (for example, reduce load/activities on the destination node) and then retry relocating
the remaining aggregates.
c) The storage aggregate relocation show command can be used to verify this status.
In the example below, aggregate A_aggr1 was not placed online by node B within the
specified time:
Relocation Status
----------------Failed: Destination node
Not attempted yet
d) To determine the aggregate online timeout on a node, run the storage failover show
fields aggregatemigrationtimeout command in advanced privilege mode:
Example
cluster01::*> storage failover show fields
aggregatemigrationtimeout
node aggregate-migration-timeout
---- --------------------------A
120
B
120
2
entries were displayed.
e) Even if an aggregate takes longer than 120 seconds to come online on the destination, it will
typically come online within five minutes. No user action is required for an aggregate that
incurs a 120-second timeout unless it fails to come online after five minutes. Run the
storage aggregate show command to check whether the aggregate is online on the
destination.
f) You can also use the volume show command to verify that the volumes hosted on this
aggregate are online. This ensures no data outage from a client perspective.
g) If the aggregate is still not online, check the event logs for messages indicating why the
aggregate was not placed online in a timely manner. Follow the corrective action advised by
the specific EMS message(s), and verify that the aggregate comes online. It might also be
necessary to reduce load on the destination.
h) If needed, relocation can be forced by using the overridedestinationchecks
parameter. Note that use of this option may result in the aggregate migration proceeding even
if the node detects outstanding issues that make the aggregate relocation dangerous or
disruptive.
Error: A conflicting volume/aggregate operation is in progress
a) If relocation is initiated on an aggregate which has another aggregate or volume operation
already in progress, then the relocation operation will fail, and the
aggr.relocation.failed EMS will be generated with the message: Another config
operation active on aggregate. Retry later.
b) If this is the case, wait for the conflicting operation to complete and retry the relocation
operation.
Troubleshooting HA issues | 99
Takeover is enabled. You can verify this using the command storage failover show
node <local_node_name> fields enabled
Does not have an SFO policy (storage aggregate show aggregate <aggr_name>
fields hapolicy).
Is a root aggregate (storage aggregate show -aggregate <aggr_name> fields
root).
Is not currently owned by the source node (storage aggregate show aggregate
<aggr_name> fields ownername).
If one of these conditions is violated, the aggregate relocation will fail with the following
error:
Example
cluster01::> storage aggregate relocation start -node A -dest B aggr B_aggr1
Error: command failed: Aggregate relocation is supported only on
online
non-root SFO aggregates that are owned by the source node.
These preconditions can be checked by running the storage aggregate show command
and checking the value of the following fields: owner-name, hapolicy, and root. An
example run of the command is shown below:
Related concepts
Stand-alone configuration
(not in an HA pair)
The chassis
Controller module A
non-ha
A single-chassis HA pair
The chassis
Controller module A
Controller module B
ha
A dual-chassis HA pair
Chassis A
Controller module A
Chassis B
Controller module B
ha
5. If necessary, enter the following command to set the HA state of the chassis:
ha-config modify chassis ha-state
7. Boot the system by entering the following command from the boot loader prompt:
boot_ontap
Copyright information
Copyright 19942013 NetApp, Inc. All rights reserved. Printed in the U.S.
No part of this document covered by copyright may be reproduced in any form or by any means
graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an
electronic retrieval systemwithout prior written permission of the copyright owner.
Software derived from copyrighted NetApp material is subject to the following license and
disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice.
NetApp assumes no responsibility or liability arising from the use of products described herein,
except as expressly agreed to in writing by NetApp. The use or purchase of this product does not
convey a license under any patent rights, trademark rights, or any other intellectual property rights of
NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents,
or pending applications.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to
restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer
Software clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).
103
Trademark information
NetApp, the NetApp logo, Network Appliance, the Network Appliance logo, Akorri,
ApplianceWatch, ASUP, AutoSupport, BalancePoint, BalancePoint Predictor, Bycast, Campaign
Express, ComplianceClock, Cryptainer, CryptoShred, CyberSnap, Data Center Fitness, Data
ONTAP, DataFabric, DataFort, Decru, Decru DataFort, DenseStak, Engenio, Engenio logo, E-Stack,
ExpressPod, FAServer, FastStak, FilerView, Flash Accel, Flash Cache, Flash Pool, FlashRay,
FlexCache, FlexClone, FlexPod, FlexScale, FlexShare, FlexSuite, FlexVol, FPolicy, GetSuccessful,
gFiler, Go further, faster, Imagine Virtually Anything, Lifetime Key Management, LockVault, Mars,
Manage ONTAP, MetroCluster, MultiStore, NearStore, NetCache, NOW (NetApp on the Web),
Onaro, OnCommand, ONTAPI, OpenKey, PerformanceStak, RAID-DP, ReplicatorX, SANscreen,
SANshare, SANtricity, SecureAdmin, SecureShare, Select, Service Builder, Shadow Tape,
Simplicity, Simulate ONTAP, SnapCopy, Snap Creator, SnapDirector, SnapDrive, SnapFilter,
SnapIntegrator, SnapLock, SnapManager, SnapMigrator, SnapMirror, SnapMover, SnapProtect,
SnapRestore, Snapshot, SnapSuite, SnapValidator, SnapVault, StorageGRID, StoreVault, the
StoreVault logo, SyncMirror, Tech OnTap, The evolution of storage, Topio, VelocityStak, vFiler,
VFM, Virtual File Manager, VPolicy, WAFL, Web Filer, and XBB are trademarks or registered
trademarks of NetApp, Inc. in the United States, other countries, or both.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. A complete and current list of
other IBM trademarks is available on the web at www.ibm.com/legal/copytrade.shtml.
Apple is a registered trademark and QuickTime is a trademark of Apple, Inc. in the United States
and/or other countries. Microsoft is a registered trademark and Windows Media is a trademark of
Microsoft Corporation in the United States and/or other countries. RealAudio, RealNetworks,
RealPlayer, RealSystem, RealText, and RealVideo are registered trademarks and RealMedia,
RealProxy, and SureStream are trademarks of RealNetworks, Inc. in the United States and/or other
countries.
All other brands or products are trademarks or registered trademarks of their respective holders and
should be treated as such.
NetApp, Inc. is a licensee of the CompactFlash and CF Logo trademarks.
NetApp, Inc. NetCache is certified RealSystem compatible.
Index | 105
Index
A
active/passive configuration 29
adapters
quad-port Fibre Channel HBA 40
aggregate relocation
commands for 85
giveback 92
monitoring progress of 86
troubleshooting 92
veto 86
aggregates
HA policy of 22
ownership change 21, 83
relocation of 24, 82
root 26
automatic giveback
commands for configuring 55
automatic takeover
triggers for 52
B
background disk firmware update 22
best practices
HA configuration 26
C
cabinets
preparing for cabling 39
cable 37
cabling
Channel A
for standard HA pairs 41
Channel B
for standard HA pairs 43
cross-cabled cluster interconnect 46
cross-cabled HA interconnect 45
error message, cross-cabled cluster interconnect 45,
46
HA interconnect for standard HA pair 45
HA interconnect for standard HA pair, 32xx systems
46
HA pairs 35
preparing equipment racks for 38
D
data network 12
Data ONTAP
upgrading nondisruptively, documentation for 7
disk firmware update 22
disk shelves
about modules for 77
adding to an HA pair with multipath HA 75
hot swapping modules in 80
managing in an HA pair 75
documentation, required 36
dual-chassis HA configurations
diagram of 30
interconnect 31
E
eliminating single point of failure 8
EMS message, takeover impossible 26
equipment racks
installation in 35
preparation of 38
events
table of failover triggering 16
F
failover
benefits of controller 8
failovers
events that trigger 16
failures
table of failover triggering 16
fault tolerance 6
Fibre Channel ports
identifying for HA pair 40
forcing takeover
commands for 70
FRU replacement, nondisruptive
documentation for 7
G
giveback
commands for configuring automatic 55
definition of 15
interrupted 72
manual 74
monitoring progress of 72, 74
partial-giveback 72
performing a 72
testing 55
troubleshooting, SFO aggregates 92
veto 72, 74
what happens during 21
giveback after reboot 54
H
HA
configuring in two-node clusters 47
HA configurations
benefits of 6
definition of 6
differences between supported system 31
single- and dual-chassis 30
HA interconnect
cabling 45
cabling, 32xx dual-chassis HA configurations 46
single-chassis and dual-chassis HA configurations
31
HA issues
troubleshooting general 89
HA mode
disabling 49
enabling 49
HA pairs
cabling 35, 39
events that trigger failover in 16
in a two-node switchless cluster 14
installing 35
managing disk shelves in 75
required connections for using UPSs with 46
setup requirements 27
setup restrictions 27
types of
installed in equipment racks 35
installed in system cabinets 35
HA pairs and clusters 12
HA policy 22
HA state 31
Index | 107
HA state values
troubleshooting issues with 100
ha-config modify command 31
ha-config show command 31
hardware
components described 11
HA components described 11
single point of failure 8
hardware assisted takeover
events that trigger 53
hardware replacement, nondisruptive
documentation for 7
high availability
configuring in two-node clusters 47
I
installation
equipment rack 35
system cabinet 35
installing
HA pairs 35
N
node states
description of 59
Non-HA mode
enabling 49
Nondisruptive aggregate relocation 6
nondisruptive hardware replacement
documentation for 7
nondisruptive operations 6
nondisruptive storage controller upgrade using
aggregate relocation
documentation for 7
storage controller upgrade using aggregate
relocation, nondisruptive
documentation for 7
nondisruptive upgrades
Data ONTAP, documentation for 7
NVMEM log mirroring 6
NVRAM adapter 37
NVRAM log mirroring 6
O
L
licenses
cf 49
not required 49
LIF configuration, best practice 26
M
mailbox disks in the HA pair 6
manual takeovers
commands for performing 70
MetroCluster configurations
events that trigger failover in 16
mirroring, NVMEM or NVRAM log 6
modules, disk shelf
about 77
best practices for changing types 77
hot-swapping 80
restrictions for changing types 77
testing 78
multipath HA loop
adding disk shelves to 75
overriding vetoes
aggregate relocation 86
giveback 72
P
panic, leading to takeover and giveback 54
ports
identifying which ones to use 40
power supply best practice 26
preparing equipment racks 38
R
racking the HA pair
in a system cabinet 35
in telco-style racks 35
reboot, leading to takeover and giveback 54
relocating aggregate ownership 83
relocating aggregates 82
relocation of aggregates 24, 82, 83
requirements
documentation 36
equipment 37
HA pair setup 27
hot-swapping a disk shelf module 80
S
SFO HA policy 22
SFP modules 37
sharing storage loops or stacks 29
shelves
managing in an HA pair 75
single point of failure
analysis 8
definition of 8
eliminating 8
single-chassis HA configurations
diagram of 30
interconnect 31
SMB 3.0 sessions on Microsoft Hyper-V
effect of takeover on 20
SMB sessions
effect of takeover on 20
spare disks in the HA pair 6
standard HA pair
cabling Channel A 41
cabling Channel B 43
cabling HA interconnect for 45
cabling HA interconnect for, 32xx systems 46
variations 29
states
description of node 59
status messages
description of node state 59
storage aggregate relocation start
key parameters 85
storage failover
commands for enabling 48
testing takeover and giveback 55
switchless-cluster
enabling in two-node clusters 47
system cabinets
installation in 35
preparing for cabling 39
system configurations
HA differences between supported 31
T
takeover
configuring when it occurs 52
definition of 15
effect on CIFS sessions 20
effect on SMB 3.0 sessions 20
effect on SMB sessions 20
events that trigger hardware-assisted 53
hardware assisted 19, 28
reasons for 52
testing 55
what happens during 20
takeover impossible EMS message 26
takeovers
commands for forcing 70
when they occur 15
testing
takeover and giveback 55
tools, required 37
troubleshooting
aggregate relocation 95
general HA issues 89
HA state issues 100
two-node switchless cluster 14
U
uninterruptible power supplies
See UPSs
UPSs
required connections with HA pairs 46
utilities
downloading and running Config Advisor 51
V
verifying
takeover and giveback 55
veto
aggregate relocation 86
giveback 72
override 72, 86
VIF configuration, best practice in an HA configuration
26