ONTAP 90 HighAvailability Configuration Guide
ONTAP 90 HighAvailability Configuration Guide
ONTAP 90 HighAvailability Configuration Guide
High-Availability Configuration
Guide
Contents
Deciding whether to use this guide ............................................................. 5
Planning your HA pair configuration ......................................................... 6
Best practices for HA pairs ......................................................................................... 6
Setup requirements and restrictions for standard HA pairs ......................................... 7
Setup requirements and restrictions for mirrored HA pairs ........................................ 8
Requirements for hardware-assisted takeover ............................................................. 9
If your cluster consists of a single HA pair ................................................................. 9
Storage configuration variations for HA pairs .......................................................... 10
HA pairs and storage system model types ................................................................ 10
Single-chassis and dual-chassis HA pairs ..................................................... 10
Interconnect cabling for systems with variable HA configurations .............. 11
HA configuration and the HA state PROM value ......................................... 11
Requirements for cabling HA pair ........................................................... 12
System cabinet or equipment rack installation .......................................................... 12
HA pairs in an equipment rack ...................................................................... 12
HA pairs in a system cabinet ......................................................................... 12
Required documentation ........................................................................................... 13
Required tools ........................................................................................................... 13
Required equipment .................................................................................................. 14
Preparing your equipment ......................................................................................... 15
Installing the nodes in equipment racks ........................................................ 15
Installing the nodes in a system cabinet ........................................................ 16
Cabling a standard HA pair ....................................................................................... 16
Cabling the HA interconnect (all systems except 32xx or 80xx in
separate chassis) ...................................................................................... 17
Cabling the HA interconnect (32xx systems in separate chassis) ................. 17
Cabling the HA interconnect (80xx systems in separate chassis) ................. 18
Cabling a mirrored HA pair ...................................................................................... 19
Cabling the HA interconnect (all systems except 32xx or 80xx in
separate chassis) ...................................................................................... 19
Cabling the HA interconnect (32xx systems in separate chassis) ................. 20
Cabling the HA interconnect (80xx systems in separate chassis) ................. 20
Required connections for using uninterruptible power supplies with standard or
mirrored HA pairs ................................................................................................ 21
Configuring an HA pair ............................................................................. 22
Verifying and setting the HA state on the controller modules and chassis ............... 22
Setting the HA mode and enabling storage failover .................................................. 24
Commands for setting the HA mode ............................................................. 24
Commands for enabling and disabling storage failover ................................ 24
Enabling cluster HA and switchless-cluster in a two-node cluster ........................... 24
Checking for common configuration errors using Config Advisor .......................... 25
4 | High-Availability Configuration Guide
• You want to understand the requirements and best practices for configuring HA pairs.
If you want to use ONTAP System Manager to monitor HA pairs, you should choose the following
documentation:
• You must not use the root aggregate for storing data.
Storing user data in the root aggregate adversely affects system stability and increases the storage
failover time between nodes in an HA pair.
• You must verify that each power supply unit in the storage system is on a different power grid so
that a single power outage does not affect all power supply units.
• You must use LIFs (logical interfaces) with defined failover policies to provide redundancy and
improve availability of network communication.
• You must verify that you test the failover capability routinely (for example, during planned
maintenance) to verify proper configuration.
• You must verify that each node has sufficient resources to adequately support the workload of
both nodes during takeover mode.
• You must use the Config Advisor tool to help make failovers successful.
• If your system supports remote management (through a Service Processor), you must configure it
properly.
System administration
• You must verify that you follow recommended limits for FlexVol volumes, dense volumes,
Snapshot copies, and LUNs to reduce takeover or giveback time.
When adding FlexVol volumes to an HA pair, you should consider testing the takeover and
giveback times to verify that they fall within your requirements.
• For systems using disks, ensure that you check for failed disks regularly and remove them as soon
as possible.
Failed disks can extend the duration of takeover operations or prevent giveback operations.
Disk and aggregate management
• Multipath HA connection is required on all HA pairs except for some FAS22xx, FAS25xx, and
FAS2600 series system configurations, which use single-path HA and lack the redundant standby
connections.
• To receive prompt notification if the takeover capability becomes disabled, you should configure
your system to enable automatic email notification for the takeover impossible EMS
messages:
Planning your HA pair configuration | 7
◦ ha.takeoverImpVersion
◦ ha.takeoverImpLowMem
◦ ha.takeoverImpDegraded
◦ ha.takeoverImpUnsync
◦ ha.takeoverImpIC
◦ ha.takeoverImpHotShelf
◦ ha.takeoverImpNotDef
• Avoid using the -only-cfo-aggregates parameter with the storage failover giveback
command.
• Architecture compatibility
Both nodes must have the same system model and be running the same ONTAP software and
system firmware versions. The ONTAP release notes list the supported storage systems.
ONTAP 9 Release Notes
NetApp Hardware Universe
• Nonvolatile memory (NVRAM or NVMEM) size and version compatibility
The size and version of the system's nonvolatile memory must be identical on both nodes in an
HA pair.
• Storage capacity
◦ The number of disks or array LUNs must not exceed the maximum configuration capacity.
◦ The total storage attached to each node must not exceed the capacity for a single node.
◦ If your system uses native disks and array LUNs, the combined total of disks and array LUNs
cannot exceed the maximum configuration capacity.
◦ The total storage attached to each node must not exceed the capacity for a single node.
◦ To determine the maximum capacity for a system using disks, array LUNs, or both, see the
Hardware Universe at hwu.netapp.com.
Note: After a failover, the takeover node temporarily serves data from all of the storage in the
HA pair.
◦ Different types of storage can be used on separate stacks on the same node.
You can also dedicate a node to one type of storage and the partner node to a different type, if
needed.
NetApp Hardware Universe
Disk and aggregate management
◦ Multipath HA connection is required on all HA pairs except for some FAS22xx, FAS25xx,
and FAS2600 series system configurations, which use single-path HA and lack the redundant
standby connections.
• Network connectivity
Both nodes must be attached to the same network and the Network Interface Cards (NICs) or
onboard Ethernet ports must be configured correctly.
• System software
The same system software, such as SyncMirror, Server Message Block (SMB) or Common
Internet File System (CIFS), or Network File System (NFS), must be licensed and enabled on
both nodes.
Note: If a takeover occurs, the takeover node can provide only the functionality for the licenses
installed on it. If the takeover node does not have a license that was being used by the partner
node to serve data, your HA pair loses functionality after a takeover.
Related references
Commands for performing and monitoring manual takeovers on page 46
◦ Disks or array LUNs in the same plex must be from the same pool, with those in the opposite
plex from the opposite pool.
◦ There must be sufficient spares in each pool to account for a disk or array LUN failure.
◦ Both plexes of a mirror should not reside on the same disk shelf because it might result in a
single point of failure.
Planning your HA pair configuration | 9
• If you are using array LUNs, paths to an array LUN must be redundant.
Related references
Commands for setting the HA mode on page 24
Related information
System administration
Related tasks
Enabling cluster HA and switchless-cluster in a two-node cluster on page 24
Related references
Halting or rebooting a node without initiating takeover on page 43
Related information
System administration
10 | High-Availability Configuration Guide
You can find more information about HA configurations supported by storage system models in the
Hardware Universe.
Related information
NetApp Hardware Universe
In a single-chassis HA pair, both controllers are in the same chassis. The HA interconnect is provided
by the internal backplane. No external HA interconnect cabling is required.
The following example shows a dual-chassis HA pair and the HA interconnect cables:
Planning your HA pair configuration | 11
In a dual-chassis HA pair, the controllers are in separate chassis. The HA interconnect is provided by
external cabling.
Related tasks
Verifying and setting the HA state on the controller modules and chassis on page 22
12
Required documentation
Installing an HA pair requires that you have the correct documentation.
The following table lists and briefly describes the documentation you might need to refer to when
preparing a new HA pair, or converting two stand-alone systems into an HA pair:
Related information
NetApp Documentation: Product Library A-Z
Required tools
You must have the correct tools to install the HA pair.
You need the following tools to install the HA pair:
14 | High-Availability Configuration Guide
• Hand level
• Marker
Required equipment
When you receive your HA pair, you should receive a list of required equipment.
For more information, see the Hardware Universe to confirm your storage system type, storage
capacity, and so on.
hwu.netapp.com
Steps
1. Install the nodes in the equipment rack as described in the guide for your disk shelf, hardware
documentation, or the Installation and Setup Instructions that came with your equipment.
2. Install the disk shelves in the equipment rack as described in the appropriate disk shelf guide.
4. Connect the nodes to the network as described in the setup instructions for your system.
Result
The nodes are now in place and connected to the network; power is available.
Steps
1. Install the system cabinets, nodes, and disk shelves as described in the System Cabinet Guide.
If you have multiple system cabinets, remove the front and rear doors and any side panels that
need to be removed, and connect the system cabinets together.
2. Connect the nodes to the network, as described in the Installation and Setup Instructions for your
system.
3. Connect the system cabinets to an appropriate power source and apply power to the cabinets.
Result
The nodes are now in place and connected to the network, and power is available.
Steps
1. Cabling the HA interconnect (all systems except 32xx or 80xx in separate chassis) on page 17
To cable the HA interconnect between the HA pair nodes, you must make sure that your
interconnect adapter is in the correct slot. You must also connect the adapters on each node with
the optical cable.
2. Cabling the HA interconnect (32xx systems in separate chassis) on page 17
To enable the HA interconnect between 32xx controller modules that reside in separate chassis,
you must cable the onboard 10-GbE ports on one controller module to the onboard 10-GbE ports
on the partner.
3. Cabling the HA interconnect (80xx systems in separate chassis) on page 18
To enable the HA interconnect between 80xx controller modules that reside in separate chassis,
you must cable the QSFP InfiniBand ports on one I/O expansion module to the QSFP InfiniBand
ports on the partner's I/O expansion module.
Related information
NetApp Documentation: Disk Shelves
Requirements for cabling HA pair | 17
Steps
1. Verify that your interconnect adapter is in the correct slot for your system in an HA pair.
hwu.netapp.com
For systems that use an NVRAM adapter, the NVRAM adapter functions as the HA interconnect
adapter.
2. Plug one end of the optical cable into one of the local node's HA adapter ports, then plug the
other end into the partner node's corresponding adapter port.
You must not cross-cable the HA interconnect adapter. Cable the local node ports only to the
identical ports on the partner node.
If the system detects a cross-cabled HA interconnect, the following message appears on the
system console and in the event log (accessible using the event log show command):
HA interconnect port <port> of this appliance seems to be connected to
port <port> on the partner appliance.
Result
The nodes are connected to each other.
Steps
1. Plug one end of the 10 GbE cable to the c0a port on one controller module.
2. Plug the other end of the 10 GbE cable to the c0a port on the partner controller module.
If the system detects a cross-cabled HA interconnect, the following message appears on the
system console and in the event log (accessible using the event log show command):
HA interconnect port <port> of this appliance seems to be connected to
port <port> on the partner appliance.
Result
The nodes are connected to each other.
Steps
1. Plug one end of the QSFP InfiniBand cable to the ib0a port on one I/O expansion module.
2. Plug the other end of the QSFP InfiniBand cable to the ib0a port on the partner's I/O expansion
module.
Result
The nodes are connected to each other.
Steps
1. Cabling the HA interconnect (all systems except 32xx or 80xx in separate chassis) on page 19
To cable the HA interconnect between the HA pair nodes, you must make sure that your
interconnect adapter is in the correct slot. You must also connect the adapters on each node with
the optical cable.
2. Cabling the HA interconnect (32xx systems in separate chassis) on page 20
To enable the HA interconnect between 32xx controller modules that reside in separate chassis,
you must cable the onboard 10-GbE ports on one controller module to the onboard 10-GbE ports
on the partner.
3. Cabling the HA interconnect (80xx systems in separate chassis) on page 20
To enable the HA interconnect between 80xx controller modules that reside in separate chassis,
you must cable the QSFP InfiniBand ports on one I/O expansion module to the QSFP InfiniBand
ports on the partner's I/O expansion module.
Related information
NetApp Documentation: Disk Shelves
Steps
1. Verify that your interconnect adapter is in the correct slot for your system in an HA pair.
hwu.netapp.com
For systems that use an NVRAM adapter, the NVRAM adapter functions as the HA interconnect
adapter.
2. Plug one end of the optical cable into one of the local node's HA adapter ports, then plug the
other end into the partner node's corresponding adapter port.
You must not cross-cable the HA interconnect adapter. Cable the local node ports only to the
identical ports on the partner node.
If the system detects a cross-cabled HA interconnect, the following message appears on the
system console and in the event log (accessible using the event log show command):
20 | High-Availability Configuration Guide
Result
The nodes are connected to each other.
Steps
1. Plug one end of the 10 GbE cable to the c0a port on one controller module.
2. Plug the other end of the 10 GbE cable to the c0a port on the partner controller module.
Result
The nodes are connected to each other.
Steps
1. Plug one end of the QSFP InfiniBand cable to the ib0a port on one I/O expansion module.
2. Plug the other end of the QSFP InfiniBand cable to the ib0a port on the partner's I/O expansion
module.
Result
The nodes are connected to each other.
Configuring an HA pair
Bringing up and configuring a standard or mirrored HA pair for the first time can require enabling
HA mode capability and failover, setting options, configuring network connections, and testing the
configuration.
These tasks apply to all HA pairs regardless of disk shelf type.
Steps
1. Verifying and setting the HA state on the controller modules and chassis on page 22
2. Setting the HA mode and enabling storage failover on page 24
3. Enabling cluster HA and switchless-cluster in a two-node cluster on page 24
4. Checking for common configuration errors using Config Advisor on page 25
5. Configuring hardware-assisted takeover on page 26
6. Configuring automatic takeover on page 27
7. Configuring automatic giveback on page 28
8. Testing takeover and giveback on page 31
The HA state is recorded in the hardware PROM in the chassis and in the controller module. It must
be consistent across all components of the system, as shown in the following table:
If the system or systems are The HA state is recorded on The HA state on the
in a... these components... components must be...
Stand-alone configuration non-ha
• The chassis
(not in an HA pair)
• Controller module A
A single-chassis HA pair ha
• The chassis
• Controller module A
• Controller module B
Configuring an HA pair | 23
If the system or systems are The HA state is recorded on The HA state on the
in a... these components... components must be...
A dual-chassis HA pair ha
• Chassis A
• Controller module A
• Chassis B
• Controller module B
• Controller module B
• Chassis B
• Controller module B
Use the following steps to verify the HA state is appropriate and, if not, to change it:
Steps
1. Reboot or halt the current controller module and use either of the following two options to boot
into Maintenance mode:
a. If you rebooted the controller, press Ctrl-C when prompted to display the boot menu and then
select the option for Maintenance mode boot.
b. If you halted the controller, enter the following command from the LOADER prompt:
boot_ontap maint
Note: This option boots directly into Maintenance mode; you do not need to press Ctrl-C.
2. After the system boots into Maintenance mode, enter the following command to display the HA
state of the local controller module and chassis:
ha-config show
3. If necessary, enter the following command to set the HA state of the controller:
ha-config modify controller ha-state
4. If necessary, enter the following command to set the HA state of the chassis:
ha-config modify chassis ha-state
6. Boot the system by entering the following command at the boot loader prompt:
boot_ontap
Related information
Stretch MetroCluster installation and configuration
Fabric-attached MetroCluster installation and configuration
MetroCluster management and disaster recovery
Related references
Description of node states displayed by storage failover show-type commands on page 34
Cluster HA ensures that the failure of one node does not disable the cluster. If your cluster contains
only two nodes:
• Enabling cluster HA requires and automatically enables storage failover and auto-giveback.
Note: If the cluster contains or grows to more than two nodes, cluster HA is not required and is
disabled automatically.
For ONTAP 9.0 and 9.1, if you have a two-node switchless configuration, the switchless-
cluster network option must be enabled to ensure proper cluster communication between the
nodes. In ONTAP 9.2, the switchless-cluster network option is automatically enabled. When
the detect-switchless-cluster option is set to false, the switchless-cluster option will behave as it has
in previous releases.
Steps
If storage failover is not already enabled, you are prompted to confirm enabling of both storage
failover and auto-giveback.
2. ONTAP 9.0, 9.1: If you have a two-node switchless cluster, enter the following commands to
verify that the switchless-cluster option is set:
Confirm when prompted to continue into advanced mode. The advanced mode prompt appears
(*>).
If the output shows that the value is false, you must issue the following command:
network options switchless-cluster modify true
Related concepts
If your cluster consists of a single HA pair on page 9
Related references
Halting or rebooting a node without initiating takeover on page 43
Steps
1. Log in to the NetApp Support Site, and then navigate to Downloads > Software > ToolChest.
26 | High-Availability Configuration Guide
3. Download, install, and run Config Advisor by following the directions on the web page.
4. After running Config Advisor, review the tool's output, and follow the recommendations that are
provided to address any issues that are discovered by the tool.
Related information
Command map for 7-Mode administrators
Related information
Command map for 7-Mode administrators
• The node cannot send heartbeat messages to its partner due to events such as loss of power or
watchdog reset.
28 | High-Availability Configuration Guide
If the onpanic parameter is set to true, a node panic also causes an automatic takeover. If onpanic
is set to false a node panic does not cause an automatic takeover.
Disable automatic giveback after takeover on storage failover modify ‑node nodename
panic (this setting is enabled by default) ‑auto‑giveback‑after‑panic false
Delay automatic giveback for a specified storage failover modify ‑node nodename
number of seconds (default is 600) ‑delay‑seconds seconds
This option determines the minimum time
that a node will remain in takeover before
performing an automatic giveback.
Change the number of times the automatic storage failover modify ‑node nodename
giveback is attempted within 60 minutes ‑attempts integer
(default is two)
Change the time period (in minutes) used by storage failover modify ‑node nodename
the -attempts parameter (default is 60 ‑attempts‑time integer
minutes)
Change the time period (in minutes) to delay storage failover modify ‑node nodename
the automatic giveback before terminating ‑auto‑giveback‑cifs‑terminate‑minutes
CIFS clients that have open files. integer
During the delay, the system periodically
sends notices to the affected workstations. If
0 (zero) minutes are specified, then CIFS
clients are terminated immediately.
Configuring an HA pair | 29
Related information
Command map for 7-Mode administrators
The following table describes how combinations of the -onreboot and -auto-giveback
parameters affect automatic giveback for takeover events not caused by a panic:
Note: If the -onreboot parameter is set to true and a takeover occurs due to a reboot, then
automatic giveback is always performed, regardless of whether the -auto-giveback parameter is
set to true.
When the -onreboot parameter is set to false, a takeover does not occur in the case of a node
reboot. Therefore, automatic giveback cannot occur, regardless of whether the -auto-giveback
parameter is set to true. A client disruption occurs.
The following table describes how parameter combinations of the storage failover modify
command affect automatic giveback in panic situations:
storage failover parameters used Does automatic giveback occur after panic?
-onpanic true Yes
-auto-giveback-after-panic true
-onpanic false No
-auto-giveback-after-panic true
-onpanic false No
-auto-giveback-after-panic false
-onpanic true No
-auto-giveback false
-auto-giveback-after-panic false
Configuring an HA pair | 31
storage failover parameters used Does automatic giveback occur after panic?
-onpanic false No
If -onpanic is set to false, takeover/giveback
does not occur, regardless of the value set for -
auto-giveback or -auto-giveback-
after-panic
Note: If the -onpanic parameter is set to true, automatic giveback is always performed if a
panic occurs.
If the -onpanic parameter is set to false, takeover does not occur. Therefore, automatic
giveback cannot occur, even if the ‑auto‑giveback‑after‑panic parameter is set to true. A
client disruption occurs.
Steps
1. Check the cabling on the HA interconnect cables to make sure that they are secure.
2. Verify that you can create and retrieve files on both nodes for each licensed protocol.
Example
If you have the storage failover command's -auto-giveback option enabled:
Example
If you have the storage failover command's -auto-giveback option disabled:
32 | High-Availability Configuration Guide
5. Enter the following command to display all the disks that belong to the partner node (Node2) that
the takeover node (Node1) can detect:
storage disk show -home node2 -ownership
The following command displays all disks belonging to Node2 that Node1 can detect:
6. Enter the following command to confirm that the takeover node (Node1) controls the partner
node's (Node2) aggregates:
aggr show ‑fields home‑id,home‑name,is‑home
During takeover, the is-home value of the partner node's aggregates is false.
7. Give back the partner node's data service after it displays the Waiting for giveback message
by entering the following command:
storage failover giveback -ofnode partner_node
8. Enter either of the following commands to observe the progress of the giveback operation:
storage failover show-giveback
storage failover show
9. Proceed depending on whether you saw the message that giveback was completed successfully:
Related references
Description of node states displayed by storage failover show-type commands on page 34
33
Monitoring an HA pair
You can use a variety of commands to monitor the status of the HA pair. If a takeover occurs, you
can also determine what caused the takeover.
Related tasks
Enabling cluster HA and switchless-cluster in a two-node cluster on page 24
34 | High-Availability Configuration Guide
State Meaning
Connected to partner_name. The HA interconnect is active and can transmit
data to the partner node.
Connected to partner_name, Partial The HA interconnect is active and can transmit
giveback. data to the partner node. The previous giveback
to the partner node was a partial giveback, or is
incomplete.
Connected to partner_name, Takeover The HA interconnect is active and can transmit
of partner_name is not possible due data to the partner node, but takeover of the
to reason(s): reason1, reason2,.... partner node is not possible.
A detailed list of reasons explaining why
takeover is not possible is provided in the
section following this table.
Connected to partner_name, Partial The HA interconnect is active and can transmit
giveback, Takeover of partner_name data to the partner node, but takeover of the
is not possible due to reason(s): partner node is not possible. The previous
reason1, reason2,.... giveback to the partner was a partial giveback.
Connected to partner_name, Waiting The HA interconnect is active and can transmit
for cluster applications to come data to the partner node and is waiting for
online on the local node. cluster applications to come online.
This waiting period can last several minutes.
Waiting for partner_name, Takeover The local node cannot exchange information
of partner_name is not possible due with the partner node over the HA interconnect.
to reason(s): reason1, reason2,.... Reasons for takeover not being possible are
displayed under reason1, reason2,…
Waiting for partner_name, Partial The local node cannot exchange information
giveback, Takeover of partner_name with the partner node over the HA interconnect.
is not possible due to reason(s): The previous giveback to the partner was a
reason1, reason2,.... partial giveback. Reasons for takeover not being
possible are displayed under reason1,
reason2,…
Pending shutdown. The local node is shutting down. Takeover and
giveback operations are disabled.
In takeover. The local node is in takeover state and
automatic giveback is disabled.
In takeover, Auto giveback will be The local node is in takeover state and
initiated in number of seconds automatic giveback will begin in number of
seconds. seconds seconds.
Monitoring an HA pair | 35
State Meaning
In takeover, Auto giveback The local node is in takeover state and an
deferred. automatic giveback attempt failed because the
partner node was not in waiting for giveback
state.
Giveback in progress, module module The local node is in the process of giveback to
name. the partner node. Module module name is
being given back.
Normal giveback not possible: The partner node is missing some of its own file
partner missing file system disks. system disks.
Retrieving disk information. Wait a The partner and takeover nodes have not yet
few minutes for the operation to exchanged disk inventory information. This
complete, then try giveback. state clears automatically.
Connected to partner_name, Takeover After a takeover or giveback operation (or in the
is not possible: Local node missing case of MetroCluster, a disaster recovery
partner disks operation including switchover, healing, or
switchback), you might see disk inventory
mismatch messages.
If this is the case, you should wait at least five
minutes for the condition to resolve before
retrying the operation.
If the condition persists, investigate possible
disk or cabling issues.
Connected to partner, Takeover is After a takeover or giveback operation (or in the
not possible: Storage failover case of MetroCluster, a disaster recovery
mailbox disk state is invalid, operation including switchover, healing, or
Local node has encountered errors switchback), you might see disk inventory
while reading the storage failover mismatch messages.
partner's mailbox disks. Local node If this is the case, you should wait at least five
missing partner disks minutes for the condition to resolve before
retrying the operation.
If the condition persists, investigate possible
disk or cabling issues.
Previous giveback failed in module Giveback to the partner node by the local node
module name. failed due to an issue in module name.
Previous giveback failed. Auto Giveback to the partner node by the local node
giveback disabled due to exceeding failed. Automatic giveback is disabled because
retry counts. of excessive retry attempts.
Takeover scheduled in seconds Takeover of the partner node by the local node
seconds. is scheduled due to the partner node shutting
down or an operator-initiated takeover from the
local node. The takeover will be initiated within
the specified number of seconds.
36 | High-Availability Configuration Guide
State Meaning
Takeover in progress, module module The local node is in the process of taking over
name. the partner node. Module module name is
being taken over.
Takeover in progress. The local node is in the process of taking over
the partner node.
firmware-status. The node is not reachable and the system is
trying to determine its status from firmware
updates to its partner.
A detailed list of possible firmware statuses is
provided after this table.
Node unreachable. The node is unreachable and its firmware status
cannot be determined.
Takeover failed, reason: reason. Takeover of the partner node by the local node
failed due to reason reason.
Previous giveback failed in module: Previously attempted giveback failed in module
module name. Auto giveback disabled module name. Automatic giveback is disabled.
due to exceeding retry counts.
• Run the storage failover show-
giveback command for more information.
Waiting for partner_name, Giveback The local node cannot exchange information
of SFO aggregates in progress. with the partner node over the HA interconnect.
Giveback of SFO aggregates is in progress.
State Meaning
Waiting for partner_name. Node owns The local node cannot exchange information
aggregates belonging to another with the partner node over the HA interconnect,
node in the cluster. and owns aggregates that belong to the partner
node.
Connected to partner_name, Giveback The HA interconnect is active and can transmit
of partner spare disks pending. data to the partner node. Giveback of SFO
aggregates to the partner is done, but partner
spare disks are still owned by the local node.
Waiting for partner_name. Waiting The local node cannot exchange information
for partner lock synchronization. with the partner node over the HA interconnect,
and is waiting for partner lock synchronization
to occur.
Waiting for partner_name. Waiting The local node cannot exchange information
for cluster applications to come with the partner node over the HA interconnect,
online on the local node. and is waiting for cluster applications to come
online.
Takeover scheduled. target node Takeover processing has started. The target
relocating its SFO aggregates in node is relocating ownership of its SFO
preparation of takeover. aggregates in preparation for takeover.
Takeover scheduled. target node has Takeover processing has started. The target
relocated its SFO aggregates in node has relocated ownership of its SFO
preparation of takeover. aggregates in preparation for takeover.
Takeover scheduled. Waiting to Takeover processing has started. The system is
disable background disk firmware waiting for background disk firmware update
updates on local node. A firmware operations on the local node to complete.
update is in progress on the node.
Relocating SFO aggregates to taking The local node is relocating ownership of its
over node in preparation of SFO aggregates to the taking-over node in
takeover. preparation for takeover.
Relocated SFO aggregates to taking Relocation of ownership of SFO aggregates
over node. Waiting for taking over from the local node to the taking-over node has
node to takeover. completed. The system is waiting for takeover
by the taking-over node.
38 | High-Availability Configuration Guide
State Meaning
Relocating SFO aggregates to Relocation of ownership of SFO aggregates
partner_name. Waiting to disable from the local node to the taking-over node is in
background disk firmware updates on progress. The system is waiting for background
the local node. A firmware update disk firmware update operations on the local
is in progress on the node. node to complete.
Relocating SFO aggregates to Relocation of ownership of SFO aggregates
partner_name. Waiting to disable from the local node to the taking-over node is in
background disk firmware updates on progress. The system is waiting for background
partner_name. A firmware update is disk firmware update operations on the partner
in progress on the node. node to complete.
Connected to partner_name. Previous The HA interconnect is active and can transmit
takeover attempt was aborted data to the partner node. The previous takeover
because reason. Local node owns attempt was aborted because of the reason
some of partner's SFO aggregates. displayed under reason. The local node owns
Reissue a takeover of the partner some of its partner's SFO aggregates.
with the "‑bypass-optimization"
• Either reissue a takeover of the partner node,
parameter set to true to takeover
setting the ‑bypass‑optimization
remaining aggregates, or issue a
parameter to true to takeover the remaining
giveback of the partner to return
the relocated aggregates.
SFO aggregates, or perform a giveback of
the partner to return relocated aggregates.
Waiting for partner_name. Previous The local node cannot exchange information
takeover attempt was aborted with the partner node over the HA interconnect.
because reason. Local node owns The previous takeover attempt was aborted
some of partner's SFO aggregates. because of the reason displayed under reason.
Reissue a takeover of the partner The local node owns some of its partner's SFO
with the "‑bypass-optimization" aggregates.
parameter set to true to takeover
• Either reissue a takeover of the partner node,
remaining aggregates, or issue a
setting the ‑bypass‑optimization
giveback of the partner to return
parameter to true to takeover the remaining
the relocated aggregates.
SFO aggregates, or perform a giveback of
the partner to return relocated aggregates.
Monitoring an HA pair | 39
State Meaning
Waiting for partner_name. Previous The local node cannot exchange information
takeover attempt was aborted. Local with the partner node over the HA interconnect.
node owns some of partner's SFO The previous takeover attempt was aborted. The
aggregates. local node owns some of its partner's SFO
Reissue a takeover of the partner aggregates.
with the "‑bypass-optimization"
• Either reissue a takeover of the partner node,
parameter set to true to takeover
setting the ‑bypass‑optimization
remaining aggregates, or issue a
parameter to true to takeover the remaining
giveback of the partner to return
SFO aggregates, or perform a giveback of
the relocated aggregates.
the partner to return relocated aggregates.
Node owns partner's aggregates as The node owns its partner's aggregates due to
part of the non-disruptive the non-disruptive controller upgrade procedure
controller upgrade procedure. currently in progress.
Connected to partner_name. Node The HA interconnect is active and can transmit
owns aggregates belonging to data to the partner node. The node owns
another node in the cluster. aggregates belonging to another node in the
cluster.
40 | High-Availability Configuration Guide
State Meaning
Connected to partner_name. Waiting The HA interconnect is active and can transmit
for partner lock synchronization. data to the partner node. The system is waiting
for partner lock synchronization to complete.
Connected to partner_name. Waiting The HA interconnect is active and can transmit
for cluster applications to come data to the partner node. The system is waiting
online on the local node. for cluster applications to come online on the
local node.
Non-HA mode, reboot to use full Storage failover is not possible. The HA mode
NVRAM. option is configured as non_ha.
Non-HA mode, remove HA interconnect Storage failover is not possible. The HA mode
card from HA slot to use full option is configured as non_ha.
NVRAM.
• You must move the HA interconnect card
from the HA slot to use all of the node's
NVRAM.
Non-HA mode, remove partner system Storage failover is not possible. The HA mode
to use full NVRAM. option is configured as non_ha.
Non-HA mode. See documentation for Storage failover is not possible. The HA mode
procedure to activate HA. option is configured as non_ha.
• Local node has encountered errors while reading the storage failover partner's mailbox disks
• Booting
• Dumping core
• Dumping sparecore and ready to be taken-over
• Halted
• In takeover
• Initializing
• Operator completed
• Rebooting
• Takeover disabled
42 | High-Availability Configuration Guide
• Unknown
• Up
• Waiting
Related references
Commands for setting the HA mode on page 24
43
Related tasks
Halting or rebooting a node without initiating takeover in a two-node cluster on page 44
44 | High-Availability Configuration Guide
Steps
2. Because disabling cluster HA automatically assigns epsilon to one of the two nodes, you must
determine which node holds it, and if necessary, reassign it to the node that you wish to remain
online.
If the node you wish to halt or reboot does not hold epsilon, proceed to step 3.
c. If the node you wish to halt or reboot holds epsilon, you must remove it from the node by
using the following command:
cluster modify -node Node1 -epsilon false
d. Assign epsilon to the node that you wish to remain online (in this example, Node2) by using
the following command:
cluster modify -node Node2 -epsilon true
3. Halt or reboot and inhibit takeover of the node that does not hold epsilon (in this example, Node2)
by using either of the following commands as appropriate:
system node halt -node Node2 -inhibit-takeover true
system node reboot -node Node2 -inhibit-takeover true
Halting or rebooting a node without initiating takeover | 45
4. After the halted or rebooted node is back online, you must enable cluster HA by using the
following command:
cluster ha modify -configured true
Related tasks
Moving epsilon for certain manually initiated takeovers on page 47
46
Take over the partner node even if there is a disk storage failover takeover
mismatch ‑allow‑disk‑inventory‑mismatch
Take over the partner node even if there is an storage failover takeover ‑option
ONTAP version mismatch allow‑version‑mismatch
Note: This option is only used during the
nondisruptive ONTAP upgrade process.
Take over the partner node without performing storage failover takeover
aggregate relocation ‑bypass‑optimization true
Take over the partner node before the partner storage failover takeover ‑option
has time to close its storage resources gracefully immediate
Note: Before you issue the storage failover command with the immediate option, you must
migrate the data LIFs to another node by using the following command:
network interface migrate-all -node node
• If you specify the storage failover takeover ‑option immediate command without
first migrating the data LIFs, data LIF migration from the node is significantly delayed even if
the skip‑lif‑migration‑before‑takeover option is not specified.
• Similarly, if you specify the immediate option, negotiated takeover optimization is bypassed
even if the bypass‑optimization option is set to false.
Performing a manual takeover | 47
Related information
Disk and aggregate management
For further information about cluster administration, quorum and epsilon, see the document library
on the NetApp Support Site.
NetApp Documentation: Product Library A-Z
System administration
Steps
1. Verify the cluster state and confirm that epsilon is held by a healthy node that is not being taken
over:
a. Change to the advanced privilege level, confirming that you want to continue when the
advanced mode prompt appears (*>):
set -privilege advanced
If the node you want to take over does not hold epsilon, proceed to Step 4.
48 | High-Availability Configuration Guide
2. Remove epsilon from the node that you want to take over:
cluster modify -node Node1 -epsilon false
3. Assign epsilon to the partner node (in this example, Node2) by using the following command:
cluster modify -node Node2 -epsilon true
Related tasks
Halting or rebooting a node without initiating takeover in a two-node cluster on page 44
Related references
Halting or rebooting a node without initiating takeover on page 43
49
Related information
Disk and aggregate management
If giveback is interrupted
If the takeover node experiences a failure or a power outage during the giveback process, that process
stops and the takeover node returns to takeover mode until the failure is repaired or the power is
restored.
However, this depends upon the stage of giveback in which the failure occurred. If the node
encountered failure or a power outage during partial giveback state (after it has given back the root
aggregate), it will not return to takeover mode. Instead, the node returns to partial-giveback mode. If
this occurs, complete the process by repeating the giveback operation.
If giveback is vetoed
If giveback is vetoed, you must check the EMS messages to determine the cause. Depending on the
reason or reasons, you can decide whether you can safely override the vetoes.
The storage failover show-giveback command displays the giveback progress and shows
which subsystem vetoed the giveback, if any. Soft vetoes can be overridden, while hard vetoes cannot
be, even if forced. The following tables summarize the soft vetoes that should not be overridden,
along with recommended workarounds.
You can review the EMS details for any giveback vetoes by using the following command:
event log show -node * -event gb*
Disk Inventory Troubleshoot to identify and resolve the cause of the problem.
The destination node might be unable to see disks belonging to an
aggregate being migrated.
Inaccessible disks can result in inaccessible aggregates or volumes.
Volume Move Operation Troubleshoot to identify and resolve the cause of the problem.
This veto prevents the volume move operation from aborting during
the important cutover phase. If the job is aborted during cutover, the
volume might become inaccessible.
Related references
Description of node states displayed by storage failover show-type commands on page 34
Give back storage even if the partner is not in storage failover giveback ‑ofnode
the waiting for giveback mode nodename ‑require‑partner‑waiting
false
Do not use this option unless a longer client
outage is acceptable.
Performing a manual giveback | 51
Monitor the progress of giveback after you issue storage failover show‑giveback
the giveback command
Related information
Command map for 7-Mode administrators
52
• Because volume count limits are validated programmatically during aggregate relocation
operations, it is not necessary to check for this manually.
If the volume count exceeds the supported limit, the aggregate relocation operation fails with a
relevant error message.
• You should not initiate aggregate relocation when system-level operations are in progress on
either the source or the destination node; likewise, you should not start these operations during
the aggregate relocation.
These operations can include the following:
◦ Takeover
◦ Giveback
◦ Shutdown
◦ ONTAP upgrade
◦ ONTAP revert
• If you have a MetroCluster configuration, you should not initiate aggregate relocation while
disaster recovery operations (switchover, healing, or switchback) are in progress.
• You should not initiate aggregate relocation on aggregates that are corrupt or undergoing
maintenance.
• If the source node is used by an Infinite Volume with SnapDiff enabled, you must perform
additional steps before initiating the aggregate relocation and then perform the relocation in a
specific manner.
54 | High-Availability Configuration Guide
You must ensure that the destination node has a namespace mirror constituent and make decisions
about relocating aggregates that include namespace constituents.
Infinite volumes management
• Before initiating the aggregate relocation, you should save any core dumps on the source and
destination nodes.
Steps
1. View the aggregates on the node to confirm which aggregates to move and ensure they are online
and in good condition:
storage aggregate show -node source-node
Example
The following command shows six aggregates on the four nodes in the cluster. All aggregates are
online. Node1 and Node3 form an HA pair and Node2 and Node4 form an HA pair.
The following command moves the aggregates aggr_1 and aggr_2 from Node1 to Node3. Node3
is Node1's HA partner. The aggregates can be moved only within the HA pair.
3. Monitor the progress of the aggregate relocation with the storage aggregate relocation
show command:
storage aggregate relocation show -node source-node
Example
The following command shows the progress of the aggregates that are being moved to Node3:
When the relocation is complete, the output of this command shows each aggregate with a
relocation status of Done.
Related information
ONTAP 9 commands
56
1. The remote management device monitors the local system for certain types of failures.
2. If a failure is detected, the remote management device immediately sends an alert to the partner
node.
Appendix: Understanding takeover and giveback | 57
• The automatic giveback causes a second unscheduled interruption (after the automatic takeover).
Depending on your client configurations, you might want to initiate the giveback manually to
plan when this second interruption occurs.
• The takeover might have been due to a hardware problem that can recur without additional
diagnosis, leading to additional takeovers and givebacks.
Note: Automatic giveback is enabled by default if the cluster contains only a single HA pair.
Automatic giveback is disabled by default during nondisruptive ONTAP upgrades.
Before performing the automatic giveback (regardless of what triggered it), the partner node waits for
a fixed amount of time as controlled by the -delay-seconds parameter of the storage failover
modify command. The default delay is 600 seconds. By delaying the giveback, the process results in
two brief outages:
2. The time it takes for the taken-over node to boot up to the point at which it is ready for the
giveback
If the automatic giveback fails for any of the non-root aggregates, the system automatically makes
two additional attempts to complete the giveback.
• You can monitor the progress using the storage failover show‑takeover command.
• The aggregate relocation can be avoided during this takeover instance by using the
‑bypass‑optimization parameter with the storage failover takeover command. To
bypass aggregate relocation during all future planned takeovers, set the
‑bypass‑takeover‑optimization parameter of the storage failover modify
command to true.
Note: Aggregates are relocated serially during planned takeover operations to reduce client
outage. If aggregate relocation is bypassed, longer client outage occurs during planned takeover
events. Setting the ‑bypass‑takeover‑optimization parameter of the storage
failover modify command to true is not recommended in environments that have
stringent outage requirements.
2. If the user-initiated takeover is a negotiated takeover, the target node gracefully shuts down,
followed by takeover of the target node's root aggregate and any aggregates that were not
relocated in Step 1.
3. Before the storage takeover begins, data LIFs migrate from the target node to the node performing
the takeover or to any other node in the cluster based on LIF failover rules.
The LIF migration can be avoided by using the ‑skip‑lif-migration parameter with the
storage failover takeover command.
SMB/CIFS management
NFS management
Network and LIF management
4. Existing SMB (CIFS) sessions are disconnected when takeover occurs.
Attention: Due to the nature of the SMB protocol, all SMB sessions except for SMB 3.0
sessions connected to shares with the Continuous Availability property set, will be
disruptive. SMB 1.0 and SMB 2.x sessions cannot reconnect after a takeover event. Therefore,
takeover is disruptive and some data loss could occur.
5. SMB 3.0 sessions established to shares with the Continuous Availability property set can
reconnect to the disconnected shares after a takeover event.
If your site uses SMB 3.0 connections to Microsoft Hyper-V and the Continuous
Availability property is set on the associated shares, takeover will be nondisruptive for those
sessions.
SMB/CIFS management
• After it reboots, the node performs self-recovery operations and is no longer in takeover mode.
• Failover is disabled.
• If the node still owns some of the partner's aggregates, after enabling storage failover, return these
aggregates to the partner using the storage failover giveback command.
Appendix: Understanding takeover and giveback | 59
5. As soon as Node B reaches the point in the boot process where it can accept the non-root
aggregates, Node A returns ownership of the other aggregates, one at a time, until giveback is
complete.
You can monitor the progress of the giveback with the storage failover show-giveback
command.
Note: The storage failover show-giveback command does not (nor is it intended to)
display information about all operations occurring during the storage failover giveback
operation.
You can use the storage failover show command to display additional details about the
current failover status of the node, such as whether the node is fully functional, whether
takeover is possible, and whether giveback is complete.
I/O resumes for each aggregate once giveback is complete for that aggregate; this reduces the overall
outage window for each aggregate.
Copyright
Copyright © 2019 NetApp, Inc. All rights reserved. Printed in the U.S.
No part of this document covered by copyright may be reproduced in any form or by any means—
graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an
electronic retrieval system—without prior written permission of the copyright owner.
Software derived from copyrighted NetApp material is subject to the following license and
disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice.
NetApp assumes no responsibility or liability arising from the use of products described herein,
except as expressly agreed to in writing by NetApp. The use or purchase of this product does not
convey a license under any patent rights, trademark rights, or any other intellectual property rights of
NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents,
or pending applications.
Data contained herein pertains to a commercial item (as defined in FAR 2.101) and is proprietary to
NetApp, Inc. The U.S. Government has a non-exclusive, non-transferrable, non-sublicensable,
worldwide, limited irrevocable license to use the Data only in connection with and in support of the
U.S. Government contract under which the Data was delivered. Except as provided herein, the Data
may not be used, disclosed, reproduced, modified, performed, or displayed without the prior written
approval of NetApp, Inc. United States Government license rights for the Department of Defense are
limited to those rights identified in DFARS clause 252.227-7015(b).
63
Trademark
NETAPP, the NETAPP logo, and the marks listed on the NetApp Trademarks page are trademarks of
NetApp, Inc. Other company and product names may be trademarks of their respective owners.
http://www.netapp.com/us/legal/netapptmlist.aspx
64