CHC User Manual.v1 1 1 0
CHC User Manual.v1 1 1 0
CHC User Manual.v1 1 1 0
User Manual
Version 1.1.1.0
Revision 1.1
Page 1 of 52
IBM HPC Cluster Health Check
TABLE OF CONTENTS
CHANGE HISTORY.................................................................................................................3
1.0 OVERVIEW.................................................................................................................5
1.1 REQUIREMENTS...............................................................................................................5
1.2 INSTALLATION.................................................................................................................5
1.3 CONFIGURATION..............................................................................................................5
1.3.1 Tool configuration files........................................................................................7
1.3.2 Group configuration files...................................................................................10
1.4 TOOL OUTPUT PROCESSING METHODS................................................................................12
2.0 FRAMEWORK..........................................................................................................14
2.1 THE HCRUN COMMAND...................................................................................................14
2.2 THE COMPARE_RESULTS COMMAND...................................................................................26
2.3 USING CONFIG_CHECKCOMPARE_RESULTS AND HCRUN TO RUN HEALTH CHECKS.......................32
3.0 GUIDELINES FOR ADDING NEW TOOLS.........................................................39
5.0 PACKAGING.............................................................................................................41
5.1 RPM CONTENTS..........................................................................................................41
6.0 USE CASES................................................................................................................49
6.1 INITIALIZING YOUR ENVIRONMENT)...................................................................................49
6.2 VERIFYING CONSISTENCY WITHIN A NODEGROUP.................................................................49
6.3 VERIFYING CONSISTENCY WITHIN A NODEGROUP OVER TIME..................................................50
6.4 VERIFYING CONFIGURATION AFTER MAINTENANCE...............................................................51
6.5 USING BASELINE HISTORY...............................................................................................51
6.6 RUNNING NODE TESTS..................................................................................................52
6.7 RUNNING FABRIC TESTS................................................................................................52
Page 2 of 52
IBM HPC Cluster Health Check
Page 3 of 52
IBM HPC Cluster Health Check
ABOUT
To determine the health of a cluster, it is necessary to get the current state of all the components
which build up the cluster. In most installations these are:
- Compute nodes
- Ethernet network
- InfiniBand network
- Storage
In this document, we describe the usage of the Cluster Health Check (CHC) framework which is
used to perform checks on these components. By working with results of these checks, we will be
able to state if the cluster is healthy or not.
Change History
Version Changes
1.1.1.0 Document changes:
Add change history
Add section on Packaging
Add section on Use cases
Include link to developerWorks page on CHC
Updated section on adding new tools
Altered based on CHC changes listed below
CHC changes:
Updates to -l option output:
Format last column of long listing of tools and groups (-l option)
Output config filename for tool or tool group helpful for
determining where to change behavior
Processing methods updates:
Use compare_results instead of config_check
compare_results vs. config_check better baseline handling;
baseline history; multiple input types (xcoll results; or pipe into
xcoll); use HC_NODERANGE; more error handling;
veryverbose; use term baseline vs. template
Add xcoll_diff to processing methods
Add support for recommendation and comment keywords to config
files
Improve remote command support over xdsh
Generate environment file and pass using xdsh
Handle remote command return codes from xdsh rather than
getting the return code from xdsh itself
Display recommendation keyword contents for non-passing tool
results
Page 4 of 52
IBM HPC Cluster Health Check
Version Changes
Added man pages
Added bpcheck tool for x86 based clusters
Page 5 of 52
IBM HPC Cluster Health Check
1.0 Overview
The Cluster Health Check framework provides an environment for integrating and running a
subset of individual checks which can be performed on a single node or a group of nodes. The
tests are categorized into different tool types and groups, so that only checks for a specific part of
the cluster can be performed. Because the framework and toolset is extensible. users can also
create their own groups and tools for checking different components of the cluster.
The main wrapper tool is hcrun, which provides access to all tools and passes them the
environment.
1.1 Requirements
The cluster health check framework with toolkit should be installed on the xCAT server because
many of the individual checks and framework use xCAT commands. In addition some of the
health checks are MPI parallel applications which run using POE runtime environment.
xCAT
POE environment
Python 2.6
1.2 Installation
The cluster health check framework along with some tools is delivered as an rpm, and is installed
as user root with the rpm command:
The rpm command installs framework which consists few commands and health check tools to
/opt/ibmchc.
To run the health checks directly from command prompt, there are two possibilities.
# export PATH=$PATH:/opt/ibmchc/bin
# ln -s /opt/ibmchc/bin/hcrun /usr/bin/hcrun
1.3 Configuration
The cluster heath check and verification tools are developed to check the health of individual
cluster components or set of cluster components. The health check tools can be grouped into
different groups of tools based on cluster components checked by the tools. There are three
different types of configuration files (global, tool, group) to configure framework, individual tools
and tool groups. These configuration files help to integrate new health check tools seamlessly
Page 6 of 52
IBM HPC Cluster Health Check
with the framework. The general format of the configuration file is keyword value pairs defined
under sections.
The Master or global configuration file chc . con fis created in the directory /etc/opt/ibmchc/.
The settings can be changed in Master configuration file / e t c /op t / i bmchc /chc . con
as fshown in
Example 1-1.
The defaults values defined are fine for most cases; however the following parameters need to be
updated to match each unique clusters environment:
hcuser
Page 7 of 52
IBM HPC Cluster Health Check
Some of the tests run with the PE environment. These tests must not be submitted as
root, thus a dedicated non-root user must be configured. The value is exported
through environment variable HC_USER. The user must have authority to run
applications in the PE environment.
subne t_managers
i b _sw i t ches
Specify the xCAT node groups for the InfiniBand switches as a comma separated list.
Note that this requires the switches to be in the xCAT database and have ssh access
enabled. Different groups for different model of switches should be created in the
xCAT database. The value is exported through environment variable
HC_IB_SWITCHES.Tools that require access to InfiniBand switches will use this
environment variable.
Page 8 of 52
IBM HPC Cluster Health Check
Page 9 of 52
IBM HPC Cluster Health Check
Example 1-2 shows configuration file for f s _usage tool. The executable is copied to all the
specified nodes, runs on all of the nodes, and that only when it exits with an exit code zero (0) it is
considered to be a successful check. The fs_usage check is part of the tool type node. The
fs_usage tool supports both tool output process methods (xcoll and compare) as described in
1.4 Tool output processing methods on page 13.
Page 10 of 52
IBM HPC Cluster Health Check
The configuration file makes it easy to implement or integrate new checks into the cluster health
check framework. Even the scripting language can be chosen easily, for example Bash, Python or
Perl script, as long as the interpreter is available on the target node. There are mainly two types
of scripts or checks. One type is simple query scripts, which query values and output them. The
other type, check scripts, which have logic implemented and also provide exit codes for pass, fail,
or warning. The following are the simple steps followed for adding new script:
Example 1-3 shows a group configuration file example. All tools will be executed in the given
order. The Section header is called [HealthCheckToolsGroup]. The name, and description are
mandatory keywords and keyword environment is optional. All these keywords take Strings as
their values. The keyword tools is a mandatory keyword whose value is a comma separated list
of tool names in a specific order.
Page 11 of 52
IBM HPC Cluster Health Check
#The env i ronment var i ab les and the i r va lues . These env i ronment var i ab les
# are expor ted whi le runn ing the member too l s as par t o f the group .
Env i ronment=
The environment variables defined in the group configuration file are exported for every tool of the
group before starting the tool. In the case of a conflict, these environment variables over-write the
tools environment variablest. To overwrite tool keywords of the tool configuration files (like
arguments or environment), add the following keyword to the group configuration file:
<toolname>.<keywordname>=<new value>
Example 1-4 same as Example 1-3 above but with overwritten fs_usage arguments
#The env i ronment var i ab les and the i r va lues . These env i ronment var i ab les
# are expor ted whi le runn ing the member too l s as par t o f the group .
Env i ronment=
The tool keyword values redefined in the group for a tool is used while running that tool under that
group. Otherwise the default values specified for that keyword in tool configuration file is used.
The group environment variables and redefined keyword values of tools can be used to change
the behavior of the tool while running as part of that group.
Creating a group configuration file and integrating with the framework is simple one step process.
That is;
Page 12 of 52
IBM HPC Cluster Health Check
Some tools cant support compare and xcoll because the output of those tools may not be in the
format suitable to use xcoll and compare methods. Hence, the tools must define the methods
they support as shown in 1.3.1 Tool configuration files on page 8. The processing method plain
is the default method. The processing methods can be used for health check groups as well. If
any member tool of a group does not support specified methods then that tool is run in default
(plain) mode.
1. Plain
The plain mode is the default mode for all the tools. In plain mode tools will be
executed and the output or results are shown without any post processing. This is useful
to check values of individual nodes.
2. xcoll
In xco l l mode the tool will be executed and the output or results are piped (forwarded) to
the xCAT command xcoll. The xcoll command summarizes the output of different nodes
and nodegroups, whose output matchies, to one output group. More details about the
xCAT command xcoll can be found at
http://xcat.sourceforge.net/man1/xcoll.1.html.
This method is generally used for the health check tools which query the attributes of the
cluster components without checking results. This method is good for checking
consistency within a nodegroup.
If a tool already does xcoll, this processing method should not be configured in the tool's
configuration file. If that is the case, a request made on the hcrun command line to use
xcoll will be ignored, and the tool will be run.
3. xcoll_diff
The xcoll_diff process method uses the xcat command xcoll and the
/opt/ibmchc/tools/config/compare_results utility to summarizes the output of different
nodes and nodegroups and then compare the various results and display the differences
between each variation and a base group. The base group is either the results group that
has the largest number of nodes in it, or it is the group that has the node specified in the
-s <seed_node> command line argument to hcrun. More details about the xCAT
command xcoll can be found at
http://xcat.sourceforge.net/man1/xcoll.1.html.
Page 13 of 52
IBM HPC Cluster Health Check
This method is generally used for health check tools which query the attributes of cluster
components without checking results. In addition to checking for consistency within a
nodegroup, it is helpful in determining differences when the tool output has many lines.
Depending on the tool, you may need to use -v to get the full verbose output to do a
proper comparison of results.
4. compare
The compare process method uses /opt/ibmchc/tools/config/compare_results
utility tool to compare the output of the tools with a given baseline, or expected output.
Depending on the tool's configuration, compare_results may call xcoll before processing
it. If the tool does not have xcoll as a value for the processmethod keyword in its
configuration file, it will not run xcoll. In that case, it is assumed that the output of the tool
will be in xcoll format.
The baseline can be created by the cluster health check tool by initially running health
check tool against a given seed node. This will create a baseline template with the output
of seed node. Typically, the user will verify the template when it is first created. On further
invocations of the health check, the results of the different nodes are checked against the
baseline. The baseline template can be regenerated or replaced by running health check
tool using different or same seed node again. Template regeneration is typically done
after maintenance is applied that will change the tool's results.
Note: It is extremely important that the seed node has the expected results so
that subsequent invocations only reveal mismatches to a known good result.
Page 14 of 52
IBM HPC Cluster Health Check
2.0 Framework
The Cluster Health Check framework consists of mainly the hcrun command and the
utility command config_check. In this chapter the framework commands hcrun and
config_check are described.
NAME
hc run - Runs individual health checks or group of health checks and processes
their results.
PURPOSE
Runs individual health checks or group of health checks on range of nodes or
node groups. The results of the health checks are processed using different
process methods.
SYNTAX
hcrun [-h] [-v] [-c] { -l | [[-f ] -n noderange ] [[s seed_node] m
[-m process_method ]{ [-p] group[,group,] |-t tool [,tool, |
cmd_args ]}}
DESCRIPTION
There are various health checking and verification tools available for checking the
cluster health during cluster bring-up, maintenance and problem debug time.
These tools are developed to check the health of individual cluster components or
group of cluster components. At any time, these individual tools or a group of
tools can be executed to check or verify the health of the cluster components. The
health check tools are grouped into different groups based on health check
functionality or components tested by them. There are separate individual
configuration files for each tool customizing the tool to support different cluster
environments.
The hc run command is used to run cluster health checking tools interactively.
The xCAT command xdsh, is used to run health checking tools on the specified
set (node range) of nodes. If nodes are not specified then the health checking tools
are executed on the local node (Management Node) based on configuration of the
tool. If tool is configured to run on multiple nodes but did not provide the nodes
to run then it would be in error. The hcrun command accepts either a single
health check tool with its command line arguments or a list of health check tools
Page 15 of 52
IBM HPC Cluster Health Check
separated by commas. If a list of health check tools are specified then the
command line arguments specified in configuration files of individual tools are
used while executing respective tools and the tools also get executed in the order
they are specified.
The hc run also accepts a list of health checking tool group names separated by
commas. If group names are specified all the tools of the individual groups are
executed in an order specified in the group configuration file until either the first
health check tool fails or all the tools are executed successfully. Even though the
tools of a group are executed in the order the tools are specified in the group
configuration file, the pre-check tools of individual tools are always executed
before the individual tool is executed. If c (continue) option is specified then all
the health check tools of the groups are executed even if one or more health check
tools fail. By default, the execution of health check tools of a group are stopped
once a tool fails. If multiple groups are specified then the tools of the groups are
executed in the order the groups are specified on the hcrun command line.
The cluster health check tools of the groups are executed on the same nodes
specified on hcrun command line. The tools are executed using xdsh command.
If any tool of a group should not be started using the xdsh command then the
configuration keyword StartTool of that tool should be set to local, in
which case that particular tool is started on local node (Management Node) only.
For any reason, if a particular tool wants to know the set of nodes (node range), it
can access the environment variable HC_NODERANGE which is always set by
hcrun command to the node range specified to hcrun command.
The hcrun command saves the cluster health checking tools results in the file
specified by keyword toolsoutput in the master configuration file. These
results are used for historical analysis. The hcrun command also saves the
summary of the cluster health checking tools in a summary file specified by the
keyword summaryfile. The summary file has brief details of the cluster
health checking tools like start and end times of the tools along with final status
(Fail/Pass) of the tools. In addition to storing the tools results in the file specified
by the keyword toolsoutput, it also displays the health checking tools results
to the stdout.
The output of the tools can also be processed using different utility tools like the
xcoll command. By default the health checking tools are run in plain mode where
the results of the tools are displayed. The hcrun command also supports two
other methods xcoll and compare. When xcoll is used the tools output is
piped through the xcoll command of xCAT before display, where the xcoll
analyzes the tools output and segregate the nodes which have common output.
This would help to identify the set of nodes on which the health check is passed
and on which health check is failed, or to reveal unexpected inconsistencies.
Page 16 of 52
IBM HPC Cluster Health Check
When compare method is used the output of health check tools from each
node is compared against a template (baseline output). If output from one or more
nodes does not match with the baseline template, alternative templates along with
the nodes matching the alternative templates are displayed. This method can be
used to find out the nodes which are deviating from the known baseline. The
health check is considered to have passed if the tool output from all the given
nodes are matched with the template, else the health check is considered to have
failed.
OPTIONS
- v The hcrun command and cluster health checking tools are executed in
verbose mode. The environment variable HC_VERBOSE is set to 1 and
exported to health checking tools if hcrun is run with option (-v). By default
the HC_VERBOSE is set to 0. To support verbose mode each cluster health
checking tool should verify the HC_VERBOSE environment variable. If
HC_VERBOSE=1, then the command should run in verbose mode.
- lThe command lists all the configured cluster health checking tools according
to their types. The defined tool groups also listed along with their member
tools and other attributes. If l option along with one or more tool names are
specified then all attributes defined for the tools are listed. If l option along
with one or more groups are specified then the tools of the specified groups
and other attributes of the groups are listed.
- n <NODERANGE>
The cluster health checking tool specified or the cluster health checking tools
of the specified group are executed on the nodes specified by NODERANGE.
The syntax of the node range is the same as supported by xCAT. The detailed
explanation of NODERANGE syntax can be found at xCAT document
available at http://xCAT.sourceforge.net/man3/noderange.3.html . An
environment variable HC_NODERANGE is set with node range specified and
exported to health checking tools.
-f Generally (by default) the execution of the health checking is stopped if one or
Page 17 of 52
IBM HPC Cluster Health Check
more nodes in the node range specified are not reachable. If f (force) option
is specified then the execution of health checking is continued even if one or
more nodes are not reachable.
group[ ,g roup ,]
One or more group names separated by comma, whose cluster health
checking tools are executed. The tools of the groups are executed in the order
the groups specified. The tools within a group are also executed in the order
the tools defined in that group.
- p If groups are specified on command line, the tools of the individual groups are
executed in the order the tools defined in respective groups. If p (preview)
option along with groups is specified then the health checking tools are listed
in the order they would get executed but the tools actually would no be
executed.
-t too [l ,too l , ]
The tools specified are executed. If either only tool name is specified without
any command line arguments required for that tool or list of tools are
specified then the default command line arguments configured in
configuration files of the tools are used while executing the respective tools. If
tool name along with command line arguments are specified then the tool is
executed using the command line arguments specified by overriding default
command line arguments configured in configuration file.
<cmd_a rgs>
The command line arguments passed to the tool when single health check tool
name is specified.
- m <process_method>
The methods used to post process the output of the tools. The supported
methods are xco l ,l xcoll_diff, and compare. When xcoll method
is used the tools output is piped through the xcoll command of xCAT before
display, where the xcoll command analyzes the tools output and segregates
the nodes which have common output. If xcoll_diff, is used, the tools
output is piped through xcoll and then compare_results to indicate the
differences in results between various nodes. If the compare method is
used, the output of health check tools form each node is compared against a
template (baseline output). If output from one or more nodes does not match
with the baseline template, alternative templates along with the nodes
matching the alternative templates are displayed. The hcrun uses the utility
command config_check (See section 2.2 The compare_results
command) to compare the tools results with baseline.
For more information on processing methods, see Error: Reference source not
foundError: Reference source not found1.4 Tool output processing methods
on page 13.
Page 18 of 52
IBM HPC Cluster Health Check
- s <seed_node>
A seed node along with processing method compare is used, first the
baseline template is created by running tool on the seed node. Then the output
of health check tools form each node is compared against template (baseline
output) generated. If output from one or more nodes does not match with
baseline template, alternative templates along with the nodes matching the
alternative templates are displayed. The hcrun uses the utility command
config_check (See section 2.2 The compare_results command) to
compare the tools results with baseline.
When the processing method xcoll_diff is used, the seed node's results are
considered to be the base results for the difference calculation in
compare_results as opposed to using the results from the largest number of
nodes.
EXAMPLES
1. To list all the health check tools available run the hc run command without passing
any arguments as shown below.
e119 f3ems2 :~ # hc run
config_check : Util tool used to compare the query attributes of nodes against a seed node
switch_module : Query the switch modules installed
run_ppping : Runs ppping interconnect test
ibdiagcounters_clear : Util tool to clean IB diag counters
fs_usage : Checks the usage of the provided filesystems
run_nsdperf : Runs nsdperf interconnect test
switch_health : Checks the switch health report
syslogs_clear : Util tool to clean the syslogs
firmware : Checks the BMC and UEFI firmware of the node
run_jlink : Runs jlink interconnect test
hca_basic : Checks basic IB fabric parameters
file_rm : Util tool to remove files copied to the execution node
nfs_mounts : Checks nfs mounts
memory : Checks the Total memory on the node
jlinksuspects : Correlates sub-par jlink BW with BER calculations per Subnet Manager(s)
jlinkpairs : Gets all jlink pairs whose BW is below threshold
hostfile_copy : Util tool to copy hostfile to the execution node
lastclear : Returns timestamp from last clearerrors on the Subnet Manager(s)
clearerrors : Clears the IB error counters on the Subnet Manager(s)
switch_inv : Query the switch inventory
switch_clk : Checks the switch clock consistency
run_daxpy : Runs daxpy test on node(s)
run_dgemm : Runs degemm test on the node(s)
leds : Checks the leds of the node
ibtools_install : Installs IB Tools on the Subnet Manager(s)
temp : Checks all temperatures
gpfs_state : Checks the gpfs state
ibtoolslog_clean : Archives the ibtools logs on the Managment Node and Subnet Manager(s)
switch_code : Checks to see if the switch code is consistent
ipoib : Query IPoIB settings
hostfile_rm : Util tool to remove hostfile to the execution node
ibqber : Runs an IB query and BER calculation on the Subnet Manager(s)
switch_ntp : Checks to see if the switch ntp is enabled
os : Checks the OS and kernel version
cpu : Checks for the CPU status
berlinks : Gets the links from a BER calculation on the Subnet Manager(s)
file_copy : Util tool to copy required files to the remote node
Page 19 of 52
IBM HPC Cluster Health Check
e119f3ems2:~ #
DESCRIPTION
Each tool name, followed by a colon, and followed by a short description of the tools
is displayed.
2. To list the tools as per their type and the tool groups in which they are members, run
the following command.
e119f3ems2:~ # hcrun -l
The Tool Type : The Tools : Tool Description
==============================================================
node : firmware : Checks the BMC and UEFI firmware of the node
temp : Checks all temperatures
run_daxpy : Runs daxpy test on node(s)
nfs_mounts : Checks nfs mounts
fs_usage : Checks the usage of the provided filesystems
leds : Checks the leds of the node
os : Checks the OS and kernel version
run_dgemm : Runs degemm test on the node(s)
gpfs_state : Checks the gpfs state
memory : Checks the To ta l memory on the node
cpu : Checks f o r the CPU s ta tus
too l s _u t i l : con f ig _check : Ut i l too l used to compare the query a t t r ibu tes o f nodes aga ins t
node
f i l e _ rm : Ut i l too l to remove f i l e s cop ied to the execu t ion node
sys logs_c lea r : Ut i l too l to c lean the sys logs
f i l e _copy : Ut i l too l to copy requ i red f i l e s to the remote node
hos t f i l e _copy : Ut i l too l to copy hos t f i l e to the execu t ion node
hos t f i l e _ rm : Ut i l too l to remove hos t f i l e to the execu t ion node
i bd iagcoun te rs _c lea r : Ut i l too l to c lean I B d iag coun te r s
=============================================================
Page 20 of 52
IBM HPC Cluster Health Check
i n te rconnec t _ tes t : Desc r ip t i on : These a re i n t rus i ve i n te rconnec t hea l th tes t ing too l
Env i ronment :
Too l s : [ ' r un_ppp ing ' , ' r un_ j l i nk ' , ' r un_nsdper f ' ]
e119 f3ems2 :~ #
DESCRIPTION
First the tools are listed as per their type. The configured types are node ,
t oo l s _u t i l , interconnect,
ib , and tools_ib. Then the configured groups
node_test, node_check, ib_check, interconnect_test, and run_ibtools
are displayed along with their member tools.
3. The attributes of a tool can be displayed using hcrun command as shown below.
e119f3ems2:~ #
DESCRIPTION
The tool cpu along with its various attributes is displayed.
Page 21 of 52
IBM HPC Cluster Health Check
4. The attributes of configured tool group can be listed as shown below using hc run
command.
e119f3ems2:~ # hcrun -l node_test
The Group Name : Group Attributes : Attributes Values
================================================================
==========
node_test : Description : These are intrusive node health testing tools.
Environment :
Tools : ['run_dgemm', 'run_daxpy']
e119 f3ems2 :~ #
DESCRIPTION
The tool group node_ tes t along with its various attributes is displayed.
5. The health check tool cpu, which query different attributes of cpu and display is
run using hc run command as shown below.
DESCRIPTION
As shown from health check tool cpu output, various attributes from nodes
e119 f4m1n04 , and e119 f4m1n05 are displayed.
6. To process the health check tools output using method xcoll run the hcrun
command as shown below.
Page 22 of 52
IBM HPC Cluster Health Check
DESCRIPTION
The results from health check tool cpu is processed by xcoll method and
summarized. Here all the nodes, e119 f4m1n06 , e119 f4m1n03 , e119 f4m1n05 ,
e119 f4m1n04 , and e119 f4m1n07 have same results returned from cpu health
check tool.
7. To over-write the default arguments defined for a health check tool run the hcrun
command by specifying the command line arguments to the tool as shown below.
DESCRIPTION
As shown above the second execution of health check tool f s _usage uses the
argument ( /tmp ) passed on the command line. The first execution uses the default
arguments defined in the configuration file.
8. To process the health check tool group results using method xco l l run the hcrun
command as shown below.
[root@c933mnx01 node]# hcrun -n c933f01x31,c933f01x33 -c -m xcoll node_check
====================================
c933f01x33
====================================
CPU Model: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Turbo HW: Off
Turbo Engaged: No
HyperThreading HW: Enable
Socket(s): Core(s) per socket: 4
Active Cores: 16
scaling_governor: performance
Page 23 of 52
IBM HPC Cluster Health Check
scaling_max_freq: 2793000
scaling_min_freq: 1596000
====================================
c933f01x31
====================================
CPU Model: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Turbo HW: Off
Turbo Engaged: No
HyperThreading HW: Enable
Socket(s): Core(s) per socket: 4
Active Cores: 16
scaling_governor: ondemand
scaling_max_freq: 2793000
scaling_min_freq: 1596000
====================================
c933f01x31
====================================
Ambient Temp: 30 C (86 F)
Page 24 of 52
IBM HPC Cluster Health Check
9. To process the health check tool group results using method xcoll_diff run the
hcrun command as shown below, which shows results from both nodes match
[root@c931mnp01 node]# hcrun -n c931f07p01,c931f07p02 -v -c -m xcoll_diff node_check
The above results indicate that both nodes passed to hcrun had the same results. Note that
the -v was used to assure that verbose output was given by the cpu tool. Otherwise, there
would be nothing to compare.
10. To process the health check tool group results using method xcoll_diff run the
hcrun command as shown below, which shows results from the two nodes differ.
[root@c931mnp01node]#hcrun.mganc931f07p01,c931f07p02vcm
xcoll_difftfs_usage
#######################################################################
#######
Differencesbetween:
c931f07p01(nodegroup1)
AND
c931f07p02(nodegroup2)
#######################################################################
#######
Page 25 of 52
IBM HPC Cluster Health Check
/tmp:2%
?^
+/tmp:1%
?^
Changedfromnodegroup1tonodegroup2
=====================
/var/tmp:2%
?^
+/var/tmp:1%
?^
Changedfromnodegroup1tonodegroup2
=====================
Thehealthchecktoolfs_usage[PASS]
Note: The fs_usage tool passed because the filesystems being checked weren't at warning
or error levels. However, xcoll_diff reported a difference in results being reported by the
two nodes. xcoll_diff used the largest number of nodes with the same result, but because
only two nodes were being checked, both results groups had only one node in them.
xcoll_diff breaks the tie by using the first one to report back from xcoll. You can see that
the difference between the two node's results is that one has 1% filesystem use and the
other has 2%. Note, that the carat (^) is under the 1 and the 2 in the results, which helps
you see the differences quickly.
Page 26 of 52
IBM HPC Cluster Health Check
NAME
compare_results - Runs health checks and compares the output against a baseline
PURPOSE
Runs the commands from the command file on a single node or range of nodes,
and node groups. The output from each node is compared with a baseline template
or output collected from a seed node. Differences are highlighted. It is intended as
a utility or wrapper for health check commands/scripts, but can be used stand-
alone by taking output piped from xcoll.
Note: It is very useful when the results should have the same strings every time. It
does not work with ranges of results or real expressions.
SYNTAX
compare_ resu l t s - f | - - f i l e <f i l ename> [ - s | - - seednode <seednode>] [ -
b | - - base l i ne <base l i ne>] [ - K | - - keepbase l i ne ] - c | - - command
<command> [ - m|-- mode <command mode>] [ - l | - - l og <log d i rec to ry>]
[ - v | - - verbose ] [ - - veryve rbose ] -
OPTIONS
-h, --help show this help message and exit
-f FILENAME, --file=FILENAME
A file that contains results from xcoll. Default is to
use STDIN.
-s SEEDNODE, --seednode=SEEDNODE
The node that is considered to be the example against
which the others are compared.
-b BASELINE, --baseline=BASELINE
A file that contains the results against which to
compare.
-c COMMAND, --command=COMMAND
To run against remote nodes, the command must be able
Page 27 of 52
IBM HPC Cluster Health Check
to do so on its own.
-m COMMAND_MODE, --mode=COMMAND_MODE
The mode to use: None| xcoll = pipe through xcoll;
xdshcoll = pipe through xdshcoll (doesn't collapse to
groups); noxcoll = don't pipe through xcoll
-l DIRECTORY, --log=DIRECTORY
Log directory
-v, --verbose Verbose mode. Default is quiet except on error.
Environment variable HC_VERBOSE=1 will also turn it
on. Command line on overrides HC_VERBOSE=0.
- - ve ryve rbose Very ve rbose mode . Shows prog ress o f the
program. Th i s i nc ludes ve rbose output .
DESCRIPTION
The compare_ resu l t sutility command is used to run commands from a given
command file or script on the specified node(s) or node groups. The output
generated from commands of command file is compared using the xCAT
command xcoll against the baseline/template, whose name is specified (typically
by hcrun). While comparing the results with the baseline, if the output from one
or more nodes does not match the baseline, then alternate results are generated
which contain those varying results. The alternative results generated are
temporarily saved in /tmp The alternative resultsare named by appending a
sequence number, starting from 1, to the basename which is compare_results.
$PID. The alternate results are removed as soon as the test is completed. The
output log describes the differences from the baseline.
The commands specified in the command file are run on nodes or node groups
using the xCAT command xdsh. If the commands specified in command file are
remote execution commands like rinv, rvitals then the commands of the
command file have to be run locally on Management node. By default, the
compare_results command displays pass/fail health check by comparing the
health check output with baseline, or against the seed_node. It can be run in
verbose mode as well to display the details of the heath check result. The
veryverbose mode will show the details like verbose, and it will also show the
commands progress. The details or logs of the health checks also saved in a log
file in directory specified by environment variable HC_LOGDIR. You can use the
Page 28 of 52
IBM HPC Cluster Health Check
Page 29 of 52
IBM HPC Cluster Health Check
ENVIRONMENT VARIABLES
The following environment variables are used by compare_results:
HC_LOGDIR = where to log results
HC_NODERANGE = node range to use
HC_VERBOSE = 1 = verbose; 0 = no verbose
HC_VERYVERBOSE = 1 = veryverbose; 0 = no veryverbose; veryverbose
includes verbose output
EXAMPLES
xco' sl li n tocompare_results:
You can p ipe the resu l t s f rom xCAT
xdsh |xcoll|compare_results
You can point the command to a file that has results from xcoll
xdsh | xcoll > FILE
compare_results -f FILE
You can have the compare_results run a command to all nodes in a node
group and compare against the unique results from the largest number of
nodes. The example runs /opt/ibmchc/ppc64/node/cpu against f02
compare_resultsc'xdshf02ve
/opt/ibmchc/ppc64/node/cpu'
You can have the compare_results run a command to all nodes in a
node group and compare against the unique results from a seed node:
compare_resultsc'xdshf02ve
/opt/ibmchc/ppc64/node/cpu'sf02n05
You can have the compare_results run a command to all nodes in a node
group and compare against the unique results from a seed node and save the
seed node's results as a baseline:
compare_resultsc'xdshf02ve
/opt/ibmchc/ppc64/node/cpu'b
/var/opt/ibmchc/data/f02.cpu.baseline
You can have the compare_results run a command to all nodes in a node
group and compare against a saved baseline:
You can have the compare_results run a command that does xdsh and
xcoll against a node group and compare against the unique results from the
largest number of nodes:
Page 30 of 52
IBM HPC Cluster Health Check
EXCEPTIONS
Exception Comments
Cannot open <file>. The results input file given in -f can't be
The results input file must exist. opened. It doesn't exist, or there's an
exception problem.
Cannot open <file> for write. This is when an output file, like the baseline
Does the path exist? or log file, cannot be written.
Maybe the filesystem is full?
Is there a permissions problem with the
path?
There were no results groups found. With hcrun, the most likely issue is that the
Perhaps the nodegroup or seed for the output of the command provides xcoll
command is not correct. output and 'noxcoll' was used as the
Perhaps the input was not piped through processing method, or vice-versa.
xcoll?
If using a command (-c), make sure it
returns in xdsh format.
If using -m noxcoll with a command, make
sure the command returns
xcoll format.
Exiting.
The baseline must exist, unless you supply You attempted to check against a baseline
a seed node. and none exists.
<baseline>
Re-run with a seed node.
Two different nodes in the base line: This typically happens when you are
<node1> and <node2> comparing between nodes that are
Re-run so that only one seed node is found. configured differently; either by design or
because of a config problem
<node> is not in nodegroup '<nodegroup>' This typically happens when you have
of the results group. provided a seednode that is not in the
Try again, making sure that <node> is a nodegroup in hcrun, or in the command
member of the results group. passed via -c to compare_results.
Exiting.
Page 31 of 52
IBM HPC Cluster Health Check
Exception Comments
Baseline results can't have more than one You can't save a baseline when there are
results group. mismatches in the nodegroup's
Make sure that you get only one node's or configuration. Fix those first.
nodegroup's results from
the command.
You should fix the discrepancies with the
non-seed nodes or the seed node, first.
Exiting.
Cannot open temp file "<file>". The temp files are used to store results
- Is /tmp full?' % file) before they are compared. Typically the
problem is that /tmp is full, but it could be a
permissions change to /tmp.
Command returned nothing. Please review: This is usually a problem with the command
<command string> that was passed via -c, or as a tool called by
hcrun.
The compare_results utility tool is developed to run under the hcrun command,
which exports the HC_TEMPLATESDIR and HC_LOGDIR specified in the master
configuration file of hcrun. The baseline name is automatically generated from the
health check tool name and health check tools group name, when run under hc run
command.
Page 32 of 52
IBM HPC Cluster Health Check
The following steps have to be followed to run health check using compare_results
under hcrun command.
Create a script that has a consistent output on a given set of nodes. Sometimes
this can be a simple wrapper for a command, and may involve using grep -v to
remove unique parameters, such as IP addressing.
Create the hcrun configuration file for the health check tool and define the
processmethod of the health check tool as compare and possibly
xcoll_diff as shown in the Example 3-1 below:
The first time you run compare_ reu l t sagainst a node group via hc run
- m compare, a seed node has to be specified to create a baseline based on
the results from the seed node. (-s <seed_node>) On the first run, you should
use verbose mode in hcrun (-v). You should check the baseline that is output
(the name will be indicated in the STDOUT), and be sure that the results are
as expected. If you don't do this, you will at least know if all of the nodes in
the node group get consistent results relative to each other, because
compare_ resu l t swill indicate if there's inconsistency and will not save
the baseline until all nodes in the nodegroup give the same result. For other
possible error messages, see the table of error messages in Error: Reference
source not found, on page Error: Reference source not found.
In subsequent runs, you may use verbose or not, as long as the tool displays
the data. If there is a failure, you can find the latest output file listed in
STDOUT, which will indicate where there are inconsistencies from the
baseline.
If all nodes mismatch the baseline, and you agree with the new values, you
will want to re-run with the seed node, so that a new baseline template is
generated. This will be used for comparison on all new runs, and the old
baseline will be saved for future reference.
EXAMPLES
Page 33 of 52
IBM HPC Cluster Health Check
1. To run the health check tool cpu using compare_ resu l t sunder hc run
command, run the following command:
Note: The hcrun calls the compare_ resu l t sutility when method compare is
specified.
2. To run the health check tool cpu using coompare_ resu l t sunder hcrun
command in the verbose mode, run the following command:
3.
Note: The hcrun calls the compare_results utility when method compare is
specified.
Everything matched
DESCRIPTION
The compare_results tool will indicate whether nodes match or do not match the
baseline template, and the output file that contains the details.
The example indicates that all nodes in the nodegroup (f07) matched the baseline
results.
This output line indicates where the detailed output file is to be found:
Using baseline
Log file in
"/var/opt/ibmchc/log/cpu.node_check.compute.f07/compare_results.log.20140227_144
208"
When run under hcrun, the above details are only output in verbose mode. In
either case, an indication of pass/fail is output:
Page 34 of 52
IBM HPC Cluster Health Check
4. To run the health check tool cpu and get a baseline, do something similar to the
following, which uses f07n01 as the seed node:
DESCRIPTION
The -s f07n01 indicates the seed node, or the node that is considered the example
for the nodegroup.
The output indicates that a baseline was created and where it was created.
The output indicates that the command was successful. If there were nodes in f07
that didn't match the output of f07n01, the command would fail, and indicate
which output lines were mismatching and where they were mismatching.
5. The following indicates a mismatch in the output vs. that in the baseline:
e119f3ems2:~ # /opt/ibmchc/bin/hcrun -n f07 -m compare - v - t cpu
Using baseline file /opt/ibmchc/conf/templates/cpu.f07.template
####################################################
##########################
Differences between:
f07n01 (nodegroup1)
AND
f07n02(nodegroup2)
####################################################
##########################
? ^
? ^
=====================
Page 35 of 52
IBM HPC Cluster Health Check
Log file in
"/var/opt/ibmchc/log/cpu.node_check.compute.f07/compare_results.log.20140227_144
334"
The health check tool cpu [ FAIL ]
e119f3ems2:~ #
DESCRIPTION
The baseline file being used is indicated.
There are differences found between f07n01 (the baseline seed node) and
f07n02. The nicknames nodegroup1 and nodegroup2 are used. This makes
more sense when there is a large list of nodes in one or both of the
nodegroups. These are not xCAT nodgroups. This is just a nickname used
by compare_results.
In this case, the difference is that f07n02 has 62 instead of 63 online CPUs.
hcrun indicates a FAIL.
DESCRIPTION
Each tool in the node_check tools group is run individually, and you see that all of
them PASS. In each case, the baseline is indicated for each individual tool. Notice
that the node_check tools group name is included in the baseline filename.
Page 36 of 52
IBM HPC Cluster Health Check
7.
Page 37 of 52
IBM HPC Cluster Health Check
Page 38 of 52
IBM HPC Cluster Health Check
Page 39 of 52
IBM HPC Cluster Health Check
13. To be of more help to the users, it is suggested that you use the recommendation
keyword in the tool configuration file as a way to indicate how to resolve
problems found by the tool.
14. The comment keyword can be used to provide a little more information about
the tool, including how you envision it being used.
15. If you are writing a new tool to be used under CHC rather than incorporating an
existing tool, consider the following:
1. Consider if it is best to have a tool that does all of the remote operations
itself or if it would be better to leverage hcrun's capability to run commands
remotely. For example, if hcrun did not exist, would you run the tool using
xdsh with the -e option, or have the tool use xdsh to access the nodegroup?
2. Consider how a user can leverage the various processing methods for the
new tool:
1. If you write the tool to be run on the management node and use xdsh
and xcoll on its own, then you won't find as much use for xcoll and
xcoll_diff, and compare. This would be similar to running the tool by
itself without hcrun.
2. If you write the tool so that it can leverage CHC's capability to do xdsh
and xcoll, the user can use xcoll to view detailed results (when verbose
is given)for each results group, much as they might by simply running.
The user could use xcoll_diff to display only the differences between
node group results. The user can use the compare method to store
baselines and compare for consistency over time. These methods are
best used for tools that produce consistent results across a group of
nodes, where the results do not have a range of values, or otherwise
match a regular expression rather than an exact string.
3. The HC_PROCESS_METHOD environment variable can be passed to
the tool to vary how the tool presents output based on the user's chosen
processing method option.
4.0 Reference
1. IBM developerWorks page for IBM Cluster Health Check:
https://www.ibm.com/developerworks/community/wikis/home?
lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing
%20%28HPC%29%20Central/page/IBM%20Cluster%20Health%20Check
2. The details about the xdsh command can be found at
http://xCAT.sourceforge.net/man1/xdsh.1.html
3. The details about the xcoll command can be found at
http://xCAT.sourceforge.net/man1/xcoll.1.html
4. The details about the xCAT s invcommand can be found at
http://xcat.sourceforge.net/man1/sinv.1.html
Page 40 of 52
IBM HPC Cluster Health Check
5.0 Packaging
The rpm path structure is illustrated below:
/etc/opt/ibmchc
/opt/ibmchcbintoolsext
confgroupschc
templates
ppc64bin
node
sharemanman8
toolsconfig
ext
ibibtoolslibperl
node
util
x86_64bin
node
/var/opt/ibmchcdata
logs
Path Description
/etc/opt/ibmchc/chc.conf Overall config file for IBM CHC.
Page 41 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/conf/groups/chc Contains configuration files for main tools
groups
/opt/ibmchc/conf/groups/chc/ibswitch_check.conf Tools group for healthcheck of InfiniBand
switches.
/ Tools group for running InfiniBand fabric test
opt/ibmchc/conf/groups/chc/interconnect_test.conf tools (mostly various bandwidth and
connectivity tests)
/opt/ibmchc/conf/groups/chc/node_check.conf Tools group for node healthchecks
Page 42 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/config/compare_results Comparing results from other tools run against
a nodegroup. Called by hcrun when -m
compare is requested
/opt/ibmchc/tools/config/compare_results.conf Config file for compare_results
Page 43 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/ib/ibtools/chkpairs_oversw Getting pair information across a switch fabric
Page 44 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/ib/lastclear Determine when the last error clear was done
by clearerrrors.sh
/opt/ibmchc/tools/ib/lastclear.conf Config file for lastclear
Page 45 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/ib/switch_module.conf Config file for switch_module
Page 46 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/node/os.conf Config file for os
Page 47 of 52
IBM HPC Cluster Health Check
Path Description
/opt/ibmchc/tools/util/ibdiagcounters_clear.conf Config file for ibdiagcounters_clear
Page 48 of 52
IBM HPC Cluster Health Check
More use cases may be available on the IBM developer works page for IBM Cluster
Health Check: https://www.ibm.com/developerworks/community/wikis/home?
lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC
%29%20Central/page/IBM%20Cluster%20Health%20Check
The current method for checking consistency of results within a nodegroup is to use the
xcoll or the xcoll_diff processing method; where a nodegroup could be server nodes
or switch devices.
For example, to run the cpu tool against node n1 in the nodegroup 'compute', use:
hcrun n1 -v -t
cpu
Page 49 of 52
IBM HPC Cluster Health Check
# the use of -v
may vary based on the tool. The idea is make
sure you get the detailed results and not just
pass/fail at this point.
Check the results and verify that they are as you expect. Make any updates or
repairs to get to the desired results
Verify the nodegroup
Depending on how you like to approach verification, you may wish to see
detailed results at first and then later use the processing method to report only the
differences from the example /seed node.
# Only report
differences
hcrun compute -v
-s n1 -m xcoll_diff -t cpu
# Report all
results after grouping all nodes that have the same
results
hcrun compute -v
-s n1 -m xcoll -t cpu
Iterate until all nodes in the nodegroup have the same results
The following example uses node n1 as the seed node and uses compute as the nodegroup
of interest.
Page 50 of 52
IBM HPC Cluster Health Check
As in, 6.2 Verifying consistency within a nodegroup, choose a seed node and
verify that it is configured as desired.
Once the seed node is as expected, save a baseline result:
hcrun n1 -v -m compare -s n1 -t cpu
Now, you can use this baseline to check the other nodes in the nodegroup, and, in
the future to check all the nodes in the nodegroup (including the seed node)
against the baseline:
hcrun compute -v -m compare -t cpu
When a new baseline is requested the old one's filename is updated with the timestamps
for when it was last modified and the current timestamp (when the new baseline is
replacing it): <baseline name>.<last modified time>-<current time>, where the format of
the timestamp is: YYYYmmdd_HHMMSS.
#thebaseline1andbaseline2arerequiredas
shortnamesofthebaselinefiles.Thisisquirky
becausefc.pyisnormallyusedtocompareresults
groupsandthosewouldnormallybethenodegroupsfor
Page 51 of 52
IBM HPC Cluster Health Check
eachresultsgroupfile.Theargumentsarenot
optionalforfc.py.Youcouldusedifforsdiff,
instead,butfc.pyoutputmaybeeasiertounderstand.
The following examples use -v and -r to assure verbose output and to get
recommendations displayed on failure.
DGEMM:
hcrun[nodegroup]vrtrun_dgemm
DAXPY:
hcrun[nodegroup]vrtrun_daxpy
The following examples use -v and -r to assure verbose output and to get
recommendations displayed on failure.
PPPING:
hcrun[nodegroup]vrtrun_ppping
JLINK:
hcrun[nodegroup]vrtrun_jlink
NSDPERF:
hcrun[nodegroup]vrtrun_nsdperf
Page 52 of 52