Node Evition

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

11g R2 RAC:Node Eviction Due To CSSDagent Stopping

In addition to the ocssd.bin process which is responsible, among other things, for the network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and are also
able to fence a server.
Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root 5085 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root 5106 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid 5136 1 0 09:45 ? 00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106s current scheduling policy: SCHED_RR
pid 5106s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can reset a server
in case :
there is some problem with the ocssd.bin process
there is some problem with OS scheduler
. CPU starvation
OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution of the
processes stops for more than 28 sec., the node will be evicted.
Let us stop the execution of cssdagent for 40 secs
root@rac1 ~]# kill -STOP 5106; sleep 40; kill -CONT 5106
check the alert log of host01
Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: ag100946,
with time stamp: L-2012-11-0910:21:28.040
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100 exceeded;
disk timeout 28100, network

timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280 milliseconds
ago based on invariant clock
Node 2 is rebooted and network connection with it breaks
value of 294678040
2012-11-09 10:21:45.671
[cssd(14493)]CRS-1612:Network communication with node host02 (2) missing for 50% of
timeout interval. Removal of this node
from cluster in 14.330 seconds
2012-11-09 10:21:53.923
[cssd(14493)]CRS-1611:Network communication with node host02 (2) missing for 75% of
timeout interval. Removal of this node
from cluster in 7.310 seconds
2012-11-09 10:21:59.845
[cssd(14493)]CRS-1610:Network communication with node host02 (2) missing for 90% of
timeout interval. Removal of this node
from cluster in 2.300 seconds
2012-11-09 10:22:02.587
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster incarnation
247848834
2012-11-09 10:22:02.717
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 10:22:02.748
[crsd(14820)]CRS-5504:Node down event reported for node host02.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool Generic.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool ora.orcl.

11g R2 RAC : Node Eviction Due To Member Kill Escalation


If the Oracle Clusterware itself is working perfectly but one of the RAC instances is hanging ,
the database LMON process will request a member kill escalation and ask the CSS process to
remove the hanging database instance from the cluster.
The following example will demonstrate it in a cluster consisting of two nodes:
SQL> select instance_name, host_name from gv$instance;
SQL> col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
orcl1 host01.example.com
orcl2 host02.example.com
- On host02 server stop the execution of all rdbms processes (by sending the STOP signal)
Find out current database processes
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2
oracle 6215 1 0 11:20 ? 00:00:00 ora_pmon_orcl2
oracle 6217 1 0 11:20 ? 00:00:00 ora_vktm_orcl2
oracle 6221 1 0 11:20 ? 00:00:00 ora_gen0_orcl2
oracle 6223 1 0 11:20 ? 00:00:00 ora_diag_orcl2
oracle 6225 1 0 11:20 ? 00:00:00 ora_dbrm_orcl2
oracle 6227 1 0 11:20 ? 00:00:00 ora_ping_orcl2
oracle 6229 1 0 11:20 ? 00:00:00 ora_psp0_orcl2
oracle 6231 1 0 11:20 ? 00:00:00 ora_acms_orcl2
oracle 6233 1 0 11:20 ? 00:00:00 ora_dia0_orcl2
oracle 6235 1 0 11:20 ? 00:00:00 ora_lmon_orcl2
oracle 6237 1 0 11:20 ? 00:00:02 ora_lmd0_orcl2

stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk {print $2} | while read PID
do
kill -STOP $PID
done
. From the client point of view the Real Application Cluster database is hanging on both
nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.

SQL> select instance_name, host_name from gv$instance;


no output, query hangs
. Due to missing heartbeats the healthy RAC instance on node host01 will remove the hanging
RAC instance by requesting a member kill escalation.
Check the database alert log file on host01 : LMS process issues a request to CSSD to
reboot the node.
The node is evicted and instance is restarted after node joins the cluster.
[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log
LMS0 (ospid: 31771) has detected no messaging activity from instance 2
LMS0 (ospid: 31771) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Nov 09 11:15:04 2012
Remote instance kill is issued with system inc 30
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Fri Nov 09 11:15:13 2012
IPC Send timeout detected. Sender: ospid 6308 [[email protected] (PZ97)]
Receiver: inst 2 binc 429420846 ospid 6251
Waiting for instances to leave:
2
Reconfiguration started (old inc 4, new inc 8)
List of instances:
1 (myinst: 1)
.. Recovery of instance 2 starts
Global Resource Directory frozen
.
All grantable enqueues granted
Post SMON to start 1st pass IR
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42

Completed redo scan


read 93 KB redo, 55 data blocks need recovery
Started redo application at
Thread 2: logseq 9, block 42
Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0
Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557
Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561
Completed redo application of 0.05MB
Completed instance recovery at
Thread 2: logseq 9, block 228, scn 1069404
52 data blocks read, 90 data blocks written, 93 redo k-bytes read
Thread 2 advanced to log sequence 10 (thread recovery)
Fri Nov 09 12:18:55 2012
.
Check the cluster clusterware alert log of host01
The node is evicted and rebooted to join the cluster
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation 247848838; details
at (:CSSNM00007:) in
/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-09 11:15:56.140
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: mo103324,
with time stamp: L-2012-11-0911:15:56.580
[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from
CSS
2012-11-09 11:16:17.365
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 11:16:17.400
[crsd(14820)]CRS-5504:Node down event reported for node host02.
2
Node 2 joins the cluster
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-09 12:18:52.713

[crsd(14820)]CRS-2772:Server host02 has been assigned to pool Generic.


2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server host02 has been assigned to pool ora.orcl.
7. After the node rejoins the cluster and the instance is restarted, reexecute the query it
succeeds
SQL> conn sys/oracle@orcl as sysdba
col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
orcl1 host01.example.com
orcl2 host02.example.com

11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat

In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node will be
evicted from the cluster, if it cant access the voting disk. To simulate it, I will stop iscsi service
on one of the nodes and then scan alert logs and ocssd logs of various nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop ISCSI service on host03 so that it is evicted.
Stop ISCSI service on host03 so that it cant access shared storage and hence voting disk
[root@host03 ~]# service iscsi stop
scan alert log of host03 Note that I/O error occurs at 03:32:11
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log
Note that ocssd process of host03 is not able to access voting disks

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at


(:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.310
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at
(:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.311
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at
(:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.311
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at
(:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.312
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at
(:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.310
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at
(:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
ACFS cant be accessed
[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references on
/u01/app/oracle/acfsmount/11.2.0/sharedhome:
[client(8050)]CRS-10001:6323 6363 6391 6375 6385 6383 6402 6319 6503 6361 6377 6505
6389 6369 6335 6367 6333 6387 6871 6325 6381 6327 6496 6498 6552 6373 7278 6339 6400
6357 6500 6329 6365
[client(8052)]CRS-10001:ACFS-9113: These processes will now be terminated.
[client(8127)]CRS-10001:ACFS-9114: done.
[client(8136)]CRS-10001:ACFS-9115: Stale mount point
/u01/app/oracle/acfsmount/11.2.0/sharedhome was recovered.
[client(8178)]CRS-10001:ACFS-9114: done.
[client(8183)]CRS-10001:ACFS-9116: Stale mount point
/u01/app/oracle/acfsmount/11.2.0/sharedhome was not recovered.
[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.
2012-11-17 03:33:34.050
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5682)]CRS-5016:Process
/u01/app/11.2.0/grid/bin/acfssinglefsmount spawned by agent
/u01/app/11.2.0/grid/bin/orarootagent.bin for action start failed: details at (:CLSN00010:)
in /u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log
At 03:34, voting disk cant be accessed even after waiting for timeout
2012-11-17 03:34:10.718

[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 99180 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 99180 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 19060 milliseconds
Voting files are offlined as they cant be accessed
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK02; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK03; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
CSSD of host03 reboots the node as no. of voting disks available(0) is less than minimum
required (2)

[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the minimum


number of voting files required, 2, resulting in CSSD termination to ensure data integrity;
details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
2012-11-17 03:36:15.645
[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03.
Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
scan ocssd log of host03
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
I/O fencing for ORCL database is carried out by CSSD at 03:32 ( same time as
when host02 got the msg that orcl has failed on host03)
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa14990),
member 2 in group DBORCL, no share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmUnreferenceMember: global grock
DBORCL member 2 refcount is 7
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceProcessDeath: client (0xaa14990)
pid 6337 undead
..
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa24250),
member 4 in group DAALL_DB, no share, death fence 1, SAGE fence 0

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa6db08),


member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating member 2


(0xaa15920) in grock DBORCL

2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760) process


death fence completed for process 6337, object type 3
..
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa05758) process
death fence completed for process 6337, object type 2

2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock DAALL_DB,


member number 4 (0xaa05aa8) node number 3 state 0x0 grock type 2

2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib


ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)

2012-11-17 03:32:11.310: [ CSSD][942172048](:CSSNM00059:)clssnmvWriteBlocks: write


failed at offset 19 of ORCL:ASMDISK02
2012-11-17 03:32:11.310: [ SKGFD][973764496]ERROR: -15(asmlib
ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][973764496](:CSSNM00059:)clssnmvWriteBlocks: write
failed at offset 19 of ORCL:ASMDISK03

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa38ae0),


member 2 in group DBORCL, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa5e128),


member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:11.354: [ CSSD][908748688]clssnmvDiskAvailabilityChange: voting file


ORCL:ASMDISK01 now offline
2012-11-17 03:32:11.354: [ CSSD][973764496]clssnmvDiskAvailabilityChange: voting file
ORCL:ASMDISK03 now offline
2012-11-17 03:32:11.354: [ CSSD][931682192]clssnmvDiskAvailabilityChange: voting file
ORCL:ASMDISK02 now offline

2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for


voting file ORCL:ASMDISK02 sched delay 1610 > margin 1500 cur_ms 232074 lastalive
230464
2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for
voting file ORCL:ASMDISK01 sched delay 1640 > margin 1500 cur_ms 232074 lastalive
230434
.
2012-11-17 03:32:12.223: [ CLSF][887768976]Closing handle:0xa746bc0
2012-11-17 03:32:12.223: [ SKGFD][887768976]Lib
:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa746df8 for disk
:ORCL:ASMDISK01:
2012-11-17 03:32:12.236: [ CLSF][921192336]Closing handle:0xa5cbbb0
2012-11-17 03:32:12.236: [ SKGFD][921192336]Lib
:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa644fb8 for disk
:ORCL:ASMDISK02:

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for


voting file ORCL:ASMDISK03 sched delay 3110 > margin 1500 cur_ms 233574 lastalive
230464

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for


voting file ORCL:ASMDISK02 sched delay 3110 > margin 1500 cur_ms 233574 lastalive
230464
2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for
voting file ORCL:ASMDISK01 sched delay 3140 > margin 1500 cur_ms 233574

2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR SIGNALED: no


CALLER: clssscExit
scan alert log of host01
Note that reboot message from host03 is received at 03:36:16
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(4942)]CRS-8011:reboot advisory message from host: host03, component: mo031159,
with time stamp: L-2012-11-17-03:36:16.705
[ohasd(4942)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-17 03:36:29.610
After host03 reboots itself, network communication with host03 is lost
[cssd(5177)]CRS-1612:Network communication with node host03 (3) missing for 50% of
timeout interval. Removal of this node from cluster in 14.060 seconds
2012-11-17 03:36:37.988
[cssd(5177)]CRS-1611:Network communication with node host03 (3) missing for 75% of
timeout interval. Removal of this node from cluster in 7.050 seconds
2012-11-17 03:36:43.992
[cssd(5177)]CRS-1610:Network communication with node host03 (3) missing for 90% of
timeout interval. Removal of this node from cluster in 2.040 seconds
2012-11-17 03:36:46.441
After network communication cant be established for timeout interval, the node is
removed form cluster
[cssd(5177)]CRS-1632:Node host03 is being removed from the cluster in cluster incarnation
232819906
2012-11-17 03:36:46.572
[cssd(5177)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
scan ocssd log of host01
Note that ocssd process of host01 discovers missing disk heartbeat from host03 at
03:32:16
[root@host01 cssd]# tailf /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log

2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess:


clssgmCommonAddMember failed, member(-1/CLSN.ONS.ONSNETPROC[3]) on node(3)
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess: Operation(3)
unsuccessful grock(CLSN.ONS.ONSNETPROC[3])
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmHandleMasterJoin:
clssgmProcessJoinUpdate failed with status(-10)
.
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection
failed, endp (0x319), probe((nil)), ninf->endp 0x319
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp
(0x319), init state 3, cur state 3

2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj 0x96c7eb8


[0000000000001310] { gipcEndpoint : localAddr gipc://host01:f278-d1bd-15092f25#10.0.0.1#20071, remoteAddr gipc://host03:gm_cluster01#10.0.0.3#58536, numPend 0,
numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x141d, pidPeer 0, flags
0x261e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
@
scan alert log of host02
Note that reboot message is received ar 03:36:16
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/alerthost02.log
. At 03:32, CRSD process of host02 receives message that orcl database has failed on host03
as
datafiles for orcl are on shared storage
[crsd(5576)]CRS-2765:Resource ora.orcl.db has failed on server host03.
2012-11-17 03:32:44.303
. CRSD process of host02 receives message that acfs has failed on host03 as
shared storage cant be accessed
[crsd(5576)]CRS-2765:Resource ora.acfs.dbhome_1.acfs has failed on server host03.
2012-11-17 03:36:16.981
. ohasd process receives reboot advisory message from host03
[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: ag031159,
with time stamp: L-2012-11-17-03:36:16.705
[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-17 03:36:16.981
[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: mo031159,
with time stamp: L-2012-11-17-03:36:16.705

[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,


unexpected failure 8 received from CSS
2012-11-17 03:36:28.920
. CSSD process of host02 identifies missing network communication from host03 as host03
has rebooted itself
[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of
timeout interval. Removal of this node from cluster in 14.420 seconds
2012-11-17 03:36:37.307
[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of
timeout interval. Removal of this node from cluster in 7.410 seconds
2012-11-17 03:36:43.328
[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of
timeout interval. Removal of this node from cluster in 2.400 seconds
After network communication cant be established for timeout interval, the node is
removed form cluster
2012-11-17 03:36:46.297
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-17 03:36:46.470
[crsd(5576)]CRS-5504:Node down event reported for node host03.
2012-11-17 03:36:51.890
[crsd(5576)]CRS-2773:Server host03 has been removed from pool Generic.
2012-11-17 03:36:51.909
[crsd(5576)]CRS-2773:Server host03 has been removed from pool ora.orcl.
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02
host03 .
scan ocssd log of host02
note that ocssd of host02 discovers missing host03 only after it has been rebooted at 03:36
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/cssd/ocssd.log
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection
failed, endp (0x22e), probe((nil)), ninf->endp 0x22e
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp
(0x22e), init state 3, cur state 3
..
2012-11-17 03:36:15.052: [ CSSD][852125584]clssgmPeerDeactivate: node 3 (host03), death 0,
state 0x1 connstate 0x1e
.

2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%


heartbeat fatal, removal in 14.420 seconds
2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmTagize: version(1), type(13),
tagizer(0x80cf3ac)
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmHandleDataInvalid: grock HB+ASM,
member 1 node 1, birth 1
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:32.204: [ CSSD][831145872]clssnmSendingThread: sending status msg to all
nodes

11g R2 RAC: Node Eviction Due To Missing Network Heartbeat


In this post, I will demonstrate node eviction due to missing netsork heartbeat i.e. a node will be
evicted from the cluster, if it cant communicate wioth other nodes in the cluster. To simulate it, I
will stop private network on one of the nodes and then scan alert logs of the surviving nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop PVT. network service on host03 so that it is evicted.
Find out the pvt network name
[root@host03 ~]# oifcfg getif
eth0 192.9.201.0 global public
eth1 10.0.0.0 global cluster_interconnect
Stop pvt. network service on host03 so that it cant communicate with host01 and host02 and
will be evicted.
[root@host03 ~]# ifdown eth1
OCSSD log of host03

It can be seen that CSSD process of host03 cant communicate with host01 and host02
at 09:43:52
Hence votedisk timeouot is set to Short Disk Time OUT (SDTO) = 27000 ms (27 secs)
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1) at 50%
heartbeat fatal, removal in 14.880 seconds
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1) is
impending reconfig, flag 132108, misstime 15120
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host02 (2) at 50%
heartbeat fatal, removal in 14.640 seconds
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host02 (2) is
impending reconfig, flag 132108, misstime 15360
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-19 09:43:52.927: [ CSSD][2833247120]clssnmSendingThread: sending status msg to
all nodes

Alert log of host03


At 09:43:52, CSSD process host03 identifies that it cant communicate with CSSD on host02
and host03
[cssd(5124)]CRS-1612:Network communication with node host01 (1) missing for 50% of
timeout interval. Removal of this node from cluster in 14.880 seconds
2012-11-19 09:43:52.714
[cssd(5124)]CRS-1612:Network communication with node host02 (2) missing for 50% of
timeout interval. Removal of this node from cluster in 14.640 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host01 (1) missing for 75% of
timeout interval. Removal of this node from cluster in 6.790 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host02 (2) missing for 75% of
timeout interval. Removal of this node from cluster in 6.550 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host01 (1) missing for 90% of
timeout interval. Removal of this node from cluster in 2.780 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host02 (2) missing for 90% of
timeout interval. Removal of this node from cluster in 2.540 seconds
2012-11-19 09:44:09.599
At 09:44:16, CSSD process of host03 reboots the node to preserve cluster integrity
[cssd(5124)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is
going down to preserve cluster integrity; details at (:CSSNM00008:) in
/u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-19 09:44:16.697
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5713)]CRS-5822:Agent
/u01/app/11.2.0/grid/bin/orarootagent_root disconnected from server. Details at
(:CRSAGF00117:) in
/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-11-19 09:44:16.193
[ctssd(5285)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03.
Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
2012-11-19 09:44:21.177

Ocssd log of host01

At 09:43:53, CSSD process of host01 identifies that it cantommunicate with CSSD on host03

2012-11-19 09:43:53.340: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%


heartbeat fatal, removal in 14.500 seconds
2012-11-19 09:43:53.340: [ CSSD][841635728]clssnmPollingThread: node host03 (3) is
impending reconfig, flag 132110, misstime 15500
2012-11-19 09:43:53.340: [ CSSD][841635728]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
Alert log of host01
At 09:44:01, alert log of host01 is updated regarding communication failure with host03
[cssd(5308)]CRS-1612:Network communication with node host03 (3) missing for 50% of
timeout interval. Removal of this node from cluster in 14.500 seconds
2012-11-19 09:44:01.695
[cssd(5308)]CRS-1611:Network communication with node host03 (3) missing for 75% of
timeout interval. Removal of this node from cluster in 7.450 seconds
2012-11-19 09:44:07.666
[cssd(5308)]CRS-1610:Network communication with node host03 (3) missing for 90% of
timeout interval. Removal of this node from cluster in 2.440 seconds
2012-11-19 09:44:10.606
[cssd(5308)]CRS-1607:Node host03 is being evicted in cluster incarnation 32819913; details at
(:CSSNM00007:) in /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-19 09:44:24.705
At 09:44:24, OHASD process on host01 receives reboot message from host03
[ohasd(4941)]CRS-8011:reboot advisory message from host: host03, component: ag050107,
with time stamp: L-2012-11-19-09:44:24.373
[ohasd(4941)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:24.705
[ohasd(4941)]CRS-8011:reboot advisory message from host: host03, component: mo050107,
with time stamp: L-2012-11-19-09:44:24.376
[ohasd(4941)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:46.379
[cssd(5308)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
OCSSD log of host02

At 09:43:52, CSSD process of host02 identifies communication failure with host03


2012-11-19 09:43:52.385: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%
heartbeat fatal, removal in 14.950 seconds
2012-11-19 09:43:52.386: [ CSSD][841635728]clssnmPollingThread: node host03 (3) is
impending reconfig, flag 394254, misstime 15050
2012-11-19 09:43:52.386: [ CSSD][841635728]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-19 09:43:52.733: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for
voting file ORCL:ASMDISK01 sched delay 970 > margin 750 cur_ms 18331974 lastalive
18331004
20

Alert log of host02

At 09:44:01 (same as host01), alert log of host02 is updated regarding communication failure
with host03
[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of
timeout interval. Removal of this node from cluster in 14.950 seconds
2012-11-19 09:44:01.971
[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of
timeout interval. Removal of this node from cluster in 6.930 seconds
2012-11-19 09:44:06.750
[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of
timeout interval. Removal of this node from cluster in 2.920 seconds
2012-11-19 09:44:24.520
At 09:44:24 (same as host01), OHASD process on host01 receives reboot message from host03
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: ag050107,
with time stamp: L-2012-11-19-09:44:24.373
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:24.520
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: mo050107,
with time stamp: L-2012-11-19-09:44:24.376
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:46.073
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
20

You might also like