Node Evition
Node Evition
Node Evition
In addition to the ocssd.bin process which is responsible, among other things, for the network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and are also
able to fence a server.
Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root 5085 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root 5106 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid 5136 1 0 09:45 ? 00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106s current scheduling policy: SCHED_RR
pid 5106s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can reset a server
in case :
there is some problem with the ocssd.bin process
there is some problem with OS scheduler
. CPU starvation
OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution of the
processes stops for more than 28 sec., the node will be evicted.
Let us stop the execution of cssdagent for 40 secs
root@rac1 ~]# kill -STOP 5106; sleep 40; kill -CONT 5106
check the alert log of host01
Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: ag100946,
with time stamp: L-2012-11-0910:21:28.040
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100 exceeded;
disk timeout 28100, network
timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280 milliseconds
ago based on invariant clock
Node 2 is rebooted and network connection with it breaks
value of 294678040
2012-11-09 10:21:45.671
[cssd(14493)]CRS-1612:Network communication with node host02 (2) missing for 50% of
timeout interval. Removal of this node
from cluster in 14.330 seconds
2012-11-09 10:21:53.923
[cssd(14493)]CRS-1611:Network communication with node host02 (2) missing for 75% of
timeout interval. Removal of this node
from cluster in 7.310 seconds
2012-11-09 10:21:59.845
[cssd(14493)]CRS-1610:Network communication with node host02 (2) missing for 90% of
timeout interval. Removal of this node
from cluster in 2.300 seconds
2012-11-09 10:22:02.587
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster incarnation
247848834
2012-11-09 10:22:02.717
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 10:22:02.748
[crsd(14820)]CRS-5504:Node down event reported for node host02.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool Generic.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool ora.orcl.
stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk {print $2} | while read PID
do
kill -STOP $PID
done
. From the client point of view the Real Application Cluster database is hanging on both
nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.
In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node will be
evicted from the cluster, if it cant access the voting disk. To simulate it, I will stop iscsi service
on one of the nodes and then scan alert logs and ocssd logs of various nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop ISCSI service on host03 so that it is evicted.
Stop ISCSI service on host03 so that it cant access shared storage and hence voting disk
[root@host03 ~]# service iscsi stop
scan alert log of host03 Note that I/O error occurs at 03:32:11
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log
Note that ocssd process of host03 is not able to access voting disks
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 99180 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 99180 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK02 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file
ORCL:ASMDISK03 will be considered not functional in 19060 milliseconds
Voting files are offlined as they cant be accessed
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK02; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK03; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
CSSD of host03 reboots the node as no. of voting disks available(0) is less than minimum
required (2)
It can be seen that CSSD process of host03 cant communicate with host01 and host02
at 09:43:52
Hence votedisk timeouot is set to Short Disk Time OUT (SDTO) = 27000 ms (27 secs)
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1) at 50%
heartbeat fatal, removal in 14.880 seconds
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1) is
impending reconfig, flag 132108, misstime 15120
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host02 (2) at 50%
heartbeat fatal, removal in 14.640 seconds
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host02 (2) is
impending reconfig, flag 132108, misstime 15360
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-19 09:43:52.927: [ CSSD][2833247120]clssnmSendingThread: sending status msg to
all nodes
At 09:43:52, CSSD process host03 identifies that it cant communicate with CSSD on host02
and host03
[cssd(5124)]CRS-1612:Network communication with node host01 (1) missing for 50% of
timeout interval. Removal of this node from cluster in 14.880 seconds
2012-11-19 09:43:52.714
[cssd(5124)]CRS-1612:Network communication with node host02 (2) missing for 50% of
timeout interval. Removal of this node from cluster in 14.640 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host01 (1) missing for 75% of
timeout interval. Removal of this node from cluster in 6.790 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host02 (2) missing for 75% of
timeout interval. Removal of this node from cluster in 6.550 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host01 (1) missing for 90% of
timeout interval. Removal of this node from cluster in 2.780 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host02 (2) missing for 90% of
timeout interval. Removal of this node from cluster in 2.540 seconds
2012-11-19 09:44:09.599
At 09:44:16, CSSD process of host03 reboots the node to preserve cluster integrity
[cssd(5124)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is
going down to preserve cluster integrity; details at (:CSSNM00008:) in
/u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-19 09:44:16.697
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5713)]CRS-5822:Agent
/u01/app/11.2.0/grid/bin/orarootagent_root disconnected from server. Details at
(:CRSAGF00117:) in
/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-11-19 09:44:16.193
[ctssd(5285)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03.
Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
2012-11-19 09:44:21.177
At 09:43:53, CSSD process of host01 identifies that it cantommunicate with CSSD on host03
At 09:44:01 (same as host01), alert log of host02 is updated regarding communication failure
with host03
[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of
timeout interval. Removal of this node from cluster in 14.950 seconds
2012-11-19 09:44:01.971
[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of
timeout interval. Removal of this node from cluster in 6.930 seconds
2012-11-19 09:44:06.750
[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of
timeout interval. Removal of this node from cluster in 2.920 seconds
2012-11-19 09:44:24.520
At 09:44:24 (same as host01), OHASD process on host01 receives reboot message from host03
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: ag050107,
with time stamp: L-2012-11-19-09:44:24.373
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:24.520
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: mo050107,
with time stamp: L-2012-11-19-09:44:24.376
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from CSS
2012-11-19 09:44:46.073
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
20