Node Evition

11g R2 RAC:Node Eviction Due To CSSDagent Stopping
In addition to the ocssd.bin process which is responsible, among other things, for the network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and are also
able to fence a server.
Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root 5085 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root 5106 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid 5136 1 0 09:45 ? 00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106s current scheduling policy: SCHED_RR
pid 5106s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can reset a server
in case :
there is some problem with the ocssd.bin process
there is some problem with OS scheduler
. CPU starvation
OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution of the
processes stops for more than 28 sec., the node will be evicted.
Let us stop the execution of cssdagent for 40 secs
root@rac1 ~]# kill -STOP 5106; sleep 40; kill -CONT 5106
check the alert log of host01
Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: ag100946,
with time stamp: L-2012-11-0910:21:28.040
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100 exceeded;
disk timeout 28100, network
timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280 milliseconds
ago based on invariant clock
Node 2 is rebooted and network connection with it breaks
value of 294678040
2012-11-09 10:21:45.671
[cssd(14493)]CRS-1612:Network communication with node host02 (2) missing for 50% of
timeout interval. Removal of this node
from cluster in 14.330 seconds
2012-11-09 10:21:53.923
2012-11-09 10:21:59.845
2012-11-09 10:22:02.587
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster incarnation
247848834
2012-11-09 10:22:02.717
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 10:22:02.748
[crsd(14820)]CRS-5504:Node down event reported for node host02.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool Generic.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool ora.orcl.
11g R2 RAC : Node Eviction Due To Member Kill Escalation

If the Oracle Clusterware itself is working perfectly but one of the RAC instances is hanging ,
the database LMON process will request a member kill escalation and ask the CSS process to
remove the hanging database instance from the cluster.
The following example will demonstrate it in a cluster consisting of two nodes:
SQL> select instance_name, host_name from gv$instance;
SQL> col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
orcl1 host01.example.com
- On host02 server stop the execution of all rdbms processes (by sending the STOP signal)
Find out current database processes
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2
oracle 6215 1 0 11:20 ? 00:00:00 ora_pmon_orcl2
oracle 6217 1 0 11:20 ? 00:00:00 ora_vktm_orcl2
oracle 6221 1 0 11:20 ? 00:00:00 ora_gen0_orcl2
oracle 6223 1 0 11:20 ? 00:00:00 ora_diag_orcl2
oracle 6225 1 0 11:20 ? 00:00:00 ora_dbrm_orcl2
oracle 6227 1 0 11:20 ? 00:00:00 ora_ping_orcl2
oracle 6229 1 0 11:20 ? 00:00:00 ora_psp0_orcl2
oracle 6231 1 0 11:20 ? 00:00:00 ora_acms_orcl2
oracle 6233 1 0 11:20 ? 00:00:00 ora_dia0_orcl2
oracle 6235 1 0 11:20 ? 00:00:00 ora_lmon_orcl2
oracle 6237 1 0 11:20 ? 00:00:02 ora_lmd0_orcl2
stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk {print $2} | while read PID
do
kill -STOP $PID
done
. From the client point of view the Real Application Cluster database is hanging on both
nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.
SQL> select instance_name, host_name from gv$instance;

no output, query hangs
. Due to missing heartbeats the healthy RAC instance on node host01 will remove the hanging
RAC instance by requesting a member kill escalation.
Check the database alert log file on host01 : LMS process issues a request to CSSD to
reboot the node.
The node is evicted and instance is restarted after node joins the cluster.
[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log
LMS0 (ospid: 31771) has detected no messaging activity from instance 2
LMS0 (ospid: 31771) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Nov 09 11:15:04 2012
Remote instance kill is issued with system inc 30
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Fri Nov 09 11:15:13 2012
IPC Send timeout detected. Sender: ospid 6308 [[email protected] (PZ97)]
Receiver: inst 2 binc 429420846 ospid 6251
Waiting for instances to leave:
2
Reconfiguration started (old inc 4, new inc 8)
List of instances:
1 (myinst: 1)
.. Recovery of instance 2 starts
Global Resource Directory frozen
.
All grantable enqueues granted
Post SMON to start 1st pass IR
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42
Completed redo scan

read 93 KB redo, 55 data blocks need recovery
Started redo application at
Thread 2: logseq 9, block 42
Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0
Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557
Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561
Completed redo application of 0.05MB
Completed instance recovery at
Thread 2: logseq 9, block 228, scn 1069404
52 data blocks read, 90 data blocks written, 93 redo k-bytes read
Thread 2 advanced to log sequence 10 (thread recovery)
Fri Nov 09 12:18:55 2012
.
Check the cluster clusterware alert log of host01
The node is evicted and rebooted to join the cluster
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation 247848838; details
at (:CSSNM00007:) in
/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-09 11:15:56.140
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: mo103324,
with time stamp: L-2012-11-0911:15:56.580
[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,
unexpected failure 8 received from
CSS
2012-11-09 11:16:17.365
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 11:16:17.400
2
Node 2 joins the cluster
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server host02 has been assigned to pool Generic.

2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server host02 has been assigned to pool ora.orcl.
7. After the node rejoins the cluster and the instance is restarted, reexecute the query it
succeeds
SQL> conn sys/oracle@orcl as sysdba
col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat
In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node will be
evicted from the cluster, if it cant access the voting disk. To simulate it, I will stop iscsi service
on one of the nodes and then scan alert logs and ocssd logs of various nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop ISCSI service on host03 so that it is evicted.
Stop ISCSI service on host03 so that it cant access shared storage and hence voting disk
[root@host03 ~]# service iscsi stop
scan alert log of host03 Note that I/O error occurs at 03:32:11
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log
Note that ocssd process of host03 is not able to access voting disks
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at

(:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.310
2012-11-17 03:32:11.311
2012-11-17 03:32:11.311
2012-11-17 03:32:11.312
2012-11-17 03:32:11.310
ACFS cant be accessed
[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references on
/u01/app/oracle/acfsmount/11.2.0/sharedhome:
[client(8050)]CRS-10001:6323 6363 6391 6375 6385 6383 6402 6319 6503 6361 6377 6505
6389 6369 6335 6367 6333 6387 6871 6325 6381 6327 6496 6498 6552 6373 7278 6339 6400
6357 6500 6329 6365
[client(8052)]CRS-10001:ACFS-9113: These processes will now be terminated.
[client(8127)]CRS-10001:ACFS-9114: done.
[client(8136)]CRS-10001:ACFS-9115: Stale mount point
/u01/app/oracle/acfsmount/11.2.0/sharedhome was recovered.
[client(8178)]CRS-10001:ACFS-9114: done.
[client(8183)]CRS-10001:ACFS-9116: Stale mount point
/u01/app/oracle/acfsmount/11.2.0/sharedhome was not recovered.
[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.
2012-11-17 03:33:34.050
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5682)]CRS-5016:Process
/u01/app/11.2.0/grid/bin/acfssinglefsmount spawned by agent
/u01/app/11.2.0/grid/bin/orarootagent.bin for action start failed: details at (:CLSN00010:)
in /u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log
At 03:34, voting disk cant be accessed even after waiting for timeout
2012-11-17 03:34:10.718
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file
ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds
2012-11-17 03:34:10.724
2012-11-17 03:34:10.724
2012-11-17 03:35:10.666
2012-11-17 03:35:10.666
2012-11-17 03:35:10.666
2012-11-17 03:35:46.654
2012-11-17 03:35:46.654
2012-11-17 03:35:46.654
Voting files are offlined as they cant be accessed
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at
2012-11-17 03:36:10.596
2012-11-17 03:36:10.596
2012-11-17 03:36:10.596
CSSD of host03 reboots the node as no. of voting disks available(0) is less than minimum
required (2)
[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the minimum

number of voting files required, 2, resulting in CSSD termination to ensure data integrity;
details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
2012-11-17 03:36:15.645
[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03.
Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
scan ocssd log of host03
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
I/O fencing for ORCL database is carried out by CSSD at 03:32 ( same time as
when host02 got the msg that orcl has failed on host03)
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa14990),
member 2 in group DBORCL, no share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmUnreferenceMember: global grock
DBORCL member 2 refcount is 7
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceProcessDeath: client (0xaa14990)
pid 6337 undead
..
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa24250),
member 4 in group DAALL_DB, no share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa6db08),

member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating member 2

(0xaa15920) in grock DBORCL
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760) process

death fence completed for process 6337, object type 3
..
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa05758) process
death fence completed for process 6337, object type 2
2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock DAALL_DB,

member number 4 (0xaa05aa8) node number 3 state 0x0 grock type 2
2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib

ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][942172048](:CSSNM00059:)clssnmvWriteBlocks: write

failed at offset 19 of ORCL:ASMDISK02
2012-11-17 03:32:11.310: [ SKGFD][973764496]ERROR: -15(asmlib
ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][973764496](:CSSNM00059:)clssnmvWriteBlocks: write
failed at offset 19 of ORCL:ASMDISK03
2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa38ae0),

member 2 in group DBORCL, same group share, death fence 1, SAGE fence 0
2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa5e128),

member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0
2012-11-17 03:32:11.354: [ CSSD][908748688]clssnmvDiskAvailabilityChange: voting file

ORCL:ASMDISK01 now offline
2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for

voting file ORCL:ASMDISK02 sched delay 1610 > margin 1500 cur_ms 232074 lastalive
230464
230434
.
2012-11-17 03:32:12.223: [ CLSF][887768976]Closing handle:0xa746bc0
2012-11-17 03:32:12.223: [ SKGFD][887768976]Lib
:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa746df8 for disk
:ORCL:ASMDISK01:
2012-11-17 03:32:12.236: [ CLSF][921192336]Closing handle:0xa5cbbb0
2012-11-17 03:32:12.236: [ SKGFD][921192336]Lib
:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa644fb8 for disk
:ORCL:ASMDISK02:

230464

230464
voting file ORCL:ASMDISK01 sched delay 3140 > margin 1500 cur_ms 233574
2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR SIGNALED: no

CALLER: clssscExit
scan alert log of host01
Note that reboot message from host03 is received at 03:36:16
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
with time stamp: L-2012-11-17-03:36:16.705
unexpected failure 8 received from CSS
2012-11-17 03:36:29.610
After host03 reboots itself, network communication with host03 is lost
timeout interval. Removal of this node from cluster in 14.060 seconds
2012-11-17 03:36:37.988
2012-11-17 03:36:43.992
2012-11-17 03:36:46.441
After network communication cant be established for timeout interval, the node is
removed form cluster
[cssd(5177)]CRS-1632:Node host03 is being removed from the cluster in cluster incarnation
232819906
2012-11-17 03:36:46.572
Note that ocssd process of host01 discovers missing disk heartbeat from host03 at
03:32:16
[root@host01 cssd]# tailf /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess:

clssgmCommonAddMember failed, member(-1/CLSN.ONS.ONSNETPROC[3]) on node(3)
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess: Operation(3)
unsuccessful grock(CLSN.ONS.ONSNETPROC[3])
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmHandleMasterJoin:
clssgmProcessJoinUpdate failed with status(-10)
.
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection
failed, endp (0x319), probe((nil)), ninf->endp 0x319
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp
(0x319), init state 3, cur state 3
2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj 0x96c7eb8

[0000000000001310] { gipcEndpoint : localAddr gipc://host01:f278-d1bd-15092f25#10.0.0.1#20071, remoteAddr gipc://host03:gm_cluster01#10.0.0.3#58536, numPend 0,
numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x141d, pidPeer 0, flags
0x261e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
@
scan alert log of host02
Note that reboot message is received ar 03:36:16
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/alerthost02.log
. At 03:32, CRSD process of host02 receives message that orcl database has failed on host03
as
datafiles for orcl are on shared storage
[crsd(5576)]CRS-2765:Resource ora.orcl.db has failed on server host03.
2012-11-17 03:32:44.303
. CRSD process of host02 receives message that acfs has failed on host03 as
shared storage cant be accessed
[crsd(5576)]CRS-2765:Resource ora.acfs.dbhome_1.acfs has failed on server host03.
2012-11-17 03:36:16.981
. ohasd process receives reboot advisory message from host03
with time stamp: L-2012-11-17-03:36:16.705
2012-11-17 03:36:16.981
with time stamp: L-2012-11-17-03:36:16.705

2012-11-17 03:36:28.920
. CSSD process of host02 identifies missing network communication from host03 as host03
has rebooted itself
2012-11-17 03:36:37.307
2012-11-17 03:36:43.328
After network communication cant be established for timeout interval, the node is
removed form cluster
2012-11-17 03:36:46.297
2012-11-17 03:36:46.470
2012-11-17 03:36:51.890
[crsd(5576)]CRS-2773:Server host03 has been removed from pool Generic.
2012-11-17 03:36:51.909
[crsd(5576)]CRS-2773:Server host03 has been removed from pool ora.orcl.
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02
host03 .
note that ocssd of host02 discovers missing host03 only after it has been rebooted at 03:36
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/cssd/ocssd.log
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection
failed, endp (0x22e), probe((nil)), ninf->endp 0x22e
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp
(0x22e), init state 3, cur state 3
..
2012-11-17 03:36:15.052: [ CSSD][852125584]clssgmPeerDeactivate: node 3 (host03), death 0,
state 0x1 connstate 0x1e
.
2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%

heartbeat fatal, removal in 14.420 seconds
2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: local diskTimeout set to
27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmTagize: version(1), type(13),
tagizer(0x80cf3ac)
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmHandleDataInvalid: grock HB+ASM,
member 1 node 1, birth 1
2012-11-17 03:36:32.204: [ CSSD][831145872]clssnmSendingThread: sending status msg to all
nodes
11g R2 RAC: Node Eviction Due To Missing Network Heartbeat

In this post, I will demonstrate node eviction due to missing netsork heartbeat i.e. a node will be
evicted from the cluster, if it cant communicate wioth other nodes in the cluster. To simulate it, I
will stop private network on one of the nodes and then scan alert logs of the surviving nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop PVT. network service on host03 so that it is evicted.
Find out the pvt network name
[root@host03 ~]# oifcfg getif
eth0 192.9.201.0 global public
eth1 10.0.0.0 global cluster_interconnect
Stop pvt. network service on host03 so that it cant communicate with host01 and host02 and
will be evicted.
[root@host03 ~]# ifdown eth1
OCSSD log of host03
It can be seen that CSSD process of host03 cant communicate with host01 and host02
at 09:43:52
Hence votedisk timeouot is set to Short Disk Time OUT (SDTO) = 27000 ms (27 secs)
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1) is
impending reconfig, flag 132108, misstime 15120
2012-11-19 09:43:52.927: [ CSSD][2833247120]clssnmSendingThread: sending status msg to
all nodes
Alert log of host03

At 09:43:52, CSSD process host03 identifies that it cant communicate with CSSD on host02
and host03
2012-11-19 09:43:52.714
2012-11-19 09:44:01.880
2012-11-19 09:44:01.880
2012-11-19 09:44:06.536
2012-11-19 09:44:06.536
2012-11-19 09:44:09.599
At 09:44:16, CSSD process of host03 reboots the node to preserve cluster integrity
[cssd(5124)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is
going down to preserve cluster integrity; details at (:CSSNM00008:) in
/u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-19 09:44:16.697
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5713)]CRS-5822:Agent
/u01/app/11.2.0/grid/bin/orarootagent_root disconnected from server. Details at
(:CRSAGF00117:) in
/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-11-19 09:44:16.193
[ctssd(5285)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03.
Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
2012-11-19 09:44:21.177
Ocssd log of host01
At 09:43:53, CSSD process of host01 identifies that it cantommunicate with CSSD on host03

Alert log of host01
At 09:44:01, alert log of host01 is updated regarding communication failure with host03
2012-11-19 09:44:01.695
2012-11-19 09:44:07.666
2012-11-19 09:44:10.606
[cssd(5308)]CRS-1607:Node host03 is being evicted in cluster incarnation 32819913; details at
2012-11-19 09:44:24.705
At 09:44:24, OHASD process on host01 receives reboot message from host03
with time stamp: L-2012-11-19-09:44:24.373
2012-11-19 09:44:24.705
with time stamp: L-2012-11-19-09:44:24.376
2012-11-19 09:44:46.379
OCSSD log of host02
At 09:43:52, CSSD process of host02 identifies communication failure with host03

18331004
20
Alert log of host02
At 09:44:01 (same as host01), alert log of host02 is updated regarding communication failure
with host03
2012-11-19 09:44:01.971
2012-11-19 09:44:06.750
2012-11-19 09:44:24.520
At 09:44:24 (same as host01), OHASD process on host01 receives reboot message from host03
with time stamp: L-2012-11-19-09:44:24.373
2012-11-19 09:44:24.520
with time stamp: L-2012-11-19-09:44:24.376
2012-11-19 09:44:46.073
20

Node Evition

Uploaded by

Copyright:

Available Formats

Node Evition

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Node Evition

Uploaded by

Copyright:

Available Formats

11g R2 RAC:Node Eviction Due To CSSDagent Stopping

11g R2 RAC : Node Eviction Due To Member Kill Escalation

SQL> select instance_name, host_name from gv$instance;

Completed redo scan

[crsd(14820)]CRS-2772:Server host02 has been assigned to pool Generic.

11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at

[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the minimum

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa6db08),

2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating member 2

2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760) process

2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock DAALL_DB,

2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib

2012-11-17 03:32:11.310: [ CSSD][942172048](:CSSNM00059:)clssnmvWriteBlocks: write

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa38ae0),

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa5e128),

2012-11-17 03:32:11.354: [ CSSD][908748688]clssnmvDiskAvailabilityChange: voting file

2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for

2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR SIGNALED: no

2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess:

2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj 0x96c7eb8

[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot,

2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%

11g R2 RAC: Node Eviction Due To Missing Network Heartbeat

Alert log of host03

Ocssd log of host01

2012-11-19 09:43:53.340: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50%

At 09:43:52, CSSD process of host02 identifies communication failure with host03

Alert log of host02

You might also like