P1-P2 H I: Revision History
P1-P2 H I: Revision History
P1-P2 H I: Revision History
Revision History:
Version Date Remarks
1.0 23-Apr-2014 Initial Document created
1.1 03-Jun-2014 Shrirang Samudralwar updated with OS commands and SQL
Scripts
1.2 28-Jul-2014 Corrections and Reviewed
Reviewers / Approvers:
Name Role
Venkat Chandragiri Reviewer / Approver
Anthony Sanchez Reviewer/ Approver
Ernie Lacson Reviewer/ Approver
Jeff Oliveto Reviewer/ Approver
Sendhil Thanigaimalai Reviewer/ Approver
Mona Mohan George Reviewer/ Approver
Luven Desouza (IMC) Reviewer
Deborah Kuznik (Rolta PM) Reviewer
P1-P2 HANDLING INSTRUCTIONS
Triggering Point :
Rolta / App – who notices the issue first , should take it up with IMC for further actions.
Phase – I ( 0 – 20 Minutes )
Inform IMC and make sure you get the L2 App team and monitoring team involved immediately.
Find out if monitoring is showing any services down. Troubleshooting should be a joint effort.
Have Application Team to run the following connection checks – if Application is experiencing
connectivity issues
o Oracle_Client_Home is installed
Perform TNSPING <SCAN NAME> , (< 100ms)
Connect to database using application username thru sqlplus
o Perform JDBC based connection check (Example ..)
o Perform telnet <SCAN NAME> 1521
o nslookup SCAN NAME , should resolve with 3 IP’s
o Ping SCAN NAME (4 times) , it should result in resolving 3 different IP’s of Scan. (< 50ms)
What DBA Team need to do (DBA to report the below findings to IMC and update in Jira) :
Check Insight tickets for that database (session wait, log scan , unresponsive, monitor fault)
Steps: Insight tool is configured for all Prod/Non-Prod databases. Will get tickets in case of
session waits, log scan, Unresponsive, Monitor faults, Table space space issues etc.
Steps: cd $HOME/rolta
$ sh hourly_db_status_report.sh
Generated report will be delivered to all ROLTA DBAs mailbox and from that report
above parameters can be validated.
Capture the server load from OSWATCHER archive directory for the last hour or for last 15
minutes using TOP command:
2
P1-P2 HANDLING INSTRUCTIONS
$ top OR $ uptime
TOP will give other useful info like highly consuming OS processes ( Oracle/root or other users)
and Memory usage/RAM and Swap usage.
Steps: Logon to server with root and then go to /var/log folder and tail 30 lines, to check for any
errors or messages.
$ ps –ef|grep pmon
OR
$ ps –ef|grep asm
Logon to ASM instance by sourcing it and then check whether its running or not using SQL
commands.
FOR CRS status, need to execute below command with root login:
# ps –ef|grep crs
It will show CRS daemon is running with crsd.bin from GRID HOME.
3
P1-P2 HANDLING INSTRUCTIONS
Execute below command for checking CRS status (With root login and from $GRID_HOME/bin).
It will show Cluster resources and status on different nodes of the cluster.
# ./crs_stat –t
Sample Output:
Example:
Check how much time lgwr process is taking to synch from Log buffer to Log files.
Steps: Source ASM instance and execute asmcmd. Execute iostat command.
4
P1-P2 HANDLING INSTRUCTIONS
ASMCMD> iostat
Group_ Name Dsk_Name Reads Writes
DG01 POD1_PROD02_DG01_01 8656176320512 351804456960
DG01 POD1_PROD02_DG01_02 7886433042432 352465548288
DG01 POD1_PROD02_DG01_03 9159568773120 361441565696
DG01 POD1_PROD02_DG01_04 9008070394368 373149604864
DG01 POD1_PROD02_DG01_05 7625927325184 353863700992
DG01 POD1_PROD02_DG01_06 7647393869312 358348650496
DG01 POD1_PROD02_DG01_07 8277035104256 345368769536
DG01 POD1_PROD02_DG01_08 7873794318336 357313810944
Check File system usage (df –kh) , look for any 100% filled mount point
Steps:
$ df –kh
Always look for Used/Available and used%. Need to ensure that used% should not cross
threshold limit, which is generally 90%.
Since db_recovery_file_dest is set to Disk group +DG4, logon to ASM instance and then check
space using asmcmd
5
P1-P2 HANDLING INSTRUCTIONS
1. If it’s a ASM Disk group then check Total_MB, Free_MB using below command:
Check RMAN backups running (other than arch log backups) , Terminate backups
$ ps –ef|grep rman
$ ps –ef|grep level0*
$ ps –ef|grep level1*
Check for EXP / Imports running , Consult with CFDBA team for termination.
$ ps –ef|grep expdp
$ ps –ef|grep impdp
Discuss with CFDBA team and if export/import is not critical, then we can kill the job during
critical issues.
Query for TOP Wait events from Database for last 30 minutes (or 1 hour of the issue was more
than an hour)
Steps: Use ASH views to query for top wait events for last 1 Hr. using below commands:
TOP events:
6
P1-P2 HANDLING INSTRUCTIONS
Top sessions:
select sesion.sid,
sesion.username,
sum(ash.wait_time + ash.time_waited)/1000000/60 ttl_wait_time_in_minutes
from v$active_session_history ash, v$session sesion
where sample_time between sysdate - 60/2880 and sysdate
and ash.session_id = sesion.sid
group by sesion.sid, sesion.username order by 3 desc;
Top queries:
SELECT active_session_history.user_id,
dba_users.username,sqlarea.sql_text,
SUM(active_session_history.wait_time +
active_session_history.time_waited)/1000000 ttl_wait_time_in_seconds
FROM v$active_session_history active_session_history,
v$sqlarea sqlarea,dba_users
WHERE active_session_history.sample_time BETWEEN SYSDATE - 1 AND SYSDATE
AND active_session_history.sql_id = sqlarea.sql_id
AND active_session_history.user_id = dba_users.user_id
and dba_users.username <>'SYS'
GROUP BY active_session_history.user_id,sqlarea.sql_text, dba_users.username
ORDER BY 4 DESC;
Top segments
SELECT dba_objects.object_name,
dba_objects.object_type,
active_session_history.event,
SUM(active_session_history.wait_time +
active_session_history.time_waited) ttl_wait_time
FROM v$active_session_history active_session_history,
dba_objects
WHERE active_session_history.sample_time BETWEEN SYSDATE - 1 AND SYSDATE
AND active_session_history.current_obj# = dba_objects.object_id
GROUP BY dba_objects.object_name, dba_objects.object_type, active_session_history.event
ORDER BY 4 DESC;
Most IO:
7
P1-P2 HANDLING INSTRUCTIONS
Check and record total number of database connections with Active and Inactive status. Find out what
queries the active sessions are running. Are they all the same? From which schemas? From which app
servers?
8
P1-P2 HANDLING INSTRUCTIONS
9
P1-P2 HANDLING INSTRUCTIONS
ii).
set lines 1000;
set pages 400;
col sql_text format a35
col username format a15
column machine format a24;
column event format a24;
select username,sid,serial#,machine,event,status,s.sql_id,
seconds_in_wait/60 mins,ss.sql_text,blocking_session
from gv$session s,v$sql ss
where type <> 'BACKGROUND' and
username not like ('%SYS%') and
username is not null and status<>'KILLED'
and seconds_in_wait> 60 and s.sql_id=ss.sql_id and
blocking_session is not null order by 8;
Check for library cache events and analyze if it creates issue in the database
o Flush Shared pool (after permission from Bridge)
10
P1-P2 HANDLING INSTRUCTIONS
11
P1-P2 HANDLING INSTRUCTIONS
Count of Application logins from application tables company wise (last 24 hours , by hours)
Check for number of Sessions that does not run any sql statements (SQL_ID is null)
12
P1-P2 HANDLING INSTRUCTIONS
If the total number of sessions / processes exceed > 800 then we have to pay more attention
and make sure that server load does not shoot up beyond 100
13
P1-P2 HANDLING INSTRUCTIONS
Notify long running jobs to Apps team with session start time, current sql , event , machine
and last_call ET and ask for instructions to terminate
The key here is to identify whether the load is attributed to one company or one feature. If so,
there is a possibility to have the company stop the job and re-schedule during non-peak
season
If too many long running jobs are scheduled over same period of time during the peak business
hours (Like EUSF1 – Quartz jobs), then we need to request Apps team to schedule Non-critical
jobs during night time when most of the DB resources are idle.
Find out if you can attribute the load to a schema/company. If you can, focus on controlling SQL
from the one schema. They could be just one job.
Find out if load is coming from customer facing app servers, quartz servers, or reporting servers.
14
P1-P2 HANDLING INSTRUCTIONS
15
P1-P2 HANDLING INSTRUCTIONS
$ free –m
$ From /var/log/sa, execute command – sar –S ( With root login)
$ sar -S 1 3
Linux 2.6.19-194.el5SMP (host) 09/26/2012 _i686_ (8 CPU)
07:31:06 AM kbswpfree kbswpused %swpused kbswpcad %swpcad
07:31:07 AM 8385920 0 0.00 0 0.00
07:31:08 AM 8385920 0 0.00 0 0.00
07:31:09 AM 8385920 0 0.00 0 0.00
Average: 8385920 0 0.00 0 0.00
Check “Puppet “
o ps –ef |grep puppet , Logs /var/log/puppet* , look for current timestamp file and
review.
From OS prompt execute below commands. Primarily look for following parameter
settings at OS level.
$ ulimit –a
Sample output:
16
P1-P2 HANDLING INSTRUCTIONS
$ ls –ltr *lgwr*
$ tail –30 ASHPCE1D3_lgwr_7112.trc
Sample output:
DBA’s to Check High number of - TNS Timeout errors in listener / alert log – Indication of
probable issue in network layer
o Count of connections for the hour from Scan / listener logs
More inputs from Network / Infra team needed here
17
P1-P2 HANDLING INSTRUCTIONS
If one single SQL_ID is running in more sessions and are active do the following checks
In the where clause we can specify for specific owner or table name as per the requirement.
18