Applies To:: RMAN Troubleshooting Catalog Performance Issues (Doc ID 748257.1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Copyright (c) 2022, Oracle. All rights reserved. Oracle Confidential.

RMAN Troubleshooting Catalog Performance Issues (Doc ID 748257.1)

In this Document

Purpose
Troubleshooting Steps
References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.0.1.0 and later


Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and


later
Oracle Database Backup Service - Version N/A and later

Information in this document applies to any platform.

PURPOSE

This note discusses what to look for when queries against the RMAN repository (whether in the controlfile or a catalog
database) are performing badly and what diagnostics needs to be gathered when raising a Service Request with Oracle
Support Services.

This note is intended for use by Database Administrators and Support personnel investigating RMAN performance
problems relating specifically to catalog resyncs or queries against the controlfile or catalog database whilst running RMAN.

TROUBLESHOOTING STEPS

Identifying a problem with the RMAN repository


The RMAN repository is updated whenever new backups are taken or housekeeping is done to maintain the backup
metadata; it is queried when RMAN reports are run and during restore to check backup history and determine the best and
most efficient way to carry out the restore. If a catalog is used, a number of resyncs are done first to update the catalog
with any new information in the controlfile but once the resyncs are complete, ALL subsequent queries of RMAN metadata
(during crosscheck, report, restore etc) are run against the catalog database only.

You know there is a problem if:

An explicit catalog resync takes a long time to complete


Statspack or AWR reports for a nocatalog run shows top wait events in the target are for CPU or control file IO:
"control file parallel write"
"control file sequential read"
Statspack or AWR reports for catalog run shows top wait events in the catalog are for CPU or the catalog datafile
IO:
"db file scattered read"
"db file sequential read"
Statspack or AWR reports shows most expensive queries in terms of CPU or elapsed time are those issued against
the controlfile or rman catalog tables
You have worked through Note 360443.1: RMAN Backup Performance, and identified the bottleneck to be in the
Oracle Layer before the backup actually starts
You have worked through Note 740911.1: RMAN Restore Performance, and identified the bottleneck to be in the
Oracle Layer before the restore actually starts

Control_file_record_keep_time and recovery window


Control_file_record_keep_time specifies the minimum amount of RMAN metadata guarranteed to be retained in the
controlfile; the default is 7 days.  In practice the rate at which records age out of the controlfile depends on how long it
takes for a recyclable section to fill up which may well be more than 7 days. It is important to note that
control_file_record_keep_time and recovery window are NOT the same; the former governs when records can be
overwritten in the controlfile, primarily to prevent unlimited controlfile growth whilst the latter is concerned with retention
of RMAN metadata according to business requirements (see Bug 6448254: RAISE WARNING IF RECOVERY WINDOW IS >
CONTROL_FILE_RECORD_KEEP_TIME IN NOCAT MODE). If a catalog is used and RMAN is run daily or at a frequency
LESS than control_file_record_keep_time days, RMAN metadata will never be lost as implicit catalog resyncs ensure data is
regularly propogated into the catalog.

Integrity of RMAN metadata

Constraints in the catalog ensure the integrity of the rman metadata. For example, a backup_set must have a
corresponding backup_piece which must have corresponding backup_X (datafile, redolog ) entries. There are no such
constraints in a controlfile; records in different sections can age out at different rates. For example, the one:many
relationship between backup_set and backup_redolog entries means the backup_redologs section will fill up faster than
the backup_set section and over time, a backup_redolog entry may age out of the controlfile BEFORE its corresponding
backup_set entry, leaving a backup_set entry with missing backup_redolog entries (mis-matched). Running housekeeping
operations (report/delete obsolete, crosscheck etc) against such a CONTROLFILE with mismatched entries will produce
unpredictable results (see Bug 7357779: RMAN REPORT/DELETE OBSOLETE RETURNS ARCHIVELOG BACKUPS INSIDE
RETENTION POLICY).

To avoid any issues with mis-matched entries, set control_file_record_keep_time equal to the recovery window +1 which
ensures that RMAN metadata INSIDE the retention policy will never be affected. In practice this may not be possible if a
very large recovery window is needed so typically a catalog is used, a retention policy configured and
control_file_record_keep_time left to its default of 7 days. But there may still be times when the catalog database is
unavailable (it is being upgraded perhaps or is under investigation) and an RMAN backup must still proceed in which case:

set control_file_record_keep_time days to the recovery window plus 1 to ensure no loss of metadata whilst the
catalog is unavailable
DO NOT RUN ANY MAINTENANCE OPERATIONS until the catalog is again available expecially if your recovery
window is greater than the original setting of control_file_record_keep_time days

Tackling a performance issue

AWR snapshot reports from the target (nocatalog) or catalog database should identify the specific SQL that is causing a
problem in terms of CPU, high IO or elapsed times.

Refer to the following notes to see if the problematic SQL is a known issue:

Note 247611.1:Known RMAN Performance Problems


Note 463227.1: Oracle10g RMAN Recovery Catalog Known Performance Issues

If none of the issues in the above notes are relevant, then further investigation is needed and the following diagnostics will
be useful and should be uploaded if a Service Request is raised with Oracle Support Services. There are three processes to
consider:

The RMAN client – this should always be traced


The 1st default channel (for nocatalog operations only, this process runs all the rman queries against the controlfile)
The RMAN session in the catalog instance

All the traces will generate a lot of information so consider FIRST how to minimise the output by stripping the job down to
a SINGLE rman command. Execute each command in a script in isolation to see which command produces the problem
and trace just that one command. It is always worth running an explicit catalog resync first to see if the real problem lies
in the implicit resync done for virtually all commands.

Tracing the RMAN client

Ideally, set debug trace on the RMAN command line and execute a simplied script to illustrate the problem eg

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run {

allocate channel t1 type sbt parms=(…………);
backup database plus archivelog all;

Otherwise trace only the command that you are interested in. The following example traces only the crosscheck command:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc


run {

allocate channel t1 type sbt parms=(….);


allocate channel t2 type sbt parms=(….);
backup database;
backup archivelog all until time ' sysdate –7' delete all input;
debug on;
crosscheck backup;
debug off;
}

Tracing queries against the RMAN repository


Make sure before you invoke rman that the following environmental variables are set:

NLS_LANG='american_america.<charset>' 
NLS_DATE_FORMAT='dd-mon-rr hh24:mi:ss' 

And for the target database (nocatalog operations) or catalog database (catalog operations):

TIME_STATISTICS=TRUE 

MAX_DUMP_FILE_SIZE=UNLIMITED 

If a recovery catalog is normally used, try again WITHOUT the catalog; if the problem persists then for diagnostic
purposes, get the trace against the CONTROLFILE only. Tracing the RMAN catalog session should only be done if the
problem only occurs when a catalog is used.

The Controlfile (nocatalog)

%rman target / log rman.log trace rman.trc debug
run { 

sql "alter session set events ''10046 trace name context forever, level 12''"; 
set command id to 'rman';

allocate channel for maintenance type sbt…………..;
backup database;

Note: Two single quotes before : 10046



: level 12
Two single quotes and a double quote after

This traces the first default channel; trace file will be generated in the target udump directory using the format
<SID>_ora_<spid>.trc. To find the <spid> of the 1st default channel:

col program format a20  
col action format a20  

select s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time,  
seconds_in_wait AS sec_wait  

from v$session s, v$process p  

where s.paddr = p.addr and s.program like '%rman%'; 

Typical output:

SID SPID PROGRAM                          CLIENT_INFO


---------------------------------------------------------

ACTION              SEQ# EVENT                      WAIT_TIME SEC_WAIT


-----------------------------------------------------------------------------
139 24822 rman@celcsol4 (TNS V1-V3)

0000012 FINISHED129    853  SQL*Net

message from client 0        1794


147  4574 rman@celcsol4 (TNS V1-V3)

                       115  SQL*Net message from client 0        1797


135  5056 rman@celcsol4 (TNS V1-V3)       rman channel=ORA_DISK_1

0000046 FINISHED129   2219  SQL*Net message from client 0        1794

Note:

is the polling channel; it never has a value in ACTION
- sid 147
- sid 139 is the 1st default Channel (CLIENT_INFO is null)

- sid 135 is the allocated channel ORA_DISK_1 (per CLIENT_INFO)


The Catalog
Allow the rman job to make the catalog connection first – it may help to use the host command to allow time to find the
catalogn session:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run { 

set command id to 'rman';

allocate channel for maintenance type sbt…………..;
host;

backup database;
}

The host command will halt execution and return control to the OS prompt.
To identify the RMAN session in the catalog database:

SELECT s.sid, s.serial#, p.spid, s.program, s.action


FROM v$session s, v$process p

WHERE s.paddr = p.addr

AND s.program like '%rman%';

Note the spid value eg 3164 then do:

oradebug setospid 3164
oradebug unlimit

oradebug event 10046 trace name context forever, level 12
exit

Once this is done, to return to the RMAN session and to continue execution of the script type: exit
A trace file will be generated in the CATALOG udump directory using the format <SID>_ora_<spid>.trc.

Upload Diagnostics

1. RMAN client trace file


2.

1st default channel 10046 trace file (nocatalog only)


3. RMAN catalog 10046 trace file (catalog issues only)

4. Related AWR/Statspack reports

REFERENCES

BUG:6448254 - RAISE WARNING IF RECOVERY WINDOW IS > CONTROL_FILE_RECORD_KEEP_TIME IN NOCAT MOD


BUG:7357779 - RMAN REPORT/DELETE OBSOLETE RETURNS ARCHIVELOG BACKUPS INSIDE RETENTION POLICY

NOTE:247611.1 - Known RMAN Performance Problems

NOTE:360443.1 - RMAN Backup Performance

NOTE:463227.1 - Oracle10g RMAN Recovery Catalog Known Performance Issues


NOTE:740911.1 - RMAN Restore Performance

Didn't find what you are looking for?

You might also like