Applies To:: RMAN Troubleshooting Catalog Performance Issues (Doc ID 748257.1)
Applies To:: RMAN Troubleshooting Catalog Performance Issues (Doc ID 748257.1)
Applies To:: RMAN Troubleshooting Catalog Performance Issues (Doc ID 748257.1)
In this Document
Purpose
Troubleshooting Steps
References
APPLIES TO:
PURPOSE
This note discusses what to look for when queries against the RMAN repository (whether in the controlfile or a catalog
database) are performing badly and what diagnostics needs to be gathered when raising a Service Request with Oracle
Support Services.
This note is intended for use by Database Administrators and Support personnel investigating RMAN performance
problems relating specifically to catalog resyncs or queries against the controlfile or catalog database whilst running RMAN.
TROUBLESHOOTING STEPS
The RMAN repository is updated whenever new backups are taken or housekeeping is done to maintain the backup
metadata; it is queried when RMAN reports are run and during restore to check backup history and determine the best and
most efficient way to carry out the restore. If a catalog is used, a number of resyncs are done first to update the catalog
with any new information in the controlfile but once the resyncs are complete, ALL subsequent queries of RMAN metadata
(during crosscheck, report, restore etc) are run against the catalog database only.
Control_file_record_keep_time specifies the minimum amount of RMAN metadata guarranteed to be retained in the
controlfile; the default is 7 days. In practice the rate at which records age out of the controlfile depends on how long it
takes for a recyclable section to fill up which may well be more than 7 days. It is important to note that
control_file_record_keep_time and recovery window are NOT the same; the former governs when records can be
overwritten in the controlfile, primarily to prevent unlimited controlfile growth whilst the latter is concerned with retention
of RMAN metadata according to business requirements (see Bug 6448254: RAISE WARNING IF RECOVERY WINDOW IS >
CONTROL_FILE_RECORD_KEEP_TIME IN NOCAT MODE). If a catalog is used and RMAN is run daily or at a frequency
LESS than control_file_record_keep_time days, RMAN metadata will never be lost as implicit catalog resyncs ensure data is
regularly propogated into the catalog.
Constraints in the catalog ensure the integrity of the rman metadata. For example, a backup_set must have a
corresponding backup_piece which must have corresponding backup_X (datafile, redolog ) entries. There are no such
constraints in a controlfile; records in different sections can age out at different rates. For example, the one:many
relationship between backup_set and backup_redolog entries means the backup_redologs section will fill up faster than
the backup_set section and over time, a backup_redolog entry may age out of the controlfile BEFORE its corresponding
backup_set entry, leaving a backup_set entry with missing backup_redolog entries (mis-matched). Running housekeeping
operations (report/delete obsolete, crosscheck etc) against such a CONTROLFILE with mismatched entries will produce
unpredictable results (see Bug 7357779: RMAN REPORT/DELETE OBSOLETE RETURNS ARCHIVELOG BACKUPS INSIDE
RETENTION POLICY).
To avoid any issues with mis-matched entries, set control_file_record_keep_time equal to the recovery window +1 which
ensures that RMAN metadata INSIDE the retention policy will never be affected. In practice this may not be possible if a
very large recovery window is needed so typically a catalog is used, a retention policy configured and
control_file_record_keep_time left to its default of 7 days. But there may still be times when the catalog database is
unavailable (it is being upgraded perhaps or is under investigation) and an RMAN backup must still proceed in which case:
set control_file_record_keep_time days to the recovery window plus 1 to ensure no loss of metadata whilst the
catalog is unavailable
DO NOT RUN ANY MAINTENANCE OPERATIONS until the catalog is again available expecially if your recovery
window is greater than the original setting of control_file_record_keep_time days
AWR snapshot reports from the target (nocatalog) or catalog database should identify the specific SQL that is causing a
problem in terms of CPU, high IO or elapsed times.
Refer to the following notes to see if the problematic SQL is a known issue:
If none of the issues in the above notes are relevant, then further investigation is needed and the following diagnostics will
be useful and should be uploaded if a Service Request is raised with Oracle Support Services. There are three processes to
consider:
All the traces will generate a lot of information so consider FIRST how to minimise the output by stripping the job down to
a SINGLE rman command. Execute each command in a script in isolation to see which command produces the problem
and trace just that one command. It is always worth running an explicit catalog resync first to see if the real problem lies
in the implicit resync done for virtually all commands.
Ideally, set debug trace on the RMAN command line and execute a simplied script to illustrate the problem eg
%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run {
allocate channel t1 type sbt parms=(…………);
backup database plus archivelog all;
Otherwise trace only the command that you are interested in. The following example traces only the crosscheck command:
Make sure before you invoke rman that the following environmental variables are set:
NLS_LANG='american_america.<charset>'
NLS_DATE_FORMAT='dd-mon-rr hh24:mi:ss'
And for the target database (nocatalog operations) or catalog database (catalog operations):
TIME_STATISTICS=TRUE
MAX_DUMP_FILE_SIZE=UNLIMITED
If a recovery catalog is normally used, try again WITHOUT the catalog; if the problem persists then for diagnostic
purposes, get the trace against the CONTROLFILE only. Tracing the RMAN catalog session should only be done if the
problem only occurs when a catalog is used.
%rman target / log rman.log trace rman.trc debug
run {
sql "alter session set events ''10046 trace name context forever, level 12''";
set command id to 'rman';
allocate channel for maintenance type sbt…………..;
backup database;
This traces the first default channel; trace file will be generated in the target udump directory using the format
<SID>_ora_<spid>.trc. To find the <spid> of the 1st default channel:
col program format a20
col action format a20
select s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time,
seconds_in_wait AS sec_wait
from v$session s, v$process p
where s.paddr = p.addr and s.program like '%rman%';
Typical output:
---------------------------------------------------------
-----------------------------------------------------------------------------
139 24822 rman@celcsol4 (TNS V1-V3)
Note:
is the polling channel; it never has a value in ACTION
- sid 147
- sid 139 is the 1st default Channel (CLIENT_INFO is null)
The Catalog
Allow the rman job to make the catalog connection first – it may help to use the host command to allow time to find the
catalogn session:
%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run {
set command id to 'rman';
allocate channel for maintenance type sbt…………..;
host;
backup database;
}
The host command will halt execution and return control to the OS prompt.
To identify the RMAN session in the catalog database:
oradebug setospid 3164
oradebug unlimit
oradebug event 10046 trace name context forever, level 12
exit
Once this is done, to return to the RMAN session and to continue execution of the script type: exit
A trace file will be generated in the CATALOG udump directory using the format <SID>_ora_<spid>.trc.
Upload Diagnostics
REFERENCES