Troubleshooting Guide MMR
Troubleshooting Guide MMR
6]
Applies to:
Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 - Release: 7.2.2 to 10.2
Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 [Release: 7.2.2 to 10.2]
Information in this document applies to any platform.
Checked for relevance on 27-Nov-2007
Checked for relevance on 16-Mar-2010.
Purpose
The purpose of this article is to provide basic steps for troubleshooting advanced replication
propagation and the underlying mechanism it uses; the deferred queue. Additional notes are
referenced through out this article that address specific issues or provide additional
information on a particular component used by Advanced Replication.
GNAME STATUS
------------------------------ ---------
GROUP1 NORMAL
2.2 Replicated object triggers, packages and status
For the propagation and queuing of data changes to be successful to remote sites, replicated
tables (objects) must display as valid at all replication sites. They must also have the
associated internalised triggers and packages defined. Run the following query to check the
replicated tables (objects), in releases prior to Oracle 8.1.x:
If the above query returns replication objects that do not have the associated internalised
triggers or packages, it may be necessary to re-generate replication support. If after re-
generation the replication objects still show as invalid, run the following statement to ensure
that all dependant SYS and SYSTEM objects are valid:
note:50593.1 Initial steps required to create Multi Master and Snapshot Replication v8.0
note:117434.1 Initial steps required to a create Multi Master Replication environment v8.1 /
v9.x/ v10.x
Check the existence of the public and private links with the following query at each site
involved to propagation:
Test each of the links and ensure that the global name matches the link name with the
following query:
It is important that the correct links exist for the user who owns the job that performs the
replication push job.
3. Checking the automatic propagation mechanism is working
Oracle Replication uses the job queue to automate the propagation and purging of deferred
transactions. If the jobs are not configured properly, they will not run as expected and their
associated tasks will not be completed.
Administrators usually become aware that there may be a problem with the job queue
mechanism when they discover that the replication deferred transaction queue is building up.
To check if the queue is growing run :
If the queue appears to be growing to an unusual size, use the following 3.x sections to ensure
that the automated jobs are not the cause. Do not attempt to quiesce the replication system
with SUSPEND_MASTER_ACTIVITY, as that will just try to push the queue first.
3.1 Checking for errors
If a job fails while attempting to push or purge replication data changes, errors will be written
to the alertSID.log. Additional and more detailed information will go into the following files
referred to by the alert.log:
- Pre V9 : SID_snpx_nnnnn.TRC
- V9 and above : SID_cjq0_nnnnn.TRC and SID_jnnn_nnnn.TRC
The format of these files may vary between operating systems, their location can be
determined by running the following from SQLPLUS:
job_queue_processes : A job queue process executes a single job at a time and this parameter
determines the maximum concurrent number of these. In most replication environments
configure this to be:
JOB WHAT
-------- ----------------------------------------------------------------
43 declare rc binary_integer; begin rc := sys.dbms_defer_sys.purge(
delay_seconds=>0); end;
If the push job exists it is important to know when it last ran and when it is next scheduled to
run, check the jobs schedule with:
There may be two reasons that the next_date has been passed. The first is the job has failed, to
check this see section 3.4. The second is that the job is still running, to check this see section
3.6. It is important to note that the next_date is set when the job completes, so if the interval is
10 minutes and the job took 9 minutes to run the next_date will be 19 minutes from the time
the job started. Please refer to DBMS_JOB.CHANGE in note:61730.1 if the next date of the
job needs to be altered.
Oracle uses a lazy algorithm to purge deferred transactions from the local queue. It is
important that the purge job runs regularly to clear down this queue because the same
underlying table is used for transactions waiting to go to remote sites as for those which have
been pushed but not purged. If there is a large number of transactions to be purged it can
affect the performance of propagation, check the scheduled purge job exists with:
If the job is showing failures follow the advice in section 3.1, once the underlying problem
has been resolved unbreak the job with DBMS_JOB.BROKEN (see note:1018453.102).
Please note that the job may show failures = 0 and broken = Y is the job was manually broken
with DBMS_JOB.BROKEN.
3.5 Check the propagator and their private database links
The owner of the job that performs the push must be the replication propagator and that user
must have a private database link to the site where the job is pushing data to. See section 2.4
for details of the required database links and how to check they are working correctly.
Use the following SQL to check the propagator, links and push jobs all match up:
DB_LINK
----------------------------------
DB2.WORLD
DB3.WORLD
JOB PUSHED_SITE_BY_PROPAGATOR
---------- -------------------------
43 DB2.WORLD
44 DB3.WORLD
3.6 Check if the push job is currently running
Oracle replication only allows a single push operation to run to a master site at a time,
although multiple push operations can occur concurrently but they have to be to different
remote master sites. If connection qualifiers are being used then there can be multiple push
jobs to the same master site, but with different database links (see note:1024982.6).
It may also be difficult to know if data is moving between replicated sites by using the
deftrandest table because new transactions will be added all the time and for transactions with
many calls they may take some time to process. The following query identifies push jobs that
are currently running:
column dblink format a30
select /*+ ORDERED */ j.job, j.sid, d.dblink,
SUBSTR(TO_CHAR(J.THIS_DATE,'MM/DD/RRRR HH24:MI:SS'),1,20) START_DATE
from defschedule d, dba_jobs_running j
where j.job in (select job from dba_jobs
where upper(what) like '%DBMS_DEFER_SYS.PUSH%')
and j.job = d.job;
When a job runs it create a Job Queue Lock to protect it from being run more than once (i.e.
run manually from a users session), there have been conditions observed where the lock is
still held by the Job Queue Process due to network failures. Use the following query to
identify Job Queue (JQ) lock:
It is important to note that the above queries will not identify manual push operations that
have been initiated from a users session, see section 4.1 for more details of identifying these.
After evaluating the above and section 4 it may be necessary to terminate the push process,
please follow the steps defined in section 3.7 to do this.
3.7 Terminating a deferred queue push job that is currently running
There will be situations where the running push job needs to be terminated and prevented
from running again, until the current problem that is being encountered is resolved. Perform
the following steps:
- Kill the Job Queue Process from the Operating System. To do this use the sid from section
3.6 to identify the process in v$session, v$process and v$bgprocess. The process will
generally be named SNPx or Jxxx.
- After killing the process, wait approximately 1 minute, to ensure the job is removed from
dba_jobs_running.
Killing the background process from the operating system will release the Job Queue lock and
the User Lock used to protect the push operation. Once the underlying problem has been
resolved by working through sections 4, 5 and 6, restart the jobs with:
execute dbms_job.broken(, false);
Remember to make sure that the job that would normally perform the push is not currently
running and has been prevented from running when the manual push is being tested (see
sections 3.6 and 3.7). If this is not the case the manual push will normally return immediately
without pushing any rows.
declare x integer;
begin
x := sys.dbms_defer_sys.push('',...);
end;
/
If you get errors during the manual push, generally rows will not be pushed, use My Oracle
Support to search for known problems that relate to these errors.
If the manual push completes without errors but the entries in deftrandest remain unchanged,
the following could be the cause:
- Another users session is performing the same push operation, when a manual push starts it
allocates a User Lock to ensure there is only one push at a time, use note:1059290.6 to
identify the blocking session.
- BUG:734902 (fixed in 8.0.6), which may manifest itself as a hang.
If neither of the above are the cause, check through the following 4.x sections and if the cause
cannot be identified, raise a call with Oracle Support Services.
If the manual push hangs check through the following 4.x sections and if they do not identify
the cause use section 5 to diagnose the problem.
- DEFTRANDEST : contains all transaction that have not yet been pushed to a remote master
site. Transactions appear once per master site they have to be pushed to.
** Note the deftran view also includes transactions from the deferred error queue, see sections
4.3.1 and 4.4 for additional information.
The transaction with the lowest delivery order will be the next transaction to be pushed to the
remote replication site. Prior to Oracle9 the DEFTRANDEST view was the only way to
identify how propagation was progressing. From Oracle8 onwards use the following to
investigate the current push:
- Transactions that are currently being pushed appear in the target sites DEF$_ORIGIN table.
- A transaction has been pushed if system.def$_destination.last_delivered is greater than
system.def$_aqcall.cscn.
Oracle9 includes a mechanism for identifying how far through the current transaction the
current push is. In Oracle9 and above, use the following query to identify if a transaction is
being pushed and how many rows have been pushed:
The current implementation of v$replprop only applies to transactions that are pushed using
parallel propagation, however Oracle recommends all customers use parallel propagation. On
systems that are CPU bound pushing the deferred queue, it may be better to run the following
query:
In Oracle9 and above running the following query should assist database administrators is
monitoring the overall activity in the deferred transaction queue:
See Section 7 for an example of a transaction being propagated to a remote site. If the current
transaction being pushed appears to be hung or running very slowly, check the following 4.x
sections of this document, if they do not help section 5 contains a more detailed analysis
method.
Deferred Oracle
Transaction Call Origin Destination Date Of Error
ID Number Database Database Error Number
----------- --------- ---------------- ---------------- --------- -------
5.11.152 0 DB3.WORLD DB4.WORLD 05/16/02 1403
20:03:07
8.10.153 2 DB3.WORLD DB4.WORLD 05/16/02 1403
20:08:01
In general to resolve the errors that the above query returns, search MetaLink for ORA-. If the
error is an ORA-1403 as in the above example data between the replication sites tables has
diverged, this data will have to be manually resynchronised and Oracle would recommend
customers implement a conflict resolution mechanism.
- If one transaction fails and is written to the error queue, then the following transactions
succeed, there is a possibility that data at the remote master site could be logically
inconsistent, because the earlier
transaction has been overtaken.
If a customer has configured stop_on_error = true, then the first transaction to fail will be
written to the error queue and subsequent transactions will not be pushed. This makes it much
easier to resolve divergent data.
Use the following query to identify what stop_on_error has been set to:
JOB WHAT
---------- ----------------------------------------------------------------
44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d
estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0,
parallelism=>2); end;
declare x integer;
begin
x := sys.dbms_defer_sys.purge(delay_seconds=>0);
end;
/
If the manual purge raises errors, check MetaLink for likely causes, address the errors and run
the purge again. If the manual purge returns without error follow the steps described in
section 4.4.2 to check that transactions are being correctly purged.
Oldest Unpurged
---------------
781633
In Oracle9 and above it may be better to run the following query particularly if the deferred
queue is very large, it shows the number of transactions that have been purged since the
database was last started:
The lazy purge will purge transactions with a system.def$_aqcall.cscn lower than the local
low water mark for propagated transactions (this is calculated based on the minimum
last_delivered in the local system.def$_destination). This low water mark can be lower than
some cscn numbers of some previously pushed transactions, so they will not be purged
immediately. This can happen if not all the push jobs have run and still have active
transactions. In this case, the transactions will remain in def$_aqcall until the low water mark
rises above the cscn for the transaction.
A precise purge will purge transactions with a cscn lower than the low water mark for
propagated transactions to it's specific destination. This means that the purge will query the
last_delivered for each dblink destination. All transactions that have been pushed from the
local site to that destination will usually fall below the low water mark and be purged. To
perform the precise purge execute:
declare x integer;
begin
x := sys.dbms_defer_sys.purge(purge_method=>0);
end;
/
For the replication propagation mechanism to achieve maximum throughput the deferred
queue needs to be kept as small as possible and transactions need to be propagated at regular
intervals. There are two types of large queue:
- Queues with one or two transactions with tens or hundreds of thousands of calls, usually
cause by bulk update operations (DML) or SQL loads.
- Queues with tens or hundreds of thousands of transactions, usually caused by a failure in the
propagation job due to network outage or space management issues at the remote site.
In the majority of cases the queue will have already become large before the database
administrator becomes aware of the problem and running queries against the DEF.... views
will prove difficult because with enormous queues the views are slow.
The following queries should help the database administrator to make a decision about what
to do with the large deferred queue, please note on some systems it may not be practical to run
these queries.
Check how many rows are in the current or next transaction to be pushed:
DEFERRED_TRAN_ID CALLS
------------------------------ ----------
1.0.704 4999
DEFERRED_TRAN_ID CALLS
------------------------------ ----------
1.0.704 4999
1.21.691 3430
1.7.725 112
10.0.669 102
Oracle 9.x and above, check how many rows have been propagated from the current
transaction so far:
Oracle 9.x and above, check the overall number of transactions and rows that have been
queued since the instance was last started:
- Ensure conflict resolution handlers are defined for tables that receive large updates, by
handling the conflicts we avoid the overhead of rolling back the transaction and re-pushing it
into the remote error queue. This operation can take considerably longer than the original
transaction.
- Monitor the push and purge jobs with Enterprise Manager events, as soon as a failure occurs
the database administrator will then be alerted and can address the problem before the queues
build up.
- There is no easy way to avoid large transactions that are generated by mistake or adhoc user
access, but for planned batch operations consider using procedural replication.
5. Diagnosing hanging propagation
If after completing the analysis described in section 4, rows do not appear to be moving
between replication sites, perform the diagnostic steps described in the section.
For this query to be executed successfully; replace 'REPADMIN' with the user that the
pushing site's replication propagator pushes to (receiver user) and make sure
CATBLOCK.SQL has been run.
If the ROWNO / OBJECT_NAME column does not change the blocking users session will
have to be killed to allow propagation to continue. Under some circumstances more than one
row may be returned because each query slave used by parallel propagation opens a separate
session at the remote site.
Run the following query at the pushing site if propagation is being executed from
DBMS_JOB:
Run the following query at the pushing site if propagation is being executed by a users
session, replace REPADMIN with the propagation user:
Run the following query at the site where data is being pushed to and replace 'REPADMIN'
with the user that the propagation user pushes to (receiver user), note each parallel
propagation slave will appear as a separate session:
If the sessions appear to be stuck on the same event for a long period of time, consult
note:61998.1 or raise a call with Oracle Support Services.
Section 4.3 can be used to identify when small transactions are being written to the error
queue, however it will not assist in identifying if there is a single transaction with thousands
of rows being written to the error queue. The following query can be used to identify if a large
error is being queued at the remote master site:
select count(*)
from v$sqlarea a, v$session s
where s.sql_address = a.address
and s.sql_hash_value = a.hash_value
and a.sql_text like '%DEF$_AQERROR%'
and s.username = 'REPADMIN';
Replace 'REPADMIN' with the user that the pushing sites replication propagator pushes to
(receiver user).
The first step is to collect an errorstack from the pushing session, this must include all query
slaves used by the push. Use the query defined in section 5.2 to identify these.
The second step is to collect an errorstack from all sessions that are applying changes at the
remote site. Use the query defined in section 5.2 to identify these.
Collect the errorstack by running the following from SQLPLUS for each session:
As hung or stuck push operations may in fact be spinning or looping operations, it may also
be necessary to collect in depth sql_trace of all sessions listed above. Collect the required
trace by running the following from SQLPLUS for each session:
The basic steps to clear down the queue are listed below, if the queue is
large, slow or hanging refer to note:190885.1:
--OR--
DBLINK LAST_DELIVERED
------------------------------ --------------
REP9I.WORLD 478992
ORA9I.WORLD 478878
At this point, DEFERRED_TRAN_ID OR ENQ_TID 4.49.495 has not yet been pushed
because its DELIVERY_ORDER OR CSCN of 479001 is greater than the current
LAST_DELIVERED value for all transactions going to REP9I.WORLD which is 478992.
7.4 Manually push transactions to REP901 and interrogate the deferred queue
..... dbms_defer_sys.push(destination=>'REP9I.WORLD');
no rows selected
select * from deftran;
ENQ_TID CSCN
------------------------------ ----------
4.49.495 479001
3.0.494 478972
As expected, the unpurged queue view deftran shows all transactions and the deftrandest view
is emply because all transactions have been pushed.
DBLINK LAST_DELIVERED
---------------------------- --------------
REP9I.WORLD 479017
ORA9I.WORLD 478878
select count(*)
from system.def$_aqcall
where cscn < (select last_delivered from
system.def$_destination where dblink ='[dblink]');
COUNT(*)
----------
2
Related
Products
• Oracle Database Products > Oracle Database > Oracle Database > Oracle Server -
Enterprise Edition
• Oracle Database Products > Oracle Database > Oracle Database > Oracle Server -
Enterprise Edition
Keywords