SYSAUX and purging big objects
SYSAUX and purging big objects
Avatar photo
Post author:Written ByDamir Vadas
Post published:November 15, 2017
If you find that SYSAUX is growing and its size is too big, besides figuring out
why this has happened (bug, some purge job disabled, or some structure problems in
objects), you need to find the objects and purge them manually.
For the AWR data, held in tables with names commencing with WRH$, the probable
cause is that a number of the tables are partitioned. New partitions are created
for these tables as part of the MMON process. Unfortunately, it seems that the
partition-splitting process is the final task in the purge process. As the later
partitions are not split, they end up containing more data. This results in
partition pruning within the purge process, making it less effective.
The second component in the AWR data is the WRM$ tables, which are actually
metadata, and, in my praxis, even if they’re big, they are easily fixable directly
by Oracle…of course, that is when their child WRH$ tables data have previously been
fixed.
For the OPTSTAT data, held in tables with names commencing with WRI$, the problem
is also more likely to be related to the volume of data held in the tables. WRI$
tables hold historical statistical data for all segments in the database for as
long as specified by the stats history retention period. Thus, if the database
contains a large number of tables with a long retention period – say 30 days – then
the purge process will have issues trying to purge all of the old statistics within
the specified window.
MediaBanner-ToadWorld-600x100-IM-JY-63709.psd
@?/rdbms/admin/awrinfo
Also, this is a fixed script, and not so easy to modify without a deeper
understanding of what the script does (if you need to adapt it to your needs).
break on report
compute sum OF MB on report
OCCUPANT_DESC MB
--------------------------------------------------------------- ----------
Automated Maintenance Tasks 0
Oracle Streams 1
Logical Standby 1
OLAP API History Tables 1
Analytical Workspace Object Table 1
PL/SQL Identifier Collection 2
Transaction Layer - SCN to TIME mapping 3
Unified Job Scheduler 11
LogMiner 12
Server Manageability - Other Components 16
Server Manageability - Advisor Framework 304
SQL Management Base Schema 10,113
Server Manageability - Optimizer Statistics History 14,179
Server Manageability – Automatic Workload Repository 361,724
----------
sum 386,369
With this easy script you focus directly on where to search for solutions. However,
another simple, “top 10 SYS objects by size” script can show you more details on
the object level:
For the first two LOB segments from the top of the previous result, we need to get
the tables that own those LOBs:
OWNER TABLE_NAME
------------------------------ ------------------------------
SYS WRH$_SQLTEXT
SYS WRH$_SQL_PLAN
Let us focus on the first table, WRH$_SQLTEXT.
COUNT(*)
----------
5.574.297
So, not too many records, but the LOBs are occupying a huge amount of space, which
is expected.
But the question is, how many of that table’s records might be fully obsolete (not
needed), such that we can remove them?
For this table, we need to find the latest possible snap_id (this is the connection
with a timestamp).
Let us say, find the first snap_id before 35 days from today. To get that
information, run the next query:
MAX(SNAP_ID)
------------
259459
So everything below the snap_id=259459 is obsolete data and should be removed.
(Please be aware that “should” does not actually indicate that those data “can” be
removed!)
Now let us see how many records are below that snap_id in table sys.
WRH$_SQLTEXT:
select /*+ FULL (t) PARALLEL(t, 4) */ count(*) from sys.WRH$_SQLTEXT t where
snap_id < 259459;
COUNT(*)
----------
5.572.096
So 5,572,096 records from a total of 5,574,297 are obsolete (99.96%). That is
really too much. And this is why we have to check all the totals and the number of
records that are actually needed.
So, it looks like we need to expand our focus beyond those few objects from the
first top 10 list.
And this is why delete action in any way is not a good way in such heavy damaged
environments.
DBA_HIST_SNAPSHOT:
CREATE OR REPLACE FORCE VIEW SYS.DBA_HIST_SNAPSHOT (
SNAP_ID,DBID,INSTANCE_NUMBER,STARTUP_TIME,BEGIN_INTERVAL_TIME,END_INTERVAL_TIME,FLU
SH_ELAPSED,SNAP_LEVEL,ERROR_COUNT,SNAP_FLAG,SNAP_TIMEZONE)
AS
SELECT
snap_id,dbid,instance_number,startup_time,begin_interval_time,end_interval_time,flu
sh_elapsed,snap_level,error_count,snap_flag,snap_timezone
FROM WRM$_SNAPSHOT
WHERE status = 0
And this is why you might get misleading results.
Purge by ‘delete’
There are many articles on how manually purge AWR records. So I will not cover this
topic, other than to point out that the number of deleted records is important when
you decide to use this technique. In my case this was not possible.
If you check the ASH of such a session, you will see something like:
IID AAS SQL_ID CNT PCT Object (Sub Object) Type Event
tablespace and file#
--- ------ ------------- ------ ------- ------------------- ----
----------------------------- -------------------------
1 0.00 fqq01wmb4hgt8 3 2.50 KOTTD
So besides the plan being wrong (especially for bigger tables), deleting so many
records from this table would take too much time. Last but not least, this is just
the first of more than 115 tables involved in the purge process.
The bottom line: This way of manipulating the purge is the main reason why 99.9% of
scheduled purge jobs fail and why manual work is necessary.
Regardless of whether you define a small snap_id period, this procedure cannot
finish successfully in the amount of time supposed to, because of the huge amount
of data and time to process. And this is exactly why scheduled jobs cannot
finished, so it fails.
And even if it succeeded, you would still have problems with the segment size,
which remains the same regardless of the number of records deleted. So you still
have a lot of manual work to do.
If you do not want to query ASH and collect SQLs, the easiest and the best way is
to trace the session.
Trace part
To be able to reproduce what Oracle does exactly, the easiest way is to trace the
Oracle session.
You can do that with the script trace_purge_session, while the log of the action is
in the trace_purge_session.LST file.
DBNAME and xxxx are custom values which depend on your database, while
“20170831_162914” represents the date and time when you run the trace file, so it
will be different in every case.
Using the Toad Trace File browser, a nice feature from Toad (or TKPROF), you can
easily extract SQLs from the traced session, and this is the content of the
traced_statements_to_analyze.sql file.
If you do not have a small database with same Oracle version that can run your
trace session, another way is to truncate tables on a database that can do it. The
script for that is truncate.sql.
By truncating the tables, your traced session will execute very quickly; in my case
it took about six minutes. The end result is irrelevant because we want to retrieve
only statements that Oracle is using in as fast a way as possible.
Truncate scripts
So now, when we have a list of delete statements, for each table we need to create
fix scripts that will be composed of truncate and insert statements.
All these scripts are placed in a zip file execute_scripts.zip. Extract them in the
same directory where master.sql is placed.
In each of those generated 115 files, at the top of the file you may find something
like:
--sys.wrh$_sql_plan
--total : 464.939.836
--usefull: 13.367.438
…which shows you how many records there are now and how many of them are really
needed. Retrieving the data was tedious but an important part of the work.
So, in my case, I have 64 actual purge scripts, while the other 51 are either empty
(had no records in my case) or were WRM$ tables, which I didn’t wanted to touch
manually.
Each script has an error handling part and will terminate complete execution if any
unexpected error occurs. This effect is produced by using the WHENEVER SQLERRROR
command, which ensures that all important parts of the script run without any error
and in case of error will stop further execution, as stated before.
If an error does occur (which has not happened to me in several runs on three
different databases), execution is stopped so you will need to run that script run
again manually from the point at which it stopped after analyzing the error and
fixing the problem.
All commands that come after the detected errors can be then run as if nothing had
happened.
The beauty of these scripts is that they can run 100% online on any production
system with little overhead, allowing new AWR data to be inserted at the same time.
This is achieved with looping and sequentially processing record by record, while
allowing the WHEN DUP_VAL_ON_INDEX exception to be silently ignored.
In each script source, you can find a looping commit, which may put some pressure
on the log file sync, but this is nothing that can’t be allowed for one-time
execution. The looping commits prevent deadlock; if there was no loop commit, with
AWR actively saving records at the same time you were executing a script deadlock
would occur 100%of the time (ORA-00060: deadlock detected while waiting for
resource).
In the scripts I use a /*+ PARALLEL(t, 4) */ hint for saving data in a replica
table; you can adapt that (lowered or raised) to your needs, but keep in mind to
have the best execution plan possible in that execution part.
In all the scripts, change “SOME_TABLESPACE” to your real tablespace name. This
tablespace will temporary hold the data, but do not use SYSAUX for that.
The easiest way to generate purge fix scripts is to copy a previous one and then
with search/replace change the WRH$ table name to a new one. For a template you may
use the “113-WRH$_MVPARAMETER_FIX.sql” file. (included with execute_scripts.zip)
Putting all calls in one place is done through the master.sql script, which is a
wrapper for all those calls as well as some post-purge calls (directly put in
master.sql). This is done by the code:
Elapsed: 00:00:00.23
17:06:34 SQL>
17:06:34 SQL>
17:06:34 SQL>PROMPT Post purge tasks ...
Post purge tasks ...
17:06:34 SQL>select dbms_stats.get_stats_history_retention from dual;
GET_STATS_HISTORY_RETENTION
---------------------------
31
Elapsed: 00:00:00.41
17:06:34 SQL>
17:06:34 SQL>select dbms_stats.get_stats_history_availability from dual;
GET_STATS_HISTORY_AVAILABILITY
---------------------------------------------------------------------------
12-MAY-15 12.57.04.400279000 AM +02:00
Elapsed: 00:00:00.02
17:06:34 SQL>
17:06:34 SQL>column a_snap_id new_value v_snap_id
17:06:34 SQL>select min(snap_id) a_snap_id from sys.WRM$_SNAPSHOT where
begin_interval_time>=(trunc(sysdate)-30);
A_SNAP_ID
----------
259129
Elapsed: 00:00:00.04
17:06:34 SQL>select &&v_snap_id snap_id from dual;
SNAP_ID
----------
259129
Elapsed: 00:00:00.01
17:06:34 SQL>
17:06:34 SQL>COL a_dbid new_value v_dbid;
17:06:34 SQL>SELECT TO_CHAR(dbid) a_dbid FROM gv$database where inst_id=1;
A_DBID
----------------------------------------
928736751
Elapsed: 00:00:00.02
17:06:35 SQL>select &&v_dbid dbid from dual;
DBID
----------
928736751
Elapsed: 00:00:00.02
17:06:35 SQL>
17:06:35 SQL>exec dbms_workload_repository.drop_snapshot_range(low_snap_id => 1,
high_snap_id=>&&v_snap_id);
Elapsed: 00:09:31.26
17:16:06 SQL>
17:16:06 SQL>select dbms_stats.get_stats_history_availability from dual;
GET_STATS_HISTORY_AVAILABILITY
---------------------------------------------------------------------------
24-AUG-17 12.57.04.400279000 AM +02:00
Elapsed: 00:00:00.02
17:16:06 SQL>
17:16:06 SQL>exec dbms_stats.purge_stats(sysdate-31);
GET_STATS_HISTORY_AVAILABILITY
---------------------------------------------------------------------------
24-AUG-17 05.16.06.000000000 PM +02:00
Elapsed: 00:00:00.02
17:20:43 SQL>
17:20:43 SQL>exec dbms_stats.gather_table_stats('SYS','WRM$_DATABASE_INSTANCE');
Elapsed: 00:00:00.51
17:20:44 SQL>
17:20:44 SQL>exec dbms_stats.gather_table_stats('SYS','WRM$_SNAPSHOT');
Elapsed: 00:00:01.04
17:20:45 SQL>
17:20:45 SQL>exec dbms_stats.gather_table_stats('SYS','WRM$_SNAPSHOT_DETAILS');
Elapsed: 00:00:12.85
17:20:58 SQL>
As you can see, in my case the whole action lasted around 10 hours. This depends on
the size of your data as well as the speed of your database.
The whole master script can be run multiple times as a whole execution (all 115
scripts), but not in parallel. Ideally the start of script would be after the
Oracle automation purge job finish
Final checking
After I run this on my DB, the situation with the SYSAUX occupants was as follows:
OCCUPANT_DESC MB
---------------------------------------------------------------- ----------
Automated Maintenance Tasks 0
Oracle Streams 1
Logical Standby 1
OLAP API History Tables 1
Analytical Workspace Object Table 1
PL/SQL Identifier Collection 2
Transaction Layer - SCN to TIME mapping 3
Unified Job Scheduler 11
LogMiner 12
Server Manageability - Other Components 16
Server Manageability - Advisor Framework 304
SQL Management Base Schema 10,113
Server Manageability – Automatic Workload Repository 13,738
1,176,879,395 records
44,627,169 records
Imagine what the overhead on the database would be to delete 1.1 billion records in
a classic delete action. The horror!
As you see, 269,394MB of space has been recovered and returned to the system.
Purge job
After all is done and fixed, check that the purge job is present and enabled. If it
is not, you can create it with a simple script, which is necessary to ensure the
purging of data automatically:
BEGIN
sys.dbms_scheduler.create_job(
job_name => '"SYS"."PURGE_OPTIMIZER_STATS"',
job_type => 'PLSQL_BLOCK',
job_action => 'begin
dbms_stats.purge_stats(sysdate-3);
end;',
repeat_interval => 'FREQ=DAILY;BYHOUR=6;BYMINUTE=0;BYSECOND=0',
start_date => systimestamp at time zone 'Europe/Paris',
job_class => '"DEFAULT_JOB_CLASS"',
comments => 'job to purge old optimizer stats',
auto_drop => FALSE,
enabled => TRUE);
END;
To be 100% sure that there will be no AWR activity in the database while you
perform maintenance, Oracle suggests doing maintenance with the database opened in
restricted mode, and then after maintenance returning the DB to the normal opened
state:
shutdown immediate;
startup restrict;
master.sql;
shutdown immediate;
startup;
But this is general advice and IMHO more oriented to segments that have
child/parent records, which was not the case here.
Another tip is to stop AWR (the database can be online–no need to restart it), but
this is not really necessary either, in my opinion.
Don’t touch any partitions, regardless of whether or not they are empty, because
Oracle will drop them (if needed) by itself.
So, the general approach for any environment that cannot be handled normally with
the Oracle purge call is:
Doc references
How to Purge WRH$_SQL_PLAN Table in AWR Repository, Occupying Large Space in SYSAUX
Tablespace. (Doc ID 1478615.1)