0% found this document useful (0 votes)
114 views

Oracle AWR Report In-Depth Analysis - WNK5D2QZ

Uploaded by

Naveed Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
114 views

Oracle AWR Report In-Depth Analysis - WNK5D2QZ

Uploaded by

Naveed Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 39
IBM Systems Onele Oracle AWR report in-depth analysis Determine if your database can benefit from IBM FlashSystem 2 Orade AW pet dah anc Contents 2 Highlights 3 Whatis AWR? ‘3 What should you know before examining AWR reports? 3 Take top-down approach © Oracle RAC-specific pages 11 Time breakdown statistics 12 Operating system statistics, 48 Foreground wait events 14 Background wait events 14 Wait event histograms 4 Sorvice-related statistics 16 The SQL sections 19 Instance activity statistics 27 Tablespace VO statistics 29 Buffer pool statistics 83 Shared pool statistics 84 Otheradtisories 38 Latch statistics 41 Segment access areas 42 Library cache activity sections Dynamic memory components sections Process memory sections Initialization Parameter changes 44 45 46 ‘Streams component sections 48 48. Global enqueue and other Oracle RAC sections Ey ‘Summary Highlights + Speed up databases to unparalleled rates, + Interpret AWR results to gain valuable, actionable insight + Utilize flash storage to improve database performance + Determine optimum conditions for database environments IBM? FlashSystem” offers customers the ability to speed up databases to rates that may not be possible even by over provisioning the fastest spinning disk arrays. This paper is intended to provide observations and insights on how to use some native Oracle Database tools to determine the effeets that flash storage can have on Oracle Database environments, Oracle utilities —Statspack and now Automatie Workload Repository (AWR) reports — provi with detailed information concerning a snapshot of database execution time. This snapshor provides statisties on wait events, latches, storage input and output volumes, and timings as well as various views of memory and SQL activities, latabase administrators ‘The statistics and insights into the memory, input/outputs (VO), and SQL performance characteristics are invaluable aids in determining ifa database is Functioning optimally Unfortunately there is such an abundance of data in AWR reports that most database administrators (DBAs) feel overwhelmed and may miss important clues to database performance issues ‘This paper provides a guide to interpreting AWR results so that even a novice DBA can glean vali, actionable insights from review of an AWR report. For customers lacking the time to delve into their AWR reports, IBM offers a free service to analyze your AWR reports in order to identify if your database ean benefit from IBM FlashSystem, Tem systems 3 ‘What is AWR? ‘The AWR provides a set of tables into which snapshots of system statistics are stored. Generally these snapshots are taken on an hourly basis and include wait interface statistics, top SQL, memory, and 1/O information that is cumulative in nature up to the time of the eapture. The AWR report process takes the cumulative data from two snapshots and subseracts the ‘earlier snapshot’ cumulative data from the later snapshot and then generates a delta report showing the statistics and information relevant for the time period requested. AWR is a more advanced version of the old Statspack reports that has been automated and made integral to Oracle’ automated tuning processes for the Oracle Database. AWR reports are run internally each hour and the findings are reported to the OEM interface. The user can use OEM ‘or manual processes to create a full AWR report, in text or HTML formac. These text or HTML reports are what we «will be examining in this paper. What should you know before examining AWR reports? Before examining AWR reports, DBAs shoul be familie with the basic wait events that an Oracle Database may: encounter andthe typical latches that may cause performance isons, plus be aware of what a typical performance profile looks like for their particular system, Usually by examining several AWR reports from periods of normal or good performance, DBAs can acquaint themselves with the basic performance profile oftheir database ‘Things to notice in the baseline reports include normal levels of specific wait events and latches and the normal /O profile (normal 1/0 rates and timings for the database files). Other items to look at are the number and types of sorts, the memory layout, and Process Global Area (PGA) activity levels In order to be aware of what waits, latches, and statisties are significant, itis suggested that DBAs become familiar with the Oracle Database Tuning guide and concepts manual. ‘Tuning books by outside authors ean also provide more detailed insight into Oracle Database tuning techniques and concepts. Take a top-down approach Unless you ae looking for specifi problems such as a known SOL issue, its usually best to start with the top data in the AWR report and drill down into the later sections as indicated by the wait events and latch data, The fist section ofan AWR report shows general data about the instance; look: at Figure 1 for an example header section. Figure 1 AW report header 4 Onde WR ret amy ‘The report healer provides information about the instance Also, pay attention to the number of sessions at the ‘upon which ehis report was run, The AWR header was beginning and end of the report; this can tell you ifthe expanded to include data about the platform, CPUs, cores, load was constant, increasing, or decreasing during the and sockets, as well as memory available in the server. report period. ‘The instance number, whether this is an Oracle RAC (Real Application Clusters) instance or not, and the release level ‘The second section of the report header contains load of the database is shown here. The header also includes the information and is shown in Figure 2. startup time as well as the times for the two snapshots used. to generate the report. The delta-T between the snapshots is also calculated and displayed. Following the instance and snapshot data the cache sizes section gives basic data about the buffer pool, shared pool, standard block, and log butfer sizes. Al of this information in the first part of the AWR header forms the foundation that we draw from to gauge whether 4 particular statistic is reasonable or not. For example, if we see that the system is an Oracle RAC-based system, then we know to expect Oracle RAC-related statistics to be included in the report. Ifwe see a large number of CPUs, then we might ‘expect to sce parallel query related numbers. The knowledge ‘meas ose ana ‘of whether a system is Linux®, Unix®, AIX®, Windows®, or somey te some other supported platform is ls valuable in tracking ae down platform specific issues. ‘You should pay attention to the duration of a snapshot Figure 2:Load ated stasis nheader ‘window against which the report was run. [Fthe window is too long then too much averaging of values could distort the true problem. On the other hand, if the period is too short important events may have been missed. Allin all you need to understand why a report was generated: Was it for a specific SQL statement? If so it might be a shore duration report. Was icfor a non-specific problem we are trying to isolate? ‘Then it could be from 15 minutes to a couple of hours in duration. Tem ystems 5 ‘The critical things to watch for in the load section depend ‘on the type of application issue you are t ng to resolve For example, in the header section in Figure 2 we seea large number of physical reads and logical reads with few block changes; this sa typical profile for a reporting ‘environment such asa data warehouse or decision support system (DSS). Seeing this large number of logical and physical reads should key’ us to look at /O-related issues for any database performance problems. An additional sign that this is probably a warehouse or DSS environment is the large amount of Work Area processing (W/A) occurring; this ‘means sorting. A further indicator is thatthe user calls and parses are low, indicating that the transactions contain few statements and are long-running. Ina predominately online transaction processing system, (OLTP) we would expect 10 see more logical reads, few physical reads, and many user eas, parses, and executes, as well as rollbacks and transactions. Generally speaking, report environments have Fewer, longer transactions that utilize the Work Area, while OLTP environments tend to have numerous small transactions with many commits and rollbacks. ‘The next section ofthe header shows us the Instance Efficiency percentages. Generally you want these 3s close to ‘one hundred percent as possible. As you can see all of our celfciencies are near t9 100 percent, with the exception of Execute to Parse and Parse CPU to Parse Elapsed. Because ‘we are dealing with a reporting system, it will probably have a great deal of ad-hoe reports. Because these by their nature are not reusable, we will have low values for parse-relared efficiencies in this type of system. In an OLTP-type system where transactions are usually repeated over and over again we would expect the parse-related efficiencies to be high as long as, cursor sharing and bind variables were being properly utilized. If we were to see that the Buffer NOWAIT and Buffer Hic percentages were low (less than 95 percent) we would investigate ifthe data block buffers were being properly used and if we might need to inerease data block buffer sizes. Iv the library hit percentage was low we would consider inereasing the shared pool allocation, or at least looking into why its percentage was low (ie could be improper bind variable usage). ‘The redo NOWAIT percentage tells us how efficiently our redo buffers are being utilize; if che percentage is low, we ‘would need to look at cuning the redo log buffers and redo. logs. If processes are waiting on redo then either the buffers are too small or something is blocking the redo logs from being reused in a proper manner. For example, in an archive log situation if there are insufficient logs then the system may have to wait on the archive log process while it copies a log to the archive location, decreasing the NOWAIT percentage. The memory sore percentage tells us if our PGA_ AGGREGATE_TARGET or, if manual settings are used, SORT_AREA_SIZE, HASH_AREA SIZE and bitmap settings need to be examined. Numbers less than 100 for the sort percentage indicate that sorts are going to disk. Sorts going to disk are slow and ean cause significant performance issues 6 Onde WR pret amis ‘The soft parse percentage tells us how often the SQL. statement we submit is being found in the eursor eaches. This is directly related to proper bind variable usage and how much ad-hoc SQL generation is taking place in our system, Hard parses cause recursive SQL and are quite costly in processing time. Ina system where SQL is being reused efficiently this should be near one hundred percent. Latch hit percent tells us how often we are not waiting on latches. IFwe are frequently spinning on latches, this value will decrease. IF this percentage is low, look for CPU-bound processes and issues with latching. “The Non-Parse CPU percentage tells us how much ofthe time the CPU is spending on processing our requests verses how much time it is spending doing things like recursive SQL. Ifehis percentage is low, look at the parse-related percentages because they 100 willbe low. When this pereentage is low it indicates the system is spending too much time processing SQL statements and not enough time doing real work. “The next section ofthe header shows us the shared pool statistics. One of the main purposes ofthe shared pool is to provide a pre-parsed pool of SQL statements that ean ‘quickly be reused. ‘This header section shows the amount ‘of memory being utilized for reusable statements. Generally, if 70 t0 80 percent (or higher) of memory is being utilized, ‘or higher then good reuse is occurring in ehe database. Ifthe percentages are less than 70 percent, then the application should be reviewed for proper SQL rease techniques such as PL/SQL encapsulation and bind variable utilization, “The next few sections of the header realy help point where DBAs or tuning users should look for the problems causing performance issues in the database. This next section is shown in Figure 3. igre 3: a, CPU, and memory satisios IeM systems OF the statistics in this nexe section of the header, the top five wait statistics are probably the most important. The wait interface in the Oracle Database stores counters for waits and timings for several hundred wait events tha instrument the internals of the database. By examining the wait statistics ‘you can quickly find the pinch points for performance in your ‘database. In our example in Figure 4 we can see that the following statistics are the dominant waits: Figure # Dominant ate ateranaiyas Obviously with 54 percent ofthe wait activity (and probably more) being /O related, our 1/0 subsystem on this Oracle RAC setup is being sessed. Note that the CPU is showing 11.9 percent usage during this period. In a normal system the CPU should show the dominant time percentage. Other wait events that might dominate an Oracle RAC. ‘environment could be redo log related, interconnect related, ‘or undo tablespace related. Note thatthe fifth largest wait was the “ge current block 2-way.” This indicates thatthe same block was being shared back and forth across the interconnect. Of course because this isa parallel query environment with the parallel query being not only cross-table and eross-index but eros-node, some amount of interconnect related waits are expected. However, if go" related waits dominated the op five wait events section, this would indicate there was definite stress on the interconnect and it was a signifieant source of wait issues. In this case the predominant wait isthe “db file sequential read” which indicates that single block reads Ge. index reads) are causing issues. Normally this would be resolved by adding more db block cache memory server memory); however, four system is memory constrained so ifwe can’t remove the ‘wats we would look to reduce the wait time per incident. By increasing the number of disks inthe array and increasing the spread of the files causing the reads we could possibly reduce this wait to as smal a five milliseconds (maybe lower if we move to a more expensive cached SAN setup), but this would be the limit in a disk-based system. The only way to farther reduce the value would be to increase the amount of server memory through a server upgrade or decrease the read lateney by moving to IBM FlashSystem. The other read-based waits would also benefit from either more memory or faster I/O subsystems. 8 Onde AW pet dah ane ‘The other major wait that is usually seen when an /O. subsystem is stressed isthe “db file scattered read” which indicates full table scans are occurring. Full table scans ean usually be corrected by proper indexing. However, in ‘or warehouse situations this may not always be possible. In the case of DSS or data warehouses, use of partitioning ‘can reduce the amount of data scanned. However, each disk read is going to require at leat five milliseconds and the only ‘way to beat that is through large-scale caching o the use of TBM FlashSystem to reduce latency, Where a disk based system ean have latency greater than 5 milliseconds, [BM FlashSystem provides lateney as low as 100 microseconds (25x improvement) When the “db ile sequential read” or “cb file scattered read” are the significant wait sources, then DBAS need to look in the SQL sections of the report to review the top SQL that is xenerating excessive logical or physical reads. Usually, if SQL, statement shows up in two or more of the SQL subsections, itis a top candidate for tuning actions A key indicator for log file stress (redo logs) would be the lg fle sy ag file paralel wit, logfile sequential write, dog file single srite, lg fle site completion wait events dominating the top five wait event listings however, ou must make sure thatthe waits erly from I/O-relted issues and not issues such as archive logging before taking proper action. Usually, log file stress occurs when the log files are placed on the sime physical disks as the data and index files and can usualy be relieved by moving the logs to their own disk array section, However, if high wait ime for log-related ‘events occur, then moving the logs to an IBM FlashSystem is Indicated. While the AW report docs not show latency. for redo log activities, redo log writes can be very latency sensitive in environments with heavy write ativity and «specially those with single threaded synchronous I/O. In heavy write environments, IBM FlashSystem reduces the latency for redo log writes, thus improving a transactional system’ ability to support higher concurrency ‘The next section of the report shows the breakdown of the CPU timing and run queue status (Lond Average) for the sime interval ‘The run queue tells you how many processes are waiting to execute. Ifthe run queue exceeds the number of available CPUs and the CPUs are not idle, then increasing the number of CPUs or upgrading the speed of CPUs assuming all other tuning actions, such as reducing recursion, have been accomplished. As you can see from the report section above, our CPU was 83 percent idle during the period hile ZO waies were 45 percent, thus CPU stress was not causing the run queue of 3. It was most likely /O-related wait activity. ‘The other statistics in the section show the amount of time utilized in user and system modes of the CPU, as well as the percentage of ime the CPU was idle and the average /O wait. Ifthe /O wait percentage is high then inereasing the number of disks (after proper tuning has occurred) may help. Ifyou have already tuned SQL and VO waitis still a large percentage of the total waits, then your best choiee is moving toa lower latency /O subsystem such as IBM FlshSystem. dicated, Following the CPU timing section, the Instance CPU seetion, shows how efficiently this instance was using the CPU resources it was given. This instance utilized the total CPU time available for only 148 percent of the time, Of that 148 percent, 85 percent of the CPU time was utilized for processing. Because no resource groups are in use in the database, zero percent of the CPU. ‘was used for resource management with the resource manager. ‘This again points out thae che system was I/O bound, leaving. the system basically idle while it waited on disks to serve data, ‘The last section of the header deals with memory usage. Tem systems 9 According to Oracle, you should only use about 60 pereent of the memory in your system for Oracle Database; however, as memory sizes increase this old saw is showing its age. Nonetheless, in memory-constrained 32-bit systems such as the one this report came from, 60 percent is probably a good poine to shoot for with Oracle Database memory usage. ‘As you can sec, this instance is using 57.64 percent so itis pretty close to the 60 percent limit. The rest of the memory is reserved for process and operating system requirements. ‘We can see that our System Global Area (SGA) remained fainly stable at 1,584 megabytes while our PGA usage grew from 169 to 302 megabytes. This again points to the system being a report DSS or data warehouse system utilizing a lot of sore area, Oracle RAC-specific pages Once we get out of the header and system profiles area, if you are using Oracle RAC you get to an Oracle RAC-specific section that will not be present if Oracle RAC is not being used.‘The fist part of the Oracle RAC- specifi statistics deals swith profiling the global cache lod, This section of the report is shown in Figure 5 igure 5: Oracle RAC oad proies 10 Oru WR rope apt ana ‘The first part of the listing shows how many instances you started with in the Oracle RAC environment and how many you ended up with. This is important because eross-instance parallel operations would be directly affected with loss of or addition of any instances to the Oracle RAC environment. “The next part of the Oracle RAG report shows the Global ‘Cache Lond Profile. The Global Cache Load Profil shows hhow stressed the global eache may have been during the period monitored. Inthe report shown above we only transferred a total of 481 kilobytes per second! across an interconnect capable of handling 100 megabytes per second, so we were hardly using the interconnect, in spite of our «ross-instance parallel operations. This is further shown by the low Buffer Access-Remote cache statistic of 0.65 percent ‘which is telling us that only 0.65 percent of our blacks came from the other node’ Oracle RAC instance. “Things to watch out for in this section inchude severe instance unbalancing where the Global Cache blocks received verses the Global Cache blocks served is way out of alignment (they should be roughly equal). Another possible indication ‘of problems is excessive amounts of Database Writer (DBWR) fasion writes, Fusion writes should only be done for eache replacement, which should be an infrequent operation. If fasion writes are excessive it could indicate inadequately sized db block cache areas, excessive checkpointing, commits, ora ‘combination ofall ofthe above. ‘The next Oracle RAC-specifie section deals with the actual timing statistics associated with the global eache. You need 10 pay close attention to the various block service timings. Ifthe time it rakes to serve a lock across the interconnect exceeds the time it would take to read it from disk then the becoming a bottleneck instead of a benefit. shown in Figure 6 ‘The section is Figure 6: Global cache and enqueue worktoad section tou syteme ‘The most important statistics in this entire section are: “These should be compared toan AWR report run on the ‘other instance: Ifthe numbers on both or all RAC instances aren't similar then this could indicate a problem with the interconnect, either at the OS buffer level or with the NIC or interface ceards themselves. Notice the high Ash values in Figure 6; these are not the correct values and probably point toan issue with the way AWR is collecting the data because they are a direct input to the product chat results in the receive times. ‘The Global Enqueue timing numbers should be less than 2-3 milliseconds in most cases, If they get anywhere near 20 milliseconds, as stated in current Oracle documentation, ‘you have a serious issue with the global enqueue services (GES) part ofthe global dictionary. ‘The last part of this Oracle RAC-specific section deals with the interconnect: ‘You should verify that the interconnect is the proper private interconnect and that itis nota public Ethernet. Ifyou see excessive global cache services (GCS) values in previous sections, be sure that the proper interconnect is being used. Time breakdown statistics Another nice addition to the statistics in AWR over Statspack involves the time breaktlown statistics that show the components of CPU, OS, and other time fields. The firse shown is the CPU statistics breakdown; this is shown in Figure 7. Fignre 7: CPU timo breakdown 12 Onwle WR repeat nay “Much of how the CPU seconds are determined is a black box. For example, i isa real mystery how in one section the CPU was only utilized 1448 percent for this instance and yet shows 5 execute elapsed time a 8,145.5 seconds when with ‘ovo CPUs there are only 7,200 seconds of CPU time in an hour. Ir may be including all cals that completed during the interval, which of course may then include calls whose majority of time was actually ouside of the measurement interval. However, that isa topic for another discussion. In the report excerpt in Figure 6 we see that a majority of| the CPU time allotted to us for this measurement interval (07.3 percent) was spent in spl execte elapsed time and this really is precisely where we want the CPU to be spending its time, If we were to see that parse time elapsed or hard parse lapsed tinre were consuming 2 large portion of time, it would dicate that we either had an ad-hoc environment with a majority of unique SQL statements or we have an application that isnot using bind variables properly: Of course if the (CPU was spending its time in any of the other areas for 2 majority ofthe reported time, that segment of processing. should be investigated, Operating system statistics “The next section of the AWR report shows operating system-related settings and statistics. Figure 8 shows an cxample repor section for OS statistics Figure 8:08 sation tou sytens 1 ‘The OS section of the report gives us the time breakdown in CU ticks to support the percentages elaimed in the other sections of the report. The correlation of reported ticks to actual tcks is stil a bit foggy. However, examination of this section stil shows that the system being examined is not CPU, bound bue is suffering from I/O contention because both the idle time and VO wait time statistics are larger than the busy time value, This part ofthe report also shows us the ‘TCP/ UDP buffer settings which are useful when determining issues in Oracle RAC. One problem often noted is thatthe /O wait time reported in this seetion may not be accurate and should not be used for analysis Foreground wait events “The next section of the AWR report shows the foreground “wait events, which are wait events that occur in foreground processes. Foreground proceses are the user or application- level processes. Figure 9 shows the excerpt from the report wwe are analyzing. Figure 9: Foregrosn wait vents ‘The wait events that accounted for less than 0.1 percent of DB time have been omitted for brevity’ sake (the actual listing is two pages long). The thing to note about this section of the reporeis that we have already looked atthe main top five events, which are where we should focus our tuning cffores. However, if you are attempting to tune some specific ‘operation or database section, the other waits in this section ly to that effort. One thing to mention in an Oracle mnment, if you see a large number of waits For the read by ater sesion event, this usually indicates your block sice of latency is too large, resulting in contention. [Fyou see that read by other session is one ofthe predominant wait events, look to the Segments by x sections of the AWR for guidance on whieh tables and indexes are being heavily utilized and consider moving these segments toa tablespace with a smaller than default block size such as 4K or even 2K or to lower latency storage such as IBM FlashSystem to reduce this contention. 14 Oral AW Reet inh nas Background wait events Background wait events, as their name implies, are waits ‘generated by the numerous background processes in the ‘Oracle Database process stack. DBWR, log writer process (LGWR), system monitor process (SMON) and process monitor (PMON) all contribute tothe background wait ‘events. The report excerpt, limited to the events with at least 0.1 percent of DB time, isshown in Figure 10. Figure 10; Backeround wat events [As we can see, the events that dominate the background: waits are also /O related. Ifthe events that are top in the foreground are similar (such as both being control file related) then that is the I/O area in which we should concentrate the ‘uning efforts. As we can see from the report excerpt, while the 1/O-related waits are similar, the predominant ones in cach section are different, indicating a general I/O issue rather than an issue with a specific set of files. Wait event histograms In che next section ofthe AWR, Oracle provides a time-based histogram report for the wait events. Ifthe histogram report vas ordered by the predominant wait events by time, it would bbe more useful; instead i is ordered by event name, making us have to search forthe important events. The liberty has been taken to remove the unimportant events from the listing example in Figure 11 for brevity’s sake. washed Figure 11: ect tne stores tou sytens 15 “The most important events have been bolded in the above ‘excerpt: Notice that the read events are more important than the write events. With an Oracle Database, unless we are talking direct writes, undo or redo log writes, or control fle ‘writes, Oracle Database uses the concept of delayed block cleanout, only writing blocks o disk when absolutely needed. This delayed block cleanout mechanism means that for most data-relaced writes we aren't too concerned with write times unless our application does frequent commits and the data is needed nearly immediately after the commit. Because this application is a read-lominated application, ‘we aren't seeing a lor of redo log and undo tablespace related events. In an OLITP type environment, we would expect to see log writes and log syncs rise to the top in an VO bound system as dominant events, In systems that generate a lot of transactions, we would also expect to see undo tablespace related events be more prevalent. By looking ac the histograms we can see that our read events are taking anywhere from four milliseconds (ms) to one second to complete. This ia typical disk-based histogram. However, in our histogram the largest number of reads by percent are taking more than Sms to complete. This indicates disk stress is happening. We could possibly reduce this nearer to Sms by inereasing the number of disks in our disk array. However, you cannot expect to get to less than Sms read or writ times ina disk-based system unless you place a large amount of cache infront of the disks. Another option, which ean be more cost effective and enable greater inputs/outputs per second OPS), isto use IBM FlashSystem technology whieh should provide less than 0.5 ms latency Service related statistics Since Oracle Database version 10g, Oracle is inereasingly using the concepr of a database “service.” A service isa grouping of processes that are used to accomplish a common, function. For example, all ofthe parallel query slaves and processes used to provide the results for a series of SOL statements for the same user could be grouped into a service. Figure 12 shows the service related section from our example report, igure 12: Servos elated satiice 16 Onwle WR repeat ana ‘The service related statistics allow you to see which users are consuming the most resources. By knowing this you ‘can concentrate your tuning activities on the hard hitters. In the report excerpt in Figure 12, we see the generic service “aultdb” which holds all the user processes that are nnon-background and non-sys owned. Because there was, only one set of processes (we know this because we are good DBAs that keep tabs on what is happening in our system) ‘we can track the usage back to a user called tpch, From looking at the second half of the report we can see that our user experienced over 517,710 /O-related waits for a total ‘effective wait time of 4,446 seconds or 8.59 milliseconds per wait, Because we know that the best wait time we ean ‘get with a disk-based, non-cached system is $ milliseconds, 4 wait time of 8.59 milliseconds shows the disks are experiencing some stress. The higher this type of wait i, the more stress experienced by the I/O subsystem. This section of the report can show where the timing issues are occurring. The SQL sections ‘The next sections of the report slice and dice the SQL in the shared pool by several differene statistics. By using the waits and other statistics we have discussed so far you can usually figure out which SQL area to examine. A general rule of chumb is thatifa SQL statement appear inthe top five statements in two oF more areas, itis a prime candidate for tuning. ‘The sections are + Total elapsed time + Total CPU time + Total butfer gers + Total disk reads + Total executions + Total parse calls + Total sharable memory + Total version count + Total cluster waie time Let’ look at each section and discuss the indicarors that would lead you to consider investigating the SQL in each. Total elapsed time Ifa SQL statement appears in the total elapsed time area of | the report, this means its CPU time plus any other wait times made it pop to the top of the pile. IFfor some reason itis at the top ofthe total elapsed ime but not atthe top of total CPU time, this indicates chat there i an issue with recursion associated with this statement. Generally, you will see the same SQL in both the total elapsed and total CPU time sections. If you see high recursion indicators such as parse ratios that are sub-optimal or in the Instance Activity Statistics (the section following the SQL areas) the recursive calls or recursive CPU usage statistics ae high. Total CPU time When a SQL statement appears in the total CPU time area this indicates it used excessive CPU eyeles during its processing, Excessive CPU processing time can he caused by sorting, excessive functions, or long parse times. Indicators that you should be looking at ths section for SQL tuning candidates include high CPU percentages in the services section for the service associated with this SQL (hine—if the SQL is upperease it probably comes from a user or application; ifit is lowerease it usually comes from internal ‘or background processes). To reduce total CPU time, reduce sorting by using multi-column indexes that can act as sort liminators and use bind variables to reduce parse times. Total butter gets “Total buffer gets” means a SQL statement is reading a lot of information from the db block buffers. Generally speaking, buffer gets (or logical reads in Statspack) are desirable, except when they become excessive, Like excessive disk reads, excessive butfer gets can cause performance issues and they are reduced inthe same way. To reduce excessive total buffer gets use partitioning, use indexes, and look at ‘optimizing SQL to avoid excessive fill table seans. Tal buffer gets are typified by high logial reads, high buffer ceache hit ratio (when they are driven by a poor selectivity index), and high CPU usage. Total disk reads “High total disk reads” mean a SQL statements reading a lot of information from disks rather than from the db block buffers. Generally speaking, disk reads (or physical reads, in Statspack) are undesirable, especially when they become excessive, Excessive disk reads eause performance issues. To reduce excessive disk reads, use partitioning, use indexes, and look at optimizing SQL co avoid excessive full able scans. ‘You can also inerease the db buffer cache if memory is not an issue. Total disk reads ae typified by high physical reads, low buffer cache hie ratio, and low CPU usage with high /O. wait times. If disk reads are a parc of your database (such as DSS or data warehouses where full table seans are a natural result of their structure) then moving to IBM FlashSystem will improve your performance, sometimes dramatically. Total executions High total executions can be an indicator that you are doing something correct in the SQL in the database. Statements With high numbers of executions usually are being properly reused. However, be sure that statements with high numbers of executions are supposed to be executed multiple times, an example would be a SQL statement exceuted over and over again in PL/SQL or Java, or C routine ina loop when itshould only execute once, Statements with high executions and high logical and/or physical reads are candidates for review to be sure they are not being executed multiple times when a single execution would serve. If che database is secing excessive physical and logical reads or excessive /O wait times, then look at the SQL statements that show excessive executions and show high physial and logieal reads 18 Onule WR rope ap ana Parse calls Whenever a statement is issued by a user or process, regardless of whether itis in the SQL pool, it undergoes a parse. The parse can be a hard parse or a soft pars. IFit cannot find an identical hash signature in the SQL pool it does a hard parse with loads of recursive SQL and all the rest of the parse baggage. [ft finds the SQL in the pool then it simply does a soft parse with minimal recursion to verify user permissions on the underlying objects. Excessive parse calls usually go with ‘excessive executions. Ifthe statement is using what are known as unsafe bind variables then the statement will be reparsed ‘each time. Ifthe header parse ratio are lows; look here and in the version count areas ‘Shareable memory The shareable memory area provides information on SQL statements thatare reused and the amount of memory in the shared pool that they consume, Only statements with 1,048,576 bytes of shared memory usage are shown in the report. Usually, high memory consumption i a resule of poor coding or overly large SQL statements that join many tables. In a DSS or data warehouse (DWH) environment, large complex statements may’ be normal. In an OLTP database, large or complex statements are usually the result ‘of over-normalization of the database design, attempts to use an OLTP system as a DWH or DSS, or poor coding techniques. Usually large statements will result in excessive parsing, recursion, and large CPU usage: Version count High version counts are usually due to multiple identical- schema databases, unsafe bind variables, or software bugs. In Oracle Database 91 there are bugs that result in unsafe bind variables driving multiple versions. Multiple versions cat up SQL memory space in the shared pool. Hi counts can also result in excessive parsing. Setting the ‘undocumented parameter * sqlexee_progression cost” to higher than the default of 1,000 decreases versioning in susceptible versions. High values for sharable memory in| the SQL pool can indicate issues if you aren seeing good performance along with high sharable memory for statements, with executions greater than I Cluster wait time AAs the name implies, the cluster wait time will only be present if you are using an Oracle RAC system. SQL that transfers a high number of statements across the interconnect willbe listed in this section. High levels of block transfer ‘occur if the block size is too large, the db eaches on each, server are too small, or the SQL is using too much of the table data, Large update statements may appear here because updates require block transfers in many eases for current blocks. High levels of GC-type wait events indicate you need to check this section for causative SQL statements. Instance activity statistics ‘The next section deals with instance activity statisties. The biggest problem with the instance activity statistis is that there are 50 many of them and many are not useful except in specific tou sytene 19 tuning situations that may never occur in a normal database. ‘The example excerpt from the report we have been examining isin Figure 13. The excerpt has had the statistics that aren’t normally a concern removed. Figure 13: instance acivty tatates 20° Orade AW reps Tei best to focus on the larger summary type statistics, atleast, at first, when dealing with this section of the reports. Even, the pared down list in Figure 12 still has many entries that may not really help novices find problems with their databases. One of the biggest hurdles to understanding the statistics in the instance Activity section is knowing whae che time units are for the time based staistis. Generally they will be reported in milliseconds; so, for example, the DB time value of 2,547,336 ‘corresponds to 2,547.336 seconds ont of a possible 7200 (there ae two equivalent CPUs) in an hous, yielding a percentage of total time of 35 percent. Of that 2,547 seconds, 910 (effective VO time) were spent doing VO related items, so only 1,637 seconds of processing time or 22 percent of available time Looking at other CPU-related simings, parsing took another 399 milliseconds and recursive SQL took 77,189 fora total non-processing time of 77,588 rounded up to 78 seconds. “That means only 1559 seconds or 21 percent of total CPU time was used to do actual work on our queries. OF course there ae other, non-database activities chat also eat bit of CPU time, dropping the total to the reported 18 or so percent. So what else is contained in the mine of data? We have an effete I/O time and the number of Os issued. From this we can see that each I/O cost an effective time of 32 milliseconds. No wonder we spent about 45 percent of the time waiting on VO! By looking at SQLNet rounderipe we can tell if our application is making effective use of array processing. [itis aking hundreds or thousands of roundtrips per transaction then. ‘we really need to examine how our application is handling arrays. By default, languages like C and Java only process 10-20 records a atime, SQL*Plus defaults to 10. By increasing array processing via precompiler Bags or by the “SET ARRAYSIZE” command in SQL*Plus we can greatly reduce round-tips and improve performance. Bytes sent and received to nd from the elients via SQLNet can be used with roundtrips to see how large a chunk is being shipped between the client and the server, allowing insight into possible network tuning. In addition, this information can be used to ee if che network is being strained (generally speaking I gigabit Ethernet can handle about 100 megabytes per second of transfers) Consistent get statistics Consistent gets deal with logical reads and ean be heavy ¢ (using two latches asin a normal consistent ger) oF light ‘weight (using one lateh as in consistent get—examination). Large numbers of consistent gets can be good or if they are excessive because of poor index or database design, bad because they can consume CPU resources best used for other things. These statistics are used in conjunetion with others, such as those involving your heavy hiteer SQLs, to diagnose database issues. DB block get statistics DB block gets are current mode gets. A current mode get is for the data block in its current incarnation, with incarnation defined as all permanent changes applied (for example, if the database shut down now and restarted, this is the block: ‘you would get). Now, this ean either be from exche, from another instance cache, from the filesystem cache, or from disk, Sometimes it will result in a disk read ora block transfer from another instance cache because there can only be one inthe cache of an instance at version of the current bloc Dirty block statistics “The dirty buffers inspected statistic cells you how many times a rty buffer (one that has changes) was looked ae when the processes were tying to find a clean buffer. If this statisti is large then you probably don’ have enough buffers assigned, because the processes are continually looking at dirty buffers to find clean ones. Enqueue statistics “The enqueue statistics dealing with deadlocks, timeouts, and waits tell you how often processes were waiting on ‘enqueues and ifthey were successful. High numbers of ‘enqueue deadlocks indicate there may be application locking. issues; high numbers of waits and failures indicate high levels of contention. You need co look in the enqueue section ‘of the report to se the specific enqueues that are causing the problems. Execution count ‘The execute count statistics are used with other statistics to develop ratios to show how much of a specific resource or statistic applies toa single execution on the average. This, can be misleading, however, if there are several long-running. transactions and many short supporting transactions. For example, a large DSS query that requires a number of recursive SQL operations o parse it will dive the executions up, but ‘you are really only interested in the large DSS query and not the underlying recursive transactions, except as they contibute tothe DSS transaction itself Free butter statistics ‘The free buffers requested verses the fre bufers inspected statistics show how many buffers, while not aetally dirty; were being used by other processes and had to be skipped when searching fora free buffer. Ifthe free butfers inspected is overly large and the statistic dirty buffers inypcted is also large, then look at commit frequency as well as possibly increasing the total db block butfers because the eache is probably congested. 22 Orade AW rep inhaabs GC statistics (global cache) ‘The GC saatisties show the components that make up the send times for the consistent read (CR) and current blocks “The statistics for build, fash, and send for the respective type of block (CR or current) are added together and divided by the blocks ofthat type sent wo determine the latency for that operation. The receive times can be divided by the number of blocks received to determine that latency (and should be compared with the send latency as calculated from the other instance’ AWR report). By seeing the components of the latency for send operations you ean determine if the issue is internal (build or flush times are large) or external (che send time is large). The GC and GES statistics will only be present if Oracle RAC is being utilized. Remember that if send or receive times are greater than the average disk latency, then the interconnect has become a source of performance bottlenecks and needs to be tuned or replaced With a higher-speed and larger-bandwidth interconnect such as Infiniband. [Fonly one node is showing issues ‘endl times excessive point to this node, receive times ‘excessive point to the other nodes) then look to excessive load, ‘TCP buffer setsings, oF NIC eard issues on that node. ‘The global enqueue statistics haven't been shown because they haven't been a large source of performance issues, Ifthe messaging shows large latencies, it will also be shown in the «global cache services because global cache activity depends ‘on the global enqueue service. Index scan statistics ‘There are two main index scan statistics: Index fetch by key —'This statistic is ineremented for each “INDEX (UNIQUE SCAN)" operation that is part of a SELECT or DML statement execution plan. Index scans kites] —This statistic is incremented for each index range sean operation that is not one of the types inde fast fll scans, inde full scan, and index unique sean. [By comparing the two values you get an idea ofthe ratio of single index lookups verses range sean. In most systems, single index lookups should predominate because they are usually more efficient. However, in DSS or DWH, systems seans or fast scans may become the dominant type of index activity. Leaf node statistics The leafnode statistics refer to index leaf nodes and tell yout how much insert activity is happening in your database. ‘The 10-90 splits show activity for monotonically increasing indexes (¢hose that use sequences or dates, generally speaking) and the 50-50 spits show other types oF index actviy such as text or random value indexes. I you see heavy 10-90 split ‘operations then you might want to look at index management ‘operations to be sure your indexes aren’ geting too broad clue to excessive unused space in your sequence or date based indexes. Usually index rebuilds are only required in databases that have monotonically increasing indexes that also undergo large amounts of random deletions resulsing in numerous partaly filled blocks tou sytens 23 ‘Open cursors: ‘The open cursors cumulative statistic is used with other statistics to calculate ratios for resources used per cursor or cursors ‘open per login, for example. Parse statistics ‘The parse statistics are used to show how efficiently you are using parses. Ifyou have large numbers of parse count Gitar) or ange numbers of parse count (hard) it could indicate a large number of ad-hoc queries, A large number of hard parses (greater than 10 percent of parses) indicates that the system probably isn’ using bind variables efficiently. IFehere isa large diserepancy between pare CPU and perse Hlaped times it indicates that the system is overloaded and may be CPU bound, Physical read and write statistics For the physical reads and writes statistics we will ook at their definitions from the Oracle Database 10g Reference manual: Physical reads —Total number of data blocks read from disk “This value can be greater than the value of “physical reads rece” plus “physical reads cache” because reads into process, private buffers are also included in this statistic, Physical read bytes — Tota size in bytes of all disk reads by application activity (and not other instance activity) only. Physical read 1/0 requests —Number of read requests for application activiey (mainly buffer cache and direct, load operation) which read one or more database blocks per request. This isa subset of the “physical read total /O requests” statistic. Physical read total bytes —Total size in bytes of disk reads by all database instance activity, including application reads, backup and recovery, and other utilities. The difference becween this value and “physical read bytes” gives the toral read size in bytes by non-application workload, Physical read total YO requests —Number of read requests which read one or more database blocks forall instance activity, including application, backup and recovery, and other utilities. The difference between this value and “physical read total multi block requests” gives the total number of single block read request Physical read total multi block requests —Total number of Oracle Database instance read requests whieh read in two or more database blocks per request forall instance activity, including. application, backup and recovery, and other utilities, Physical reads cache —Total number of data blocks read from disk into the buffer cache, This isa subset of the “physical reads” statistic. Physical read direct Number of reads divectly from disk, bypassing the buffer cache. For example, in high bandwidth, data-intensive operations such as parallel query, reads of disk blocks bypass the buffer cache to maximize transfer rates and to prevent the premature aging of shared data blocks resident in the buffer cache. Physical reads prefetch warmap—Number of data blocks that were read from the disk during the automatic prewarming of the buffer eache. 24 Orade AW rep inhaabis Physical write bytes —Toral size in bytes of all disk writes, from the database application activity (and not other kinds of instance activity). Physical write I/O requests —Number of write requests, for application activity (mainly buffer eache and direct, load operation) which wrote one or more database blocks per request Physical write rotal bytes — ‘Tora size in bytes ofall disk writes for the database instance, including application activity, backup and recovery, and other utilities. The difference between this value and “physical write bytes” gives the total write size in bytes by non-application workload. Physical vite total /O requests Number of write requests which wrote one or more database blocks from all instance activity, including application activity, Backup and recovery, and other utilities. The difference between this sea and “physical write total mali block requests” gives the number of single block write request. Physical write total malt Mock requests —Total number of Oracle Database instance write requests which wrote two or more blocks per request to the disk forall instance activity, including application activity, recovery and backup, and other utilities. Physical writes —Total number of data blocks written to disk. ‘This statistic’ value equals the sum of the “physical writes direct” and “physical writes from cache” values. Physical writes direct —Number of writes directly to disk, bypassing the buffer cache (asin a direct load operation), Physical writes fom cache — "Total number of daa blocks written to disk from the bulfer cache. This is a subset of the “physical Physical werites nom checkpoint —Numiber of times a buter is ‘writen for reasons other than advancement ofthe checkpoint. Used asa metre for determining the /O overhead imposed by setting the FAST_START_IO_TARGETT parameter to limic recovery VOs (Note that FAST_START_1O_TARGET isa depreciated parameter). Essentially this statis the number of writes that would have accursed had there been no checkpointing. Subtracting this value from “physical writes gives the extra I/O for checkpointi Recursive statistics “The recto call statistics can be used in ratio with the sr cals statistic to get the number of reasive calls per user cal. Ifthe number of recursive calls is high for each user eall then this indicates you are not reusing SQL very efficiently. In our example printout the ratio is about 10 0 1, which is fine Ifthe ratio was $0 co Lor greater it would bear investigation. Essentially, you need to determine what is good ratio of reausive cals to wer cal for your system iewill depend on the number of tables on average in your queries, the number of indexes on those tables, and whether ‘or not the Oracle Database has to reparse the entre statement or fi ean instead use a soft pase. This ratio i actually reported in the header information. We have already shown hhow the recursive CPU statistic is used with the CPU usage and other CPU related timings. Redo related statistics ‘The redo-related statistics ean be used to determine the health ofthe redo log activity and the LGWR processes. By using redo log space zat time divided by redo log space ‘requests you can determine the wait time per space request. If this time is excessive ie shows that the redo logs are under VO stress and should be moved to IBM FlashSystem. In 2 similar calculation the redo synch time can be divided by the ‘redo synch writes o determine the time taken during each redo, syne operation. This too isan indicator of VO stress if itis ‘excessive, final indicator of VO stress isa ratio of redo ‘write time to veda rites, gi ‘The redo wastage statis in the redo logs when they were written; excessive values of redo oastage per redo write indicates that the LQWR process is being stressed. The rollback changes —undo records statistics are actually rollback cbanges—ando reords applied, According to Jonathan Lewis, ifa session's Sur rollbacks” is large, but its “rollback changes —undo records applied” is small (and those ‘numbers are relative to your system) then most ofthe rollbacks are doing nothing. So by comparing these two metzies you cing the time for each redo write. shows the amount of unused space ‘ean determine, relative o your system, if you have an undo issue. Undo issues deal with rollback commands either explicit, ‘or implicit, Explicit are generated by isuing the rollback ‘command, while implicit can be from DDL, DCL or improper Session cursor statistic ‘The sesion cursor cache hits statistic shows how often a statement issued by a session was actually found in the session cursor cache, The session cursor eache is controlled by the sesion_cached_cursors setting and defaults (usually) 0 50, IF you see that the ratio of session cursor atce hits/uer

You might also like