Doug's Oracle Blog
Jul 30: AWR Differences Report
Some features in this post require a Diagnostics Pack license.
This morning I had an opportunity to use one of my favourite AWR tools, the differences report. Our system has a fairly involved overnight batch schedule consisting of multiple concurrent job streams that starts at 02:00 and usually completes at about 5:40. This morning it finished at 6:40 so I needed to work out what had gone wrong. (Better still, because this wasn't a problem with specific SQL statements, I can blog about it without using any system-specific information that needs to be obscured and that doesn't happen too often.)
Because I don't have access to the server to run $ORACLE_HOME/rdbms/admin/awrddrpt.sql, I used a SQL-only approach.
First find out the snapshots I'm interested in.
Now that I have those snapshot ids, I can generate the AWR differences report, passing in the DBID, Instance Number, Begin and End Snapshot IDs for the first period and then the second period.
The report shows information for both periods, layed out for easy comparison. What immediately jumped out at me was the much higher DB Time over the first period than the second, which is probably more easily expressed as Average Active Sessions over the period. (Some people might say AAS is the magic metric )
Yes, this is over a long 4 hour period but I have the benefit of knowing that this is the same batch process that runs every night and it's performance and workload profile is more or less the same night after night. There's still no substitute for knowing your systems, after all. So to have over 4 times the number of sessions active, averaged over a long period, is a sure sign that something weird is going on and I had an immediate suspicion about what it might be.
Let's look at the Top 5 Timed Events section.
Yes, as I suspected the first period with many more active sessions on average is showing a lot of time spent on the parallel execution-related "PX Deq Credit: send blkd' event which doesn't show up at all in the second period. In fact, that's why it has an asterisk next to it in the first period numbers, to show it's an event which doesn't appear in the top 5 over the second period. That's what I love about this report. It takes the particular strength of AWR/Statspack - comparing 'good' and 'bad' periods and lays it out in readable format. No more sitting there at a desk with print-outs of two Statspack reports, poring over the details, or flicking between two vi sessions paging through the results. Those who've done that will know precisely what I mean
We have 4 times as many sessions active in the first period and there's a lot of parallel wait time that doesn't appear in the second period so I suspect that PX wasn't being used properly in the second period. I can check various statistics for that. This is a small section of the
Instance Activity section of the report.
Yes, every parallel operation in the second period was downgraded to serial. You can see that the use of parallelism isn't very successful in the first 'good' period either, but that's a whole other story and possibly a blog post or two. That's where the big difference in DB Time has come from though - lots of parallel slave activity that wasn't there last night.
I think this is an interesting example because this particular report is showing a 'good' period with higher DB Time than a 'bad' period, which isn't what you'd normally expect to see - that's a side effect of looking at consolidated DB time - there's lots of additional wait time from the various slaves but they're actually helping push more work through the system, so they're a good thing. To put it another way, the instance was much busier during the good period, that's why it was able to push more work through and complete more quickly. Even though that might make sense, if we look at the graphical representation in OEM, here's the good period
and here's the bad
I'm not convinced that people wouldn't look at those two pictures and think 'Mmmmm, things look much worse in the first period'. Oh, I should also point out that that bump at 22:00 each day is the auto stats gather job which ran for the entire maintenance window on the first night.
The root cause was a little more difficult to find out, but I got there in the end. TOAD has a nasty habit of keeping hold of PX slaves long after a parallel query has finished. (Well, that's my experience, using version 18.104.22.168, post a comment if you know different.) Unfortunately we have parallel DEGREE settings of 8 across our logging and monitoring tables (we shouldn't) and so if I query those using TOAD and forget to disable parallel query at the session level, Oracle decides to run them with parallel plans. I had one such query which used 16 slaves - our parallel_max_servers setting - which meant that there were no slaves for any of our batch processes! Result - batch was delayed by an hour.
The punchline, though, is that although the ETL process sets DEGREE to 1 on most of the tables on completion, it doesn't set the indexes to 1 (which is an oversight). The end result is that our users use parallel query during the day for their reports and (this is definitely another blog post) it makes them run more slowly . So now that I have all slaves assigned to my query, it's helped all the user queries run more quickly, so I'm sure they're happy. In fact, maybe I should just run one of those queries each morning from TOAD
The AWR differences report is very handy sometimes and I know it can't be that well known because when I had to request one prior to having the privileges myself, the DBA denied it's existence!