Monitoring Oracle Performance with AppDynamics

One of my customers uses AppDynamics as their tool of choice for Application Performance Management of a multi-tier, B2B/B2C application environment with several Oracle Databases at the back end. This tool gives a graphic overview of the application infrastructure, discovers (JDBC or .net) database queries automatically and provides metrics on the response time for those database calls as well as metrics from the servers where the application components run on.

To my surprise, I found out that there are so-called “Machine Agents” that can be run on the servers to collect performance and other metrics, but there wasn’t any component for collecting metrics from the Oracle RDBMS itself. Thus, born was the idea to create such a component (“Monitor” in the AppDynamics terminology)!

What’s it all about?

The basic idea was to enable application administrators to gain insight on database performance as an additional source of information about the overall performance of their application infrastructure. Usually, application administrators aren’t DBAs and aren’t necessarily versatile with the interpretation of AWR and ASH reports. Furthermore, many larger organizations practice a separation of duties, meaning that an application admin has to call the DBA in charge (or – even worse – file a ticket) to find out if a performance issue of the application has its roots within the database instance or not.

On the DBA side, there are tools like “Oracle Enterprise Manager” that provide extensive insight into the database. But on one hand, those tools don’t show the complete picture of every link in the chain of a business transaction. And on the other hand, their usage is often restricted to DBAs only and therefore not visible to other stakeholders.

So, wouldn’t it be great to have a common repository of performance metrics from every system component? Combined with a front-end that allows for public dashboards and for drill-down into the collected data? Free of surplus charges and including the Java sources for your own customisation? Well – it’s here!

First of all, let’s take a look at a screenshot to demonstrate what this monitor is doing:

3_SIDs_execs

The three colored lines above display the metric “Executions Per Sec” (the number of SQL executes in a second) on three different Oracle instances in this system. This was shot during the ramp-up phase of a test scenario where you can see the different workloads unfold on each instance. The AppDynamics Machine Agent calls its monitors every 60 s (default, but configurable), thus, metrics are sampled on a 60s-basis.

License Restrictions

Many useful metrics are already aggregated in the Active Session History (ASH) or stored in the Automatic Workload Repository (AWR). There’s one problem, however:  Usage of ASH and AWR requires an additional license for Oracle’s Diagnostics Pack. Not every organization is willing to pay the additional license fees, as they can add up quite significantly on multi-CPU hardware.

So there should be an alternative to ASH/AWR or at least an option not to use views that are part of the diagnostics pack.

V$SYSMETRIC to the rescue!
This view displays the system metric values captured for the most current time interval for both the long duration (60-second) and short duration (15-second) system metrics. And with a monitor that samples every 60 seconds, the long duration metrics are a perfect fit!

To explain this in more detail: Before the advent of ASH, Oracle’s performance metrics consisted mostly of cumulative statistics that were counted up from zero at instance startup. To make sense of those metrics, one usually had to build differences of two snapshots, dividing them by the time that passed inbetween the snapshots (e.g., by using Statspack, which still works in 11g, btw). With ASH, Oracle implemented a mechanism of sampling system metrics every second. Those metrics are then aggregated in v$sysmetric for 15- and 60-s-intervals. Values older than 60 s can be retrieved from the DBA_HIST_% views (which are subject to licensing).

But why do all the snapshot and aggregation stuff yourself when – leaving some limitations aside –  Oracle already does that for you? So, to get a v1 of that monitor up and running fast, v$sysmetric became the preferred source of information, as it provides us with ASH metrics but without the cost for ASH. The “H” part of it, the historicising and rollup,  can be done by the AppDynamics Controller.

Information Hiding

In their business best-seller “re:work“, the authors mention how important it is to decide on which features not to include in a product to maintain usability. The same decision had to be made here: The Oracle kernel provides an abundance of metrics (in the hundreds, see the Oracle docs for an idea), but would an application admin really need all of them, not to mention the ability to interpret them? Which metrics would in turn be necessary to get a good picture about what is happening inside the instance?

For v1.0, we decided to collect 42 metrics (Douglas Adams reference partly intended :-) that should mainly serve two purposes:

  1. Provide throughput data to build baselines on. Those baselines can be used to assess throughput changes after changing application code or system configuration.
  2. Provide further information to identify and alert on performance bottlenecks.
ratios

Db CPU Time Ratio vs. Db Wait Time Ratio and Memory Sort Ratio

Most of the information can be retrieved from v$sysmetric and some additional information from v$sessionv$sesstatv$system_wait_class and v$waitclassmetric.

These are the metrics that we chose to collect in the first version. Some of them are scaled up by 100, because AppDynamics displays integers only, which is bad for small numbers with fractional values:

Activity

  • Active Sessions Current: Number of active sessions at the point in time when the snapshot was taken.
  • Average Active Sessions: Average number of active sessions within the last 60 s. This is maybe the single most important DB load metric and a good starting point for a drill-down.
  • Average Active Sessions per logical CPU (*100): This shows the average load the database imposes on each logical CPU (i.e. cores or hyperthreads). Values above 100 (more than 1 waiting DB session per CPU) indicate a higher demand for resources than the host can satisfy. This often marks the beginning of quickly rising response times.
  • Current OS Load: Host CPU load, when available.
  • DB Block Changes Per Sec: Database blocks changed in the buffer cache.
  • DB Block Changes Per Txn: Database blocks changed in the buffer cache per SQL transaction.
  • DB Block Gets Per Sec: Database blocks read from the buffer cache.
  • DB Block Gets Per Txn: Database blocks read from the buffer cache per SQL transaction.
  • Executions Per Sec: SQL executions/s
  • Executions Per Txn: SQL executions per SQL transaction.
  • I/O Megabytes per Second
  • Logical Reads Per Sec: Logical reads are comprised of database block reads from the buffer cache + physical reads from disk.
  • Logons Per Sec
  • Physical Reads Per Sec: Database blocks read from disk.
  • Physical Read Total Bytes Per Sec
  • Physical Write Total Bytes Per Sec
  • Txns Per Sec: Transactions per second.

Wait Class Breakdown

Shows average active sessions per each wait class. Typically, the top wait classes are “CPU” and “User I/O”. A shift to other wait classes is a good pointer for further investigation (e.g., of network latency issues).
Wait classes are documented in the Oracle Database Reference. See here: http://docs.oracle.com/cd/E11882_01/server.112/e17110/waitevents001.htm#BGGHJGII

Efficiency

  • Database CPU Time Ratio: Percentage of CPU time against all database time.
  • Database Wait Time Ratio: Complementary to “Database CPU Time Ratio” (percentage of non-CPU waits).
  • Memory Sorts Ratio: Percentage of sort operations that were done in RAM (as opposed to disk).
  • Execute Without Parse Ratio: Percentage of (soft and hard) parsed SQL against all executed SQL.
  • Soft Parse Ratio: Ratio of soft parses to hard parses.
  • Response Time Per Txn (ms)
  • SQL Service Response Time (ms)

Resource Utilization

  • # of logical CPUs: Observation for informational purpose. This count is used, among others, for the metric “Average Active Sessions per logical CPU”.
  • Total Sessions: Count of all database sessions at the time the snapshot was taken.
  • % of max sessions: Open sessions vs. DB parameter “sessions”.
  • % of max open cursors: Maximum count of open cursors in a session vs. DB parameter “open_cursors”.
  • Shared Pool Free %
  • Temp Space Used (MB): Amount of used temporary tablespace.
  • Total PGA Allocated (MB): Amount of RAM used for sorts, hashes and the like.

It would be great to get some feedback from you about their usefulness and whether other metrics should be included in a future version.
What’s your opinion? Got other questions? Feel free to comment!

Sample Code

Here’s some SQL I used to retrieve the metrics:

-- Active User Sessions
SELECT 'Active User Sessions', COUNT(*) FROM v$session WHERE status='ACTIVE' AND username IS NOT NULL;

-- Active Sessions per logical CPU (alert if >1)
SELECT a.cnt / b.cpus AS act_ses_per_cpu
FROM ( SELECT COUNT(*) cnt FROM v$session WHERE status='ACTIVE' AND username IS NOT NULL ) a
, ( SELECT value AS cpus FROM v$parameter WHERE name='cpu_count' ) b;

-- % of max open cursors
select a.crs / b.max_crs * 100 as pct_open_cur
FROM ( SELECT MAX(a.value) AS crs from v$sesstat a, v$statname b where a.statistic# = b.statistic#
AND b.name = 'opened cursors current' ) a
, ( select value AS max_crs FROM v$parameter WHERE name='open_cursors' ) b
;

-- % of max sessions
SELECT a.cnt / b.maxses * 100 AS pct_max_sessions
FROM ( SELECT COUNT(*) cnt FROM v$session ) a
, ( SELECT value AS maxses FROM v$parameter WHERE name='sessions' ) b;

-- Waits by Class
SELECT 'Waits by Class|'|| wait_class
, time_waited
FROM v$system_wait_class
WHERE wait_class != 'Idle'
ORDER BY time_waited DESC;

-- Wait Class Breakdown
SELECT 'Wait Class Breakdown|'||wait_class, ROUND( aas, 2 ) FROM(
SELECT n.wait_class, m.time_waited/m.INTSIZE_CSEC AAS
FROM v$waitclassmetric m
, v$system_wait_class n
WHERE m.wait_class_id=n.wait_class_id
AND n.wait_class != 'Idle'
UNION ALL
SELECT 'CPU', value/100 AAS
FROM v$sysmetric
WHERE metric_name = 'CPU Usage Per Sec'
AND group_id = 2
);

/* Wait Classes:
Wait Class Breakdown|Other
Wait Class Breakdown|Application
Wait Class Breakdown|Configuration
Wait Class Breakdown|Administrative
Wait Class Breakdown|Concurrency
Wait Class Breakdown|Commit
Wait Class Breakdown|Network
Wait Class Breakdown|User I/O
Wait Class Breakdown|System I/O
Wait Class Breakdown|Scheduler
Wait Class Breakdown|CPU
*/

-- v$sysmetric is not subject to diagnostics pack licensing
SELECT metric_name, value
FROM v$sysmetric
WHERE group_id = 2 -- 60 sec interval
AND metric_name IN (
'Average Active Sessions'
, 'Current OS Load'
, 'Database CPU Time Ratio'
, 'Database Wait Time Ratio'
, 'DB Block Changes Per Sec'
, 'DB Block Changes Per Txn'
, 'DB Block Gets Per Sec'
, 'DB Block Gets Per Txn'
, 'Executions Per Sec'
, 'Executions Per Txn'
, 'I/O Megabytes per Second'
, 'Logical Reads Per Sec'
, 'Memory Sorts Ratio'
, 'Physical Read Total Bytes Per Sec'
, 'Physical Write Total Bytes Per Sec'
, 'Shared Pool Free %'
, 'Execute Without Parse Ratio'
, 'Soft Parse Ratio'
, 'Temp Space Used'
, 'Total PGA Allocated'
, 'Response Time Per Txn'
, 'SQL Service Response Time'
)
ORDER BY metric_name;

-- Txns / s
SELECT 'Txns Per Sec', a.epx / b.ept
FROM (
SELECT value epx
FROM v$sysmetric
WHERE group_id = 2 -- 60 sec interval
AND metric_name = 'Executions Per Sec' ) a,
( SELECT value ept
FROM v$sysmetric
WHERE group_id = 2 -- 60 sec interval
AND metric_name = 'Executions Per Txn' ) b

--------------------------------------------------------------------------------
-- From ASH or AWR:

-- Top SQL by DB Time
select COUNT(1) as dbtime
, sql_id
from v$active_session_history
where session_type='FOREGROUND'
and sql_id IS NOT NULL
group by sql_id
order by 1 DESC;

/*
-- Alternative code for ratios
select max(a.value) / p.value * 100 as pct_open_cur
from v$sesstat a, v$statname b, v$parameter p
where a.statistic# = b.statistic#
and b.name = 'opened cursors current'
and p.name = 'open_cursors'
group by p.value;
*/

Weblinks

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s