You sit down at your desk. It's 9:10AM and your coffee is still warm. There is a smell of bacon in the air.

 

Suddenly your phone rings. The trading system is down. The time for quick thinking is now.

 

Where would you begin to troubleshoot this scenario?

 

A lot of people will reach for the biggest hammer they can find: a tool that will trace all activity as it hits the database instance. For SQL Server, that tool is typically SQL Profiler.

 

The trouble here is this: you are in a reactive mode right now. You have no idea as to the root cause of the issue. Therefore, you will configure your trace to capture as many details as possible. This is your reaction to make certain that when the time comes you are prepared to do as thorough a forensics job as possible in the hope that you can fix the issue in the shortest amount of time.

 

And this method of performance monitoring and troubleshooting is the least efficient way to get the job done.

 

When it comes to performance monitoring and troubleshooting you have two options: tracing or polling.

 

Tracing will track details and capture events as they happen. In an ironic twist this method can interfere with the performance of the queries you are trying to measure. Examples of tools that use the tracing method for SQL Server are Extended Events and SQL Profiler.

 

Polling, however, is also known by another name: sampling. A tool that utilizes polling will gather performance data at regular intervals. This is considered a light-weight option to tracing. Examples of tools that use this method are Performance Monitor (by default it samples once per second) and 3rd party applications like Database Performance Analyzer that query dynamic management objects (which are system views known as DMVs in SQL Server, and x$ and v$ objects in Oracle).

 

See, here's the secret about performance monitoring and troubleshooting that most people don't understand: when it comes to gathering performance metrics it's not the what you gather as much as it is the how you gather.


Knowing what to measure is an easy task. It really is. You can find lots of information on the series of tubes known as the internet that will list out all the metrics an administrator would want. Database size, free disk space, CPU utilization, page life expectancy, buffer cache hit ratio, etc. The list of available metrics seems endless and often overwhelming. Some of that information is even useful; a lot of it can be just noise, depending on the problem you are trying to solve.


So which method is right for you?


Both, actually.


Let me explain.


Think of a surgeon that needs to operate on a patient. There's a good chance that before the surgeon cuts into healthy skin they will take an X-ray of the area. Once they examine the X-ray, they know more about what they need to do when they operate.

Polling tools are similar to X-rays. They help you understand more about what areas you need to investigate further.  Then, when you need to take that deeper dive, that's where you are likely to use a tracing tool in order to return only the necessary information needed to solve the problem, and only for the shortest possible duration.

 

I find that many junior administrators (and developers with novice database performance troubleshooting skills) tend to rely on tracing tools for even the most routine tasks that can be done with a few simple queries against a DMV or two. I do my best to educate when I can, but it often is an uphill battle. I lost track of the number of times I've been thrown under the bus by someone saying that they can't fix an issue because I won't let them run Profiler against a production box as a first attempt at figuring out what’s going on. Rather than make people choose between one tool or the other I do my best to explain how they work well together.

 

I never use a tracing tool as a first option for performance monitoring and troubleshooting. I rely on polling in order to help me understand where to go next. Sometimes that next step requires a trace but often times I'm able to help make positive performance improvements without ever needing to run a trace. Then again, I'm lucky that I have some really good tools to use for monitoring database servers, even ones that are running on VMWare, or Amazon RDS, or Microsoft Azure.

 

There's a lot for anyone to learn as an administrator, and it can be overwhelming for anyone, new or experienced.

 

So it wouldn't hurt to double check how you are currently monitoring right now, to make certain you are grabbing the right things and at the right frequency.