For the most part most database performance monitoring tools do a great job at real-time monitoring – by that I mean alerting us when certain counter thresholds are reached, such as Page Life Expectancy below 300 or Memory Pages per Second is too high. Although this is definitely crucial to have setup within our environment, having hard alerts does pose a problem of its own. How do we know that reaching a page life expectancy of 300 is a problem? Maybe this is normal for a certain period of time such as month end processing.
This is where the baseline comes into play. A baseline, by definition is a minimum or starting point used for comparisons. In the database performance analysis world, it’s a snapshot or how our databases and servers are performing when not experiencing any issues for a given point of time. We can then take these performance snapshots and use them as a starting point when troubleshooting performance issues. For instance, take into consideration a few of the following questions…
- Is my database running slower now than it was last week?
- Has my database been impacted by the latest disk failure and RAID rebuild?
- Has the new SAN migration impacted my database services in any way?
- Has the latest configuration change/application update impacted my servers in any way?
- How have the addition of 20 VMs into my environment impacted my database?
With established baselines we are able to quickly see by comparison the answer to all of these questions. But, let’s take this a step further, and use question 5 in the following scenario.
Jim is currently comparing how his database server is performing now against a baseline he had taken a few months back. This, being after adding 20 new VMs into his environment. He concludes, with the data to back him up, that his server is indeed running slower. He is seeing increased read/write latency and increased CPU usage. So is the blame really to be placed on the newly added VMs? Well, this all depends – What if something else was currently going on that is causing the latency to increase? Say month end processing and backups are happening now and weren't during the snapshot of the older baseline.
We can quickly see that baselines, while they are important, are really only as good as the time that you take them. Comparing a period of increased activity to a baseline taken during a period of normal activity is really not very useful at all.
So this week I ask you to simply tell me about how you tackle baselines.
- Do you take baselines at all? How many? How often?
- What counters/metrics do you collect?
- Do you baseline your applications during peak usage? Low usage? Month end?
- Do you rely solely on your monitoring solution for baselining? Does it show you trending over time?
- Can your monitoring solution tell you, based on previous data, what is normal for this period of time in your environment?
You don’t have to stick to these questions – let's just have a conversation about baselining!