5 Essential Components For Succesful Monitoring Systems

best.jpg

On Day Zero of being a DBA I inherited a homegrown monitoring system. It didn't do much, but it did what was needed. Over time we modified it to suit our needs. Eventually we got to the point where we integrated with OpsMgr to automate the collection and deployment of monitoring alerts and code to our database servers. It was awesome.

The experience of building and maintaining my own homegrown system combined with working for software vendors has taught me that every successful monitoring platform needs to have five essential components; identify, collect, share, integrate, and govern. Let's break down what each of those mean.

Identify

A necessary first step is to identify the data and metrics you want to monitor and alert upon. I would start this process by looking at a metric and putting it into one of two classes: informational or actionable. Metrics that were classified as information were the metrics that I wanted to track, but didn't need to be alerted upon. Actionable are the metrics where I needed to be alerted upon because I was needed to perform some actions in response. For more details on how to identify what metrics you want to collect, check out the Monitoring 101 guide, and look for the Monitoring 201 guide coming soon.

Collect

After you identify the metrics you want, you need to decide how you want to collect and store them for later use. This is where flexibility becomes important. Your collection mechanism needs to be able to consume data in varying formats. If you build a system the relies on data being in a perfect state, you will find yourself easily frustrated the first time some imperfect data is loaded. You will also find yourself spending far too much time playing the role of data janitor.

Share

Now that your data is being collected, you will want to share it with others, especially when you want to help provide some details about specific issues and incidents. As much as you might love the look of raw data and decimal points, chances are that other people will want to see something prettier. And there's a good chance they will want to be able to export data in a variety of formats, too. More than 80% of the time your end-users will be fine with the ability to export to CSV format.

Integrate

With your system humming along, collecting data, you are going to find that other groups will want that data. It could also be the case that you need to feed your data into other monitoring systems. Designing a system that can integrate well with other systems requires a lot of flexibility. It's best that you think about this now, before you build anything, as opposed to trying to make round pegs fit in a square hole later. And it doesn't have to be perfect for every possible case, just focus on the major integration method used the world over that I already mentioned: CSV.

Govern

This is the component that is overlooked most often. Once a system is up and running, very few people consider the task of data governance. It's important that you take the time to define what the metrics are and where they came from. Anyone consuming your data will need this information, as well. And if you change a collection, you need to communicate the changes and the possible impacts they may have for anyone downstream.

When you put those five components together you have the foundation for a solid monitoring application. I'd even surmise that these five components would serve any application well, regardless of purpose.

Parents
  • To these I'd add:

    • Staff appropriately.  No one wants to be crushed by the load, and adding monitoring brings transparency to issues, causing increased demand for timely responses by staff.
    • Train your staff to the level needed to prevent problems, troubleshoot issues, and implement best practices.  Budget for each person to be out of the office for two weeks each year, to help keep them focused on the training and to ensure they're not interrupted with drive-by issues.  Plan on $6,500 per class per person (includes air far, tuition, materials, rental car, hotel, food).  Don't be cheap and try to get by with online classes that a person has to try to take at their desk, or at home.  I've done both; the in-person classes are far superior for learning more about what's going on, plus you get people who network pre-class, at lunch, on breaks, and in the evenings.  I've learned nearly as much from those out-of-class experiences as I have in-class.
    • Plan it all with an architect of systems, network, and security.  Then work your plan.

    You'll find success happens easier this way.  Not that it can't via other ways; some businesses can't afford to send people away from work for training--they're staffed too lean.   That's a path to employee burn-out, dissatisfaction, and turnover.

Comment
  • To these I'd add:

    • Staff appropriately.  No one wants to be crushed by the load, and adding monitoring brings transparency to issues, causing increased demand for timely responses by staff.
    • Train your staff to the level needed to prevent problems, troubleshoot issues, and implement best practices.  Budget for each person to be out of the office for two weeks each year, to help keep them focused on the training and to ensure they're not interrupted with drive-by issues.  Plan on $6,500 per class per person (includes air far, tuition, materials, rental car, hotel, food).  Don't be cheap and try to get by with online classes that a person has to try to take at their desk, or at home.  I've done both; the in-person classes are far superior for learning more about what's going on, plus you get people who network pre-class, at lunch, on breaks, and in the evenings.  I've learned nearly as much from those out-of-class experiences as I have in-class.
    • Plan it all with an architect of systems, network, and security.  Then work your plan.

    You'll find success happens easier this way.  Not that it can't via other ways; some businesses can't afford to send people away from work for training--they're staffed too lean.   That's a path to employee burn-out, dissatisfaction, and turnover.

Children
No Data
Thwack - Symbolize TM, R, and C