cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Application Performance Monitoring: What Should We Be Monitoring? How Should We Be Monitoring?

Level 10

What Should We Be Monitoring?

To effectively begin getting a grasp on your applications performance, you must begin mapping out all the components in the path of your application. It might be wise to begin with the server or servers where your application lives. Again, this could be multiple different scenarios depending on the architecture housing your application. Or it could easily be a mixture of different architectures.

Let's say your application is a typical three-tier app that runs on several virtual servers that run on top of some hypervisor. You would want to start collecting logs and metrics from the hypervisor platform such as CPU, memory, network, and storage. You would also want to collect these metrics from the virtual server. Obviously, in addition to the metrics being collected from the virtual server, your application logs and metrics would be crucial to the overall picture as well. If, for some reason, your application does not provide these capabilities, you will need to either develop them or rely on some other existing tooling that you could easily integrate into your application. These metrics are crucial in identifying potential bottlenecks, failures, and overall health of your application. In addition to what we have already identified, we also need to collect metrics from any load balancers, databases, and caching layers. Obtaining all these metrics and aggregating them for deep analysis and correlation gives us the overall view into how our application is stitched together and assists us in pinpointing where a potential issue might arise.

How Should We Be Monitoring?

We have now identified a few of the things we should be monitoring. What we need to figure out next is how will we begin monitoring these things and ensure that they are accurate as well as providing us with some valuable telemetry data. Telemetry data comes from logs, metrics, and events.

Logging (Syslog)

First, let us begin with logging. We should have a centralized logging solution that can not only receive our log messages, but also have the ability to aggregate and correlate events. In addition, our logging solution should provide us with the ability to view graphs and customized dashboards, and also provide us with some level of alerting capabilities. If we have these initial requirements available to us from our logging solution, we are already off to a good beginning.

Now we need to begin configuring all our hypervisors, servers, load balancers, firewalls, caching layers, application servers, database servers, etc. There are many, many systems to ensure we are collecting logs from. But we need to make sure we get everything that is directly in the path of our application configured for logging. Also, remember that your application logs are important to be collected as well. With all these different components configured to send logging data, we should begin seeing events over time, painting a picture of what we might determine as normal. But remember, this might be what we termed after-the-fact application monitoring.

Metrics

There are numerous different methods of obtaining metrics. We should be clear about one thing when we begin discussing these methods, and that would be not using SNMP polling data. Now don't get me wrong--SNMP polling data is better than nothing at all. However, there are much better sources of metric data.

Our performance metrics should be time series-based. Time series-based metrics are streamed to our centralized metrics collection solution. With time series-based metrics we can drill into a performance graph at a very fine level of detail.

Most time series-based metrics require an agent of some sort on the device that we would like to collect metrics from that is responsible for providing the metrics that we are interested in. The metrics we are interested in include those that were mentioned before: CPU, memory, network, disk, etc. However, we are interested in obtaining metrics for our application stack as well. These metrics would include application-related metrics such as response latency, searches, database queries, user patterns, etc. We need all of these metrics to visually see what the environment looks like, including the health and performance.

With metrics and logging in place, we can begin correlating application performance with the events/logs to start understanding what the underlying issue might be when our application performance is degraded.

12 Comments
Level 14

Thanks for the article.

Level 13

Thanks - Good article

Level 10

Here is the link to the first article in this series. Application Performance Monitoring: What is it All About? How Do We Get Started?

A reaction might be "We should be monitoring everything!"   Which consumes a lot of resources, fills logs, causes alert fatigue, etc.

But when we don't monitor "everything", we may miss important clues about attacks or trends that may indicate imminent failures or crashes or reboots.

So one solution may be monitoring everything and sending that information to a tool that can interpret it, deduplicate information as needed, and make intelligent pattern recognition and present choices or recommendations about what's going on, and what you should do about it.  A powerful SIEM can do that.  But moving to an off-box / non-SolarWinds product for this pattern recognition and interpretation means losing efficiency--you're no longer working with a single interface, that "single pane of glass", and having a single product to contract for support.

I don't see the right solution for the size of network I monitor.  Yet.

Level 9

Deja Vu....

After many years of doing this I have a different perspective.  I have seen the evolution of "what do you want to monitor/alert on", answer = EVERYTHING (so you want me to page you at 2:30 in the morning on EVERYTHING) to let the Subject Matter Expert decide.  As I see it, Solarwinds is already capturing Average Response Times, Packet Loss, Network Latency, CPU Load, CPU Load per Processor, CPU Capacity Forecast, Memory Utilization, Min/Max/Average Memory Usage, Storage Allocated and Available, Current Hardware Health, Asset Information and aggregating all this data (pretty much out of the box).  So we have the metrics, now the question is, what do we alert on.  After three tours with IBM, implementation of the 2nd largest SCOM implementation at JPMorgan Chase (monitoring 45k Windows Servers), on and on...it to me comes down to two things, first the SME decides.... and when they ask me what I suggest I simply say two things, first, monitoring/alerting is evolutionary, (2nd) start with a simple question, what parts of an application, server, system can cause your system to quit providing service to its users.  This is what we want to monitor and alert on initially.  Afterwards if something breaks with the system, add a monitor/alert to that specific component where one does not already exist.  Reactive to Proactive.

Don't get me wrong, I am not the smart one here, but I have worked with some brilliant folks along the way.  And this is the mantra I promote from the experience I have gained along the way.

To me in the simplest of terms, it's all about the service the "said" system is providing to the users, focus your monitoring and alerting efforts on these components first (that can break and cause the service to be disrupted), then grow it from there.

this is of course just my humble opinion, happy monitoring 🙂

Level 20

I'm learning to cover many of these things with LEM now.  LEM is a pretty neat tool for the money when it comes to digesting logs from various sources.

Level 9

Thanks, good article.

MVP
MVP

Good article

Somehow I don’t get the „grip“ on LEM. There are things that I think „should be easy to solve with LEM“ where I find myself 8 hours later on the phone with support and they need to build a custom „ingest method“ for my requirements.

Don‘t get me wrong, it is super that support will create those parsers/tools/ingest methods, but I would also like to build those myself from time to time.

Level 10

Some really good feedback here. Looking forward to more and also the upcoming posts.

Level 14

Good stuff in there.  Problem we have had here in the past is that they would install all manner of monitoring tools, do half a job configuring them and sending out millions of useless alerts.  I killed off most of the tools and consolidated with Solarwinds.  I set up the monitoring correctly and removed the useless alerts.  We now have something useful but the business is still predudiced against alerting.  Pity.  If they had read the alert yesterday they would have avoided a P1 incident.  Ho hum.

MVP
MVP

Ah...this brings up several points that many people don't think about.

The more you monitor the more resources it takes and the more impact it has on the environment you are monitoring thus now your server statistics are skewed.

Secondly, just because you are monitoring something doesn't mean you have alerts set up for "everything".

Third, you don't want to alert on everything.  That is a firehose nobody can drink from.

It is our job to help the customer in picking out the mission critical (critical path) items that must be dealt with right away and then build up the others.