Application Performance Monitoring: What Should We Be Monitoring? How Should We Be Monitoring?

What Should We Be Monitoring?

To effectively begin getting a grasp on your applications performance, you must begin mapping out all the components in the path of your application. It might be wise to begin with the server or servers where your application lives. Again, this could be multiple different scenarios depending on the architecture housing your application. Or it could easily be a mixture of different architectures.

Let's say your application is a typical three-tier app that runs on several virtual servers that run on top of some hypervisor. You would want to start collecting logs and metrics from the hypervisor platform such as CPU, memory, network, and storage. You would also want to collect these metrics from the virtual server. Obviously, in addition to the metrics being collected from the virtual server, your application logs and metrics would be crucial to the overall picture as well. If, for some reason, your application does not provide these capabilities, you will need to either develop them or rely on some other existing tooling that you could easily integrate into your application. These metrics are crucial in identifying potential bottlenecks, failures, and overall health of your application. In addition to what we have already identified, we also need to collect metrics from any load balancers, databases, and caching layers. Obtaining all these metrics and aggregating them for deep analysis and correlation gives us the overall view into how our application is stitched together and assists us in pinpointing where a potential issue might arise.

How Should We Be Monitoring?

We have now identified a few of the things we should be monitoring. What we need to figure out next is how will we begin monitoring these things and ensure that they are accurate as well as providing us with some valuable telemetry data. Telemetry data comes from logs, metrics, and events.

Logging (Syslog)

First, let us begin with logging. We should have a centralized logging solution that can not only receive our log messages, but also have the ability to aggregate and correlate events. In addition, our logging solution should provide us with the ability to view graphs and customized dashboards, and also provide us with some level of alerting capabilities. If we have these initial requirements available to us from our logging solution, we are already off to a good beginning.

Now we need to begin configuring all our hypervisors, servers, load balancers, firewalls, caching layers, application servers, database servers, etc. There are many, many systems to ensure we are collecting logs from. But we need to make sure we get everything that is directly in the path of our application configured for logging. Also, remember that your application logs are important to be collected as well. With all these different components configured to send logging data, we should begin seeing events over time, painting a picture of what we might determine as normal. But remember, this might be what we termed after-the-fact application monitoring.

Metrics

There are numerous different methods of obtaining metrics. We should be clear about one thing when we begin discussing these methods, and that would be not using SNMP polling data. Now don't get me wrong--SNMP polling data is better than nothing at all. However, there are much better sources of metric data.

Our performance metrics should be time series-based. Time series-based metrics are streamed to our centralized metrics collection solution. With time series-based metrics we can drill into a performance graph at a very fine level of detail.

Most time series-based metrics require an agent of some sort on the device that we would like to collect metrics from that is responsible for providing the metrics that we are interested in. The metrics we are interested in include those that were mentioned before: CPU, memory, network, disk, etc. However, we are interested in obtaining metrics for our application stack as well. These metrics would include application-related metrics such as response latency, searches, database queries, user patterns, etc. We need all of these metrics to visually see what the environment looks like, including the health and performance.

With metrics and logging in place, we can begin correlating application performance with the events/logs to start understanding what the underlying issue might be when our application performance is degraded.

  • Ah...this brings up several points that many people don't think about.

    The more you monitor the more resources it takes and the more impact it has on the environment you are monitoring thus now your server statistics are skewed.

    Secondly, just because you are monitoring something doesn't mean you have alerts set up for "everything".

    Third, you don't want to alert on everything.  That is a firehose nobody can drink from.

    It is our job to help the customer in picking out the mission critical (critical path) items that must be dealt with right away and then build up the others.

  • Good stuff in there.  Problem we have had here in the past is that they would install all manner of monitoring tools, do half a job configuring them and sending out millions of useless alerts.  I killed off most of the tools and consolidated with Solarwinds.  I set up the monitoring correctly and removed the useless alerts.  We now have something useful but the business is still predudiced against alerting.  Pity.  If they had read the alert yesterday they would have avoided a P1 incident.  Ho hum.

  • Some really good feedback here. Looking forward to more and also the upcoming posts.

  • Somehow I don’t get the „grip“ on LEM. There are things that I think „should be easy to solve with LEM“ where I find myself 8 hours later on the phone with support and they need to build a custom „ingest method“ for my requirements.

    Don‘t get me wrong, it is super that support will create those parsers/tools/ingest methods, but I would also like to build those myself from time to time.

Thwack - Symbolize TM, R, and C