As the amount of network traffic, connected devices, and data exponentially rises every year, it becomes more and more difficult to keep track of things. This is a key reason why AIOps is slowly getting a bigger spotlight across all levels of the IT world.
In this article, we’ll look closely at anomaly detection. What is anomaly detection? Why do we want/need it? What approaches can we leverage? And of course, as the title suggests, we’re going to talk about common pitfalls we encounter during development.
Every metric in IT systems is monitored in a form of time series. In other words, a time series is data indexed by time.
An anomaly is an outlying data point or one that doesn’t belong. This can happen in two ways:
- Point anomalies—points falling within low-density value regions
- Contextual anomalies—points falling outside of low-density value regions but anomalous regarding local values
Anomaly detection in a time series is the ability to detect anomalies in your data. This can be pretty handy if you look over a huge network, provide several websites, or offer many online services.
When the worst-case scenario happens and your website goes down, the impact is expensive. You may have trouble solving the problem and can miss revenue on sales. But with anomaly detection, it’s possible to prevent this. If an anomaly detector was able to detect unusually high traffic on your website (maybe thanks to a celebrity tweeting about your small, unknown shop) in time, your ITOps could’ve adjusted accordingly and prevented the page from going down completely.
Next, we’ll talk about approaches to anomaly detection—but don’t worry, I’ll keep this part as simple as possible and provide examples.
1. Static thresholds—this is a pretty old approach, but this doesn’t mean it’s useless. The problem with this approach is you need to set up the threshold, which can be different for every device and metric in your system.
2. Statistical anomaly detection—this approach leverages statistical methods to set up the thresholds. This allows you to overcome the variety in devices and metrics, and with statistical anomaly detection, you can change the threshold according to the development of the time series.
3. Artificial intelligence (AI)-/machine learning (ML)-/deep learning (DL)-based anomaly detection—these approaches are becoming more popular thanks to their flexibility and the huge amount of data we have at our disposal. In fact, watch this SolarWinds Lab to see how SolarWinds Database Performance Analyzer (DPA) is built to perform anomaly detection powered by machine learning.
We can further divide this type of anomaly detection based on their internal logic:
- Classification (clustering)—the result is a statement telling you whether you’re dealing with an anomaly. This can be achieved in multiple ways.
- Forecasting—based on historical data, we predict what we think the next set of data will look like. Then, we compare it with the raw data we measured and—based on some thresholding—decide if it’s an anomaly. Typically, we’re dealing with some recurrent neural networks (RNNs).
- Reconstruction-based—based on historical data, we learn how to generate new signals without any anomalies and compare them with the ones we were able to measure.
This is the part I looked forward to the most because this topic isn’t stressed enough. As a team developing anomaly detectors, we’ve gone through numerous articles and whitepapers to look for inspiration or an ultimate solution. And every time we read one, we thought, “This is the one!” But, more often than not, this wasn’t the case.
Results in whitepapers and research aren’t always precise. Sometimes, this is because they average the results using multiple data sets or only showcase a data set the solution works perfectly on.
Other times, it can be due to different evaluation approaches—there’s no unified approach for anomaly detection evaluation. Some papers developed their own evaluation approach, and others just boldly claimed everything they detected in one of the first steps was considered a correct detection and was used as such in the next steps.
In these cases, it’s always good to know what you’re trying to achieve and choose an approach based on real performance and not just on some evaluation results.
Labeled data is very scarce. There are a few repositories containing time series data with labels, but these labels aren’t always precise, so you can’t take them as a Holy Grail and use them without a second thought. They’re good to give you a sense of direction, but they shouldn’t be your final destination.
Another common problem with the data is it doesn’t come with metadata (explanation, description, or reasoning). Sometimes, having this information can be helpful—what may be an anomaly for one customer may be normal behavior for another.
I came across a few use cases where we didn’t know anything about the data and said it was definitely anomalous. But the feedback from the customer was, “No, it’s fine. We don’t want this reported.” This is the perfect case for using metadata to make decisions, as the last thing you want is to spam your customer with alerts. Alert overload can lead to the customer ignoring them, and this can take an ugly turn.
My personal preference would be something like an LSTM classifier. But for this approach, you need a lot of data with labels. You also need periodic feedback collection from customers, who go through the system and manually label what was actually suspicious so you can continuously improve the model.
But calling it the best approach would be a bold statement. Every customer—and what works for them—is different. For some, the Holy Grail can be hidden in something “simple” like false positives/alert count reduction. Some can be more demanding and request 100% precision at the cost of numerous alerts.
It would be great if your service/detector could satisfy all their needs. Sometimes, this comes with the price of configuring the system. But if you do configure the system, make it as simple as possible so the customer can interact painlessly with your detectors.
Do you monitor for anomalies/alerts in your systems? If you do, are the reported anomalies/alerts triggering actions on your end? If so, what kinds of actions?
Learn more about how SolarWinds DPA can help you improve performance through database anomaly detection powered by machine learning.