cancel
Showing results for 
Search instead for 
Did you mean: 

Cross Referencing Alarms: Network Security Monitoring and Network Outage Notifications

Level 10

As anyone that has run a network of any size has surely experienced, with one alert, there is typically (but not always) a deeper issue that may or may not generate further alarms. An often overlooked correlation is that between a security event caught by a Network Security Monitor (NSM) and that of a network or service monitoring system. In certain cases, a security event will be noticed first by a network or service alarm. Due to the nature of many modern attacks, either volumetric or targeted, the goal is typically to offline a given resource. This “offlining,” usually labeled a denial of service (DoS), has a goal that is very easy to understand: make the service go away.

To understand how and why correlating these events is important, we need to understand what they are. In this case, the focus will be on two different attack types and how they manifest. There are significant similarities. Knowing the difference can save time and effort in working triage, whether you're dealing with a large or small outages/events.

Keeping that in mind, the obvious goals are:

1. Rooting out the cause

2. Mitigating the outage

3. Understanding the event to prevent future problems

For the purposes of this post, the two most common issues will be described: volumetric and targeted attacks.

Volumetric

This one is the most common and it gets the most press due to the sheer scope of things it can damage. At a high level, this is just a traffic flood. It’s typically nothing fancy, just a lot of traffic generated in one way or another (there are a myriad of different mechanisms for creating unwanted traffic flows), typically coming from either compromised hosts or misconfigured services like SNMP, DNS recursion, or other common protocols. The destination, or target, of the traffic is a host or set of hosts that the offender wants to knock offline.

Targeted

This is far more stealthy. It’s more of a scalpel, where a volumetric flood is a machete. A targeted attack is typically used to gain access to a specific service or host. It is less like a flood and more like a specific exploit pointed at a service. This type of attack usually has a different goal, gain access for recon and information collecting. Sometimes a volumetric attack is simply a smoke screen for a targeted attack. This can be very difficult to root out.

Given what we know about these two kinds of attacks, how can we utilize all of our data to better (and more quickly) triage the damage? Easily, actually. Knowing that an attack is occurring is fairly easy to determine in the case of volumetric attacks: traffic spikes, flatlined circuits, service degradation. In a strictly network engineering world, the problem would manifest and best case NetFlow data would be consulted in short order. Even with that particular dataset, it may not be obvious to a network engineer that there is an attack occurring. It may just appear as a large amount of UDP traffic from different sources. Given traffic levels, a single 10G connected host can drown out a 1G connected site. This type of attack can also manifest in odd ways, especially if there are link speed differences along a given path. This can look like a link failure or a latency issue when in reality it is a volumetric attack.  However, if also utilizing a tuned and maintained NSM of some kind, the traffic patterns should be readily identified as a flood, and the traffic pattern can be filtered more quickly, either by the site or the upstream ISP.

Targeted attacks will be very different, especially when performed on their own. This is where an NSM is critical. With attempts to compromise actual infrastructure hardware like routers and switches on a significant uptick, knowing the typical traffic patterns for your network is key. If a piece of your critical infrastructure is targeted, and it is inside of your security perimeter, your NSM should catch that and alert you to it. This is especially important in the case of your security equipment. Having your network tapped before the filtering device can greatly aid in seeing traffic with destinations of your actual perimeter. Given that there are cases of firewalls being compromised, this is a real threat. If and when that occurs, it may appear as a high load on a device, a memory allocation increase, or perhaps a set of traffic spikes, most of which a network engineer will not be concerned with as long as it is not affecting service. However, understanding the traffic patterns that led to that could help uncover a far less pleasant cause.

Most of these occurrences are somewhat rare, nevertheless, it is a very good habit to check all data sources when something out of the baseline occurs on a given network. Perhaps more importantly, there is no substitute for good collaboration. Having a strong, positive and ongoing working relationship between security professionals and networking engineers is a key element to making any of these occurrences less painful. In many cases of small- and medium-sized environments, these people are one and the same. But when they aren’t, collaboration at a professional level is as important and useful as the cross referencing of data sources.

12 Comments
vinay.by
Level 16

Good article

david.botfield
Level 13

Nice write up

jkump
Level 15

Nice article.  Helpful!  Thanks!

smttysmth02gt
Level 13

Thanks for the write up.  I think this is the case i most org's...I think that Engineering, Security, and Network teams should all be in sync, but that's a difficult task, as all 3 are so vastly different.  

petergwilson
Level 14

We had an incident recently.  We have a pair of Linux web servers that access a back-end database.  They sit behind an F5 in a load balanced pool.  We monitor each server and the F5 pool for both 'down' and 'back up' status and alert on both because that is what the customer wanted.  One of my colleagues updated ONE of the servers python install and got it wrong.  Alerts started.  The updated server was flip flopping slowly and triggering both the server and the F5 pool alerts correctly.  By Monday morning when I got in there were over 800 e-mail alerts and he was busy e-mailing the world telling them that the alerting system was useless and I didn't know what I was doing.  Pity I was able to show that he had caused the alerts by breaking one of the servers and the alerting had worked perfectly.  Each alert told a story and taken all together showed a pattern which made it easy to see what was wrong and when it happened.  He doesn't talk to me now which is no great loss.

ecklerwr1
Level 19

I hate stuff like that...

ecklerwr1
Level 19

I suppose event correlation is where the rubber meets the road.

bobmarley
Level 15

We are now sending all of our events from all of our various tools to one event collection engine so the correlation can be done in a single place. Seems like Solarwinds LEM could do something similar however I haven't yet tried it.

rschroeder
Level 21

A great SIEM and proper monitoring tools, and a thorough understanding of what's "normal" for any node or circuit, are critical to recognizing, stopping, preventing, and correcting a problem.

Your phrase " . . . check all data sources when something out of the baseline occurs . . ." isn't something anyone I know has time to do.  We rely on automation and tools to identify unusual activity or performance, and they need to alert us instead of having bodies assigned to become familiar with bits and bytes, using mental gears to identify and recognize and differentiate between normal and abnormal flows.

What tools do you use to accomplish this, since I want to believe this kind of work is tailor-made for machines, not human minds?

tallyrich
Level 15

Good article. Good and proper notifications are key. How many times do we monitor in a reactive mode. It's critical to have valuable alerts that not only tell us when things are "down" but also when things aren't performing to the baseline so that we can head off issues before they become outages.

robertkcampbell
Level 8

Thanks for posting this!

buraglio
Level 10

Exactly. This kind of work is really meant for humans to do once to twice and then to automate their actions. Fortunately that kind of thing is very easily accomplished with threshold alarming. Cross referring it, from my experience, is typically accomplished by custom code that is unique to an environment. There are exceptions to this, and tools available that can accomplish things like identifying excessive UDP flows to/from port 123 and then kicking off scans to identify vulnerable deamons in the src/dst hosts, etc. But yes, you're absolutely right, this is a perfect job for a bit of custom code and a machine with data sources and some cpu cycles.

About the Author
15+ years IT experience ranging from networking, UNIX, security policy, incident response and anything else interesting. Mostly just a networking guy with hobbies including, film, beer brewing, boxing, MMA, jiu jitsu/catch wresting/grappling, skateboarding, cycling and being a Husband and Dad. I don't sleep much.