IT professionals are admittedly a prideful bunch. It comes with the territory when you have to constantly defend yourself, your decisions, and your infrastructure against people who don’t truly understand what you do. This is especially true for network administrators. “It’s always the network.” Ever heard that one before? Heck, there’s even a blog out there with that expression created by someone I respect, Colby Glass. My point is, as IT professionals, we have to be prepared at a moment’s notice to provide evidence that an issue is not related to the devices we manage. That's why it's imperative that we must know our network very well inside and out.
With that being said, It should be no surprise to you that when I started my career in networking in 2010, I thought NMS platforms were pretty amazing. Pop some IP addresses in and you’re set.¹ The NMS goes about its duty, monitoring the kingdom and alerting you when things go awry. I could even log in and verify it for myself by looking if I wanted to be certain. I could even dig in at the interface level and give you traffic statistics like discards and errors, utilization, etc. I had instant credibility at my finger tips. I could prove the network was in great shape at a moment's notice. Want to know if that interface to your server was congested yesterday evening at 7pm? It sure wasn't and I have the proof! Can’t get much better than that, right?
I saw netflow for the first time. Netflow has a way of really opening your eyes. “How did I ever think I knew my network so well in the past?”, I thought. I had no visibility into the traffic patterns flowing through my network. Sure, I could fire up a packet capture pretty easily, but that approach is reactive and time-consuming depending on your setup. What if that interface really WAS congested yesterday evening at 7pm? I have no data to reference because I wasn't running a packet capture at that exact time or for that particular traffic flow. It’s helpful to tell someone that the interface was congested, but how about taking it a step further with what was congesting it? What misbehaving application caused that link to be 90% utilized when traffic should have been relatively light at that time of the day? The important thing to realize is that I’m not just an advocate for netflow, I’m also a user!² Here’s a quick recap of an instance where netflow saved my team and I.
I recently encountered a situation where having net flow data was instrumental. One day at work, we received multiple calls, e-mails, and tickets about slow networks at our remote offices. They seemed to be related, but we weren't sure at first. The slowness complaints were sporadic in nature which made us scratch our heads even more. After looking at our instance of NPM, we definitely saw high interface utilization at some, but not all of our remote sites. We couldn't think of any application or traffic pattern that would cause this. Was our network under attack? We thought it might be prudent to involve the security team, in case it really was an attack, but before we sounded the alarm, we decided to check out our netflow data first. What we saw next really baffled us.
Large amounts of traffic (think GBs/hour) was coming from our Symantec Endpoint Protection (SEP) servers to clients at the remote offices over TCP port 8014. For those of you who have worked with Symantec before, you probably already know that this is the port that the SEP manager uses to manage its clients (e.g. virus definition updates). At some point, communication between the manager and most of its clients (especially in remote offices) had failed and the virus definitions on the clients became outdated. After a period of time, the clients would no longer request the incremental definition update; they wanted the whole enchilada. That’s okay if it’s a few clients and the download process ends in success the first time. This wasn't the case in our situation. There were hundreds of clients all trying to download this 400+MB file from one server over relatively small WAN links (avg. 10Mb/s). The result of this was constantly failing downloads which triggered the process to start over again ad infinitum. As a quick workaround, we decided to QoS the traffic based on the port number until the issue with the clients was resolved. With this information at our disposal, we brought it to the security team to show them that their A/V system was not healthy. Armed with the information we gave them, they were quickly able to identify several issues with the SEP manager and its clients which helped them eventually resolve several issues including standing up a redundant SEP manager. Without net flow data, we would have had to set up SPAN ports on our switches and wait for a period of time before analyzing packet captures to determine what caused the congestion. By having netflow, we were instantly able to capitalize on it by viewing specific times in the past to determine what was traversing our network when our users were complaining.
That’s just one problem netflow has solved for us. What if that port was TCP/6667 and it was coming from your CFO’s computer? Do you really think your CFO is on #packetpushers (irc.freenode.net) trying to learn more about networking? No, it’s more likely a command and control botnet obtaining its next instructions on how to make your life worse. From a security perspective, netflow is just one more tool to add in the never-ending fight against malware. So what are you waiting for? Get with the flow… with netflow!
1. Of course it's never quite that easy. You'll have to configure SNMP on all of your devices that you want to manage and/or monitor.