Geek Speak

9 Posts authored by: buraglio

The collection of operational and and analytics information can be an addictive habit, especially in the case of an interesting and active network. However, this information can quickly and easily overwhelm an aggregation system when the constant fire hose of information begins. Assuming the desire is collection and utilization of this information, it becomes clear that a proper strategy is required. This strategy should be comprised of a number of elements, including consideration of needs and requirements before beginning a project of this scope.

 

In practice, gathering requirements will likely happen either in parallel or, as with many things in network, adjusted on-demand, building the airplane as it is in flight. Course correction should be an oft-used tool in any technologist's toolbox. New peaks and valleys, pitfalls, and advantages should be referenced in the constant evaluation that occurs in any dynamic environment. Critical parts of this strategy should be included in consideration for nearly all endeavors of this kind. But even before that, the reasoning and use cases should be identified.

 

A few of the more important questions that need to be answered are:

 

  • What data is available?
  • What do we expect to do with the data?
  • How can we access the data?
  • Who can access what aspects of the data types?
  • Where does the data live?
  • What is the retention policy on each data type?
  • What is the storage model of the data? Is it encrypted at rest? Is it encrypted in transit?
  • How is the data ingested?

 

Starting with these questions can dramatically simplify and smooth the execution process of each part of the project. The answers to these questions may change, too. There is no fault in course correction, as mentioned above. It is part of the continuous re-evaluation process that often marks a successful plan.

 

Given these questions, let’s walk through a workflow to understand the reasoning for them and illuminate the usefulness of a solid monitoring and analytics plan. “What data is available?” will drive a huge number of questions and their answers. Let’s assume that the goal is to consume network flow information, system and host log data, polled SNMP time series data, and latency information. Clearly this is a large set of very diverse information, all of which should be readily available. The first mistake most engineers make is diving into the weeds of what tools to use straight away. This is a solved problem, and frankly it is far less relevant to the overall project than the rest of the questions. Use the tools that you understand, can afford to operate (both fiscally and operationally), and that provide the interfaces that you need. Set that detail aside, as answers to some of the other questions may decide it for you.

 

How will we store the data? Time series is easy: that typically goes into a RRD. Will there be a need for complex queries against things like NetFlow and other text, such as syslog? If so, there may be a need for an indexing tool. There are many commercial and open source options. Keep in mind that this is one of the more nuanced parts, as answers to this question may change answers to the others, specifically retention, access, and storage location. Data storage is the hidden bane of an analytics system. Disk isn't expensive, but it’s hard to do right, and on budget. Whatever disk space is required, always, always, always add head room. It will be necessary later, or adjustment of the retention policy may be necessary.

 

Encryption comes into play here as well. Typical security practice is to encrypt in flight and at rest, but in many cases this isn’t feasible (think router syslog). Encryption at rest also incurs a fairly heavy cost, both one-time (CPU cycles to encrypt) and perpetual (decryption for access). In many cases, the justification for encryption does not make sense. Exceptions should be documented and risks accepted to provide a documented path on decisions and acceptance of risk by management on the off chance that sensitive information is leaked or exfiltrated.

 

With all of this data, what is the real end goal? Simple: Baseline. Nearly all monitoring and measurement systems provide, at their elemental level, a baseline. Knowing how something operates is fundamental to successful management of any resource, and networks are no exception. By having stored statistical information it becomes significantly easier to identify issues. Functionally, any data collected will likely be useful at some point if it is available and referenced. Having a solid plan as to how the statistical data is dealt with is the foundation of ensuring those deliverables are met.

As anyone that has run a network of any size has surely experienced, with one alert, there is typically (but not always) a deeper issue that may or may not generate further alarms. An often overlooked correlation is that between a security event caught by a Network Security Monitor (NSM) and that of a network or service monitoring system. In certain cases, a security event will be noticed first by a network or service alarm. Due to the nature of many modern attacks, either volumetric or targeted, the goal is typically to offline a given resource. This “offlining,” usually labeled a denial of service (DoS), has a goal that is very easy to understand: make the service go away.

 

To understand how and why correlating these events is important, we need to understand what they are. In this case, the focus will be on two different attack types and how they manifest. There are significant similarities. Knowing the difference can save time and effort in working triage, whether you're dealing with a large or small outages/events.

 

Keeping that in mind, the obvious goals are:

 

1. Rooting out the cause

2. Mitigating the outage

3. Understanding the event to prevent future problems

 

For the purposes of this post, the two most common issues will be described: volumetric and targeted attacks.

 

Volumetric

This one is the most common and it gets the most press due to the sheer scope of things it can damage. At a high level, this is just a traffic flood. It’s typically nothing fancy, just a lot of traffic generated in one way or another (there are a myriad of different mechanisms for creating unwanted traffic flows), typically coming from either compromised hosts or misconfigured services like SNMP, DNS recursion, or other common protocols. The destination, or target, of the traffic is a host or set of hosts that the offender wants to knock offline.

 

Targeted

This is far more stealthy. It’s more of a scalpel, where a volumetric flood is a machete. A targeted attack is typically used to gain access to a specific service or host. It is less like a flood and more like a specific exploit pointed at a service. This type of attack usually has a different goal, gain access for recon and information collecting. Sometimes a volumetric attack is simply a smoke screen for a targeted attack. This can be very difficult to root out.

 

Given what we know about these two kinds of attacks, how can we utilize all of our data to better (and more quickly) triage the damage? Easily, actually. Knowing that an attack is occurring is fairly easy to determine in the case of volumetric attacks: traffic spikes, flatlined circuits, service degradation. In a strictly network engineering world, the problem would manifest and best case NetFlow data would be consulted in short order. Even with that particular dataset, it may not be obvious to a network engineer that there is an attack occurring. It may just appear as a large amount of UDP traffic from different sources. Given traffic levels, a single 10G connected host can drown out a 1G connected site. This type of attack can also manifest in odd ways, especially if there are link speed differences along a given path. This can look like a link failure or a latency issue when in reality it is a volumetric attack.  However, if also utilizing a tuned and maintained NSM of some kind, the traffic patterns should be readily identified as a flood, and the traffic pattern can be filtered more quickly, either by the site or the upstream ISP.

 

Targeted attacks will be very different, especially when performed on their own. This is where an NSM is critical. With attempts to compromise actual infrastructure hardware like routers and switches on a significant uptick, knowing the typical traffic patterns for your network is key. If a piece of your critical infrastructure is targeted, and it is inside of your security perimeter, your NSM should catch that and alert you to it. This is especially important in the case of your security equipment. Having your network tapped before the filtering device can greatly aid in seeing traffic with destinations of your actual perimeter. Given that there are cases of firewalls being compromised, this is a real threat. If and when that occurs, it may appear as a high load on a device, a memory allocation increase, or perhaps a set of traffic spikes, most of which a network engineer will not be concerned with as long as it is not affecting service. However, understanding the traffic patterns that led to that could help uncover a far less pleasant cause.

 

Most of these occurrences are somewhat rare, nevertheless, it is a very good habit to check all data sources when something out of the baseline occurs on a given network. Perhaps more importantly, there is no substitute for good collaboration. Having a strong, positive and ongoing working relationship between security professionals and networking engineers is a key element to making any of these occurrences less painful. In many cases of small- and medium-sized environments, these people are one and the same. But when they aren’t, collaboration at a professional level is as important and useful as the cross referencing of data sources.

Data, data, data. You want all of the data, right? Of course you do. Collecting telemetry and logging data is easy. We all do it and we all use it from time to time. Interrupt-driven networking is a way of life and is more common than any other kind (i.e., automation and orchestration-based models) because that is how the vast majority of us learned. “Get it working, move to the next fire.” What if there were more ways to truly understand what is happening within our sphere of control? Well, I believe there is -- and the best part is that the barrier of entry is pretty low, and likely right in your periphery.

 

 

Once you have all of that data, the next step is to actually do something with it. All too often, we as engineers poll, collect, visualize, and store a wealth of data and are only in rare occasions actually leveraging even a fraction of the potential it can provide. In previous posts, we touched on the usefulness of correlation of collected and real-time data. This will take that a step further. It should be noted that this is not really intended to be a tutorial, but instead more of a blueprint or, more accurately, a high-level recipe that may have rotating and changing ingredients list. We all like different flavors and levels of spiciness, right?

 

 

As noted in the previous post on related attributes, there is a stealthy enemy in our network -- a gremlin, if you will. That gremlin's name is “grey failure,” and is very hard to detect, and even more difficult to plan around. Knowing this, and realizing that there is a large amount of data that has related attributes, similar causes, and noticeable effects, we can start to build a framework to aid in this endeavor.  We talked about the related attributes of SNMP and NetFlow. Now, let us expand that further into the familial brethren of interface errors and syslog data.

 

 

While syslog data may be a wide, wide net, there are some interesting bits and pieces we can glean out of even the most rudimentary logs. Interface error detection will manifest in many ways depending on the platform in use. There may be logging mechanisms for this. It may come as polled data. It could possibly reveal itself as an SNMP trap. The mechanism isn’t really important. However, having the knowledge to understand that a connection is causing an issue with an application is critical. In fact, the application may be a key player in discovery of an interface issue. Let’s say that an application is working one day and the next there are intermittent connectivity issues. If the lower protocol is TCP, it will be hard to run down without packet capture because of TCP retransmissions. If, however, this application generates connectivity error logs and sends them to syslog, then that can be an indicator of an interface issue. From here it can be ascertained that there is a need to look at a path, and the first step of investigating a path is looking at interface errors. Here is the keystone, though. Simply looking at interface counter on a router can uncover incrementing counters, but, looking at long-term trends will make such an issue very obvious. In the case of UDP, this can be very hard to find since UDP is functionally connectionless. This is where the viewing the network as an ecosystem (as described in a previous blog post) can be very useful. Application, system, network, all working together in symbiosis. Flow data can help uncover these UDP issues, and with the help of the syslog from an appropriately built application, the job simply becomes a task of correlation.

 

 

Eventually these tasks will become more machine-driven, and the operator and engineer will only need to feed data sources into a larger, smarter, more self-sustaining (and eventually self-learning) operational model. Until then, understanding the important components and relations between them will only make for a quieter weekend, a more restful night, and a shorter troubleshooting period in the case of an issue. 



For many engineers, operators, and information security professionals, traffic flow information is a key element to performing both daily and long-term strategic tasks. This data usually takes the form of NetFlow version 5, 9, 10, and IPFIX as well as sFlow data. This tool kit is widely utilized, and enables an insight into network traffic, performance, and long-term trends. When done correctly, it also lends itself well to security forensics and triage tasks.

 

Having been widely utilized in carrier and large service provider networks for a very long time, this powerful data set has only begun to really come into its own for enterprises and smaller networks in the last few years as tools for collecting and, more importantly, processing and visualizing it have become more approachable and user-friendly. As the floodgates open to tool kits and devices that can export either sampled flow information or one to one flow records, more and more people are discovering and embracing the data. What many do not necessarily see, however, is the correlation of this flow data with other information sources, particularly SNMP-based traffic statistics, can make for a very powerful symbiotic relationship.

 

By correlating traffic spikes and valleys over time, it is simple to cross-reference flow telemetry and identify statistically diverse users, applications, segments, and time periods. Now, this is a trivial task for a well-designed flow visualization tool. It can be accomplished without even looking at SNMP flow statistics. However, where it provides a different and valuable perspective is in the valley time periods when traffic is low. Human nature is to ignore that which is not out of spec, or obviously divergent from the baseline. So, the key is in looking at lulls in interface traffic statistics. View these anomalies as one would a spike, and mine flow data for pre-event traffic changes. Check TCP flags to find out more intricate details of the flows (note: this is a bit of a task as it entails adding TCP flags as they are exported as a numerical value in NetFlow v5 and v9, but they can provide an additional view into other potential issues). Conversely, the flags may also be an indicator into soft failures of interfaces along a path, which could manifest as SNMP interface errors that are exported and can be tracked. Think about the instances where this may be useful: soft failures. Soft failures are notoriously hard to detect, and this is a step in the right direction to doing so. Once this kind of mentality and correlation is adopted, adding in even more data sources to the repertoire of relatable data is just a matter of consuming and alerting on it. This falls well within the notion and mentality of looking at the network and systems as a relatable ecosystem, as mentioned in this post. Everything is interconnected, and the more expansive the understanding of one part, the more easily it can be related to other, seemingly “unrelated” occurrences.

 

 

This handily accomplishes two important tasks: building a relation experience table in an engineer or operators mind, and, if done correctly, a well-oiled, very accurate, efficient, and documented workflow of problem analysis and resolution. When this needs to be sold to management, which will need to occur in many environments, proving out that most of these tracked analytics can be used in concert with each other for a more complete, more robust, more efficient network monitoring and operational experience may need some hard deliverables, which can prove challenging. However, the prospect of “Better efficiency, less downtime” is typically enough to get enough interest in at least a few conversations.

Many of us have or currently operate in a stovepipe or silo IT environment. For some this may just be a way of professional life, but regardless of how the organizational structure is put together, having a wide and full understanding of any environment will lend itself to a smoother and more efficient system overall. As separation of duties continues to blur in the IT world, it is becoming increasingly important to shift how we as systems and network professionals view the individual components and the overall ecosystem. As such changes and tidal shifts occur, Linux appears in the switching and routing infrastructure, servers are consuming BGP feeds and making intelligent routing choices, creating orchestration workflows that automate the network and the services it provides -- all of these things are slowly creeping into more enterprises, more data centers, more service providers. What does this mean for the average IT engineer? It typically means that we, as professionals, need to keep abreast of workflows and IT environments as a holistic system rather than a set of distinct silos or disciplines.

 

This mentality is especially important in monitoring aspects of any IT organization, and it is a good habit to start even before these shifts occur. Understanding the large-scale behavior of IT in your environment will allow engineers and practitioners to accomplish significantly more with less -- and that is a win for everyone. Understanding how your servers interact with the DNS infrastructure, the switching fabric, the back-end storage, and the management mechanisms (i.e. handcrafted curation of configurations or automation) naturally lends itself to faster mean time to repair due to a deeper understanding of an IT organization, rather than a piece, or service that is part of it.

 

One might think “I don’t need to worry about Linux on my switches and routing on my servers,” and that may be true. However, expanding the knowledge domain from a small box to a large container filled with boxes will allow a person to not just understand the attributes of their box, but the characteristics of all of the boxes together. For example, understanding that the new application will make a DNS query for every single packet the application sees, when past applications did local caching, can dramatically decrease the downtime that occurs when the underlying systems hosting DNS become overloaded and slow to respond. The same can be said for moving to cloud services: Having a clear baseline of link traffic -- both internal and external -- will make obvious that the new cloud application requires more bandwidth and perhaps less storage.

 

Fear not! This is not a cry to become a developer or a sysadmin. It's not a declaration that there is a hole in the boat or a dramatic statement that "IT as we know it is over!" Instead, it is a suggestion to look at your IT environment in a new light. See it as a functioning system rather than a set of disjointed bits of hardware with different uses and diverse managing entities (i.e. silos). The network is the circulatory system, the servers and services are the intelligence. The storage is the memory, and the security is the skin and immune system. Can they stand alone on technical merit? Not really. When they work in concert, is the world a happier place to be? Absolutely. Understand the interactions. Embrace the collaborations. Over time, when this can happen, the overall reliability will be far, far higher.

 

Now, while some of these correlations may seem self-evident, piecing them together and, more importantly, tracking them for trends and patterns has the high potential to dramatically increase the occurrence of better-informed and fact-based decisions overall, and that makes for a better IT environment.

Incident responders: Build or buy?

There is far more to security management than technology. In fact, one could argue that the human element is more important in a field where intuition is just as valuable as knowledge of tech. In the world of security management I have not seen a more hotly debated non-technical issue than the figurative “build or buy” when it comes to incident responder employees. The polarized camps are the obvious:

  • Hire for experience.
  • In this model the desirable candidate is a mid-career or senior level, experienced incident responder. The pros and cons are debatable:Hire for ability
    • More expensive
    • Potentially set in ways
    • Can hit the ground running
    • Low management overhead
  • In this model, a highly motivated but less experienced engineer is hired and molded into as close to what the enterprise requires as they can get. Using this methodology the caveats and benefits are a bit different, as it is a longer term strategy.
    • Less expensive
    • “Blank Slate”
    • Requires more training and attention
    • Initially less efficient
    • More unknowns due to lack of experience
    • Can potentially become exactly what is required
    • May leave after a few years

In my stints managing and being involved with hiring, I have found that it is a difficult task to find a qualified, senior level engineer or incident responder that has the personality traits conducive to melding seamlessly into an existing environment. That is not so say it isn’t possible, but soft skills are a lost art in technology, and especially so in development and security. In my travels, sitting on hiring committees and writing job descriptions, I have found that the middle ground is the key. Mid-career, still hungry incident responders that have a budding amount of intuition have been the blue chips in the job searches and hires I have been involved with. They tend to have the fundamentals and a formed gut instinct that makes them incredibly valuable and at the same time, very open to mentorship. Now, the down side is that 40% of the time they’re going to move on just when they’re really useful, but that 60% that stick around a lot longer? They are often the framework that thinks outside the box and keeps the team fresh.

What seems like a lifetime ago I worked for a few enterprises doing various things like firewall configurations, email system optimizations and hardening of Netware, NT4, AIX and HPUX servers. There were 3 good sized employers, a bank and two huge insurance companies that both had financial components. While working at each and every one of them, I was, subject to their security policy (one of which I helped to craft, but that is a different path all together), and none of which really addressed data retention. When I left those employers, they archived my home directories, remaining email boxes and whatever other artifacts I left behind. None of this was really an issue for me as I never brought any personal or sensitive data in and everything I generated on site was theirs by the nature of what it was. What did not occur to me then, though, was that this was essentially a digital trail of breadcrumbs that could exist indefinitely. What else was left behind and was it also archived? Mind you, this was in the 1990s and network monitoring was fairly clunky, especially at scale, so the likely answer to this question is "nothing", but I assert that the answer to that question has changed significantly in this day and age.

Liability is a hard pill for businesses to swallow. Covering bases is key and that is where data retention is a double edged sword. Thinking like I am playing a lawyer on TV, keeping data on hand is useful for forensic analysis of potentially catastrophic data breaches, but it can also be a liability in that it can prove culpability in employee misbehavior on corporate time, resources and behalf. Is it worth it?

Since that time oh so long ago I have found that the benefit has far outweighed the risk in retaining the information, especially traffic data such proxy, firewall, and network flows.  The real issues I have, as noted in previous posts, is the correlation of said data and, more often than not, the archival method of what can amount to massive amounts of disk space.

If I can offer one nugget of advice, learned through years of having to decide what goes, what stays and for how long, it is this: Buy the disks. Procure the tape systems, do what you need to do to keep as much of the data as you can get away with because once it is gone it is highly unlikely that you can ever get it back.

Of all of the security techniques, few garner more polarized views than interception and decryption of trusted protocols. There are many reasons to do it and a great deal of legitimate concerns about compromising the integrity of a trusted protocol like SSL. SSL is the most common protocol to intercept, unwrap and inspect and accomplishing this has become easier and requires far less operational overhead than it did even 5 years ago. Weighing those concerns against the information that can be ascertained by cracking it open and looking at its content is often a struggle for enterprise security engineers due to the privacy implied. In previous lives I have personally struggled to reconcile this but have ultimately decided that the ethics involved in what I consider to be violation of implied security outweighed the benefit of SSL intercept. With other options being few, blocking protocols that obfuscate their content seems to be the next logical option, however, with the prolific increase of SSL enabled sites over the last 18 months, even this option seems unrealistic and frankly, clunky. Exfiltration of data, being anything from personally identifiable information to trade secrets and intellectual property is becoming a more and more common "currency" and much more desirable and lucrative to transport out of businesses and other entities. These are hard problems to solve.

Are there options out there that make better sense? Are large and medium sized enterprises doing SSL intercept? How is the data being analyzed and stored?

Given the current state of networking and security and with the prevalence of DDoS attacks such as the NTP Monlist attack, SNMP and DNS amplifications as well as the very directed techniques like DoXing and most importantly to many enterprises, exfiltration of sensitive data, network and security professionals are forced to look at creative and often innovative means to ascertain information about their networks and traffic patterns. This can sometimes mean finding and collecting data from many sources and correlating it or in extreme cases, obtaining access to otherwise secure protocols.

Knowing your network and computational environment is absolutely critical to classification and detection of anomalies and potential security infractions. In today’s hostile environments that have often had to grow organically over time, and with the importance and often associated expenses of obtaining, analyzing and storing this information, what creative ways are being utilized to accomplish these tasks? How is the correlation being done? Are enterprises and other large networks utilizing techniques like full packet capture at their borders? Are you performing SSL intercept and decryption? How is correlation and cross referencing of security and log data accomplished in your environment? Is it tied into any performance or outage sources?

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.