Geek Speak

7 Posts authored by: buraglio

Data, data, data. You want all of the data, right? Of course you do. Collecting telemetry and logging data is easy. We all do it and we all use it from time to time. Interrupt-driven networking is a way of life and is more common than any other kind (i.e., automation and orchestration-based models) because that is how the vast majority of us learned. “Get it working, move to the next fire.” What if there were more ways to truly understand what is happening within our sphere of control? Well, I believe there is -- and the best part is that the barrier of entry is pretty low, and likely right in your periphery.

 

 

Once you have all of that data, the next step is to actually do something with it. All too often, we as engineers poll, collect, visualize, and store a wealth of data and are only in rare occasions actually leveraging even a fraction of the potential it can provide. In previous posts, we touched on the usefulness of correlation of collected and real-time data. This will take that a step further. It should be noted that this is not really intended to be a tutorial, but instead more of a blueprint or, more accurately, a high-level recipe that may have rotating and changing ingredients list. We all like different flavors and levels of spiciness, right?

 

 

As noted in the previous post on related attributes, there is a stealthy enemy in our network -- a gremlin, if you will. That gremlin's name is “grey failure,” and is very hard to detect, and even more difficult to plan around. Knowing this, and realizing that there is a large amount of data that has related attributes, similar causes, and noticeable effects, we can start to build a framework to aid in this endeavor.  We talked about the related attributes of SNMP and NetFlow. Now, let us expand that further into the familial brethren of interface errors and syslog data.

 

 

While syslog data may be a wide, wide net, there are some interesting bits and pieces we can glean out of even the most rudimentary logs. Interface error detection will manifest in many ways depending on the platform in use. There may be logging mechanisms for this. It may come as polled data. It could possibly reveal itself as an SNMP trap. The mechanism isn’t really important. However, having the knowledge to understand that a connection is causing an issue with an application is critical. In fact, the application may be a key player in discovery of an interface issue. Let’s say that an application is working one day and the next there are intermittent connectivity issues. If the lower protocol is TCP, it will be hard to run down without packet capture because of TCP retransmissions. If, however, this application generates connectivity error logs and sends them to syslog, then that can be an indicator of an interface issue. From here it can be ascertained that there is a need to look at a path, and the first step of investigating a path is looking at interface errors. Here is the keystone, though. Simply looking at interface counter on a router can uncover incrementing counters, but, looking at long-term trends will make such an issue very obvious. In the case of UDP, this can be very hard to find since UDP is functionally connectionless. This is where the viewing the network as an ecosystem (as described in a previous blog post) can be very useful. Application, system, network, all working together in symbiosis. Flow data can help uncover these UDP issues, and with the help of the syslog from an appropriately built application, the job simply becomes a task of correlation.

 

 

Eventually these tasks will become more machine-driven, and the operator and engineer will only need to feed data sources into a larger, smarter, more self-sustaining (and eventually self-learning) operational model. Until then, understanding the important components and relations between them will only make for a quieter weekend, a more restful night, and a shorter troubleshooting period in the case of an issue. 



For many engineers, operators, and information security professionals, traffic flow information is a key element to performing both daily and long-term strategic tasks. This data usually takes the form of NetFlow version 5, 9, 10, and IPFIX as well as sFlow data. This tool kit is widely utilized, and enables an insight into network traffic, performance, and long-term trends. When done correctly, it also lends itself well to security forensics and triage tasks.

 

Having been widely utilized in carrier and large service provider networks for a very long time, this powerful data set has only begun to really come into its own for enterprises and smaller networks in the last few years as tools for collecting and, more importantly, processing and visualizing it have become more approachable and user-friendly. As the floodgates open to tool kits and devices that can export either sampled flow information or one to one flow records, more and more people are discovering and embracing the data. What many do not necessarily see, however, is the correlation of this flow data with other information sources, particularly SNMP-based traffic statistics, can make for a very powerful symbiotic relationship.

 

By correlating traffic spikes and valleys over time, it is simple to cross-reference flow telemetry and identify statistically diverse users, applications, segments, and time periods. Now, this is a trivial task for a well-designed flow visualization tool. It can be accomplished without even looking at SNMP flow statistics. However, where it provides a different and valuable perspective is in the valley time periods when traffic is low. Human nature is to ignore that which is not out of spec, or obviously divergent from the baseline. So, the key is in looking at lulls in interface traffic statistics. View these anomalies as one would a spike, and mine flow data for pre-event traffic changes. Check TCP flags to find out more intricate details of the flows (note: this is a bit of a task as it entails adding TCP flags as they are exported as a numerical value in NetFlow v5 and v9, but they can provide an additional view into other potential issues). Conversely, the flags may also be an indicator into soft failures of interfaces along a path, which could manifest as SNMP interface errors that are exported and can be tracked. Think about the instances where this may be useful: soft failures. Soft failures are notoriously hard to detect, and this is a step in the right direction to doing so. Once this kind of mentality and correlation is adopted, adding in even more data sources to the repertoire of relatable data is just a matter of consuming and alerting on it. This falls well within the notion and mentality of looking at the network and systems as a relatable ecosystem, as mentioned in this post. Everything is interconnected, and the more expansive the understanding of one part, the more easily it can be related to other, seemingly “unrelated” occurrences.

 

 

This handily accomplishes two important tasks: building a relation experience table in an engineer or operators mind, and, if done correctly, a well-oiled, very accurate, efficient, and documented workflow of problem analysis and resolution. When this needs to be sold to management, which will need to occur in many environments, proving out that most of these tracked analytics can be used in concert with each other for a more complete, more robust, more efficient network monitoring and operational experience may need some hard deliverables, which can prove challenging. However, the prospect of “Better efficiency, less downtime” is typically enough to get enough interest in at least a few conversations.

Many of us have or currently operate in a stovepipe or silo IT environment. For some this may just be a way of professional life, but regardless of how the organizational structure is put together, having a wide and full understanding of any environment will lend itself to a smoother and more efficient system overall. As separation of duties continues to blur in the IT world, it is becoming increasingly important to shift how we as systems and network professionals view the individual components and the overall ecosystem. As such changes and tidal shifts occur, Linux appears in the switching and routing infrastructure, servers are consuming BGP feeds and making intelligent routing choices, creating orchestration workflows that automate the network and the services it provides -- all of these things are slowly creeping into more enterprises, more data centers, more service providers. What does this mean for the average IT engineer? It typically means that we, as professionals, need to keep abreast of workflows and IT environments as a holistic system rather than a set of distinct silos or disciplines.

 

This mentality is especially important in monitoring aspects of any IT organization, and it is a good habit to start even before these shifts occur. Understanding the large-scale behavior of IT in your environment will allow engineers and practitioners to accomplish significantly more with less -- and that is a win for everyone. Understanding how your servers interact with the DNS infrastructure, the switching fabric, the back-end storage, and the management mechanisms (i.e. handcrafted curation of configurations or automation) naturally lends itself to faster mean time to repair due to a deeper understanding of an IT organization, rather than a piece, or service that is part of it.

 

One might think “I don’t need to worry about Linux on my switches and routing on my servers,” and that may be true. However, expanding the knowledge domain from a small box to a large container filled with boxes will allow a person to not just understand the attributes of their box, but the characteristics of all of the boxes together. For example, understanding that the new application will make a DNS query for every single packet the application sees, when past applications did local caching, can dramatically decrease the downtime that occurs when the underlying systems hosting DNS become overloaded and slow to respond. The same can be said for moving to cloud services: Having a clear baseline of link traffic -- both internal and external -- will make obvious that the new cloud application requires more bandwidth and perhaps less storage.

 

Fear not! This is not a cry to become a developer or a sysadmin. It's not a declaration that there is a hole in the boat or a dramatic statement that "IT as we know it is over!" Instead, it is a suggestion to look at your IT environment in a new light. See it as a functioning system rather than a set of disjointed bits of hardware with different uses and diverse managing entities (i.e. silos). The network is the circulatory system, the servers and services are the intelligence. The storage is the memory, and the security is the skin and immune system. Can they stand alone on technical merit? Not really. When they work in concert, is the world a happier place to be? Absolutely. Understand the interactions. Embrace the collaborations. Over time, when this can happen, the overall reliability will be far, far higher.

 

Now, while some of these correlations may seem self-evident, piecing them together and, more importantly, tracking them for trends and patterns has the high potential to dramatically increase the occurrence of better-informed and fact-based decisions overall, and that makes for a better IT environment.

Incident responders: Build or buy?

There is far more to security management than technology. In fact, one could argue that the human element is more important in a field where intuition is just as valuable as knowledge of tech. In the world of security management I have not seen a more hotly debated non-technical issue than the figurative “build or buy” when it comes to incident responder employees. The polarized camps are the obvious:

  • Hire for experience.
  • In this model the desirable candidate is a mid-career or senior level, experienced incident responder. The pros and cons are debatable:Hire for ability
    • More expensive
    • Potentially set in ways
    • Can hit the ground running
    • Low management overhead
  • In this model, a highly motivated but less experienced engineer is hired and molded into as close to what the enterprise requires as they can get. Using this methodology the caveats and benefits are a bit different, as it is a longer term strategy.
    • Less expensive
    • “Blank Slate”
    • Requires more training and attention
    • Initially less efficient
    • More unknowns due to lack of experience
    • Can potentially become exactly what is required
    • May leave after a few years

In my stints managing and being involved with hiring, I have found that it is a difficult task to find a qualified, senior level engineer or incident responder that has the personality traits conducive to melding seamlessly into an existing environment. That is not so say it isn’t possible, but soft skills are a lost art in technology, and especially so in development and security. In my travels, sitting on hiring committees and writing job descriptions, I have found that the middle ground is the key. Mid-career, still hungry incident responders that have a budding amount of intuition have been the blue chips in the job searches and hires I have been involved with. They tend to have the fundamentals and a formed gut instinct that makes them incredibly valuable and at the same time, very open to mentorship. Now, the down side is that 40% of the time they’re going to move on just when they’re really useful, but that 60% that stick around a lot longer? They are often the framework that thinks outside the box and keeps the team fresh.

What seems like a lifetime ago I worked for a few enterprises doing various things like firewall configurations, email system optimizations and hardening of Netware, NT4, AIX and HPUX servers. There were 3 good sized employers, a bank and two huge insurance companies that both had financial components. While working at each and every one of them, I was, subject to their security policy (one of which I helped to craft, but that is a different path all together), and none of which really addressed data retention. When I left those employers, they archived my home directories, remaining email boxes and whatever other artifacts I left behind. None of this was really an issue for me as I never brought any personal or sensitive data in and everything I generated on site was theirs by the nature of what it was. What did not occur to me then, though, was that this was essentially a digital trail of breadcrumbs that could exist indefinitely. What else was left behind and was it also archived? Mind you, this was in the 1990s and network monitoring was fairly clunky, especially at scale, so the likely answer to this question is "nothing", but I assert that the answer to that question has changed significantly in this day and age.

Liability is a hard pill for businesses to swallow. Covering bases is key and that is where data retention is a double edged sword. Thinking like I am playing a lawyer on TV, keeping data on hand is useful for forensic analysis of potentially catastrophic data breaches, but it can also be a liability in that it can prove culpability in employee misbehavior on corporate time, resources and behalf. Is it worth it?

Since that time oh so long ago I have found that the benefit has far outweighed the risk in retaining the information, especially traffic data such proxy, firewall, and network flows.  The real issues I have, as noted in previous posts, is the correlation of said data and, more often than not, the archival method of what can amount to massive amounts of disk space.

If I can offer one nugget of advice, learned through years of having to decide what goes, what stays and for how long, it is this: Buy the disks. Procure the tape systems, do what you need to do to keep as much of the data as you can get away with because once it is gone it is highly unlikely that you can ever get it back.

Of all of the security techniques, few garner more polarized views than interception and decryption of trusted protocols. There are many reasons to do it and a great deal of legitimate concerns about compromising the integrity of a trusted protocol like SSL. SSL is the most common protocol to intercept, unwrap and inspect and accomplishing this has become easier and requires far less operational overhead than it did even 5 years ago. Weighing those concerns against the information that can be ascertained by cracking it open and looking at its content is often a struggle for enterprise security engineers due to the privacy implied. In previous lives I have personally struggled to reconcile this but have ultimately decided that the ethics involved in what I consider to be violation of implied security outweighed the benefit of SSL intercept. With other options being few, blocking protocols that obfuscate their content seems to be the next logical option, however, with the prolific increase of SSL enabled sites over the last 18 months, even this option seems unrealistic and frankly, clunky. Exfiltration of data, being anything from personally identifiable information to trade secrets and intellectual property is becoming a more and more common "currency" and much more desirable and lucrative to transport out of businesses and other entities. These are hard problems to solve.

Are there options out there that make better sense? Are large and medium sized enterprises doing SSL intercept? How is the data being analyzed and stored?

Given the current state of networking and security and with the prevalence of DDoS attacks such as the NTP Monlist attack, SNMP and DNS amplifications as well as the very directed techniques like DoXing and most importantly to many enterprises, exfiltration of sensitive data, network and security professionals are forced to look at creative and often innovative means to ascertain information about their networks and traffic patterns. This can sometimes mean finding and collecting data from many sources and correlating it or in extreme cases, obtaining access to otherwise secure protocols.

Knowing your network and computational environment is absolutely critical to classification and detection of anomalies and potential security infractions. In today’s hostile environments that have often had to grow organically over time, and with the importance and often associated expenses of obtaining, analyzing and storing this information, what creative ways are being utilized to accomplish these tasks? How is the correlation being done? Are enterprises and other large networks utilizing techniques like full packet capture at their borders? Are you performing SSL intercept and decryption? How is correlation and cross referencing of security and log data accomplished in your environment? Is it tied into any performance or outage sources?

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.