Seeing the Big Picture: Give Me All of the Data

The collection of operational and and analytics information can be an addictive habit, especially in the case of an interesting and active network. However, this information can quickly and easily overwhelm an aggregation system when the constant fire hose of information begins. Assuming the desire is collection and utilization of this information, it becomes clear that a proper strategy is required. This strategy should be comprised of a number of elements, including consideration of needs and requirements before beginning a project of this scope.

In practice, gathering requirements will likely happen either in parallel or, as with many things in network, adjusted on-demand, building the airplane as it is in flight. Course correction should be an oft-used tool in any technologist's toolbox. New peaks and valleys, pitfalls, and advantages should be referenced in the constant evaluation that occurs in any dynamic environment. Critical parts of this strategy should be included in consideration for nearly all endeavors of this kind. But even before that, the reasoning and use cases should be identified.

A few of the more important questions that need to be answered are:

  • What data is available?
  • What do we expect to do with the data?
  • How can we access the data?
  • Who can access what aspects of the data types?
  • Where does the data live?
  • What is the retention policy on each data type?
  • What is the storage model of the data? Is it encrypted at rest? Is it encrypted in transit?
  • How is the data ingested?

Starting with these questions can dramatically simplify and smooth the execution process of each part of the project. The answers to these questions may change, too. There is no fault in course correction, as mentioned above. It is part of the continuous re-evaluation process that often marks a successful plan.

Given these questions, let’s walk through a workflow to understand the reasoning for them and illuminate the usefulness of a solid monitoring and analytics plan. “What data is available?” will drive a huge number of questions and their answers. Let’s assume that the goal is to consume network flow information, system and host log data, polled SNMP time series data, and latency information. Clearly this is a large set of very diverse information, all of which should be readily available. The first mistake most engineers make is diving into the weeds of what tools to use straight away. This is a solved problem, and frankly it is far less relevant to the overall project than the rest of the questions. Use the tools that you understand, can afford to operate (both fiscally and operationally), and that provide the interfaces that you need. Set that detail aside, as answers to some of the other questions may decide it for you.

How will we store the data? Time series is easy: that typically goes into a RRD. Will there be a need for complex queries against things like NetFlow and other text, such as syslog? If so, there may be a need for an indexing tool. There are many commercial and open source options. Keep in mind that this is one of the more nuanced parts, as answers to this question may change answers to the others, specifically retention, access, and storage location. Data storage is the hidden bane of an analytics system. Disk isn't expensive, but it’s hard to do right, and on budget. Whatever disk space is required, always, always, always add head room. It will be necessary later, or adjustment of the retention policy may be necessary.

Encryption comes into play here as well. Typical security practice is to encrypt in flight and at rest, but in many cases this isn’t feasible (think router syslog). Encryption at rest also incurs a fairly heavy cost, both one-time (CPU cycles to encrypt) and perpetual (decryption for access). In many cases, the justification for encryption does not make sense. Exceptions should be documented and risks accepted to provide a documented path on decisions and acceptance of risk by management on the off chance that sensitive information is leaked or exfiltrated.

With all of this data, what is the real end goal? Simple: Baseline. Nearly all monitoring and measurement systems provide, at their elemental level, a baseline. Knowing how something operates is fundamental to successful management of any resource, and networks are no exception. By having stored statistical information it becomes significantly easier to identify issues. Functionally, any data collected will likely be useful at some point if it is available and referenced. Having a solid plan as to how the statistical data is dealt with is the foundation of ensuring those deliverables are met.

  • I am one that has a two car garage with one car in it. As a former Boy Scout, I'd rather have it and not need it than to need it and not have it, because if I am considering keeping it I usually end up needing it. That obviously comes with a price, both monetarily and operationally. My current mechanism, which I have found to be very, very useful, is to sock all of that data into an indexer with an API. There are a few options, both commercial and FOSS, and they're all well developed to do exactly this. I prefer something with an API that way I can wrote my own queries against it, but realistically it doesn't really matter as long as there is an access and query method that works within your operational model.

    I would also assert that having lots of data is necessary if you DO know the questions you need to answer, because the answers can often (and usually do) lead to more questions. The real issue I think you're hinting at is retention. How *long* do I need to keep it? Having all of the data is one thing. Storing it indefinitely is quite another. Long term storage can be on a less expensive, cold medium, while ready access data can be weeks or a few months. The larger the environment the more important it is to keep more data longer.

  • Data Science Engineers is one of the biggest new growth areas in Computer Science right now.

  • Agreed. Analyzing what to keep, where to keep it and how long is usually done post-disaster. Doing so in advance saves a lot of issues in the future and gives much better use of existing resources and preparation for expansion.

  • Data Warehouse = Sticker Shock = Nevermind

    pastedImage_0.png

Thwack - Symbolize TM, R, and C