Data, data, data. You want all of the data, right? Of course you do. Collecting telemetry and logging data is easy. We all do it and we all use it from time to time. Interrupt-driven networking is a way of life and is more common than any other kind (i.e., automation and orchestration-based models) because that is how the vast majority of us learned. “Get it working, move to the next fire.” What if there were more ways to truly understand what is happening within our sphere of control? Well, I believe there is -- and the best part is that the barrier of entry is pretty low, and likely right in your periphery.
Once you have all of that data, the next step is to actually do something with it. All too often, we as engineers poll, collect, visualize, and store a wealth of data and are only in rare occasions actually leveraging even a fraction of the potential it can provide. In previous posts, we touched on the usefulness of correlation of collected and real-time data. This will take that a step further. It should be noted that this is not really intended to be a tutorial, but instead more of a blueprint or, more accurately, a high-level recipe that may have rotating and changing ingredients list. We all like different flavors and levels of spiciness, right?
As noted in the previous post on related attributes, there is a stealthy enemy in our network -- a gremlin, if you will. That gremlin's name is “grey failure,” and is very hard to detect, and even more difficult to plan around. Knowing this, and realizing that there is a large amount of data that has related attributes, similar causes, and noticeable effects, we can start to build a framework to aid in this endeavor. We talked about the related attributes of SNMP and NetFlow. Now, let us expand that further into the familial brethren of interface errors and syslog data.
While syslog data may be a wide, wide net, there are some interesting bits and pieces we can glean out of even the most rudimentary logs. Interface error detection will manifest in many ways depending on the platform in use. There may be logging mechanisms for this. It may come as polled data. It could possibly reveal itself as an SNMP trap. The mechanism isn’t really important. However, having the knowledge to understand that a connection is causing an issue with an application is critical. In fact, the application may be a key player in discovery of an interface issue. Let’s say that an application is working one day and the next there are intermittent connectivity issues. If the lower protocol is TCP, it will be hard to run down without packet capture because of TCP retransmissions. If, however, this application generates connectivity error logs and sends them to syslog, then that can be an indicator of an interface issue. From here it can be ascertained that there is a need to look at a path, and the first step of investigating a path is looking at interface errors. Here is the keystone, though. Simply looking at interface counter on a router can uncover incrementing counters, but, looking at long-term trends will make such an issue very obvious. In the case of UDP, this can be very hard to find since UDP is functionally connectionless. This is where the viewing the network as an ecosystem (as described in a previous blog post) can be very useful. Application, system, network, all working together in symbiosis. Flow data can help uncover these UDP issues, and with the help of the syslog from an appropriately built application, the job simply becomes a task of correlation.
Eventually these tasks will become more machine-driven, and the operator and engineer will only need to feed data sources into a larger, smarter, more self-sustaining (and eventually self-learning) operational model. Until then, understanding the important components and relations between them will only make for a quieter weekend, a more restful night, and a shorter troubleshooting period in the case of an issue.