Fiefdoms, Silos, and Plumbing

In the world of information technology, those of us who are tasked with providing and maintaining the networks, applications, storage, and services to the rest of the organization, are increasingly under pressure to provide more accurate, or at least more granular, service level guarantees. The standard quality of service (QoS) mechanisms we have used in the past are becoming more and more inadequate to properly handle the disparate types of traffic we are seeing on the wire today. In order to continue to successfully provide services in a guaranteed, deliberate, measurable, and ultimately very accurate manner, is going to require different tools and additional focus on increasingly more all encompassing ecosystems. Simply put: our insular fiefdoms are not going to cut it in the future. So, what are we going to do about the problem? What can we do to increase our end to end visibility, tracking, and service level guarantees?

One of the first things we ought to do is make certain that we have, at the very least, implemented some baseline quality of service policies. Things like separating video, voice, regular data, high priority data, control plane data, etc., seem like the kind of thing that should be a given, but every day I am surprised by another network that has very poorly deployed what QoS they do have. Often I see video and voice in the same class, and I see no class for control plane traffic; my guess is no policing either, but that is another topic for another day. If we cannot succeed at the basics, we most certainly should not be attempting anything more grandiose until we can fix the problems of today.

I have written repeatedly on the need to break down silos in IT, to get away from the artificial construct that says one group of people only control one area of the network, and have only limited interaction with other teams. Many times, as a matter of fact, I see such deep and ingrained silos that the different departments do not actually converge, from a leadership perspective, until the CIO. This unnecessarily obfuscated the full network picture from pretty much everyone. Server teams know what they have and control, storage teams are the same, and on down the line it goes with nobody really having an overall picture of things until you get far enough into the upper management layer that the fixes become political, and die by the proverbial committee.

In order to truly succeed at providing visibility to the network, we need to merge the traditional tools, services, and methodologies we have always used, with the knowledge and tools from other teams. Things like application visibility, hooks into virtualized servers, storage monitoring, wireless and security and everything in between need to be viewed as one cohesive structure on which service guarantees may be given. We need to stop looking at individual pieces, applying policy in a vacuum, and calling it good. When we do this it is most certainly not good or good enough.

We really don’t need QoS, we need full application visibility from start to finish. Do we care about the plumbing systems we use day to day? Not really, we assume they work effectively and we do not spend a lot of time contemplating the mechanisms and methodologies of said plumbing. In the same way, nobody for whom the network is merely a transport service cares about how things happen in the inner workings of that system, they just want it to work. The core function of the network is to provide a service to a user. That service needs to work all of the time, and it needs to work as quickly as it is designed to work. It does not matter to a user who is to blame when their particular application quits working, slows down, or otherwise exhibits unpleasant and undesired tendencies, they just know that somewhere in IT, someone has fallen down on the job and abdicated one of their core responsibilities: making things work.

I would suggest that one of the things we should certainly be implementing, looking at, etc., is a monitoring solution that can not only tell us what the heck the network routers, switches, firewalls, etc., are doing at any given time, but one in which applications, their use of storage, their underlying hardware—virtual, bare metal, containers—and their performance are measured as well. Yes, I want to know what the core movers and shakers of the underlying transport infrastructure are doing, but I also want visibility into how my applications are moving over that structure, and how that data becomes quantifiable as relates to the end user experience.

If we can get to a place where this is the normal state of affairs rather than the exception, using an application framework bringing everything together, we’ll be one step closer to knowing what the heck else to fix in order to support our user base. You can’t fix what you don’t know is a problem, and if all groups are in silos, monitoring nothing but their fiefdoms, there really is not an effective way to design a holistic, network-wide solution to the quality of service challenges we face day to day. We will simply do what we have always done and deploy small solutions, in a small way, to larger problems, then spend most of our time tossing crap over the fence to another group with a “it’s not the network” thrown in as well. It’s not my fault, it must be yours. And at the end of the day, the users are just wanting to know why the plumbing isn’t working and the toilets are all backed up.

Thwack - Symbolize TM, R, and C