In my last post on this topic, I described some scenarios where an outage was significantly extended primarily because although the infrastructure had been discovered and was being
managed, a true understanding was still being elusive. In this post, I will meditate on what else we might need in order to find peace with our network or, as network defender so eloquently put it, Know the network. Be the network.
Managing Your Assets
Virtualization is both the best and the worst thing to have happened to our data centers. The flexibility and efficiency of virtualization has changed the way we work with our compute resources, but from a management perspective it's a bit of a nightmare. If you have been around a baby at some point in your life, you may recognize that when a baby is very small you can put it down on its back and it will lie there in the same spot; you can leave the room for five minutes, and when you come back, the baby will be exactly where you put it. At some point though, it all changes. If you put the baby down, you'd better watch the poor thing constantly because it has no interest in staying put; leave the room for five minutes and you may spend the next half an hour trying to find out exactly where it got to. And so it is - or at least it feels - with virtualization. A service that in the past would have had a fixed location on a specific server in a particular data center is now simply an itinerant workload looking for a compute resource on which to execute, and as a result it can move around the data center at the will of the server administrators, and it can even migrate between datacenters. This is the parental equivalent of leaving the room for five minutes and coming back to find that your baby is now in next door's back yard.
The reason we should care about this is because understanding the infrastructure properly means understanding dependencies.
Know Your Dependencies
Let's look at some virtual machines (VMs) and see where there might be dependencies:
In this case, there are users on the Internet connecting to a load balancer, which will send their sessions onward to, say, VM1. In a typical environment with element management systems, we can identify if any of the systems in this diagram fail. However, what's less clear is what the impact of that failure will be. Let's say the SERVER fails; who is affected? We need to know immediately that VM1, VM2, and VM3 will be unavailable, and from an application perspective, I need to know those those virtual machines are my web farm, so web services will be unavailable. I can identify service-level problems because I know what my VMs do, and I know where they are currently active. If a VM moves, I need to know about it so that I can keep my dependencies accurate.
The hypervisor shown is using NFS mounts to populate what the VMs will see as attached storage. It should be clear from the diagram that if anything in the path between the hypervisor and the SAN fails, while the Hypervisor will initially be the one complaining, it won't take too long before the VMs complain as well.
From an alerting perspective, knowing these dependencies means that:
- I could try to suppress or otherwise categorize alerts that are
downstreamfrom the main error (e.g. I don't want to see 100 NFS alerts per second when I already know that the SAN has failed);
- I know what impact a particular failure will have on the services provided by the infrastructure.
Application Level Dependencies
It may also be important to dig inside the servers themselves and monitor both the server's performance as well as the applications on that server, and the server's view of the infrastructure. For example, if it is reported that MS SQL Server has a problem on a particular node, we can infer that applications dependent on that database service will also be impacted. It's possible that everything in the infrastructure is nominally ok, but there is an application or process problem on the server itself, or perhaps the VM is simply running at capacity. I will say that tools like Solarwinds' Server & Application Monitor are very helpful when it comes to getting visibility beyond the system level, and when used with knowledge of an application's service role this can make a huge difference when it comes to pre-empting problems and quickly identifying the root cause of emergent problems and using that information to ignore the downstream errors and focus on the real issue.
Long Distance Relationships
Let's take this a step further. Imagine that VM3 in the local datacenter runs a database that is used by many internal applications. In another location there's a web front end which accesses that database. An alert comes in that there is a high level of dropped packets on the SAN's connection to the network. Shortly after, there are alerts from VM3 complaining about read/write failures. It stands to reason that if there's a problem with the SAN, there will be problems with the VMs because the hypervisor uses NFS mounts for the VM hard drives. We should therefore fully anticipate that there will be problems with the web front end even though it's located somewhere else, and when those alerts come in, we don't want to waste any time checking for errors with the web server farm or the WAN link. In fact, it might be wise to proactively alert the helpdesk about the issue so that when a call comes in, no time will be wasted trying to replicate the problem or track down the issue. Maybe we can update a status page with a notice that there is potential impact to some specific services, and thus avoid further calls and emails. Suddenly, the IT group is looking pretty good in the face of a frustrating problem.
Service Oriented Management
One of the biggest challenges with element management systems is in the name; they are managing technologies not services. Each one, reasonably enough, is focused on managing a specific technology to the best of its abilities, but without some contextual information and an understanding of dependencies, the information being gathered cannot be used to its full potential. That's not to say that element management systems have no value; far from it, and for the engineers responsible for that technology, they are invaluable. Businesses, however, don't typically care about the elements, they care about services. When there's a major outage and you call the CTO to let them know, it's one thing to tell them that
SAN-17 has failed! but the first question a CTO should be asking in response is
What services are impacted by this? If your answer to that kind of question would be
I don't know, then no matter how many elements you monitor and how frequently you poll them, you don't fully know your infrastructure and you'll never reach a state of inner peace.
I'm curious to know whether Thwack users feel like they have a full grip on the services using the infrastructure and the dependencies in place, not just a view of the infrastructure itself?
In my next post, I'll be looking at more knowledge challenges: baselining the infrastructure and identifying abandoned compute resources.