Showing results for 
Search instead for 
Did you mean: 
Create Post

Zen And The Ongoing Art of Infrastructure Monitoring

Level 13

In my last post on this topic, I described some scenarios where an outage was significantly extended primarily because although the infrastructure had been discovered and was being managed, a true understanding was still being elusive. In this post, I will meditate on what else we might need in order to find peace with our network or, as network defender so eloquently put it, Know the network. Be the network.

Managing Your Assets

Virtualization is both the best and the worst thing to have happened to our data centers. The flexibility and efficiency of virtualization has changed the way we work with our compute resources, but from a management perspective it's a bit of a nightmare. If you have been around a baby at some point in your life, you may recognize that when a baby is very small you can put it down on its back and it will lie there in the same spot; you can leave the room for five minutes, and when you come back, the baby will be exactly where you put it. At some point though, it all changes. If you put the baby down, you'd better watch the poor thing constantly because it has no interest in staying put; leave the room for five minutes and you may spend the next half an hour trying to find out exactly where it got to. And so it is - or at least it feels - with virtualization. A service that in the past would have had a fixed location on a specific server in a particular data center is now simply an itinerant workload looking for a compute resource on which to execute, and as a result it can move around the data center at the will of the server administrators, and it can even migrate between datacenters. This is the parental equivalent of leaving the room for five minutes and coming back to find that your baby is now in next door's back yard.

The reason we should care about this is because understanding the infrastructure properly means understanding dependencies.

Know Your Dependencies

Let's look at some virtual machines (VMs) and see where there might be dependencies:


In this case, there are users on the Internet connecting to a load balancer, which will send their sessions onward to, say, VM1. In a typical environment with element management systems, we can identify if any of the systems in this diagram fail. However, what's less clear is what the impact of that failure will be. Let's say the SERVER fails; who is affected? We need to know immediately that VM1, VM2, and VM3 will be unavailable, and from an application perspective, I need to know those those virtual machines are my web farm, so web services will be unavailable. I can identify service-level problems because I know what my VMs do, and I know where they are currently active. If a VM moves, I need to know about it so that I can keep my dependencies accurate.

The hypervisor shown is using NFS mounts to populate what the VMs will see as attached storage. It should be clear from the diagram that if anything in the path between the hypervisor and the SAN fails, while the Hypervisor will initially be the one complaining, it won't take too long before the VMs complain as well.

From an alerting perspective, knowing these dependencies means that:

  • I could try to suppress or otherwise categorize alerts that are downstream from the main error (e.g. I don't want to see 100 NFS alerts per second when I already know that the SAN has failed);
  • I know what impact a particular failure will have on the services provided by the infrastructure.

Application Level Dependencies

It may also be important to dig inside the servers themselves and monitor both the server's performance as well as the applications on that server, and the server's view of the infrastructure. For example, if it is reported that MS SQL Server has a problem on a particular node, we can infer that applications dependent on that database service will also be impacted. It's possible that everything in the infrastructure is nominally ok, but there is an application or process problem on the server itself, or perhaps the VM is simply running at capacity. I will say that tools like Solarwinds' Server & Application Monitor are very helpful when it comes to getting visibility beyond the system level, and when used with knowledge of an application's service role this can make a huge difference when it comes to pre-empting problems and quickly identifying the root cause of emergent problems and using that information to ignore the downstream errors and focus on the real issue.

Long Distance Relationships

Let's take this a step further. Imagine that VM3 in the local datacenter runs a database that is used by many internal applications. In another location there's a web front end which accesses that database. An alert comes in that there is a high level of dropped packets on the SAN's connection to the network. Shortly after, there are alerts from VM3 complaining about read/write failures. It stands to reason that if there's a problem with the SAN, there will be problems with the VMs because the hypervisor uses NFS mounts for the VM hard drives.  We should therefore fully anticipate that there will be problems with the web front end even though it's located somewhere else, and when those alerts come in, we don't want to waste any time checking for errors with the web server farm or the WAN link. In fact, it might be wise to proactively alert the helpdesk about the issue so that when a call comes in, no time will be wasted trying to replicate the problem or track down the issue. Maybe we can update a status page with a notice that there is potential impact to some specific services, and thus avoid further calls and emails. Suddenly, the IT group is looking pretty good in the face of a frustrating problem.

Service Oriented Management

One of the biggest challenges with element management systems is in the name; they are managing technologies not services. Each one, reasonably enough, is focused on managing a specific technology to the best of its abilities, but without some contextual information and an understanding of dependencies, the information being gathered cannot be used to its full potential. That's not to say that element management systems have no value; far from it, and for the engineers responsible for that technology, they are invaluable. Businesses, however, don't typically care about the elements, they care about services. When there's a major outage and you call the CTO to let them know, it's one thing to tell them that SAN-17 has failed! but the first question a CTO should be asking in response is What services are impacted by this? If your answer to that kind of question would be I don't know, then no matter how many elements you monitor and how frequently you poll them, you don't fully know your infrastructure and you'll never reach a state of inner peace.

I'm curious to know whether Thwack users feel like they have a full grip on the services using the infrastructure and the dependencies in place, not just a view of the infrastructure itself?

In my next post, I'll be looking at more knowledge challenges: baselining the infrastructure and identifying abandoned compute resources.


There is another layer I didn't see represented...a business service that has two way dependencies on another business service say through some middleware such as MSMQ and/or MQ.

Are the queues backing up because a service is having a problem ? Is that other service that is supposed to be pulling data from the queues impacted by an offsite B2B service ? 

It is more than just payers of an onion....


Level 12

I don't feel like I ever have a firm grasp of all the dependencies.  It's a never-ending battle.

Level 20

The first rule with this an VM's is treat each VM as though it was a physical asset.  YES this also means asset tags for your VM's.  We call them Vtags to make it obvious it's a VM and not a physical asset but track them and monitor them the exact same way.  Also since sometime in 10.x.x NPM began speaking the VMware API which changed the game in monitoring your servers if they run on VM's in VMware.  Having NPM know which vm's were running on which esx server at any given time really improved Orion.


Jfrazier , bleggett , ecklerwr1 ... all excellent points! To achieve a complete and comprehensive monitoring strategy of applications it is near impossible. These applications don't exist in a vacuum and the importance of the dependencies can quickly become an IT expert's undoing. Logs, ports, Services, Processes, Traps, and so on are all more things to consider.



Level 21

I have found that some Business Services can't be directly monitored; it really depends on how the end Business Service is delivered.  On the other hand if it's web based it's pretty easy to monitor.

In the cases where it can't be monitored I add as much monitoring as I can to all of the different elements of it and put them in a group to represent the Business Service.

Level 13

isn't it always the way...just like you can't secure anything 100% unless you turn it off and leave it off.

Level 10

Wait... you mean something is actually using all of these nodes I'm monitoring?!

No, I definitely don't have a handle on the dependencies and the services. We're still trying to get a full view of the infrastructure. You make an absolutely valid point, though. The business, and even IT management, don't care about the individual elements. It's the services and impact on the customer that matters to them. We all have to look out for the customers, whoever they may be, and for those of us "in the weeds", that can be easy to overlook.

Level 20

Hehe yeah off is pretty secure... also having no connection to the outside world aka airgapped works quite well too!

You can't beat Layer One security.




Level 14

With NPM we have some knowledge of what is working on our network.  We will be adding NTA, SAM, and LEM soon.  I can't wait to have a clearer view that these tools will bring to the party.

Level 13

Very true. It may be a case of figuring out how to represent that kind of indirect dependency - or whether to simply list them as if they were direct, e.g. ApplicationA is dependent on MQ and a dependency on  ApplicationB, since failure of either of those elements will cause problems for ApplicationA. It's certainly difficult to model all dependencies, and where core services are involved (e.g. DNS), it's likely that the impact could be quite widespread.

Level 13

That's for sure, and it's doubly hard when it's not established when an application is first deployed. This is the classic network engineer problem of being told there's an outage for an application and you ask "Ok, who is supposed to be talking to your application" or "What is your application trying to connect to" and the service owner can't tell you... Somehow whoever runs network management is supposed to guess all this stuff? I think not.

Level 13

I'm totally in agreement. This kind of integration is fundamental to keeping a firm grip on your assets, both virtual and physical.

Level 13

Well, ironically enough somebody might not be using all those nodes (the abandoned compute resources I mentioned at the end) 😉 We can't help being in the weeds most of the time; we're here because we are expert horticulturists. That said, being able to see all of the weeds and not just looking at our favorite species of weed is a great help when trying to figure out whether somebody threw weedkiller into the flower bed. Maybe I'm stretching the analogy a little too far. My point is, rather than getting stuck only monitoring one set of elements, it's better to have visbility of a wide-range of elements even if we can't yet roll that knowledge up into a business service view.

Level 13



The point of IT is to provide a service - then demonstrate the value provided by the service.

Two key parts - (service oriented monitoring) and (the difference between monitoring and alerting)

If you don't monitor the service you're providing then you can't tell your customers how reliable your services are or the value you provide.


If you alert for things you should only be monitoring your alert recipients start ignoring real alerts.

You know #1 is the issue when the most common question is "Server x just alerted, what service is impacted"

Level 14

Knowledge is power.