cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Zen And The Art of Infrastructure Monitoring

Level 13

zen_stones_by_undeadstawa-d36h8mn.jpg

(Zen Stones by Undeadstawa on DeviantArt)

Over the years, I've observed that despite running multiple element and performance management systems, most organizations still don't truly understand their IT infrastructure. In this post I'll examine how it's possible to have so much information on hand yet still have a large blind spot.

Discovery

What does discovery mean to you? For most of us I'm guessing that it involves ICMP pings, SNMP community strings, WMI, login credentials and perhaps more in an attempt to find all the manageable devices that make up our infrastructure: servers, hypervisors, storage devices, switches, routers and so forth. We spin up network management software, perhaps a storage manager, virtualization management, performance management, and finally we can sleep safely knowing that we have full visibility and alerting for our compute, storage and networking infrastructure.

At this point I'd argue that the infrastructure discovery is actually only about 50% complete. Why? Because the information gathered so far provides little or no data that can be used to generate a correlation between the elements. By way of an analogy you could say that at this point all of the trees have been identified, labeled and documented, but we've yet to realize that we're standing in the middle of a forest. To explain better, let's look at an example.

Geographical Correlation

Imagine you have a remote site at which we are monitoring servers, storage, printers and network equipment. The site is connected back to the corporate network using a single WAN link, and—horrifyingly—that link is about to die. What do the monitoring systems tell us?

  • Network Management: I lost touch with the edge router and six switches.
  • Storage Management: I lost touch with the storage array.
  • Virtualization Management: I lost touch with these 15 VMs.
  • Performance Management: These elements (big list) are unresponsive.

Who monitors those systems? Do the alerts all appear in the same place, to be viewed by the same person? If not, that's the first issue, as spotting the (perhaps obvious) relationship between these events requires a meat-bag (human) to realize that if storage, compute and network all suddenly go down, there's likely a common cause. If this set of alerts went in different directions, in all likelihood the virtualization team, for example, might not be sure whether their hypervisor went down, a switch died, or something else, and they may waste time investigating all those options in an attempt to access their systems.

Centralize your alert feeds

Suppressing Alerts

If all the alerts are coming into a single place, the next problem is that in all likelihood the router failure event led to the generation of a lot of alerts at the same time. Looking at it holistically, it's pretty obvious that the real alert should be the loss of a WAN link; everything else is a consequence of losing the site's only link to the corporate network. Personally in that situation, I'd ideally like the alert to look like this:

2016/07/28 01:02:03.123 CRITICAL: WAN Node <a.b.c.d> is down. Other affected downstream elements include (list of everything else).

This isn't a new idea by any means; alert suppression based on site association is something that we should all strive to achieve, yet so many of us fail to do so. One of the biggest challenges with alert monitoring is being overwhelmed by a large number of messages, and the signal to noise ratio makes it impossible to see the important information. This is a topic I will come back to, but let's assume it's a necessary evil.

Suppress unnecessary alert noise

Always On The Move

In addition to receiving several hundred alerts from the devices impacted by the WAN failure, now it seems the application team is troubleshooting an issue with the e-commerce servers. The servers themselves seem fine, but the user-facing web site is generating an error when trying to populate shipping costs during the checkout process. For some reason the call to the server calculating shipping costs isn't able to connect, which is odd because it's based in the same datacenter as the web servers.

The security team is called in and begins running a trace on the firewall, only to confirm that the firewall is correctly permitting a session from the e-commerce server to an internal address on port tcp/5432 (postgres).

The network team is called in to find out why the TCP session to shipsrv01.ecomm.myco.corp is not establishing through the firewall, and they confirm that the server doesn't seem to respond to ping. Twenty minutes later, somebody finally notices that the IP returned for shipsrv01.ecomm.myco.corp is not in the local datacenter. Another five minutes later, the new IP is identified as being in the site that just went down; it looks like somebody had moved the VM to a hypervisor in the remote site, presumably by mistake, when trying to balance resources across the servers in the data center. Nobody realized that the e-commerce site had a dependency on a shipping service that was now located in a remote site, so nobody associated the WAN outage with the e-commerce issue. Crazy. How was anybody supposed to have known that?

It seems that despite having all those management systems I'm still a way from having true knowledge of my infrastructure. When I post next, I'll look at some of the things I'd want to do in order to get a better and more holistic view of my network so that I can embrace the inner peace I desire so much.

14 Comments
Level 20

Event correlation is more an art form than science I've come to understand... also dependencies and groups have really helped NPM in this respect for sure!  It sure is nice is nice to not get 50 alerts when one of your WAN routers goes down!

MVP
MVP

Sounds like dependency mapping, understanding the physical makeup of your environment,  and correlation of events.

Asecklerwr1​ says, event correlation is more of an art than a science. 

This was a peaceful and calming article--very zen-ish.  Obviously every job can be improved with thoughtful consideration and design.

pastedImage_0.png

Yup, you need to identify the root cause as quickly as possible when you have an Emailageddon going on. In Orion, turning on 'Auto Dependancies' helps a lot with this, but you're always going to want to ensure the big silos of kit are sat behind the correct edge device in these parent>child relationships.

By edge device, I don't always mean the edge device in the traditional sense. I'm talking about the monitoring edge device, the device all ICMP/SNMP/WMI traffic flows through to get to the site devices. That said, often this is the same thing when monitoring remote sites over VPNs from "HQ".

For example, if you are monitoring your remote sites over a VPN, you would look to make the device that is the endpoint of the tunnel, on the remote site end, as the parent of all other objects. As I mentioned, the automatic dependencies created by Orion when you turn the feature on will help, but you can do this manually by using an Orion group that contains all other devices on that site, and then making the new group the child dependant of the VPN endpoint device. This way if you loose the tunnel, the monitored edge will go down and you'll get an alert (if configured), but no others. Meaning no muddied waters for your Ops team, leading to a quicker resolution of this type of issue.

(By default all interfaces on a device are the 'child dependants' of the parent node, so you don't need to go to that level)

Note that automatic dependencies are calculated from network topology, so it's fairly safe to turn these on, just keep an eye on them to ensure that any critical device alerts are not suppressed, as by default when the parent is down, all alerts that can fire on child objects are suppressed.

Level 14

Love the holistic view.  One must know the network, be the network, to successfully monitor the network.

Level 13

I love that. BE the network. 😉

That needs to be on a T-Shirt....

Level 12

Good read, looking forward to the next installment.

Level 14

Oh yes, Know the Network, Be the Network on a T-shirt.  That thought never occurred to me, seriously.  Great idea silverbacksays

I have a server running an application. I have a disk array. I have a switch. I have a firewall. I have a WAN.

I have NPM/SAM/NCM//NTA/Patch Mgr/LEM

How do I monitor and manage it all effectively?

There goes my zen right out the window....

MVP
MVP

Agreed, when they finally did dependencies it made things much easier for us!

Level 13

Have you been reading ahead? 😉

Level 12

Agree wholeheartedly.  Therein lies the human element that is, at least for the time being, irreplaceable in IT.  The ability to intuitively grasp the correlation of events is terribly difficult to quantify, and is founded upon a careful study of the network and its elements.  Tools such as Solarwinds clearly help, but they must wielded effectively to help find the level of knowledge that leads to understanding disparate events in an holistic sense.

Level 17

Double check your dependencies, already had a Distribution node the child behind an Access Switch... some things happen, and you'll need to group your redundant pairs to really squash some of those alerts.

   - now the kicker - If anyone can get the Auto dependencies to pickup on Sub-interfaces and squash those alerts, I'd be in Alert Heaven...where it's all quiet like.

Also, we used to track a lot more when it came to events and correlation of.... those were the days. Now I have less of an idea how close we come to 3 or 4, 9's..