Over the years, I've observed that despite running multiple element and performance management systems, most organizations still don't truly understand their IT infrastructure. In this post I'll examine how it's possible to have so much information on hand yet still have a large blind spot.
What does discovery mean to you? For most of us I'm guessing that it involves ICMP pings, SNMP community strings, WMI, login credentials and perhaps more in an attempt to find all the manageable devices that make up our infrastructure: servers, hypervisors, storage devices, switches, routers and so forth. We spin up network management software, perhaps a storage manager, virtualization management, performance management, and finally we can sleep safely knowing that we have full visibility and alerting for our compute, storage and networking infrastructure.
At this point I'd argue that the infrastructure discovery is actually only about 50% complete. Why? Because the information gathered so far provides little or no data that can be used to generate a correlation between the elements. By way of an analogy you could say that at this point all of the trees have been identified, labeled and documented, but we've yet to realize that we're standing in the middle of a forest. To explain better, let's look at an example.
Imagine you have a remote site at which we are monitoring servers, storage, printers and network equipment. The site is connected back to the corporate network using a single WAN link, and—horrifyingly—that link is about to die. What do the monitoring systems tell us?
- Network Management: I lost touch with the edge router and six switches.
- Storage Management: I lost touch with the storage array.
- Virtualization Management: I lost touch with these 15 VMs.
- Performance Management: These elements (big list) are unresponsive.
Who monitors those systems? Do the alerts all appear in the same place, to be viewed by the same person? If not, that's the first issue, as spotting the (perhaps obvious) relationship between these events requires a meat-bag (human) to realize that if storage, compute and network all suddenly go down, there's likely a common cause. If this set of alerts went in different directions, in all likelihood the virtualization team, for example, might not be sure whether their hypervisor went down, a switch died, or something else, and they may waste time investigating all those options in an attempt to access their systems.
Centralize your alert feeds
If all the alerts are coming into a single place, the next problem is that in all likelihood the router failure event led to the generation of a lot of alerts at the same time. Looking at it holistically, it's pretty obvious that the real alert should be the loss of a WAN link; everything else is a consequence of losing the site's only link to the corporate network. Personally in that situation, I'd ideally like the alert to look like this:
2016/07/28 01:02:03.123 CRITICAL: WAN Node <a.b.c.d> is down. Other affected downstream elements include (list of everything else).
This isn't a new idea by any means; alert suppression based on site association is something that we should all strive to achieve, yet so many of us fail to do so. One of the biggest challenges with alert monitoring is being overwhelmed by a large number of messages, and the signal to noise ratio makes it impossible to see the important information. This is a topic I will come back to, but let's assume it's a necessary evil.
Suppress unnecessary alert noise
Always On The Move
In addition to receiving several hundred alerts from the devices impacted by the WAN failure, now it seems the application team is troubleshooting an issue with the e-commerce servers. The servers themselves seem fine, but the user-facing web site is generating an error when trying to populate shipping costs during the checkout process. For some reason the call to the server calculating shipping costs isn't able to connect, which is odd because it's based in the same datacenter as the web servers.
The security team is called in and begins running a trace on the firewall, only to confirm that the firewall is correctly permitting a session from the e-commerce server to an internal address on port tcp/5432 (postgres).
The network team is called in to find out why the TCP session to shipsrv01.ecomm.myco.corp is not establishing through the firewall, and they confirm that the server doesn't seem to respond to ping. Twenty minutes later, somebody finally notices that the IP returned for shipsrv01.ecomm.myco.corp is not in the local datacenter. Another five minutes later, the new IP is identified as being in the site that just went down; it looks like somebody had moved the VM to a hypervisor in the remote site, presumably by mistake, when trying to balance resources across the servers in the data center. Nobody realized that the e-commerce site had a dependency on a shipping service that was now located in a remote site, so nobody associated the WAN outage with the e-commerce issue. Crazy. How was anybody supposed to have known that?
It seems that despite having all those management systems I'm still a way from having true knowledge of my infrastructure. When I post next, I'll look at some of the things I'd want to do in order to get a better and more holistic view of my network so that I can embrace the inner peace I desire so much.