6 Replies Latest reply on Dec 15, 2012 12:41 PM by Joep Piscaer

    Monitoring across the stack?

    Joep Piscaer

      In a previous post, byrona brought up a valid point:

      For our internal stuff we also apply the same process, use a canned template and modify as necessary.  Internally we also use the application monitoring functionality for some interesting things such as monitoring our facilities for HVAC status, UPS status, data-center temperature, data-center humidity and eventually per cabinet power usage as well.

      In order to correlate any given alarm(s) to a root cause, monitoring across the entire stack is a must-have. I always use the example of a physical 19” rack stacked with VMware ESXi-hosts. One of those physical hosts is experiencing a failure on one of its PSUs. Due to that failure, the host is automatically placed in maintenance mode. During the evacuation of the host (using vMotion), an alarm goes off for one of the VMs being migrated, because it disappeared from the network for a second or so.

      The root cause is obviously the failing PSU, but without a monitoring solution that has picked up on that, the root cause is much harder to locate. And even when you are monitoring the correct stuff, the root cause might still be external to those objects monitored. It could very well be that the temperature in that part of the rack is much higher than anticipated, because the amount of power and Ethernet cables in that part of the rack prevents heat from being dissipated correctly, which had led to the PSU to fail.

      http://www.cs.bris.ac.uk/Research/Micro/img/16-core-boards/eight-board-stack-front.jpg

      I think everybody agrees with me in that monitoring your hardware stack and meta-information about the physical layer (like temperature, humidity and others indicated by byrona) is critical; but how do you actually accomplish this?

      Do you spec your DC to be ‘monitoring-capable’? If you're renting cage space from a co-location provider, how do you co-operate with them so you can monitor ‘their’ objects and metrics?

        • Re: Monitoring across the stack?
          jkmills

          As we begin to go down the road of an additional remote Data Center we are focusing more and more on this sort of monitoring. The new DC will be spec'd to be monitoring capable as we will not have any onsite presence.

            • Re: Monitoring across the stack?
              Sohail Bhamani

              The best solutions will include considerations about everything you have mentioned, if possible, from the beginning.  More often than not, this will need to happen in a retrofit fashion.  This is probably harder as costs will be a major portion of the discussion as well as what it will take to get to where is needed.

               

              I think getting information individually from the needed places in "the stack" is simple.  The hard part is piecing it together to make a readable story.  Solarwinds is on the right path to this with recent/new features and acquisitions. 

               

              The end result is that in a modern network with all of its moving pieces and parts, the need for a 100% single pane of glass in terms of monitoring is highly needed and still yet to be properly provided by any cost effective software vendor.

               

              Sohail Bhamani

              Loop1 Systems

              http://www.loop1systems.com

            • Re: Monitoring across the stack?
              byrona

              While I think monitoring the entire stack is important, I can't say that I have found any monitoring system very good at root cause analysis.  We have a 24x7 Enterprise Operations Center that leverages our monitoring system (powered by SolarWinds software) to identify and respond to problems.  I have found that people armed with the proper data-set are much better at root cause analysis.

               

              I like to compare it to the medical industry: the monitoring system is like the medical equipment and the EOC Tech is like the Dr.  The medical equipment is used to provide data but the Dr ultimately makes the diagnosis.

               

              With the monitoring services we provide we generally don't send alerts directly to customers.  We send the alerts to our EOC Techs for evaluation and diagnosis and then the EOC Tech notifies the customer if a real problem exists.  The way that the Tech escalates to the customer will be different depending on the severity of the issue; if the issue is critical a phone call will be made and if the issue is non-critical an email is sent and tracked via our ticketing system.  By doing this our customers get to discuss the issue with real people and don't receive a ton of false alarms in the middle of the night.  We feel that by doing these things we provide a lot of extra value to our customers.  After all, when you go to the hospital do you want to talk to the medical equipment or do you want to talk to the Dr?

                • Re: Monitoring across the stack?
                  Joep Piscaer

                  Root cause analysis is indeed very hard to accomplish automatically. For any monitoring solution, it helps to have as much (relevant) monitoring data as possible available to work with. I like and agree with your analogy of the medical equipment and the doctor.

                  I do think however that some parts of the infrastructure are harder to monitor than others. For instance: setting up monitoring for virtual machines should be a lot easier than monitoring a piece of physical equipment like a temperature of humidity gauge in one of the isles. Even if you are able to do so, you really need multiple gauges in different places (and with different threshold and granularity) in order to correctly monitor for temperature and humidity...

                • Re: Monitoring across the stack?
                  donpepe

                  I am in charge of several remote locations with small server rooms. The strategy we employ is to standardize our equipment in the field with
                  monitoring in mind. So we have adopted a line of APC UPS's with network cards which are quite versatile in environmental monitoring. Even if we don't get
                  cooperation from our vendor or if setting up a way to monitor their equipment is cost prohibitive the UPS's with their I/O attachments are quite sufficient and very easy to monitor in Solar Winds.