5 Replies Latest reply on Nov 29, 2012 4:16 PM by Joep Piscaer

    Monitoring baseline & false positives during peak business demand

    Joep Piscaer

      End-to-end monitoring and correlation of the data gathered from the entire stack is what matters. Even then, it's only useful if you have set a baseline and know what the normal state of your environment is.


      With tens of thousands processes, services and components, how do you determine what that 'normal 'state is? It’s is nearly impossible to map all the interrelations between these objects.

      If you've finally succeeded in setting that ‘normal’ baseline, how do you handle any 'exceptions' (like end-of-quarter reporting, a slashdot effect or a cyber Monday deal leaving your servers and applications begging for a day off) that are actually considered 'wanted' from a business perspective but are not seen as such by your SAM-solution?


      In these cases, the SAM-application gives you such an enormous amount of false positives that the monitoring application becomes useless for the duration of the exception. You do want to keep monitoring your servers and applications during these critical times (where it’s not business as usual) and you need to keep everything running as smooth as possible.

      http://thwack.solarwinds.com/servlet/JiveServlet/previewBody/169118-102-1-4923/smooth_sailing.jpg

      So how do you keep your SAM-solution sailing smoothly in exceptional situations and prevent a sea of false positives rendering your monitoring useless?

        • Re: Monitoring baseline & false positives during peak business demand
          branfarm

          Great topic.  If you're getting alerts that you feel are false, then it would seem that you don't have your thresholds set appropriately, or you're monitoring the wrong things.  After all, your monitoring is just telling you what you set it up to do.  I think there's a difference between baselining your normal traffic profiles, and setting performance thresholds appropriately.  What are you more interested in -- identifying activity that's out of profile, or identifying performance problems?  Because they don't always go hand in hand. After all -- the business wants the heat, but they don't want to lose business because the servers melted, or the application went up in flames.    Overall I think the answer boils down to load testing -- you have to know the breaking point for each component in your infrastructure, and your baseline will be a range of different metrics in which you know the application performs well.  Your alerting should then be at or near the various points where performance starts to degrade.  Whether that's certain CPU or Memory thresholds, database transaction rates, or whatever else you measure.

            • Re: Monitoring baseline & false positives during peak business demand
              byrona

              branfarm I love that you have noted the difference between identifying out of profile activity versus performance problems.  I agree that the monitoring system is generally better suited to notify you of performance problems and baselining and load testing is the best way to do this.  There are also a lot of great KB data out there for things such as the Microsoft apps to help you understand when those performance metrics indicate a problem.

               

              When it comes to identifying out of profile activity, that is work for a systems analyst go to in and active use the performance data that is in the monitoring system to see if and when that is happening.  I have always found that humans do a much better job of that than systems anyway as they can correlate the data in a much more effective way; besides if you have a system that does it you will spend all of your time configuring it as things change.

              • Re: Monitoring baseline & false positives during peak business demand
                Joep Piscaer

                I agree that the difference between identifying normal traffic profile activity and monitoring for exceeding of performance thresholds is very important. But even if you know what the breaking point of each individual component is, it is still very hard to map all interrelations between those components. For instance: in an isolated test, a single metric (let's say 'virtual CPU usage') can exceed 90%. However, if the %rdy time for a virtual machine is above 15% in combination with virtual CPU usage of above 80%, it might indicate a problem with context switching; a problem that you cannot discover using virtual CPU usage alone. So, interrelating and correlating numerous metrics is required to actually gain deep understanding of the health and performance of your environment. This means that individual component load testing does not necessarily yield trustworthy performance threshold data unless you do group component (entire system) testing for various specific use cases to prevent server melting or application flames .

                 

                Identifying 'out of profile' activity is a good indicator for a performance (or other problem) that might occur in the near future. Although they don't actually have a correlation, usually 'out of profile' activity occurs before a performance problem does. This means that you  need to know the baseline / nominal performance and activity, as well as the upper limit (problems start to occur). Without either, you simply don't know what your systems are capable of (maximum) or should run at (normal activity).

                 

                My point in the first post was, that even if you have set a baseline and know when problems might start to occur, you only know these values for specific use cases. If the business requires your systems and applications to perform exceptional tasks (tasks you haven't profiled for), how do you *quickly* adapt your monitoring system to handle these scenarios without flooding the administrator with false information?

                  • Re: Monitoring baseline & false positives during peak business demand
                    branfarm

                    I hear what you're saying, and I certainly don't mean to imply that you can take individual metrics alone and expect to gain a clear picture of overall performance -- I definitely believe you need a combination of individual and systemic load testing to identify the meaningful numbers.    But to your question, I don't think you can necessarily change your monitoring quickly -- it should be a constant refining process that makes your monitoring/alerting more on point. Sure you can go in and starting changing alert thresholds on the fly so the emails stop flooding -- but as I said before, if you're getting false alerts -- why are your thresholds set there to begin with?  If the business has a requirement that includes occasional heavy loads, shouldn't your monitoring be tuned to the maximum load your business expects to handle?  This might be where a little diplomacy with management can come in handy, because if your company decides to have a 99% off sale, and they don't let IT know about it ahead of time, how you can you be reasonably expected to anticipate this type of scenario?  I know a lot of management types feel like IT is an infinite resource -- but we all know that is definitely not the case. ("What do yo mean our website couldn't handle so many hits -- it's a server, right?)

                      • Re: Monitoring baseline & false positives during peak business demand
                        Joep Piscaer

                        I think we agree: creating a baseline is hard since you need to do systematic and extensive load testing.

                         

                        Changing your monitoring to adapt itself to ever changing circumstances is nearly impossible but this 'self-learning' capability is what every monitoring solution should have; although it would be very hard to implement for the software vendor.

                         

                        Setting threshold correctly might be more complex than just setting them to a known 'good' point; as the threshold might change in under a specific load. My point is, that I'd like my monitoring solution be smart enough to determine is needs to change a threshold under a specific load or in a specific situation.

                         

                        Finally: we should all migrate our stuff to the cloud, since that thing is really all about infinite resources. Solves all of our problems with management .