20 Replies Latest reply on Apr 24, 2014 6:15 AM by bluefunelemental

    Alerting setup failed us when most needed - suggestions for improvement

    thamizh85

      Yesterday we had a midnight incident with Cisco 3750 (seems very choosy about the time window), all end hosts connected to the switch lost network connection and the switch was also not manageable. Finally we resumed service by reloading the switch. Solarwinds didn't send out any email alert during the whole outage. It was a failure at various levels for the network management setup.

       

      1. The email alerting stopped working few days earlier and I came to know of it only after the incident. It seemed to be a permission issue, but still inexplicable why those SMTP failure events were not highlighted in the Web console. I spend half my time staring in to that console and I am sure I would have noticed it. And it seems inexplicable that the Alerting engine was working fine for more than a year and suddenly it became too fussy about running as LOCAL SYSTEM account.

      2. Two hours before the loss of network connection, the switch stopped responding to SNMP polls. This we found out using missing CPU load data from historic charts. However the device was still pingable and hence no node down event was recorded (Up until the manual reload). To me, it seems the switch was already showing signs of outage when it stopped responding to SNMP. Could we have noticed it earlier? Is it possible to send out email alerts if a SNMP poll fails?

       

      The greatest embarrassment for a monitoring system is when user reports that the network is down and everyone looks surprised. Even more so when I constantly have to convince my team to tolerate several false alerts just to not miss any events, but the system failed to throw any alert during a real incident.

       

      Any suggestions for improvement?

        • Re: Alerting setup failed us when most needed - suggestions for improvement
          Goliath

          Hi -

           

          First thing, you may need to upgrade the IOS version of your 3750 stacks. We had the same issue with a number of them last year, and it turned out to be a couple of bugs that meant the SNMP engine on the switch slows and then stops responding: other indications are that a sho interface status command issued on the switch "hangs" (meaning you have to reconnect to the unit etc) and individual connections dropping off. CSCso07861 is  one of the bug IDs (it's got a typically vague Cisco description!!). It was particularly affecting any stack with more than 4 member switches, although we did get the issue with fewer members.

           

          To answer your specific questions:

           

          1) I have a couple of email notifications set up for my nightly config/inventory jobs via NCM, this at least tests that the Orion server is sending emails correctly. I suppose you could get clever with the NPM Alerting setup and configure a job to alert you when a switch is UP during a specific timeframe (i.e. only send the alert between 11:45pm and 11:50pm) to make sure that the NPM engine is operating correctly.

           

          2) There are a couple of "canned" alerts available within the advanced alert manager: Alert me when a managed node has not been polled during the last 5 tries and Alert me when a managed node last poll time is 10 mins old.

           

          Regards,

           

          John

            • Re: Alerting setup failed us when most needed - suggestions for improvement
              thamizh85

              Goliath wrote:

               

              Hi -

               

              First thing, you may need to upgrade the IOS version of your 3750 stacks. We had the same issue with a number of them last year, and it turned out to be a couple of bugs that meant the SNMP engine on the switch slows and then stops responding: other indications are that a sho interface status command issued on the switch "hangs" (meaning you have to reconnect to the unit etc) and individual connections dropping off. CSCso07861 is  one of the bug IDs (it's got a typically vague Cisco description!!). It was particularly affecting any stack with more than 4 member switches, although we did get the issue with fewer members.

              We are running a higher version (that particular bug should be fixed on 12.44 and above) but we are indeed plagued by newer bugs. And upgrading all switches is also a PIA, since we run well over 400 switch stacks serving 24 x 7 operations. But somehow we need to keep the ball moving and upgrade IOS (some for bug fix and some for new features).

               

              To answer your specific questions:

               

              1) I have a couple of email notifications set up for my nightly config/inventory jobs via NCM, this at least tests that the Orion server is sending emails correctly. I suppose you could get clever with the NPM Alerting setup and configure a job to alert you when a switch is UP during a specific timeframe (i.e. only send the alert between 11:45pm and 11:50pm) to make sure that the NPM engine is operating correctly.

              Our NCM alerts were working fine, only the NPM alerts failed. I guess the Alerting Engine Process is different for both. The second option is clever indeed. I am going to implement it.

              2) There are a couple of "canned" alerts available within the advanced alert manager: Alert me when a managed node has not been polled during the last 5 tries and Alert me when a managed node last poll time is 10 mins old.

               

              Regards,

               

              John


              Have you tried these alerts? Would the polled time field get updated by successful ICMP ping as well? I want to be alerted for SNMP failures, I will try it out and let you know whether it fits my purpose.

                • Re: Alerting setup failed us when most needed - suggestions for improvement
                  Goliath

                  thamizh85 wrote:

                   

                  We are running a higher version (that particular bug should be fixed on 12.44 and above) but we are indeed plagued by newer bugs. And upgrading all switches is also a PIA, since we run well over 400 switch stacks serving 24 x 7 operations. But somehow we need to keep the ball moving and upgrade IOS (some for bug fix and some for new features).

                   

                   

                  That's a lot, thought we had it bad with 150 of them!! In the end I've managed to convince our management that it's critical we update them on a regular basis, having quite a few just stop operations as you described certainly helped (the conversation went along the lines of "you either let us upgrade them twice a year or I can't guarantee this or something similar won't happen again" ). We ended up just scheduling reloads at 4am etc, I've not yet had a complaint from an end user that their network was unavailable at that time......

                  We did find that the issue popped back in a couple of versions post 12.2(44) by the way, eventually we've settled on 12.2(55) SE4 which seems pretty stable - it definitely doesn't have the snmp/complete halt issue.

                   

                  thamizh85 wrote:

                   

                  Have you tried these alerts? Would the polled time field get updated by successful ICMP ping as well? I want to be alerted for SNMP failures, I will try it out and let you know whether it fits my purpose.

                   

                  I've been playing around with them this morning, and as standard they don't work. I do think I've figured a workaround however:

                   

                  To test, I created an ACL on my test 3750 stack to block Orion from polling snmp but still allow icmp polls. Neither of the two canned alerts triggered, which suggests that the polled time field takes into account icmp polls as well as snmp. However, the status of the node changed to show that the 'Overall Hardware Status' has state: could not poll:

                   

                  node status.jpg

                   

                  So, I've created an alert to trigger when a switch has this status:

                   

                  Alert.jpg

                   

                  (I've set the do not trigger time to be 2.5 times our default polling cycle just to squish the alerts if only a single poll is missed).

                   

                  I've tested this alert out and it works well - if I block snmp via the ACL the alert triggers, the moment I unblock it the alert clears etc. It's had the added benefit of identifying a couple of minor switches elsewhere in our estate that have an incorrect ACL for snmp

                   

                  If you try this let me know how you get on.

                   

                  Regards,

                  John

              • Re: Alerting setup failed us when most needed - suggestions for improvement
                zzz

                Note, this information is from Windows Server 2008 R2.

                Try to check out the system events in Event Viewer. Solarwinds has a seperate set of events under Applications and Services. IE. I have a task attached to events for email failure (Event ID 3004) that mails a gmail account should it fail to mail to the internal email.

                 

                Note that previously there were some type mismatched errors that took up all the space, but after I resolved that issue, there usually isn't any events/information logs. Checking daily or just a few times per week should be ok. However should they occur, they tend to occur in groups, so it might not be wise to just set up alerts for whenever an event happens or it could flood your mailbox.

                • Re: Alerting setup failed us when most needed - suggestions for improvement
                  superfly99

                  Was there anything in the logs of the switch? Did syslog send out any kind of message? If it did, maybe turn on alerting for that type of syslog message.

                  • Re: Alerting setup failed us when most needed - suggestions for improvement
                    ssaraswat

                    Friends, I am also in the same situation.

                     

                    In my case, the issue is the interface for the following device is up. But somehow neighbor ship is not there. Is there any way to get alert for this issue ?

                    We are using SNMP. We need to receive alert for below issue, because normall what is happening is, Services went down for long and still no alerts..

                     

                    Below is the screen shot.

                     

                    Thanks.

                     

                    Capture.PNG

                    • Re: Alerting setup failed us when most needed - suggestions for improvement
                      bluefunelemental

                      I Just had this happen with our ticketing system refusing to accept Solarwinds emails and create tickets- can't wait to replace it with Swis API.

                      In the meantime I set a SQL ux monitor to query my ticketing system log table for can't create ticket text. We build such complex monitoring for a monitor to monitor the monitoring monitor:-)

                      Bar of soap cleans your hands but WHAT cleans the bar of soap?