49 Replies Latest reply on Jul 27, 2016 11:14 PM by fazl azeem

    Delayed Alerts

    fazl azeem

      Hi everyone,

       

      Our company has a client who has NPM/SAM installed and has asked not to disclose their identity, so the thing is, that when ever a node in their network goes down or has any issue, an alert is sent 4-8 hours later, when my superiors discussed it with me, i was like what? Because i already have worked on NPM/SAM at my previous job here in Pakistan and our client was in US, and when ever there was an issue regarding any thing, alert was sent with in 30 secs. So i cross questioned my superiors about the re-discovery schedule and alert reporting time when any issue occurs, so my superiors said that they tried everything and checked everything, nothing looked like out of the blue and everything seemed normal. So they asked me to ask for help from you guys because some of you might have faced the same issue and could help me out identify the case, or provide me with some details in finding the root cause of the issue. Looking forward to hear from all the thwackians...

      CourtesyIT DanielleH aLTeReGo Jfrazier familyofcrowes jeremymayfield wabbott patrick.hubbard bsciencefiction.tv and others who could help.

        • Re: Delayed Alerts
          superfly99

          Can you post up a copy of the alert? It sounds like there's something mis-configured in the alert itself.

            • Re: Delayed Alerts
              fazl azeem

              the first few questions that came to my mind was, its impossible, it cannot be that late, are you serious or something might be mis-configured, re-discovery schedule would be after 4 hours or +, or the system configuration was very low that it took time to generate the alert and i dont think alert generation has anything to do with re-discovery etc.

            • Re: Delayed Alerts
              CourtesyIT

              That does sound a bit odd.  Given that this is a multi-continent/multi-timezone instance I would be interested in knowing what type of Network Time Protocol time zone you would be using and if all your devices are using the same.  Also, I am wondering on how prevalent this issue is.  Is it just a couple devices, 25 or so devices, or 100 devices spread across various time zones.  Also, can you provide more insight on your Logging architecture and design?  

              1 of 1 people found this helpful
              • Re: Delayed Alerts
                CourtesyIT

                Ok,  basically when the node goes down, NPM should trigger an alert.  There can be various ways the alert is triggered and notification sent.  Is the customer waiting for an email or is the delay being noticed on the alerts section in NPM.  Can you supply a screen shot?   

                  • Re: Delayed Alerts
                    fazl azeem

                    i have asked for all the possible things which would cause the alerts to be delayed and asked for screen shots. Lets wait for the reply and then i will share it with you guys.

                      • Re: Delayed Alerts
                        fazl azeem

                        another cause could be that the system configuration would be not very high and number of nodes would be high, it has a lot to do with the system configuration, because for a short while the alerts are triggered successfully because when VM starts up, the RAM is free and after some time when there are many alerts the system start to get jammed because of utilization of a lot of RAM and they might not have required amount of RAM for that much nodes.

                    • Re: Delayed Alerts
                      supreet299

                      Hi Fazl

                       

                      If you have configured the alert correctly and still you are not getting the alert or you are getting delayed in alert.

                      can you tell me how many elements you are polling from your main polar and what is the CPU utilization of the server at the time of polling?? 

                       

                       

                      DELAYED ALERTS

                      This question is Not Answered.

                      fazl azeemLevel 14

                      • Re: Delayed Alerts
                        Jfrazier

                        Since you are mentioning resources...what does the monitoring environment loo like ?

                        1) Number of polling engines

                        1a) provide cpu/disk/memory of the polling engines

                        2) number of nodes being polled

                        3) How are you determining a node is down (ICMP, SNMP, WMI) ?

                        4) as previously requested, can you post a copy of the alert in question ?

                        5) is there a window in which the alert is active ?

                        6) How is the client receiving the alert ?

                        and lastly but not least....

                        7), I am going to toss on my Solarwinds support hat and ask are you fully patched (Orion wise) at the current software level ?  There was a version that had some issues with delayed alerts that required a hotfix to to resolve or better yet upgrade to a newer version.

                          • Re: Delayed Alerts
                            fazl azeem

                            Here is an overview of my chat with the client. The points were raised by me and non-points conversation was sent by the client:

                            Client: You can notice from below events that event of 21 July ”switch KSE-MZ-SW reboot” came to NPM on 25 July 2:20 AM.

                            1. And how are the alerts configured to be sent?

                            2.   Are they configured to be sent via Email/SMS or notification on the Events and Alerts column in NPM.

                            Client: Alerts are configured via email and we see them in NPM. Further icmp nodes have also been added even they don’t change their state in NPM web console.

                            3. Which version of NPM are you using for example 11.5.3 or  12.0 or a lower version than 11.5.3

                            Client: NPM version 10.6.1

                            4.    Can you provide me a screen shot of the received Events and Alerts.

                            Client:

                            EVENTS:

                            ALERTS:

                             

                            5. Is  the issue coming on just a couple devices, 25 or so devices, or 100 devices spread across various time zones or they all are on the same time zone.

                            Client: We are facing this issue on 10 network devices

                             

                            6. Are the alerts configured to be triggered when the re-discovery of the nodes takes place

                             

                            Client: NO

                             

                            7. The system configuration is very low, it sends alert initially because of free space in RAM and after some time when the alerts number increase then the system becomes jammed and it becomes nearly impossible for the NPM to send alerts in time.

                             

                            Client: CPU and memory utilization is within normal limits

                            We are currently monitoring total of 91 nodes and 498 sensors.

                             

                            8. Kindly check these things and provide me the details of the System configuration on which the NPM is running, how many nodes are being monitored because it has a lot to do with system configuration.

                            Client: NPM is being run on VM:

                            Windows server 2008 R2 Standard

                            Processor:           Intel Xeon CPU E5-2960 @ 2.9GHz 2.9GHz

                            RAM:                     3 GB

                            HDD:                      30 GB /  899 MB Free Space

                              • Re: Delayed Alerts
                                Jfrazier

                                I immediately see issues in resources. CPU, Memory, and potentially drive space.

                                Is the database local to the polling engine or on a separate machine.

                                  • Re: Delayed Alerts
                                    cahunt

                                    Only if the DB is present on this box, but it is a small install... I think the CPU and RAM can take care of this small setup.

                                    We are currently monitoring total of 91 nodes and 498 sensors.
                                      • Re: Delayed Alerts
                                        Jfrazier

                                        I wonder what the polling completion rate looks like.  I suspect it falls off through the day based on fazl-e-azeem comment that alerts are timely after a reboot and are delayed later.

                                        I agree the cpu and ram are likely the big culprits. If they could increase it to at least 2 cores and 6GB of ram there should be improvement.

                                        1 of 1 people found this helpful
                                          • Re: Delayed Alerts
                                            cahunt

                                            yes indeed... that reminds me to suggest checking polling...... via settings or with something like......

                                             

                                             

                                            select

                                            Engines.ServerName,Engines.IP, Engines.ServerType,

                                            convert(varchar, round(nodes.systemuptime/60/60, 2, 1))+' hrs' as Uptime,

                                            Engines.Elements as Elmts, Engines.Nodes, Engines.Interfaces as Int, Engines.Volumes as Vol,

                                            c.custpolls as UnDP, a.samct as SAM,

                                            N.Down_node, I.Down_Int, V.Down_vol, A2.Down_sam,

                                            s.failed as noSNMP,

                                            Engines.PollingCompletion as "%complt",

                                            nodes.nodeid, nodes.CPUload as "%CPU", nodes.percentmemoryused as "%RAM",

                                            e1.PropertyValue as NPM_Rate, e2.PropertyValue as SAM_Rate

                                            from Engines

                                            join nodes on engines.ip = nodes.ip_address

                                             

                                             

                                            left join (select engineproperties.engineid, EngineProperties.PropertyValue from EngineProperties where engineproperties.propertyname = 'Orion.Standard.Polling') e1

                                              on engines.engineid = e1.engineid

                                            left join (select engineproperties.engineid, EngineProperties.PropertyValue from EngineProperties where engineproperties.propertyname = 'APM.Components.Polling') e2

                                              on engines.engineid = e2.engineid

                                             

                                             

                                            or possibly with - UDT Job Status report by polling engine

                                  • Re: Delayed Alerts
                                    fazl azeem

                                    what were you talking about the hotfix? can you explain more in detail? Jfrazier

                                    • Re: Delayed Alerts
                                      Jeff Catlin

                                      As a quick item to check, I would suggest doing a query against the alertlog table in the DB and seeing if the time of the actions and the alert triggering actually lines up with the time that node went down or the email was sent.  This could at least confirm that the alerts are triggering late and eliminate the possibility that it is just emails getting delayed somehow.

                                      • Re: Delayed Alerts
                                        jeremymayfield

                                        Have you verified there is no Delay in your SMTP server settings where in your email server is processing your requests from the Orion server?   seems like every update I get some that route to my outlook Spam folder. 

                                         

                                        Also i would ask are you using the alerting feature in Orion or the Advanced alerting features on the Orion Server it self?  Advanced alerting has many features which can delay sending if misconfigured.

                                        • Re: Delayed Alerts
                                          jeremymayfield

                                          Have you Checked the Alert properties,

                                           

                                           

                                          Name of alert:  

                                          Email me when a Node goes down  

                                          Description of alert:  

                                          Paging

                                           

                                           

                                          Type of Property to monitor

                                           

                                          Node

                                          Enabled(On/Off):  

                                          ON  

                                           

                                          Evaluation Frequency of alert:  

                                          Every 15 seconds  

                                           

                                          Severity of alert:  

                                          Critical  

                                           

                                           

                                          Alert Custom Properties: (0)

                                           

                                          No Alert Custom Properties defined 

                                          Alert Limitation Category  

                                          No Limitation

                                           

                                           

                                          Also make sure there are no time of Day settings against the alerts that would prevent alerts from being sent.

                                          • Re: Delayed Alerts
                                            cahunt

                                            A few things off the top

                                             

                                                                               - Not sure about your changing IP Addresses on the node and what that's all about... maybe your setting them up with in band IP, and changing to an out of band address on the node, or another back door, loopback, etc.

                                                                          

                                                                               - Your VM reset tells me you are losing resources, or as things start to run and query you have long run times on your filters, alters, reports, etc.... specifically alerts if too in dept or nested will cause you an issue of not being able to get through the entire DB before your 'Check this every X minutes (or Seconds)' is set to - Check your Server Application Logs for Issues there - and you can also check out link to an alert for long running queues... don't remember where i came across that, but you will want to edit the conditions - it's currently set with a check of 15 minutes.. so any query running longer than 15 minutes will trigger the email.... I am sure you want to edit that.

                                             

                                            Long Running SQL Queue . AlertDefinition

                                             

                                             

                                                                              - Verify your actions and what time they trigger - you can use a simple query   - customize or adjust to make it fit or show off a specific alert, or other items..

                                            SELECT TOP 1000 *

                                              FROM [AlertLog]

                                              where ActionType = 'EMail'

                                              order by LogDateTime desc

                                             

                                             

                                                                              -

                                            • Re: Delayed Alerts
                                              rschroeder

                                              From your original comments, and from analysis by folks like Jfrazier, it seems apparent the Orion VM does not have sufficient resources allocated to accomplish its tasks.

                                               

                                              Assign it more memory and CPU and see if the problem decreases.

                                                • Re: Delayed Alerts
                                                  rschroeder

                                                  Performance will improve significantly once database issues are resolved:

                                                   

                                                  1 of 1 people found this helpful
                                                    • Re: Delayed Alerts
                                                      fazl azeem

                                                      so is the issue related to DB or improving the amount of Resources like CPU Hard disk and RAM?

                                                      I came across a DB query to resolve an issue with the help of which some type of alerts are removed from a table. Check this out and tell me is it safe to do this or not? I found this:

                                                       

                                                      " SELECT Count (*) FROM [dbo].[AlertHistory]

                                                       

                                                      you can clear this table up (perform DB Backup first) then restart orion services. This table by the way only houses the actions and clearing these up can fix your issue with delayed alerts."

                                                    • Re: Delayed Alerts
                                                      cahunt

                                                      Looks like there is some clean up you can do here... make sure you are not over retaining your history for EVERYTHING on this small little engine that could setup.

                                                       

                                                      I would also stop sending the netflow data for those devices or interfaces that the system is just dumping..

                                                        • Re: Delayed Alerts
                                                          fazl azeem

                                                          tell me more about removing the Netflow Events and Alerts from NPM. I want to take as many things as I can to the client so that he would be at peace. And I being a Customer support Representative would feel good to bring peace to someone, there is nothing more relaxing than seeing a client at peace.

                                                      • Re: Delayed Alerts
                                                        Jfrazier

                                                        something else to also think about...how many SNMP traps are being received from various sources ?

                                                        If the trap database is large, it will also affect performance.  It needs to be trimmed from time to time.

                                                        Large numbers of traps even if they are not being alerted on or even set to be discarded will affect performance as they still have to be processed.

                                                        Firewall log entries sent as traps can cripple a monitoring solution. 

                                                        • Re: Delayed Alerts
                                                          Jfrazier

                                                          It is likely related to both DB and resources.

                                                            • Re: Delayed Alerts
                                                              fazl azeem

                                                              is the above mentioned query safe to run? and I have requested the client to upgrade the resources until I have discussed the DB thing with my superiors and they agree with going forward with deleting the table in the DB. I will let you guys know untill or after tuesday. Thank you everyone for all your support. I really appreciate the help.

                                                            • Re: Delayed Alerts
                                                              rjg5050

                                                              There is also an option in Advanced Alerts to "hold on" to an alert of "X" amount of time before sending out the alert... it is on the Trigger Conditions tab and its a check box for "Condition must exist for more than"... I use it for several different things... for example a node must be down for 6 minutes (which equates to 3 polling cycles) so that devices that "flap" don't generate an extreme amount of white noise.. check to see if that is checked and then if so what are the settings...

                                                              • Re: Delayed Alerts
                                                                ecklerwr1

                                                                Also a LOT has changed for the better since 10.6.1 in more ways than one.

                                                                1 of 1 people found this helpful