Our company has a client who has NPM/SAM installed and has asked not to disclose their identity, so the thing is, that when ever a node in their network goes down or has any issue, an alert is sent 4-8 hours later, when my superiors discussed it with me, i was like what? Because i already have worked on NPM/SAM at my previous job here in Pakistan and our client was in US, and when ever there was an issue regarding any thing, alert was sent with in 30 secs. So i cross questioned my superiors about the re-discovery schedule and alert reporting time when any issue occurs, so my superiors said that they tried everything and checked everything, nothing looked like out of the blue and everything seemed normal. So they asked me to ask for help from you guys because some of you might have faced the same issue and could help me out identify the case, or provide me with some details in finding the root cause of the issue. Looking forward to hear from all the thwackians...
the first few questions that came to my mind was, its impossible, it cannot be that late, are you serious or something might be mis-configured, re-discovery schedule would be after 4 hours or +, or the system configuration was very low that it took time to generate the alert and i dont think alert generation has anything to do with re-discovery etc.
and one more thing, when the devices were added they might have the ip of another subnet for eg 172.16.x.x and later their subnet changed to 10.x.x.x or 192.168.x.x and they are up but the status is down.
That does sound a bit odd. Given that this is a multi-continent/multi-timezone instance I would be interested in knowing what type of Network Time Protocol time zone you would be using and if all your devices are using the same. Also, I am wondering on how prevalent this issue is. Is it just a couple devices, 25 or so devices, or 100 devices spread across various time zones. Also, can you provide more insight on your Logging architecture and design?
So i just discussed with the client, they said that they have installed NPM on vmware and now it has become so severe that even for a whole day no alert is generated when a node goes down or comes back up. He said when they restart the vmware then everything comes back to normal and after some time it becomes bad again, and CourtesyIT can you be more specific for what kind of logs i need to ask from the client and from where the logs can be collected?
Ok, basically when the node goes down, NPM should trigger an alert. There can be various ways the alert is triggered and notification sent. Is the customer waiting for an email or is the delay being noticed on the alerts section in NPM. Can you supply a screen shot?
i have asked for all the possible things which would cause the alerts to be delayed and asked for screen shots. Lets wait for the reply and then i will share it with you guys.
another cause could be that the system configuration would be not very high and number of nodes would be high, it has a lot to do with the system configuration, because for a short while the alerts are triggered successfully because when VM starts up, the RAM is free and after some time when there are many alerts the system start to get jammed because of utilization of a lot of RAM and they might not have required amount of RAM for that much nodes.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.