If we lose a poller, we get NO Alerts from any node that is monitored by that poller. No stats, no updates. Whatever the state last reported will stay that way until the poller comes back to collect data again.
* What I do get; is an alert that my poller went down - Rather If the Server Goes Down - If you want to setup the application monitoring to get alerts when a service hangs or begins to eat resources that needs to be setup also.
This alert still triggers because we have more than 1 poller, and I cross monitor the pollers. So Poller 1 monitors Poller 3 and the Main App Engine (we keep the polling load off our main app due to heavy DB input and past web page performance issues ), Poller 2 monitors Poller 1 and Poller 3 monitors poller 2..... this ensures that when a Poller goes off line I see that alert - instead of realizing 25 minutes later that nodes are not updating.
Older versions of the app we had issues where polling collection just quit with NO real visible identifier, other than noticing No Updates. So if your running into that issue, take a look at the Windows Counters and SW Server App Monitoring Templates. From there you can start trending and building custom alerts based on what you see.
Another thing is what type of Circuits you are working with..... If you always get a light, then the interface will not go down unless there is a hardware issue, so you need to watch your trends and alert on something other than a down interface (of course keep that in place incase of hardware failure, but consider an EIGRP Neighbor loss alert to know if your wan box is losing the connection to your routers). And monitor from both ends of course, if possible see if you can monitor an internal and an external IP address for sites.
Dont overload your device though, so if both IP's hit the same box, make sure 1 is a Ping Only ..<tangent>.. Some Nexus devices are bad at being dully polled - especially if you are using vDC's - keep your pollers light on everything but the main context. (full details there - stats, inventory if you do, history, and only interfaces and critical details on the other contexts).
I recently opened a support case for this same scenario. I was told that the current version of NPM has no way to roll polling of nodes over to a another Poller when a Remote Poller goes down. We recently lost connectivity on primary and secondary WAN links to a location where one of our Remote Pollers is located. We got not indication from NPM that anything was wrong.
Since then I've set up ICMP only polling that poller from our primary NPM server, as well as ICMP-only polling of a loopback on the core switch in that same datacenter so we'll at least get some alert that a site, and our Remote Poller are unreachable.
2 of 2 people found this helpful
One thing I ended up doing was setting up polling of the various NPM servers by remote NPM pollers. That way I at least know when the regional NPM hardware is unavailable.
I also poll the regional routers / gateways from my local NPM solution, which lets me know which regional routers are actually up or down, even though their local poller is off line.
I created a Dependency and Group that helps keep me from being overloaded with alerts when a regional site or its poller is down.
Finally, there's a slick little option in the Engineers' Toolset that let me manually create a mini NPM/Poller on my laptop, into which I've dumped all the regional router's management addresses. Even if the regional AND the local NPM poller were to go off line, I can start up the local poller on my laptop and see what's actually down from my local network's point of view. That feature's been great when doing local changes that have the potential to cause major outages. I can see the results of a command immediately, rather than waiting for a 2-minute poller cycle to show it up in NPM.
1 of 1 people found this helpful
rschroeder, we were in the same boat before. We had our main WAN link between two pollers go down and no alerts. I opened a case and support explained to me that since the nodes that are polled by the additional poller don't get updated in the database by the additional poller, they keep the same status as before (usually "up"). We had to monitor the Additional poller from the main poller and vice versa. We setup both node up/down and application / component monitors. Now we actually get alerts when services on the remote poller get restarted (annoying), but we know when something bad happens. We get an alert saying that far side WAN router is down and SW poller services are down or unreachable.
You mentioned dependencies -make sure your additional poller is not dependent on the far side wan router or you won't get alerts on it.
One thing I don't have figured out is this... since the main poller is responsible for sending email to the mail server for alerts... when the Main poller goes down and the additional poller is up you get nothing. Maybe that's addressed by having a separate monitoring system monitor SolarWinds or one of those failover engines which we don't have. For now we just keep SolarWinds up on a screen - if it goes down we get a visual indication.
That's the "keep it simple" part--having the main monitor up, and watching when various views or windows go blank, which indicates your poller has issues. Low tech, but effective.