Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials

Allow 'Disable Application Monitor Polling' for Node Down

It seems logical, doesn't it? If your node is no longer responding to ICMP (or SNMP or agent up/down queries) you'd think that the entire node would be flagged as unreachable and, like interfaces, volumes, etc. in NPM, that SAM application monitors would be disabled as well. That is not the case. If you are monitoring a node's response time via ICMP and someone drops a firewall rule in that blocks ICMP the node will appear down but the SNMP or WMI or agent-based application monitors will continue to work. You end up with odd scenario where you application monitors have data in SAM but NPM shows the node as down.

Awesome, right?

Unless the node is actually down.

Let's say you have a remote site with 30 servers and each server has 10 component monitors in various application monitors. That is 300 component monitors. Someone accidentally disconnects the WAN (with a backhoe!) and your polling engine can no longer ping those nodes. And since the nodes are actually offline, none of the WMI/SNMP/agent-based SAM monitors are working either. Except that your polling engine will continue try and query those components on the defined interval.

But it get worse.

Because the defined SNMP timeout (and we'll assume you are using SNMP, but the same principle applies to the agent and WMI) is 2500ms (by default) and the SNMP retries are set to 3 it means that every time you try and poll a component it waits 2500ms (25s) for a response and does this 3 times before it declares the component unknown. (Unknown because SAM can't assess an up/warn/down status without a response) No multiple that across 300 components for those 30 servers. Instead of getting a response in sub-100ms for each component (and maybe faster!) you are now waiting upwards of 7400ms longer (7500 - 100ms) to get a response. That is 74 seconds for each and every component. And there are limited slots in the 'query this component' queue so the more 'unresponsive' SAM components you have the 'worse' the situation gets.

I agree that the default behaviour should remain as it is today, but I propose that we be given the option to allow SAM to skip component monitoring if a node is not responding. Basically give us a toggle switch to decide whether or not we want component monitors to also become 'Unreachable' if a node is considered down.

Yes, I know, it means some data sharing between NPM and SAM and might have some other interdependencies as well, but in an enterprise environment where you could have hundreds of servers down at any given time it would make you application monitoring perform so much cleaner.

Find more posts tagged with

disable_monitoring

sam

application monitors

Status: None

Comments

There are no comments yet