This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Cisco Devices Falsely Reporting As Down

guse over 5 years ago

Hi all,

I've been having a persistent issue within Solarwinds. Our company has a main facility as well as a number of remote facilities. A few times each week (usually in the morning) Solarwinds will show that (mostly) all of the Cisco 3700 series switches are down at one of our non primary locations. I can still ping these devices just fine and towards the end of the day these false alerts will usually have sorted themselves out. I have tried tweaking our polling setting thinking it may just be network latency causing the issue, but that hasn't yielded any results as of yet. I have changed polling times for the nodes in question, as well as changing the "node down" alert to allow more time before reporting as such.

I was wondering if anyone else has had similar issues and what steps were taken to resolve it.

I appreciate any and all feedback and will gladly share any more information that may be useful.

Thanks!

Top Replies

rschroeder over 5 years ago +2

I had similar problems with NPM regularly falsely reporting the same specific devices were unavailable/down. Or, they didn't complete their NCM jobs. The solution involved a combination of things, each…

0 jpetkevich over 5 years ago

guse Do you happen to have an open case regarding this issue?
Cancel
Vote Up 0 Vote Down

Cancel
0 guse over 5 years ago in reply to jpetkevich

I do not.
Cancel
Vote Up 0 Vote Down

Cancel
0 rschroeder over 5 years ago
I had similar problems with NPM regularly falsely reporting the same specific devices were unavailable/down. Or, they didn't complete their NCM jobs.
The solution involved a combination of things, each associated with a different aspect of the false outage reports:
I theorized the SysAdmins' server backup schedule was causing some of the APE's to miss ICMP returns from certain nodes--especially when there were jobs running on the APE's. I asked the SA's to move their schedule ahead or behind, and I stopped getting some false alerts about nodes down.
The same thing was causing NCM jobs to partially fail. One NCM job was firing off at 4 a.m. and would not complete before a server backup of the APE's started up. Shifting the two tasks further apart fixed the issue.
The Desktop Support Team configured two PC's at each regional site (I have about a hundred sites) to be used as master patching servers. Each morning a small number of these PC's would automatically request a huge dump of Windows or other patch files from the WAN, and then begin checking all the PC's at the remote site and pushing patches into them. This was an erroneous configuration and schedule--it should have never grabbed ALL the possible patches--only the ones needed, and then in compressed form. This job overlapped two of my NCM scheduled jobs and caused them to fail because the WAN pipes were fully utilized. That also caused ICMP to drop between the SolarWinds pollers, making them send off alerts that the sites were down--when in fact they were NOT. NTA helped show the problem and the PC's and Servers involved. The Desktop Team was happy to correct their misconfiguration, and the users who'd come into these sites early in the day stopped having complaints of slow WAN performance.
Two new bandwidth-intensive applications were deployed to several regional sites--without consulting with the Network Team. Administration had tested these heavy-bandwidth apps on-campus where we have 40 Gb/s connectivity between switch blocks and the data centers. They found the apps worked well on campus, and assumed that they'd work well across the WAN. That turned out not to be the case. Each time someone would start either application up at any regional site they would fully utilize the small WAN pipe at the site and cause problems for everyone from Citrix users to VoIP conversations. NTA was the easy detective for discovering the problem, displaying how the apps negatively impacted business at the sites, and showing the Administrators that they had to either pony up for more bandwidth or stop using these apps during business hours because they were causing problems for everyone else. When NPM can't get ICMP through a clogged bottleneck WAN pipe (5 Mb/s DSL at some sites, two T1's or just one T1 at other sites) because some app needs to move 15 GB through TCP as quickly as possible--for employees' convenience--something had to give. We got the users of those apps to move their work to evening hours, and Management bought bigger WAN circuits so they could use the heavy-hitting apps during the day.
I'd set the amount of simultaneous NCM download connections too high, in hopes of getting NCM jobs done more quickly. Once I backed off the setting for simultaneous downloads the jobs began completing successfully, and NPM stopped having issues associated with overloading the SolarWinds servers.
Here's hoping you can fix your false alarms easier and more quickly than I did. It took serious time & testing & record analysis to find the causes of some of these. Fortunately all of the fixes for my network's issues were easy to correct once I isolated their causes.
Swift Packets!
Rick Schroeder
Cancel
Vote Up +2 Vote Down

Cancel