This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Monitoring - Device reports down in SolarWinds - ICMP only - but is not down - crossing VLAN's

I have noticed at times that SolarWinds is unable to poll or produces a false positive because it is unable to poll.  I was OK with this scenario when it was just TimeSkew that could not poll, or TimeSkew was reporting down on a device.  Generally, within the next couple of polling cycles (every 120 seconds) the TimeSkew resolves and reports back just fine.

Last night, I received a notification from SolarWinds, the NetApp Filer was reporting a SP-2 down; this lasted for over 20 minutes.  My first instinct, check the NetApp.   Wait a minute, if the NetApp had a failure, it would have called home, and I would have received a panic alert from the filer, and a phone call from NetApp.  After looking at the NetApp logs, absolutely nothing happened during the time frame that SolarWinds was reporting the device down.

The SP's on the NetApp are in the Unknown devices.  I am only polling ICMP.  I am crossing VLANS on this monitoring - most all the other devices are on the same "management" VLAN, along with the Orion and Orion SQL server.

I do have the Orion Platform loaded on a single VM (small site) - running NPM, Netflow, APM, Config Manager, IPAM, and Virtualization.

Is it a resource issue?  I am slowly growing the monitoring environment - in VM Ware.. I can throw more resources at the Orion server, but it appears to be working just fine at this time.

Any suggestions?  Ideas? 

  • Many years ago  I would get the groups on our overview map go yellow, and then clear as Solarwinds had problems polling the nodes.

    I reported this to the network engineers, who dismissed this as a solarwinds polling issue, and I got steadily more frustrated.

    It started happening more frequently, and sometimes bits of our network would now show RED as Solarwinds would flag odd devices as being 'down'

    I reported this to the network engineers, who dismissed this as a solarwinds polling issue, and I got even more frustrated.

    It got worse and worse until someone really looked at the problem and realized a network policy device that was thought to be capable of 40Gbps throughput was only capable of 20% of that in the function/mode we were using it in, and the campus network was teetering on the edge of a meltdown...

    SOOOO.... my suggestion here is to carefully figure out the network path between NPM and the devices giving false positives to see if there is something common on the path that might be not performing quite as well as you hoped and is not reporting its imminent demise.

    [also: what is your VM resources, and where is the database located in case it really is the VM server]

  • Database is on its own VM

    Orion VM Ware resources:

    4 CPU's

    20 Gb RAM

    200 Gb storage/hard drive

    Orion SQL VM Ware resources:

    4 CPU's

    6 Gb RAM

    600 Gb storage/hard drive

    I am in charge of everything ... so I don't have to contend with Network naysayers or Server naysayers .. it is on me!!!  Do you think crossing the VLAN's might be the issue.  Kind of jives with the Time Skew issue I see.  I have all networking equipment being managed / polled via a management vlan.  The Data Center and "servers" are on a different VLAN.

    I do appreciate your feedback!  Every push helps!

  • It should not be -- I've thousands (4954) of VLANS I monitor equipment across. I would look at the router that connects your vlans first to see if it is having problems.

    My database server is physical -- I'll leave discussion of that to someone who might have more input on that.

  • So ... Data Center to Core Switch - 10 G back-plane- error free- tight config (really proud of the Network - Solar Winds tells me so)  ..... so back to the SQL server... looking at the stats I sent you .. would you add additional resources to the SQL server?   I noticed right away when I brought up the Orion Application Server (Physical to Virtual) that I had to throw more resources at it.  Now it performs so much better than it did on a physical box.   Whaduthink?  I am definitely not a SQL or database person, but do have their support when I ask! I don't see the situation surface often ... I HATE FALSE POSITIVES!  I know I can make it better!  Really proud of the Solar Winds Deployment - my site is ROCKING! ... cept for a slight SW glitch every now and then!!  I don't see this issue building up like a runaway roller coaster like you described above. 

    Thanks again for your input!