Question regarding polling instability.


We have an infrastructure that's pretty extensive with all sorts of devices and os versions etc. Many types of different devices etc. We are at an elemental count of around 50k elements and counting as our infra continues to under go changes and grow. Through the process I've noticed a weird error come up that I cannot for the life of me get others to believe or accept that it's a valid problem. They just want to put the finger at solarwinds and say the software is broken when in reality it's a configuration error in the infra somewhere causing this problem.

I have an old deployment of solarwinds old environment. And a new environment that was built separately. We have many network devices that either work on one environment and fail the other and vice versa. Every couple days it's choosing to work on one environment and fail the other and this is causing gaps in data. Has anyone ever experienced such a thing? The devices are currently being monitored by two separate environments. But I think think this would cause an issue in snmp monitoring. To be clear what I mean working is on the snmp monitoring. if you run snmp tests you'll see that one day it fails a few days down the line it works, then a few days later it works again. And when it doesn't work on one environment it works on the other. It is a really strange issue.

My thought is that there is filtration of some kind happening. There maybe a firewall or a river bed device or a palo alto that's messing with the packets and how things are being monitored. And when it see requests coming from two different sources it is probably detecting one source as suspect and blocking that traffic and keeps flipping around on me. Because when it fails on either sw environments if you do a pcap you get nothing back from the device it's a time out. and when it works, everything works.

There are a mix of tipping point devices, palo alto devices, f5 devices, and firewall devices with in the communication routes of these devices. It's not a straight shot from the SW server to the device.

We are trying to bring live the new deployment but can't because of these instabilities that involve both environments.

Any help at all is appreciated.

  • Is it just random devices all over or are there a few common culprits?  I just ask because if its completely random then it will probably be a lot harder to pin down.  I'd expect you are probably on the right trail, snmp itself is not really that complicated.  Like you said, you review the captures and can see pretty clearly when the devices stop responding.  Now the tricky part would be to figure out what changes in the environment are happening at those times.  It's probably a pain but I would be gathering logs from the firewall devices between your poller and endpoints to verify that they are seeing the traffic and aren't rejecting it.

    Depending on what equipment is available to you i'd try to grab a capture from the poller or as near to it as possible, and also a capture from an endpoint (or as near to it as you can) that's giving you snmp problems.  Confirm that the polling packets are or are not arriving, then if they aren't you just have to dig through the firewalls and such along the path to figure out what it jamming them up.  You might also try setting up a netpath probe between the poller and problem endpoint on a port you would expect to be open to see if you pick up anything weird there.  Since SNMP is usually only running UDP this wouldn't be an exact match but you might find something there that correlates to your missing snmp packets.

    I have a custom SWQL resource you can set up that might help as well, it identifies nodes that are showing as up in ICMP but have stopped responding to snmp requests, this same logic could be set up into an alert so you get notified right away when there is a gap in polling so you can begin troubleshooting while the problem is hot.

    n.caption as [Node]
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    ,n.statusdescription as [Status Description]
    ,n.objectsubtype as [Collection Type]
    ,n.statcollection as [Interval]
    ,case when n.objectsubtype !='SNMP' then 'Not Used'
    when'' then 'Not Used'
    end as [SNMPv2 Community]
    ,case when c.Name is null then 'Not Used'
    else c.Name
    end AS [WMI/SNMPv3 Credential]
    ,tolocal(n.lastsystemuptimepollutc) as [Last Stat Collection]
    ,tolocal(n.lastsync) as [Last Ping]
    ,daydiff(lastsystemuptimepollUTC,getutcdate()) as [Days Since Polled]
    ,'Edit' AS [Edit]
    , '/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
    ,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]

    from orion.nodes n
    left JOIN Orion.NodeSettings ns ON n.NodeID = ns.NodeID and SettingName like '%Credential%'  
    left JOIN Orion.Credential c ON ns.SettingValue = c.ID  
    join Orion.Engines e on e.engineid=n.engineid
    where status<>'2'
    and status<>'9'
    and objectsubtype!='ICMP'
    and minutediff(lastsystemuptimepollUTC,getutcdate())>20

    Order by Lastsystemuptimepollutc
  • Do you monitor over the WAN and do you use some form of WAN optimization? I found that with one of my clients, they were getting strange snmp errors after implementing WAN Optimization boxes. What does your "polling Completion" value say?

  • I like the way you construct links for various purposes. Would you please be able to show (screenshot) how did you configure [Edit] link? (I mean as seen in web-based report config)

  • This query is a good one to demo that


    n.caption as Node
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    ,'Edit' AS [Edit]
    ,'/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
    ,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]
    ,'List' AS [Resources]
    ,'/Orion/Nodes/ListResources.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Resources]
    ,'/Orion/images/nodemgmt_art/icons/icon_list.gif' as [_IconFor_Resources]
    ,'Assign' AS [Pollers]
    ,'/Orion/NPM/NodeCustomPollers.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Pollers]
    ,'/Orion/images/device_poller.png' as [_IconFor_Pollers]

    from orion.nodes n

    I also have some resources that I like to link into other resources I build so you can use them to drill down, such as this example node status summary


    click on the down nodes and it takes you to


    SWQL below

    --Node Summary
    SELECT Count(*) as TOTAL
    ,'/Orion/images/StatusIcons/ContainerMembers/DefaultIcon.gif' as [_iconfor_TOTAL]
    ,'/orion/nodes/default.aspx' as [_linkfor_TOTAL]
    ,(SELECT Count(*) as UP FROM Orion.Nodes Where Status = 1) as UP
    , '/Orion/images/StatusIcons/Small-Up.gif' as [_iconfor_UP]
    ,'/orion/nodes/default.aspx' as [_linkfor_UP]
    ,(SELECT Count(*) as DOWN FROM Orion.Nodes Where Status = 2) as DOWN
    ,'/Orion/images/StatusIcons/Small-Down.gif' as [_iconfor_DOWN]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Down Nodes with Duration%'),'&NetObject=&') as [_linkfor_DOWN]
    ,(SELECT Count(*) as UNMANAGED FROM Orion.Nodes Where Status = 9) as UNMANAGED
    ,'/Orion/images/StatusIcons/Small-Unmanaged.gif' as [_iconfor_UNMANAGED]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Unmanaged Nodes%' and resourcename='Custom Query'),'&NetObject=&') as [_linkfor_UNMANAGED]
    ,(SELECT Count(*) as OTHER FROM Orion.Nodes Where Status not in (1,2,9)) as OTHER
    ,'/Orion/images/StatusIcons/Small-NotRunning.gif' as [_iconfor_OTHER]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Other Nodes%'),'&NetObject=&') as [_linkfor_OTHER]
    FROM Orion.Nodes

    --Down nodes
    n.caption as Node
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Never'
    else tostring(ToLocal(n.LASTSYSTEMUPTIMEPOLLUTC))
    end as [Last Polled]
    --when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Forever'
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/1440.0,1)
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/60.0,1)
    end as [Time Down]
    when n.LASTSYSTEMUPTIMEPOLLUTC is null then ''
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then 'Days'
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then 'Hours'
    else 'Minutes'
    end as [ ]

    from orion.nodes n
    Where status=2

    -Marc Netterfield

        Loop1 Systems: SolarWinds Training and Professional Services

  • O, wow, this is super cool, thanks a lot for sharing. I was just wondering what resource do you use to feed this script into?

  • ok, I have figured it out - it is Custom Query emoticons_happy.png I love it, thanks a lot.

  • Nice resource. But, how would on add the "Poll" button to the list?

  • I'm not sure it can be done this way. The poll now, rediscover, and maintenance all kick off javascript bits like this one and I don't think the SWQL _linkfor_ function can parse them

    <a href="javascript:;" onclick="return showPollNowDialog(['N:21']);" id="pollNowLnk"><img src="/Orion/images/pollnow_16x16.gif" alt=""> Poll Now</a>

  • The Interesting part is that we do have Riverbed devices and also palo alto and Tipping point devices. All these do some form or another of filtration of packets. Now I've spoke with my network group and they have dug into the logs on these machines and tell me that there are no logs indicating packets are being rejected.

    We also have cisco ASA's. And those logs are checked and found to have no logs indicating trouble. When the problem happens a pcap from the primary shows the packet is put on the wire and sent and gets no response back. We did a test where I had a network admin check real time when I performed a snmp walk and they advised that the packet shows as going through without problem. And it shows reaching the device the device responds packets make it through the filter but never make it to the server.

    I even went as far as looking into the VM part of the equation to see if there was a bad VM setting in my server VM's or a network setting problem but couldn't find any and the packets show to trace all the way back to the server. Yet on pcap there is no proof of the server getting it.

    Our company uses McAfee anti-virus suit. EPO enterprise. I've had a call with my security guys where McAfee rep's where on the line. They helped me place exceptions but they are a layer 2 firewall meaning that anything coming in or going out goes through the firewall regardless of exceptions. Exception by the explanation I got is to tell the software to do nothing regardless of what is found.

    At the end everyone says it's not them. They say everything is working and point the finger back at my team saying maybe solarwinds is malfunctioning which never ends up being the case. haha.

  • would it be possible to switch off any optimization for SNMP and let that protocol run unfiltered/unchecked.

    maybe​ wants to share his experience with WAN optimization.