This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Question regarding polling instability.

Hi,

We have an infrastructure that's pretty extensive with all sorts of devices and os versions etc. Many types of different devices etc. We are at an elemental count of around 50k elements and counting as our infra continues to under go changes and grow. Through the process I've noticed a weird error come up that I cannot for the life of me get others to believe or accept that it's a valid problem. They just want to put the finger at solarwinds and say the software is broken when in reality it's a configuration error in the infra somewhere causing this problem.

I have an old deployment of solarwinds old environment. And a new environment that was built separately. We have many network devices that either work on one environment and fail the other and vice versa. Every couple days it's choosing to work on one environment and fail the other and this is causing gaps in data. Has anyone ever experienced such a thing? The devices are currently being monitored by two separate environments. But I think think this would cause an issue in snmp monitoring. To be clear what I mean working is on the snmp monitoring. if you run snmp tests you'll see that one day it fails a few days down the line it works, then a few days later it works again. And when it doesn't work on one environment it works on the other. It is a really strange issue.

My thought is that there is filtration of some kind happening. There maybe a firewall or a river bed device or a palo alto that's messing with the packets and how things are being monitored. And when it see requests coming from two different sources it is probably detecting one source as suspect and blocking that traffic and keeps flipping around on me. Because when it fails on either sw environments if you do a pcap you get nothing back from the device it's a time out. and when it works, everything works.

There are a mix of tipping point devices, palo alto devices, f5 devices, and firewall devices with in the communication routes of these devices. It's not a straight shot from the SW server to the device.

We are trying to bring live the new deployment but can't because of these instabilities that involve both environments.

Any help at all is appreciated.

Parents
  • Is it just random devices all over or are there a few common culprits?  I just ask because if its completely random then it will probably be a lot harder to pin down.  I'd expect you are probably on the right trail, snmp itself is not really that complicated.  Like you said, you review the captures and can see pretty clearly when the devices stop responding.  Now the tricky part would be to figure out what changes in the environment are happening at those times.  It's probably a pain but I would be gathering logs from the firewall devices between your poller and endpoints to verify that they are seeing the traffic and aren't rejecting it.

    Depending on what equipment is available to you i'd try to grab a capture from the poller or as near to it as possible, and also a capture from an endpoint (or as near to it as you can) that's giving you snmp problems.  Confirm that the polling packets are or are not arriving, then if they aren't you just have to dig through the firewalls and such along the path to figure out what it jamming them up.  You might also try setting up a netpath probe between the poller and problem endpoint on a port you would expect to be open to see if you pick up anything weird there.  Since SNMP is usually only running UDP this wouldn't be an exact match but you might find something there that correlates to your missing snmp packets.

    I have a custom SWQL resource you can set up that might help as well, it identifies nodes that are showing as up in ICMP but have stopped responding to snmp requests, this same logic could be set up into an alert so you get notified right away when there is a gap in polling so you can begin troubleshooting while the problem is hot.

    select
    n.caption as [Node]
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    ,n.statusdescription as [Status Description]
    ,n.objectsubtype as [Collection Type]
    ,e.servername
    ,n.statcollection as [Interval]
    ,case when n.objectsubtype !='SNMP' then 'Not Used'
    when n.community='' then 'Not Used'
    else n.community
    end as [SNMPv2 Community]
    ,case when c.Name is null then 'Not Used'
    else c.Name
    end AS [WMI/SNMPv3 Credential]
    ,tolocal(n.lastsystemuptimepollutc) as [Last Stat Collection]
    ,tolocal(n.lastsync) as [Last Ping]
    ,daydiff(lastsystemuptimepollUTC,getutcdate()) as [Days Since Polled]
    ,'Edit' AS [Edit]
    , '/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
    ,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]


    from orion.nodes n
    left JOIN Orion.NodeSettings ns ON n.NodeID = ns.NodeID and SettingName like '%Credential%'  
    left JOIN Orion.Credential c ON ns.SettingValue = c.ID  
    join Orion.Engines e on e.engineid=n.engineid
    where status<>'2'
    and status<>'9'
    and objectsubtype!='ICMP'
    and minutediff(lastsystemuptimepollUTC,getutcdate())>20


    Order by Lastsystemuptimepollutc
  • I like the way you construct links for various purposes. Would you please be able to show (screenshot) how did you configure [Edit] link? (I mean as seen in web-based report config)

Reply Children
  • This query is a good one to demo that

    pastedImage_3.png

    SELECT
    n.caption as Node
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    ,'Edit' AS [Edit]
    ,'/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
    ,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]
    ,'List' AS [Resources]
    ,'/Orion/Nodes/ListResources.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Resources]
    ,'/Orion/images/nodemgmt_art/icons/icon_list.gif' as [_IconFor_Resources]
    ,'Assign' AS [Pollers]
    ,'/Orion/NPM/NodeCustomPollers.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Pollers]
    ,'/Orion/images/device_poller.png' as [_IconFor_Pollers]

    from orion.nodes n

    I also have some resources that I like to link into other resources I build so you can use them to drill down, such as this example node status summary

    pastedImage_5.png

    click on the down nodes and it takes you to

    pastedImage_6.png

    SWQL below

    --Node Summary
    SELECT Count(*) as TOTAL
    ,'/Orion/images/StatusIcons/ContainerMembers/DefaultIcon.gif' as [_iconfor_TOTAL]
    ,'/orion/nodes/default.aspx' as [_linkfor_TOTAL]
    ,(SELECT Count(*) as UP FROM Orion.Nodes Where Status = 1) as UP
    , '/Orion/images/StatusIcons/Small-Up.gif' as [_iconfor_UP]
    ,'/orion/nodes/default.aspx' as [_linkfor_UP]
    ,(SELECT Count(*) as DOWN FROM Orion.Nodes Where Status = 2) as DOWN
    ,'/Orion/images/StatusIcons/Small-Down.gif' as [_iconfor_DOWN]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Down Nodes with Duration%'),'&NetObject=&') as [_linkfor_DOWN]
    ,(SELECT Count(*) as UNMANAGED FROM Orion.Nodes Where Status = 9) as UNMANAGED
    ,'/Orion/images/StatusIcons/Small-Unmanaged.gif' as [_iconfor_UNMANAGED]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Unmanaged Nodes%' and resourcename='Custom Query'),'&NetObject=&') as [_linkfor_UNMANAGED]
    ,(SELECT Count(*) as OTHER FROM Orion.Nodes Where Status not in (1,2,9)) as OTHER
    ,'/Orion/images/StatusIcons/Small-NotRunning.gif' as [_iconfor_OTHER]
    ,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Other Nodes%'),'&NetObject=&') as [_linkfor_OTHER]
    FROM Orion.Nodes

    --Down nodes
    SELECT
    n.caption as Node
    ,n.detailsurl as [_linkfor_Node]
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
    ,n.ip_address as [IP Address]
    ,n.detailsurl as [_linkfor_IP Address]
    ,case
    when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Never'
    else tostring(ToLocal(n.LASTSYSTEMUPTIMEPOLLUTC))
    end as [Last Polled]
    ,CASE
    --when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Forever'
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/1440.0,1)
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/60.0,1)
    else minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())
    end as [Time Down]
    ,CASE
    when n.LASTSYSTEMUPTIMEPOLLUTC is null then ''
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then 'Days'
    when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then 'Hours'
    else 'Minutes'
    end as [ ]

    from orion.nodes n
    Where status=2
    order by LASTSYSTEMUPTIMEPOLLUTC desc

    -Marc Netterfield

        Loop1 Systems: SolarWinds Training and Professional Services

  • O, wow, this is super cool, thanks a lot for sharing. I was just wondering what resource do you use to feed this script into?

  • ok, I have figured it out - it is Custom Query emoticons_happy.png I love it, thanks a lot.

  • Nice resource. But, how would on add the "Poll" button to the list?

  • I'm not sure it can be done this way. The poll now, rediscover, and maintenance all kick off javascript bits like this one and I don't think the SWQL _linkfor_ function can parse them

    <a href="javascript:;" onclick="return showPollNowDialog(['N:21']);" id="pollNowLnk"><img src="/Orion/images/pollnow_16x16.gif" alt=""> Poll Now</a>