This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Question regarding polling instability.

lcsw2013 over 6 years ago

Hi,

We have an infrastructure that's pretty extensive with all sorts of devices and os versions etc. Many types of different devices etc. We are at an elemental count of around 50k elements and counting as our infra continues to under go changes and grow. Through the process I've noticed a weird error come up that I cannot for the life of me get others to believe or accept that it's a valid problem. They just want to put the finger at solarwinds and say the software is broken when in reality it's a configuration error in the infra somewhere causing this problem.

I have an old deployment of solarwinds old environment. And a new environment that was built separately. We have many network devices that either work on one environment and fail the other and vice versa. Every couple days it's choosing to work on one environment and fail the other and this is causing gaps in data. Has anyone ever experienced such a thing? The devices are currently being monitored by two separate environments. But I think think this would cause an issue in snmp monitoring. To be clear what I mean working is on the snmp monitoring. if you run snmp tests you'll see that one day it fails a few days down the line it works, then a few days later it works again. And when it doesn't work on one environment it works on the other. It is a really strange issue.

My thought is that there is filtration of some kind happening. There maybe a firewall or a river bed device or a palo alto that's messing with the packets and how things are being monitored. And when it see requests coming from two different sources it is probably detecting one source as suspect and blocking that traffic and keeps flipping around on me. Because when it fails on either sw environments if you do a pcap you get nothing back from the device it's a time out. and when it works, everything works.

There are a mix of tipping point devices, palo alto devices, f5 devices, and firewall devices with in the communication routes of these devices. It's not a straight shot from the SW server to the device.

We are trying to bring live the new deployment but can't because of these instabilities that involve both environments.

Any help at all is appreciated.

Top Replies

Parents

0 mesverrum over 6 years ago
Is it just random devices all over or are there a few common culprits? I just ask because if its completely random then it will probably be a lot harder to pin down. I'd expect you are probably on the right trail, snmp itself is not really that complicated. Like you said, you review the captures and can see pretty clearly when the devices stop responding. Now the tricky part would be to figure out what changes in the environment are happening at those times. It's probably a pain but I would be gathering logs from the firewall devices between your poller and endpoints to verify that they are seeing the traffic and aren't rejecting it.
Depending on what equipment is available to you i'd try to grab a capture from the poller or as near to it as possible, and also a capture from an endpoint (or as near to it as you can) that's giving you snmp problems. Confirm that the polling packets are or are not arriving, then if they aren't you just have to dig through the firewalls and such along the path to figure out what it jamming them up. You might also try setting up a netpath probe between the poller and problem endpoint on a port you would expect to be open to see if you pick up anything weird there. Since SNMP is usually only running UDP this wouldn't be an exact match but you might find something there that correlates to your missing snmp packets.
I have a custom SWQL resource you can set up that might help as well, it identifies nodes that are showing as up in ICMP but have stopped responding to snmp requests, this same logic could be set up into an alert so you get notified right away when there is a gap in polling so you can begin troubleshooting while the problem is hot.
select
n.caption as [Node]
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,n.statusdescription as [Status Description]
,n.objectsubtype as [Collection Type]
,e.servername
,n.statcollection as [Interval]
,case when n.objectsubtype !='SNMP' then 'Not Used'
when n.community='' then 'Not Used'
else n.community
end as [SNMPv2 Community]
,case when c.Name is null then 'Not Used'
else c.Name
end AS [WMI/SNMPv3 Credential]
,tolocal(n.lastsystemuptimepollutc) as [Last Stat Collection]
,tolocal(n.lastsync) as [Last Ping]
,daydiff(lastsystemuptimepollUTC,getutcdate()) as [Days Since Polled]
,'Edit' AS [Edit]
, '/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]

from orion.nodes n
left JOIN Orion.NodeSettings ns ON n.NodeID = ns.NodeID and SettingName like '%Credential%'
left JOIN Orion.Credential c ON ns.SettingValue = c.ID
join Orion.Engines e on e.engineid=n.engineid
where status<>'2'
and status<>'9'
and objectsubtype!='ICMP'
and minutediff(lastsystemuptimepollUTC,getutcdate())>20

Order by Lastsystemuptimepollutc
Cancel
Vote Up +3 Vote Down

Cancel
0 AlexSoul over 6 years ago in reply to mesverrum

I like the way you construct links for various purposes. Would you please be able to show (screenshot) how did you configure [Edit] link? (I mean as seen in web-based report config)
Cancel
Vote Up 0 Vote Down

Cancel

0 mesverrum over 6 years ago in reply to AlexSoul

This query is a good one to demo that

SELECT
n.caption as Node
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,'Edit' AS [Edit]
,'/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]
,'List' AS [Resources]
,'/Orion/Nodes/ListResources.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Resources]
,'/Orion/images/nodemgmt_art/icons/icon_list.gif' as [_IconFor_Resources]
,'Assign' AS [Pollers]
,'/Orion/NPM/NodeCustomPollers.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Pollers]
,'/Orion/images/device_poller.png' as [_IconFor_Pollers]

from orion.nodes n

I also have some resources that I like to link into other resources I build so you can use them to drill down, such as this example node status summary

click on the down nodes and it takes you to

SWQL below

--Node Summary
SELECT Count(*) as TOTAL
,'/Orion/images/StatusIcons/ContainerMembers/DefaultIcon.gif' as [_iconfor_TOTAL]
,'/orion/nodes/default.aspx' as [_linkfor_TOTAL]
,(SELECT Count(*) as UP FROM Orion.Nodes Where Status = 1) as UP
, '/Orion/images/StatusIcons/Small-Up.gif' as [_iconfor_UP]
,'/orion/nodes/default.aspx' as [_linkfor_UP]
,(SELECT Count(*) as DOWN FROM Orion.Nodes Where Status = 2) as DOWN
,'/Orion/images/StatusIcons/Small-Down.gif' as [_iconfor_DOWN]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Down Nodes with Duration%'),'&NetObject=&') as [_linkfor_DOWN]
,(SELECT Count(*) as UNMANAGED FROM Orion.Nodes Where Status = 9) as UNMANAGED
,'/Orion/images/StatusIcons/Small-Unmanaged.gif' as [_iconfor_UNMANAGED]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Unmanaged Nodes%' and resourcename='Custom Query'),'&NetObject=&') as [_linkfor_UNMANAGED]
,(SELECT Count(*) as OTHER FROM Orion.Nodes Where Status not in (1,2,9)) as OTHER
,'/Orion/images/StatusIcons/Small-NotRunning.gif' as [_iconfor_OTHER]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Other Nodes%'),'&NetObject=&') as [_linkfor_OTHER]
FROM Orion.Nodes

--Down nodes
SELECT
n.caption as Node
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,case 
when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Never'
else tostring(ToLocal(n.LASTSYSTEMUPTIMEPOLLUTC))
end as [Last Polled]
,CASE
--when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Forever'
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/1440.0,1)
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/60.0,1)
else minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())
end as [Time Down]
,CASE
when n.LASTSYSTEMUPTIMEPOLLUTC is null then ''
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then 'Days'
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then 'Hours'
else 'Minutes'
end as [ ]

from orion.nodes n
Where status=2
order by LASTSYSTEMUPTIMEPOLLUTC desc

-Marc Netterfield

Loop1 Systems: SolarWinds Training and Professional Services

LinkedIN: Loop1 Systems
Facebook: Loop1 Systems
Twitter: @Loop1Systems

0 AlexSoul over 6 years ago in reply to mesverrum

O, wow, this is super cool, thanks a lot for sharing. I was just wondering what resource do you use to feed this script into?
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 AlexSoul over 6 years ago in reply to mesverrum

O, wow, this is super cool, thanks a lot for sharing. I was just wondering what resource do you use to feed this script into?
Cancel
Vote Up 0 Vote Down

Cancel

Children

0 AlexSoul over 6 years ago in reply to AlexSoul

ok, I have figured it out - it is Custom Query I love it, thanks a lot.
Cancel
Vote Up 0 Vote Down

Cancel