cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 13

Question regarding polling instability.

Hi,

We have an infrastructure that's pretty extensive with all sorts of devices and os versions etc. Many types of different devices etc. We are at an elemental count of around 50k elements and counting as our infra continues to under go changes and grow. Through the process I've noticed a weird error come up that I cannot for the life of me get others to believe or accept that it's a valid problem. They just want to put the finger at solarwinds and say the software is broken when in reality it's a configuration error in the infra somewhere causing this problem.

I have an old deployment of solarwinds old environment. And a new environment that was built separately. We have many network devices that either work on one environment and fail the other and vice versa. Every couple days it's choosing to work on one environment and fail the other and this is causing gaps in data. Has anyone ever experienced such a thing? The devices are currently being monitored by two separate environments. But I think think this would cause an issue in snmp monitoring. To be clear what I mean working is on the snmp monitoring. if you run snmp tests you'll see that one day it fails a few days down the line it works, then a few days later it works again. And when it doesn't work on one environment it works on the other. It is a really strange issue.

My thought is that there is filtration of some kind happening. There maybe a firewall or a river bed device or a palo alto that's messing with the packets and how things are being monitored. And when it see requests coming from two different sources it is probably detecting one source as suspect and blocking that traffic and keeps flipping around on me. Because when it fails on either sw environments if you do a pcap you get nothing back from the device it's a time out. and when it works, everything works.

There are a mix of tipping point devices, palo alto devices, f5 devices, and firewall devices with in the communication routes of these devices. It's not a straight shot from the SW server to the device.

We are trying to bring live the new deployment but can't because of these instabilities that involve both environments.

Any help at all is appreciated.

Labels (2)
Tags (4)
0 Kudos
13 Replies

Do you monitor over the WAN and do you use some form of WAN optimization? I found that with one of my clients, they were getting strange snmp errors after implementing WAN Optimization boxes. What does your "polling Completion" value say?

The Interesting part is that we do have Riverbed devices and also palo alto and Tipping point devices. All these do some form or another of filtration of packets. Now I've spoke with my network group and they have dug into the logs on these machines and tell me that there are no logs indicating packets are being rejected.

We also have cisco ASA's. And those logs are checked and found to have no logs indicating trouble. When the problem happens a pcap from the primary shows the packet is put on the wire and sent and gets no response back. We did a test where I had a network admin check real time when I performed a snmp walk and they advised that the packet shows as going through without problem. And it shows reaching the device the device responds packets make it through the filter but never make it to the server.

I even went as far as looking into the VM part of the equation to see if there was a bad VM setting in my server VM's or a network setting problem but couldn't find any and the packets show to trace all the way back to the server. Yet on pcap there is no proof of the server getting it.

Our company uses McAfee anti-virus suit. EPO enterprise. I've had a call with my security guys where McAfee rep's where on the line. They helped me place exceptions but they are a layer 2 firewall meaning that anything coming in or going out goes through the firewall regardless of exceptions. Exception by the explanation I got is to tell the software to do nothing regardless of what is found.

At the end everyone says it's not them. They say everything is working and point the finger back at my team saying maybe solarwinds is malfunctioning which never ends up being the case. haha.

0 Kudos

would it be possible to switch off any optimization for SNMP and let that protocol run unfiltered/unchecked.

maybe bao@xxxlutz.at​ wants to share his experience with WAN optimization.

0 Kudos

I've made multiple requested and provided facts to my network teams but they push back really strong saying that the requests are unjustified and without solid proof that this will indeed fix the issue permanently. So the answer is no. My network guys fight me on this saying that the optimizer are configured to let traffic through and logs indicate that they are doing just that.

I've asked for Solarwinds to be unfiltered and have straight communications for monitoring purposes and this went as high as our VP's siding on the network side. It's a solid no unless I have some very strong proof that could over turn their evidence and show that this will indeed fix it than I can do it.

0 Kudos

Sorry you have that pain -- in my environment network management runs in its own VLANS and has QoS applied, so we get a guaranteed chunk  of bandwidth (if the network is melting down due to a broadcast storm, we can at least log in to our switches and shut down offending ports -- I can't tell you how good this is on a large network).

My suggestion, for what its worth (though this will mess with your metrics on one install) is to switch to SNMP polling for latency and packet loss measurement on one of the installs. This will be less accurate than ICMP-polling, but it will give you a broader view of SNMP packet-loss and round trip times without having to turn up debugging and go log-mining. That should give you evidence to take to the network folks.

Richard,

That sounds interesting. I might just give it a try.

Thank you!

0 Kudos

Is it just random devices all over or are there a few common culprits?  I just ask because if its completely random then it will probably be a lot harder to pin down.  I'd expect you are probably on the right trail, snmp itself is not really that complicated.  Like you said, you review the captures and can see pretty clearly when the devices stop responding.  Now the tricky part would be to figure out what changes in the environment are happening at those times.  It's probably a pain but I would be gathering logs from the firewall devices between your poller and endpoints to verify that they are seeing the traffic and aren't rejecting it.

Depending on what equipment is available to you i'd try to grab a capture from the poller or as near to it as possible, and also a capture from an endpoint (or as near to it as you can) that's giving you snmp problems.  Confirm that the polling packets are or are not arriving, then if they aren't you just have to dig through the firewalls and such along the path to figure out what it jamming them up.  You might also try setting up a netpath probe between the poller and problem endpoint on a port you would expect to be open to see if you pick up anything weird there.  Since SNMP is usually only running UDP this wouldn't be an exact match but you might find something there that correlates to your missing snmp packets.

I have a custom SWQL resource you can set up that might help as well, it identifies nodes that are showing as up in ICMP but have stopped responding to snmp requests, this same logic could be set up into an alert so you get notified right away when there is a gap in polling so you can begin troubleshooting while the problem is hot.

select
n.caption as [Node]
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,n.statusdescription as [Status Description]
,n.objectsubtype as [Collection Type]
,e.servername
,n.statcollection as [Interval]
,case when n.objectsubtype !='SNMP' then 'Not Used'
when n.community='' then 'Not Used'
else n.community
end as [SNMPv2 Community]
,case when c.Name is null then 'Not Used'
else c.Name
end AS [WMI/SNMPv3 Credential]
,tolocal(n.lastsystemuptimepollutc) as [Last Stat Collection]
,tolocal(n.lastsync) as [Last Ping]
,daydiff(lastsystemuptimepollUTC,getutcdate()) as [Days Since Polled]
,'Edit' AS [Edit]
, '/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]


from orion.nodes n
left JOIN Orion.NodeSettings ns ON n.NodeID = ns.NodeID and SettingName like '%Credential%'  
left JOIN Orion.Credential c ON ns.SettingValue = c.ID  
join Orion.Engines e on e.engineid=n.engineid
where status<>'2'
and status<>'9'
and objectsubtype!='ICMP'
and minutediff(lastsystemuptimepollUTC,getutcdate())>20


Order by Lastsystemuptimepollutc
- Marc Netterfield, Github

I like the way you construct links for various purposes. Would you please be able to show (screenshot) how did you configure [Edit] link? (I mean as seen in web-based report config)

0 Kudos

This query is a good one to demo that

pastedImage_3.png

SELECT
n.caption as Node
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,'Edit' AS [Edit]
,'/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]
,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]
,'List' AS [Resources]
,'/Orion/Nodes/ListResources.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Resources]
,'/Orion/images/nodemgmt_art/icons/icon_list.gif' as [_IconFor_Resources]
,'Assign' AS [Pollers]
,'/Orion/NPM/NodeCustomPollers.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Pollers]
,'/Orion/images/device_poller.png' as [_IconFor_Pollers]

from orion.nodes n

I also have some resources that I like to link into other resources I build so you can use them to drill down, such as this example node status summary

pastedImage_5.png

click on the down nodes and it takes you to

pastedImage_6.png

SWQL below

--Node Summary
SELECT Count(*) as TOTAL
,'/Orion/images/StatusIcons/ContainerMembers/DefaultIcon.gif' as [_iconfor_TOTAL]
,'/orion/nodes/default.aspx' as [_linkfor_TOTAL]
,(SELECT Count(*) as UP FROM Orion.Nodes Where Status = 1) as UP
, '/Orion/images/StatusIcons/Small-Up.gif' as [_iconfor_UP]
,'/orion/nodes/default.aspx' as [_linkfor_UP]
,(SELECT Count(*) as DOWN FROM Orion.Nodes Where Status = 2) as DOWN
,'/Orion/images/StatusIcons/Small-Down.gif' as [_iconfor_DOWN]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Down Nodes with Duration%'),'&NetObject=&') as [_linkfor_DOWN]
,(SELECT Count(*) as UNMANAGED FROM Orion.Nodes Where Status = 9) as UNMANAGED
,'/Orion/images/StatusIcons/Small-Unmanaged.gif' as [_iconfor_UNMANAGED]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Unmanaged Nodes%' and resourcename='Custom Query'),'&NetObject=&') as [_linkfor_UNMANAGED]
,(SELECT Count(*) as OTHER FROM Orion.Nodes Where Status not in (1,2,9)) as OTHER
,'/Orion/images/StatusIcons/Small-NotRunning.gif' as [_iconfor_OTHER]
,concat('/Orion/DetachResource.aspx?ResourceID=',(SELECT top 1 ResourceID FROM Orion.Resources WHERE resourcetitle LIKE '%Other Nodes%'),'&NetObject=&') as [_linkfor_OTHER]
FROM Orion.Nodes

--Down nodes
SELECT
n.caption as Node
,n.detailsurl as [_linkfor_Node]
,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]
,n.ip_address as [IP Address]
,n.detailsurl as [_linkfor_IP Address]
,case
when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Never'
else tostring(ToLocal(n.LASTSYSTEMUPTIMEPOLLUTC))
end as [Last Polled]
,CASE
--when n.LASTSYSTEMUPTIMEPOLLUTC is null then 'Forever'
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/1440.0,1)
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then round(minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())/60.0,1)
else minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())
end as [Time Down]
,CASE
when n.LASTSYSTEMUPTIMEPOLLUTC is null then ''
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>1440 then 'Days'
when minutediff(n.LASTSYSTEMUPTIMEPOLLUTC,GETUTCDATE())>60 then 'Hours'
else 'Minutes'
end as [ ]

from orion.nodes n
Where status=2
order by LASTSYSTEMUPTIMEPOLLUTC desc

-Marc Netterfield

    Loop1 Systems: SolarWinds Training and Professional Services

- Marc Netterfield, Github

Nice resource. But, how would on add the "Poll" button to the list?

0 Kudos

I'm not sure it can be done this way. The poll now, rediscover, and maintenance all kick off javascript bits like this one and I don't think the SWQL _linkfor_ function can parse them

<a href="javascript:;" onclick="return showPollNowDialog(['N:21']);" id="pollNowLnk"><img src="/Orion/images/pollnow_16x16.gif" alt=""> Poll Now</a>

- Marc Netterfield, Github
0 Kudos

O, wow, this is super cool, thanks a lot for sharing. I was just wondering what resource do you use to feed this script into?

0 Kudos

ok, I have figured it out - it is Custom Query I love it, thanks a lot.

0 Kudos