This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SNMP Queue Failures

Ok, I posted orginally in the NPM forum ().  Since I recently installed UDT as well, we thought it best I posted here.

We have recnetly started getting a lot of %SNMP-3-INPUT_QFULL_ERR: Packet dropped due to input queue full" errors on my big Catalyst switches. This starting happeing around the same time we installed UDT - and is still happening, btw.

I opened a support case with both Cisco and Solarwinds, and have not had much luck in determining if UDT or another Orion product is causing this. Cisco suggested I increase the queue-length on the switch (I did this 2x - first to 100, then to 250 - no change) then requested the MIBs being used by Orion, so i sent that over yetserday.  Solarwinds support, provided the mibs, and I am still looking at increasing the buffers more, but I am not sure what other impact this might have.

It just seems odd - has ANYONE see anything like this before?  My syslogs have doubled in size since this issue started, and is becomming quite worrysome for me.

Thanks.

  • Im assuming since noboy has remarked, commented or anything else on this thread, NOBODY is having the issue, or pushing it as far as I did yet.

    so far, neither Cisco or Solarwinds has been able to help "fix" the problem, so I have had to make some changes to how I import/monitor ports on the nodes, and hope for the best.

    By default, UDT will "monitor" any port it sees as "up". This includes any virtual gateway interface, vty port, etc.  On my switches that I have this problem with, i have manually "unmonitored" any port that is virtual. Hoping this helps. we shall see.

     

    BTW - we need the "What we are working on" thread so we can start posting suggestions.

  • I am also seeing this on my Cisco 3750 stacks (on the stacks that have 6 or more switches in).It also only started happening after I installed the eval for UDT.

    I have raised a call with SW support and they are currently working on it.

  • We believe this is due to overloading the switches with SNMP queries. The UDT queries are bulk requests and could take some time for the query to resolve. If you have another application (ex: NPM) trying to run SNMP queries against the machine at the same time, it could be possible that there are query timeouts.

    One way to look at this is to find out more about when your jobs are running to verify they correlate with this message. Then look at what other SNMP queries may be hitting the equipment.

    NPM did not have this issue because it relies on single SNMP gets rather than bulk requests. The bulk gets are more efficient, but take longer to process than individual gets.

     

    There are several config settings in UDT.BusinessLayer.dll.config which allow to setup periodic dump of collected data into files on disk.
    The files are saved side by side with UDT log files into 'ProgramData\Solarwinds\Logs\Orion\'. File names are 'UDTJobRuntimeStat_JobRawStat.csv', and 'UDTJobRuntimeStat_JobStat.csv'
    The setting:
    <add key="UDT.JobStatUpdateInterval" value="15" />
    <add key="UDT.JobStatAveragingCount" value="8" />
    <add key="UDT.JobStatDumpData" value ="False" />
    Dump is enabled by setting JobStatDumpData to True.
    JobStatAverageingCount defines how many consequent job intervals are saved and used for internal calculations.
    If any settings is changed, restart of SolarWinds Module Engine is required.

  • I would also suggest to change two following settings:

    <add key="UDT.Layer3JobAllowAsync" value="True" />

    <add key="UDT.PollingJobAllowAsync" value="True" />

    from True to False. It will instruct UDT to send snmp queries sequentially and may help to mitigate the issue.

  • Please provide a bit more guidance on this.

    I can´t find the files you mention, to modify this setting.

    I´m seeing a similar pattern on some of my larger switches, like 6500 and 7600 series. I´m seeing interface utilization going trough the roof, even higher than possible for the specific interface.

  • FormerMember
    0 FormerMember in reply to mavturner

    We believe this is due to overloading the switches with SNMP queries. The UDT queries are bulk requests and could take some time for the query to resolve. If you have another application (ex: NPM) trying to run SNMP queries against the machine at the same time, it could be possible that there are query timeouts.

    Mav, is this resolved/improved in UDT 2.0 or NPM 10.2?  Do the settings as noted still need to be manually tweaked?

  •    We started seeing these on our 3750 core switch stack today and had to shut off SNMP monitoring on the device because it was making the thing crawl.  What (if anything) has been found about this regarding cause and resolution?

  • After the last post I removed the re-added the device (thus losing historical data) and the problem went away.

    Today I upgraded to NPM 10.3 and now we have the problem again. I cannot afford to keep losing historical data to fix this!  What can be done?

  • Hi ttl,

    Have you consuted with support?

    Jiri

  •   Yes. They contend that this is Cisco's problem, even after I mentioned that I didn't have this problem for quite some time until I updated to NPM 10.3. No polling logs were ever requested, no packet captures, just an old link to a Cisco Community discussion that ends with a post reporting that the "fix" in the thread didn't work.