24 Replies Latest reply on Nov 21, 2012 5:52 PM by Steve Welsh

    SNMP Queue Failures

    dclick

      Ok, I posted orginally in the NPM forum (Re: SNMP Events).  Since I recently installed UDT as well, we thought it best I posted here.

      We have recnetly started getting a lot of %SNMP-3-INPUT_QFULL_ERR: Packet dropped due to input queue full" errors on my big Catalyst switches. This starting happeing around the same time we installed UDT - and is still happening, btw.

      I opened a support case with both Cisco and Solarwinds, and have not had much luck in determining if UDT or another Orion product is causing this. Cisco suggested I increase the queue-length on the switch (I did this 2x - first to 100, then to 250 - no change) then requested the MIBs being used by Orion, so i sent that over yetserday.  Solarwinds support, provided the mibs, and I am still looking at increasing the buffers more, but I am not sure what other impact this might have.

      It just seems odd - has ANYONE see anything like this before?  My syslogs have doubled in size since this issue started, and is becomming quite worrysome for me.

      Thanks.

        • Re: SNMP Queue Failures
          dclick

          Im assuming since noboy has remarked, commented or anything else on this thread, NOBODY is having the issue, or pushing it as far as I did yet.

          so far, neither Cisco or Solarwinds has been able to help "fix" the problem, so I have had to make some changes to how I import/monitor ports on the nodes, and hope for the best.

          By default, UDT will "monitor" any port it sees as "up". This includes any virtual gateway interface, vty port, etc.  On my switches that I have this problem with, i have manually "unmonitored" any port that is virtual. Hoping this helps. we shall see.

           

          BTW - we need the "What we are working on" thread so we can start posting suggestions.

            • Re: SNMP Queue Failures
              nickirwin

              I am also seeing this on my Cisco 3750 stacks (on the stacks that have 6 or more switches in).It also only started happening after I installed the eval for UDT.

              I have raised a call with SW support and they are currently working on it.

                • Re: SNMP Queue Failures
                  mavturner

                  We believe this is due to overloading the switches with SNMP queries. The UDT queries are bulk requests and could take some time for the query to resolve. If you have another application (ex: NPM) trying to run SNMP queries against the machine at the same time, it could be possible that there are query timeouts.

                  One way to look at this is to find out more about when your jobs are running to verify they correlate with this message. Then look at what other SNMP queries may be hitting the equipment.

                  NPM did not have this issue because it relies on single SNMP gets rather than bulk requests. The bulk gets are more efficient, but take longer to process than individual gets.

                   

                  There are several config settings in UDT.BusinessLayer.dll.config which allow to setup periodic dump of collected data into files on disk.

                  The files are saved side by side with UDT log files into 'ProgramData\Solarwinds\Logs\Orion\'. File names are 'UDTJobRuntimeStat_JobRawStat.csv', and 'UDTJobRuntimeStat_JobStat.csv'

                  The setting:

                  <add key="UDT.JobStatUpdateInterval" value="15" />

                  <add key="UDT.JobStatAveragingCount" value="8" />

                  <add key="UDT.JobStatDumpData" value ="False" />

                  Dump is enabled by setting JobStatDumpData to True.

                  JobStatAverageingCount defines how many consequent job intervals are saved and used for internal calculations.

                  If any settings is changed, restart of SolarWinds Module Engine is required.

                    • Re: SNMP Queue Failures
                      igorvvv

                      I would also suggest to change two following settings:

                      <add key="UDT.Layer3JobAllowAsync" value="True" />

                      <add key="UDT.PollingJobAllowAsync" value="True" />

                      from True to False. It will instruct UDT to send snmp queries sequentially and may help to mitigate the issue.

                        • Re: SNMP Queue Failures
                          JesperVestergaard

                          Please provide a bit more guidance on this.

                          I can´t find the files you mention, to modify this setting.

                          I´m seeing a similar pattern on some of my larger switches, like 6500 and 7600 series. I´m seeing interface utilization going trough the roof, even higher than possible for the specific interface.

                        • Re: SNMP Queue Failures
                          rgward

                          We believe this is due to overloading the switches with SNMP queries. The UDT queries are bulk requests and could take some time for the query to resolve. If you have another application (ex: NPM) trying to run SNMP queries against the machine at the same time, it could be possible that there are query timeouts.

                          Mav, is this resolved/improved in UDT 2.0 or NPM 10.2?  Do the settings as noted still need to be manually tweaked?

                      • Re: SNMP Queue Failures
                        ttl

                           We started seeing these on our 3750 core switch stack today and had to shut off SNMP monitoring on the device because it was making the thing crawl.  What (if anything) has been found about this regarding cause and resolution?

                      • Re: SNMP Queue Failures
                        AndyRoberts

                        We are having the same problems with our 3750 stacks.  Although we haven't implemented UDT and we have been running 10.3 for a few months now with no problems until the last 3 weeks or so.

                         

                        I raised a ticket with TAC and all they came back with was to reboot the stack of switches. Which i did, things were fine for a couple of days then SNMP started failing again (as in i can't list the resources or test the snmp from NPM) but i am no longer seeing the SNMP error in the logs any more.  As this is now happening on 5 or out 12 stacks in our datacentre i have escalated this to TAC again. Although i am sure they will just come back with a recommended upgrade.

                         

                        If anyone has any further knowledge in this they could share that would be great.

                          • Re: SNMP Queue Failures
                            Steve Welsh

                            Hi Andy,

                             

                            Thanks for your posting on the Cisco 3750 stack issue.  Could you please update this posting again when you hear back from TAC?

                             

                            Many thanks,

                             

                            Steve

                              • Re: SNMP Queue Failures
                                AndyRoberts

                                I am still working through this with TAC at the moment. They gave me some things to try, mainly excluding particular MIB's and assigning this block to a particular community, see below:-

                                 

                                snmp-server view test dot1dBridge excluded

                                snmp-server view test ciscoFlashMIB excluded

                                 

                                I was also asked to use the command "no ip http secure-server" as they said it was a cause of a known bug and vulnerability. This worked for me for about 48 hours then started to fail again, although it did improve things so i lose the ability to SNMP a lot less frequently.

                                 

                                I am expecting to be advised of code to upgrade to, to resolve this. Once i hear more i will post accordingly.

                                 

                                So far i have only seen this on our 3750 stacks. Although this weekend i am implementing UDT and really hoping it doesn't cause any problems with my Nexus and 6500's.

                            • Re: SNMP Queue Failures
                              cahunt

                              I am also looking for a solution to this issue. I am seeing this on a few of our 6500 Series VSS Paired distribution boxes. A case with Cisco resulted in us turning back our polling frequency. We have not seen this issue until running NPM 10.2/10.3; and it persists even after adjusting the polling frequency. Cisco also stated that the SNMP function is not CPU intensive and this message is noting that the switch is not responding to the SNMP query as to not affect the CPU usage. For one thing this clogs the logs, I know it is affecting our switches at least in a minimal sense and I would like to find a way to resolve this before it adversely affects our Distribution level nodes.

                              We are running SNMP v2

                              • Re: SNMP Queue Failures
                                vbartosik

                                Hello - I've ran into the same issue on stack consisting of 8 3750X with latest IOS monitoring all 400 ports. Cisco TAC was / is really not helpful with this issue - their solution is to shutdown snmp, which is obviously not acceptable.

                                Problem seems to be with version 2c (bulk-requests?) with many ports. The solution is to change the poling method to SNMP v1 with the same credentials or the best is to use v3 authenticated and encrypted method.

                                 

                                You can achieve this by following config:

                                snmp-server group {groupname} v3 priv

                                snmp-server user {username} {groupname} v3 auth sha {authenticationpassword} priv aes {128 | 192 | 256} {encryptionpassword} - this command doesn't show in running-config, you can check the settings by show snmp user command.

                                 

                                You can of course use another settings of privacy and encryption as md5 / 3DES, DES for authentication, resp. privacy and best would be to apply access-list with permit on Orion server address, but you got the idea.

                                These setting obviously have to be applied in Orion. it's pretty straightforward, just make sure you don't put anything into SNMPv3 context field.

                                 

                                Hope this will work for you also.

                                 

                                Have a nice day.

                                Vladislav

                                • Re: SNMP Queue Failures
                                  AndyRoberts

                                  HI

                                   

                                  After working with TAC for some time on this, it was identified that my switches had 2 known bugs associated to SNMP and SHH. After upgrading the code on my switches to a version where these bugs were fixed this appears to now be working or me.  If it helps the bug ID's TAC identified were CSCed69872 and CSCsv32556.

                                   

                                  May be worth speaking to your support company and seeing if they can advise which version you should upgrade to. TAC wouldn't advise me because they didn't want to accept liability, but my service provider gave me a list of about 20 codes this was fixed in and one happened to be a version i was already running at another site so i chose that one.

                                  So far its looking good. But will continue to monitor this.

                                   

                                  Hope this helps