12 Replies Latest reply on Oct 7, 2019 6:04 PM by mesverrum

    Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)

    Craig Norborg

      So, I'm curious if anyone else is having an issue with Cisco 3850s, that we were kind of lucky to find, that affects Orion.

       

      The way we noticed it was a few of our switches had the "VLAN" checkbox that is normally in "List Resources" missing.

       

      ie: Working switch

       

      Not working switch

       

       

      The problem turns out to be that the OID (1.3.6.1.2.1.17.1.4.1.2  or dot1dBasePortIfIndex) cannot be scanned by Orion (or anything else!) on the switches that are not functioning right.   This is causing some issues with reports we want to generate and topology and other things.   Out of 861 switch stacks, I'm seeing it happen on 35 of ours.    The IOS versions I'm seeing affected are 03.06.06E, 16.3.7 and 16.3.8.   Switches that were just rebooted are affected, as well as those up 80+ weeks.    Can't find any commonality in terms of uptime, memory, cpu, stack size, IOS, etc...

       

      I use this SWQL query to find them now, if the 3 columns after "Caption" are NULL, the switch seems to be affected.  

       

      If you have a bunch of 3850's on your network, you can either run this in SWQL studio or add a "Custom Query" resource to any page and it would run right there.   Curious if anyone else is experiencing the issue that has a decent amount of 3850's, or if you do and aren't experiencing it, I'd be curious what IOS versions you're running.

       

      Have a case open with Cisco, but they're at a loss I think.   If anyone else is experiencing the issue, it would probably be good for you to open a case also to get this moved to a higher priority on figuring out whats up and getting it fixed.

       

      SELECT N.NodeID, N.IP_Address, N.Caption, TPE.InstanceID, TPE.Enabled, TPE.Node.Caption, N.CPUCount,  WeekDiff(N.LastBoot, GetDate()) AS Weeks, N.SwitchStack.MemberCount, N.MemoryAvailable, N.CPULoad, N.HardwareHealthInfos.Model ,N.IOSVersion, N.IOSImage

      FROM  Orion.Nodes N

      LEFT  JOIN Orion.TechnologyPollingAssignments TPE ON (TPE.InstanceID = N.NodeID) AND (TPE.TechnologyPollingID = 'Core.Topology.Vlan')

      WHERE  (N.MachineType LIKE '%38xx%')

      ORDER BY TPE.InstanceID, N.IP_Address

       

      Post your responses here, or if you have any questions on how to run the query just let me know!

        • Re: Cisco 3850 issue which shows up in Orion
          zackm

          out of 73 nodes; we have one that seems to be affected.

           

          2 members

          up for 49 weeks

          WS-C3850-24S-E

          IOS Version: 03.07.01.E RELEASE SOFTWARE (fc3)

          IOS Image: CAT3K_CAA-UNIVERSALK9-M

           

          of the 72 unaffected nodes:

           

          uptime from 1 week to 98 weeks

          35 others have the same IOS Version

          all 72 have the same IOS Image

            • Re: Cisco 3850 issue which shows up in Orion
              Craig Norborg

              Interesting!     Not far off from what I'd expect I guess though.  We're affected on about 4% of our nodes, which would be 2.92 nodes out of 73?   So 1 isn't too far off I don't think.   I think the IOS version is kind of "in the middle" of the versions we're using, so that would make sense.   Looks like a stack of 2 switches?  Out of our affected switches, only one of them is a stack.

               

              Cisco is giving a bit of a runaround right now, hoping our SE will step in.   Already asked for it to be moved to another engineer while ours was taking a day off, but they didn't listen to me.   I gave him all the info he wanted in terms of "show" commands on 3 non-working switches and 1 working, plus all the stats I generated, including an SNMP walk from one of the non-working switches.   But now he wants it from 2 other switches, 1 working 1 not, saying that getting it from these other switches will make it "easier to look up and reproduce".    Why does he need an snmpwalk from a working device from me?  Its working!

               

              Sorry, a bit annoyed with them...

               

              Going to open a case on yours?

              • Re: Cisco 3850 issue which shows up in Orion
                Craig Norborg

                Oh hey, we might have figured out a commonality that could help figure out why some switches are that way and not others.  Only gone through a few of my switches so far, but, it seems to be limited to those switches that are either 100% full and have a dynamically assigned vlan (other than vlan 1) on each port, or have been statically configured on all the end-user ports to a vlan other than vlan 1.   That, and all the trunk ports might need to have specific vlans allowed on the trunk ports too.

                 

                But, one other "symptom" we've figured out is that you can do the SNMP walk for one of the active vlans on the switch, but not vlan 1 also...   ie: if all your switchports have "switchport access vlan 34" on them, you can do:

                 

                snmpwalk -v 2c -c MySNMPCommunity <IP of Switch>@34 1.3.6.1.2.1.17.1.4.1.2

                 

                while

                snmpwalk -v 2c -c MySNMPCommunity <IP of Switch> 1.3.6.1.2.1.17.1.4.1.2

                 

                Note the "@34" to signify that vlan after the switch IP in the first example...

              • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                Craig Norborg

                So, did some more testing with both an Arista switch and an older Cisco switch.

                 

                Long story short, its probably not an issue that affects multiple vendors.  The Arista switch always returned info when that OID was scanned, no matter how the ports were configured.

                 

                However, on an old Cisco 2960-24TT-L, I observed the same behavior, both via a snmpwalk, and "list resources" missing the "VLAN" box in Orion.     When all ports were configured in VLANS, issues were present.  When at least one port wasn't in a vlan, issues were not present.    This was a 12.2(X) strain of IOS also, so this issue has probably been around for quite a while also!!

                 

                So, its very possible that if you have cisco switches, that quite a bit of information might be missing from them!!

                • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                  superfly99

                  We have 3 out of 6 showing this behaviour. They are the top 3 in this list.

                   

                  • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                    Craig Norborg

                    So, Solarwinds has apparently decided that this is not a bug.   According to them "I've discussed your findings with other members of our team to discuss if there are any other options that we have to resolve your issue. Unfortunately what you're wanting is currently not a feature of the product. You can submit a Feature Request but there is no timeline on if or when it will be implemented."

                     

                    This approach has definitely annoyed me, I find it to definitely be a bug in that they're not interpreting the results of their polling correctly...

                     

                    What are your thoughts?   Bug, or Feature Request?

                    Should this be a "Bug" or a "Feature Request", what do you think?

                      • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                        aLTeReGo

                        Sounds like what you are describing is a legitimate bug in IOS. I'm not sure what the feature request would be in NPM, but maybe I'm missing something here.

                          • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                            mesverrum

                            So I would say this is less of a bug and more of a non standard snmp "feature" that Cisco has implemented.  They have it documented various places but I am not aware of any other vendors who do it the way Cisco does.

                            SNMP Community String Indexing - Cisco

                             

                            I looked at the IETF standard for the bridge mib's and really cannot see anything there that indicates that they intended for that kind of capability to get different results from an OID via these "@" contexts.

                            RFC 4188 - Definitions of Managed Objects for Bridges

                             

                            So the list resources only tests with the community string it was given, and it looks like the cisco default behavior is that if no context is given on the community string it defaults to vlan1, which may or may not be in use. 

                             

                            Not sure of a clean solution to the issue.

                            If there there is another OID that just gives a list of all configured vlan id's SW could query that and then do a series of snmp scans where they append each context on, but that sounds like it could become realllly taxing in terms of polling load and on some polled devices with lots of VLAN's configured

                              • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                                Craig Norborg

                                So, from how Cisco explains it, that particular MIB (dot1dBasePortIfIndex) is a list of ifIndex's of ports in a given VLAN.   Since the "default" is VLAN 1, if no ports are in VLAN 1, they return no ports.   They say they interpret this the same way across all devices and IOS's, and from what I've seen they do.

                                 

                                There are other OID's from which you get the list of VLANs and their names and such.   ie: if you query the OID below, you get a list that would be similar to what is shown.  

                                 

                                VLAN Name: 1.3.6.1.4.1.9.9.46.1.3.1.1.4

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1’ => “default”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.10’ => “VLAN0010”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.100’ => “VLAN0100”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.101’ => “VLAN0101”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.102’ => “vlan102”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.109’ => “VLAN0109”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1000’ => “VLAN1000”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1001’ => “VLAN1001”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1002’ => “fddi-default”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1003’ => “token-ring-default”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1004’ => “fddinet-default”

                                ‘1.3.6.1.4.1.9.9.46.1.3.1.1.4.1.1005’ => “trnet-default”

                                 

                                Taxing in terms of polling load?   Solarwinds is supposed to do this for all devices, so not really sure why it would be more taxing than its regular behavior...

                                  • Re: Cisco 3850 issue which shows up in Orion  (probably affects ALL cisco switches - UPDATE!)
                                    mesverrum

                                    So to be sure to know about all the port mappings wouldn't they need to poll

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@1

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@10

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@100

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@101

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@102

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@109

                                    1.3.6.1.4.1.9.9.46.1.3.1.1.4 via <mycommunity>@1000

                                    etc

                                     

                                    If they don't poll all the used vlans then wouldn't they always run a risk of having the problem you started in the first place?  Or do all interfaces show up even if that interface doesn't carry the vlan so we just need to poll the first existing vlan?