This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Cisco 3850 issue which shows up in Orion (probably affects ALL cisco switches - UPDATE!)

cnorborg over 4 years ago

So, I'm curious if anyone else is having an issue with Cisco 3850s, that we were kind of lucky to find, that affects Orion.

The way we noticed it was a few of our switches had the "VLAN" checkbox that is normally in "List Resources" missing.

ie: Working switch

Not working switch

The problem turns out to be that the OID (1.3.6.1.2.1.17.1.4.1.2 or dot1dBasePortIfIndex) cannot be scanned by Orion (or anything else!) on the switches that are not functioning right. This is causing some issues with reports we want to generate and topology and other things. Out of 861 switch stacks, I'm seeing it happen on 35 of ours. The IOS versions I'm seeing affected are 03.06.06E, 16.3.7 and 16.3.8. Switches that were just rebooted are affected, as well as those up 80+ weeks. Can't find any commonality in terms of uptime, memory, cpu, stack size, IOS, etc...

I use this SWQL query to find them now, if the 3 columns after "Caption" are NULL, the switch seems to be affected.

If you have a bunch of 3850's on your network, you can either run this in SWQL studio or add a "Custom Query" resource to any page and it would run right there. Curious if anyone else is experiencing the issue that has a decent amount of 3850's, or if you do and aren't experiencing it, I'd be curious what IOS versions you're running.

Have a case open with Cisco, but they're at a loss I think. If anyone else is experiencing the issue, it would probably be good for you to open a case also to get this moved to a higher priority on figuring out whats up and getting it fixed.

SELECT N.NodeID, N.IP_Address, N.Caption, TPE.InstanceID, TPE.Enabled, TPE.Node.Caption, N.CPUCount, WeekDiff(N.LastBoot, GetDate()) AS Weeks, N.SwitchStack.MemberCount, N.MemoryAvailable, N.CPULoad, N.HardwareHealthInfos.Model ,N.IOSVersion, N.IOSImage

FROM Orion.Nodes N

LEFT JOIN Orion.TechnologyPollingAssignments TPE ON (TPE.InstanceID = N.NodeID) AND (TPE.TechnologyPollingID = 'Core.Topology.Vlan')

WHERE (N.MachineType LIKE '%38xx%')

ORDER BY TPE.InstanceID, N.IP_Address

Post your responses here, or if you have any questions on how to run the query just let me know!

Top Replies

0 zackm over 4 years ago

out of 73 nodes; we have one that seems to be affected.
2 members
up for 49 weeks
WS-C3850-24S-E
IOS Version: 03.07.01.E RELEASE SOFTWARE (fc3)
IOS Image: CAT3K_CAA-UNIVERSALK9-M
of the 72 unaffected nodes:
uptime from 1 week to 98 weeks
35 others have the same IOS Version
all 72 have the same IOS Image
Cancel
Vote Up 0 Vote Down

Cancel
0 cnorborg over 4 years ago in reply to zackm

Interesting! Not far off from what I'd expect I guess though. We're affected on about 4% of our nodes, which would be 2.92 nodes out of 73? So 1 isn't too far off I don't think. I think the IOS version is kind of "in the middle" of the versions we're using, so that would make sense. Looks like a stack of 2 switches? Out of our affected switches, only one of them is a stack.
Cisco is giving a bit of a runaround right now, hoping our SE will step in. Already asked for it to be moved to another engineer while ours was taking a day off, but they didn't listen to me. I gave him all the info he wanted in terms of "show" commands on 3 non-working switches and 1 working, plus all the stats I generated, including an SNMP walk from one of the non-working switches. But now he wants it from 2 other switches, 1 working 1 not, saying that getting it from these other switches will make it "easier to look up and reproduce". Why does he need an snmpwalk from a working device from me? Its working!
Sorry, a bit annoyed with them...
Going to open a case on yours?
Cancel
Vote Up 0 Vote Down

Cancel
0 cnorborg over 4 years ago in reply to zackm

Oh hey, we might have figured out a commonality that could help figure out why some switches are that way and not others. Only gone through a few of my switches so far, but, it seems to be limited to those switches that are either 100% full and have a dynamically assigned vlan (other than vlan 1) on each port, or have been statically configured on all the end-user ports to a vlan other than vlan 1. That, and all the trunk ports might need to have specific vlans allowed on the trunk ports too.
But, one other "symptom" we've figured out is that you can do the SNMP walk for one of the active vlans on the switch, but not vlan 1 also... ie: if all your switchports have "switchport access vlan 34" on them, you can do:
snmpwalk -v 2c -c MySNMPCommunity <IP of Switch>@34 1.3.6.1.2.1.17.1.4.1.2
while
snmpwalk -v 2c -c MySNMPCommunity <IP of Switch> 1.3.6.1.2.1.17.1.4.1.2
Note the "@34" to signify that vlan after the switch IP in the first example...
Cancel
Vote Up +1 Vote Down

Cancel
0 zackm over 4 years ago in reply to cnorborg

interesting!
our network team has this on their queue; but not sure what traction it would get above projects and the like. We'll see I guess.
Cancel
Vote Up 0 Vote Down

Cancel
0 cnorborg over 4 years ago

So, did some more testing with both an Arista switch and an older Cisco switch.
Long story short, its probably not an issue that affects multiple vendors. The Arista switch always returned info when that OID was scanned, no matter how the ports were configured.
However, on an old Cisco 2960-24TT-L, I observed the same behavior, both via a snmpwalk, and "list resources" missing the "VLAN" box in Orion. When all ports were configured in VLANS, issues were present. When at least one port wasn't in a vlan, issues were not present. This was a 12.2(X) strain of IOS also, so this issue has probably been around for quite a while also!!
So, its very possible that if you have cisco switches, that quite a bit of information might be missing from them!!
Cancel
Vote Up 0 Vote Down

Cancel
0 superfly over 4 years ago

We have 3 out of 6 showing this behaviour. They are the top 3 in this list.
Cancel
Vote Up 0 Vote Down

Cancel
0 cnorborg over 4 years ago

So, Solarwinds has apparently decided that this is not a bug. According to them "I've discussed your findings with other members of our team to discuss if there are any other options that we have to resolve your issue. Unfortunately what you're wanting is currently not a feature of the product. You can submit a Feature Request but there is no timeline on if or when it will be implemented."
This approach has definitely annoyed me, I find it to definitely be a bug in that they're not interpreting the results of their polling correctly...
What are your thoughts? Bug, or Feature Request?
Cancel
Vote Up +1 Vote Down

Cancel
0 aLTeReGo over 4 years ago in reply to cnorborg

Sounds like what you are describing is a legitimate bug in IOS. I'm not sure what the feature request would be in NPM, but maybe I'm missing something here.
Cancel
Vote Up 0 Vote Down

Cancel
0 superfly over 4 years ago in reply to superfly

Interestingly I did a List Resources on one of them and it showed a cached version. I did a Force Refresh and then it showed the VLAN option. So one of the interfaces now had VLAN 1 on it. On the other 2, once I added an interface to VLAN1, I got the vlan option.
Cancel
Vote Up 0 Vote Down

Cancel
0 mesverrum over 4 years ago in reply to aLTeReGo

So I would say this is less of a bug and more of a non standard snmp "feature" that Cisco has implemented. They have it documented various places but I am not aware of any other vendors who do it the way Cisco does.
SNMP Community String Indexing - Cisco
I looked at the IETF standard for the bridge mib's and really cannot see anything there that indicates that they intended for that kind of capability to get different results from an OID via these "@" contexts.
RFC 4188 - Definitions of Managed Objects for Bridges
So the list resources only tests with the community string it was given, and it looks like the cisco default behavior is that if no context is given on the community string it defaults to vlan1, which may or may not be in use.
Not sure of a clean solution to the issue.
If there there is another OID that just gives a list of all configured vlan id's SW could query that and then do a series of snmp scans where they append each context on, but that sounds like it could become realllly taxing in terms of polling load and on some polled devices with lots of VLAN's configured
Cancel
Vote Up 0 Vote Down

Cancel