So, I'm curious if anyone else is having an issue with Cisco 3850s, that we were kind of lucky to find, that affects Orion.
The way we noticed it was a few of our switches had the "VLAN" checkbox that is normally in "List Resources" missing.
ie: Working switch
Not working switch
The problem turns out to be that the OID (188.8.131.52.184.108.40.206.4.1.2 or dot1dBasePortIfIndex) cannot be scanned by Orion (or anything else!) on the switches that are not functioning right. This is causing some issues with reports we want to generate and topology and other things. Out of 861 switch stacks, I'm seeing it happen on 35 of ours. The IOS versions I'm seeing affected are 03.06.06E, 16.3.7 and 16.3.8. Switches that were just rebooted are affected, as well as those up 80+ weeks. Can't find any commonality in terms of uptime, memory, cpu, stack size, IOS, etc...
I use this SWQL query to find them now, if the 3 columns after "Caption" are NULL, the switch seems to be affected.
If you have a bunch of 3850's on your network, you can either run this in SWQL studio or add a "Custom Query" resource to any page and it would run right there. Curious if anyone else is experiencing the issue that has a decent amount of 3850's, or if you do and aren't experiencing it, I'd be curious what IOS versions you're running.
Have a case open with Cisco, but they're at a loss I think. If anyone else is experiencing the issue, it would probably be good for you to open a case also to get this moved to a higher priority on figuring out whats up and getting it fixed.
SELECT N.NodeID, N.IP_Address, N.Caption, TPE.InstanceID, TPE.Enabled, TPE.Node.Caption, N.CPUCount, WeekDiff(N.LastBoot, GetDate()) AS Weeks, N.SwitchStack.MemberCount, N.MemoryAvailable, N.CPULoad, N.HardwareHealthInfos.Model ,N.IOSVersion, N.IOSImage
FROM Orion.Nodes N
LEFT JOIN Orion.TechnologyPollingAssignments TPE ON (TPE.InstanceID = N.NodeID) AND (TPE.TechnologyPollingID = 'Core.Topology.Vlan')
WHERE (N.MachineType LIKE '%38xx%')
ORDER BY TPE.InstanceID, N.IP_Address
Post your responses here, or if you have any questions on how to run the query just let me know!
So, from what I can tell at this point, this is due to Solarwinds taking some shortcuts to try and speed things up a bit. There are two changes you can make in a config file that is in
volume:\Program Files (x86)\SolarWinds\Orion\Toplogy
The file is "SolarWinds.Topology.Polllers.dll.config". The changes you need to make are below. You will need to stop all the solarwinds services before doing this, and have admin privs to change this file. It's advised to make a backup of the original file before doing these changes.
Once this is done, the "VLAN" box should show up on the devices if you go into "List Resources" and you might need to do a "Force Refresh". You will need to make sure the VLAN box is selected AND hit submit once it is.
According to Solarwinds, "The settings only have an impact on the discovery process when doing a List Resources. It will take more time now as it's looking for the other VLANs to discovery the topology pollers. It doesn't have a impact on the discovery process when you run a Discovery as it already does a full inventory. It has no impact on polling". Which I would wonder why it would be this way if it has such a minor impact on the product, but potentially causes issues.
I also confirmed that if you do these changes, running the "Configuration Wizard" will revert the changes back. Not something I like to hear!!
So, no changes are required on the devices, just a couple minor changes to Orion.
So, Solarwinds has apparently decided that this is not a bug. According to them "I've discussed your findings with other members of our team to discuss if there are any other options that we have to resolve your issue. Unfortunately what you're wanting is currently not a feature of the product. You can submit a Feature Request but there is no timeline on if or when it will be implemented."
This approach has definitely annoyed me, I find it to definitely be a bug in that they're not interpreting the results of their polling correctly...
What are your thoughts? Bug, or Feature Request?
Sounds like what you are describing is a legitimate bug in IOS. I'm not sure what the feature request would be in NPM, but maybe I'm missing something here.
So I would say this is less of a bug and more of a non standard snmp "feature" that Cisco has implemented. They have it documented various places but I am not aware of any other vendors who do it the way Cisco does.
I looked at the IETF standard for the bridge mib's and really cannot see anything there that indicates that they intended for that kind of capability to get different results from an OID via these "@" contexts.
So the list resources only tests with the community string it was given, and it looks like the cisco default behavior is that if no context is given on the community string it defaults to vlan1, which may or may not be in use.
Not sure of a clean solution to the issue.
If there there is another OID that just gives a list of all configured vlan id's SW could query that and then do a series of snmp scans where they append each context on, but that sounds like it could become realllly taxing in terms of polling load and on some polled devices with lots of VLAN's configured
So, from how Cisco explains it, that particular MIB (dot1dBasePortIfIndex) is a list of ifIndex's of ports in a given VLAN. Since the "default" is VLAN 1, if no ports are in VLAN 1, they return no ports. They say they interpret this the same way across all devices and IOS's, and from what I've seen they do.
There are other OID's from which you get the list of VLANs and their names and such. ie: if you query the OID below, you get a list that would be similar to what is shown.
VLAN Name: 220.127.116.11.18.104.22.168.22.214.171.124.1.4
‘126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.11’ => “default”
‘18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52’ => “VLAN0010”
‘184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.124’ => “VLAN0100”
‘126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.11’ => “VLAN0101”
‘18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52’ => “vlan102”
‘184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.124’ => “VLAN0109”
‘126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.110’ => “VLAN1000”
‘18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.521’ => “VLAN1001”
‘184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.1242’ => “fddi-default”
‘126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.113’ => “token-ring-default”
‘18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.524’ => “fddinet-default”
‘184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.1245’ => “trnet-default”
Taxing in terms of polling load? Solarwinds is supposed to do this for all devices, so not really sure why it would be more taxing than its regular behavior...
So to be sure to know about all the port mappings wouldn't they need to poll
126.96.36.199.188.8.131.52.184.108.40.206.1.4 via <mycommunity>@1
220.127.116.11.18.104.22.168.22.214.171.124.1.4 via <mycommunity>@10
126.96.36.199.188.8.131.52.184.108.40.206.1.4 via <mycommunity>@100
220.127.116.11.18.104.22.168.22.214.171.124.1.4 via <mycommunity>@101
126.96.36.199.188.8.131.52.184.108.40.206.1.4 via <mycommunity>@102
220.127.116.11.18.104.22.168.22.214.171.124.1.4 via <mycommunity>@109
126.96.36.199.188.8.131.52.184.108.40.206.1.4 via <mycommunity>@1000
If they don't poll all the used vlans then wouldn't they always run a risk of having the problem you started in the first place? Or do all interfaces show up even if that interface doesn't carry the vlan so we just need to poll the first existing vlan?
Interestingly I did a List Resources on one of them and it showed a cached version. I did a Force Refresh and then it showed the VLAN option. So one of the interfaces now had VLAN 1 on it. On the other 2, once I added an interface to VLAN1, I got the vlan option.
So, did some more testing with both an Arista switch and an older Cisco switch.
Long story short, its probably not an issue that affects multiple vendors. The Arista switch always returned info when that OID was scanned, no matter how the ports were configured.
However, on an old Cisco 2960-24TT-L, I observed the same behavior, both via a snmpwalk, and "list resources" missing the "VLAN" box in Orion. When all ports were configured in VLANS, issues were present. When at least one port wasn't in a vlan, issues were not present. This was a 12.2(X) strain of IOS also, so this issue has probably been around for quite a while also!!
So, its very possible that if you have cisco switches, that quite a bit of information might be missing from them!!
out of 73 nodes; we have one that seems to be affected.
up for 49 weeks
IOS Version: 03.07.01.E RELEASE SOFTWARE (fc3)
IOS Image: CAT3K_CAA-UNIVERSALK9-M
of the 72 unaffected nodes:
uptime from 1 week to 98 weeks
35 others have the same IOS Version
all 72 have the same IOS Image
Oh hey, we might have figured out a commonality that could help figure out why some switches are that way and not others. Only gone through a few of my switches so far, but, it seems to be limited to those switches that are either 100% full and have a dynamically assigned vlan (other than vlan 1) on each port, or have been statically configured on all the end-user ports to a vlan other than vlan 1. That, and all the trunk ports might need to have specific vlans allowed on the trunk ports too.
But, one other "symptom" we've figured out is that you can do the SNMP walk for one of the active vlans on the switch, but not vlan 1 also... ie: if all your switchports have "switchport access vlan 34" on them, you can do:
snmpwalk -v 2c -c MySNMPCommunity <IP of Switch>@34 220.127.116.11.18.104.22.168.4.1.2
snmpwalk -v 2c -c MySNMPCommunity <IP of Switch> 22.214.171.124.126.96.36.199.4.1.2
Note the "@34" to signify that vlan after the switch IP in the first example...
Interesting! Not far off from what I'd expect I guess though. We're affected on about 4% of our nodes, which would be 2.92 nodes out of 73? So 1 isn't too far off I don't think. I think the IOS version is kind of "in the middle" of the versions we're using, so that would make sense. Looks like a stack of 2 switches? Out of our affected switches, only one of them is a stack.
Cisco is giving a bit of a runaround right now, hoping our SE will step in. Already asked for it to be moved to another engineer while ours was taking a day off, but they didn't listen to me. I gave him all the info he wanted in terms of "show" commands on 3 non-working switches and 1 working, plus all the stats I generated, including an SNMP walk from one of the non-working switches. But now he wants it from 2 other switches, 1 working 1 not, saying that getting it from these other switches will make it "easier to look up and reproduce". Why does he need an snmpwalk from a working device from me? Its working!
Sorry, a bit annoyed with them...
Going to open a case on yours?
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.