We recently migrated one of our App servers from a virtual to a physical server.
The migration went smooth but moments after the go-live we noticed performance issues on the server.
To rule out Orion I stopped the Orion services to see if the performance would stabilize to normal activity. CPU usage dropped but the device was still lagging.
Originally the vm server had one nic assigned to it and was in vmhost that was dual homed. The new hardware was configured to use 2 of the 3 on board nics. Each nic has 5 ports and were teamed. A port channel was created on the switches for the teamed nic.
We have since disabled the on board nics and plugged into the pci express nic without teaming. This will be temporary and we do plan on having a second connection to a separate switch.
Checking the hardware requirements I was not able to see if there were any restrictions on nic teaming. Has anyone experienced a similar issues? If so, what did you to investigate and resolve the situation.
I've watched System Administrators continually be handed NIC-Teaming tasks and fail to understand the process and options and correct configurations.
In a Nexus environment, we offered four options to SA's for LACP to allow two or more server NIC's to connect to resilient switches. Some of those options work well with MS products, other work well with Unix / Big Iron capabilities.
When Microsoft stopped supported NIC Teaming, things seemed to be scattered, and too many cooks were in the kitchen, resulting in dual-NIC resilience occasionally being lost--even on brand new servers. The skill sets of the System Admin and the Network Analyst must overlap at the "good communications between each other" stage.
The Network Analyst should be able to explain what kinds of options are available on port-channels, and which ones work well for various servers.
The System Admin must be able to discover what LACP (or other) multi-NIC solutions are supported by the hardware and interpret which ones work properly with the given switch.
channel-group [xxx] = raw etherchannel negotiated
channel-group [xxx] mode on = raw etherchannel hard coded
channel-group [xxx] mode passive = LACP negotiated
channel-group [xxx] mode active = LACP hard coded
Working with the IBM Big Iron folks, IBM recommends first trying Active, then Passive. We've also tried "On" for the experience.
For Windows servers, we recommended using "active".
When troubleshooting LACP, there are number of useful commands in the Nexus world that can show what kind of LACP traffic is being accepted or offered or rejected by servers, from the CLI interface on the Nexus gear:
I've seen where setting the switch to "raw" mode would bring the port-channel up, but the traffic wouldn't flow because the server's LACP wasn't correctly configured.
When all else fails, RTFM. The info is in there.
Or, open a TAC case.
We have deployed SolarWinds Orion on servers (VM and Physical) with teamed NIC's and while this works most of the time, we have encountered issues. Some of these were resolved by changing the VM interface type, some with drivers, but I do recall a couple where despite all efforts when the NIC's on the server were teamed data polling backed up, query performance to the SQL Server was affected etc.
Sounds like a tricky one to blame on Solarwinds. I would be tempted to upgrade network card drivers and possibly firmware. There are many variables to be considered here.
What apps are you running?
How much memory does this machine have?
What model are the network cards?
How many CPU's + cores do you have?
I have run many large deployments on both physical and virtual infrastructure and i've only managed to once flood a single NIC with too much traffic.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.