2 Replies Latest reply: Sep 4, 2013 10:10 AM by fcpsolaradmin RSS

What did I miss?

fcpsolaradmin

Thursday just before the end of the week, after a team meeting I received the call. We all know it, it starts with "what the #$#@ is going on with the network!? (for some reason it always gets blamed on network never the servers) The issue reported was: Users in the east side of the building were having issues with slow logins, slow email(whatever that means) , application timeouts, and inability to log into other applications. I logged into solarwinds, checked:

  1. Network Utilization---All looks fine, the switch in question isn't even in the top five for any list.
  2. Check Virtualization manager---No new issues
  3. Check SAM--no server errors
  4. Checked VSphere--No alerts
  5. Pinged servers that were reported giving issues----Pinged fine
  6. Pinged workstations and printers in the department having issues--Pinged fine
  7. VOIP phones working at 100%

 

The network admin logged into the switch and noticed there were a moderate level of discards on one of the two links from the switch, he shut off the one reporting discards and it seemed that it fixed the problem. The amount of discards was not high enough to stand out from our other switches.

 

Issue with that "fix" is it shouldn't have made THAT much of a difference as each switch has two 1gig fiber links to the main building switch. So if one was having issues traffic could just go over the other with little impact.

 

Our  HIPPA guy has been messing with IPSEC, it happens to be in this wing of the building where the issues were occurring, I know little about IPSEC but from what I read, it can cause a bottleneck, but he says he didn't enable it on anything yet....

 

I will have to answer to my boss on Monday as to why this was found sooner. I am not sure what to tell him.

I put together a quick diagram, switch 2 is the one where issues were reported. Any thoughts as to what I may have misses when looking for the bottleneck?

 

Blank Flowchart - New Page.png

 
  • Re: What did I miss?
    svindler

    A few things to consider:

    If the two links are not in a port channel, only one of them is actively being used due to spanning tree, depending on your setup.

    If the switches are also layer 3 devices, load balancing may prefer only one of the links, even if they are in a port channel. There are a number of ways to mitigate this.

    Was the interface reenabled to verify if discards continued?

    Was the discards verified on the switch or through Orion? There is a bug on some Cisco switches where they occasionally report the wrong number via snmp.

    Two links will in most cases be configured so both will handle the traffic. Unless one of the links go down, both links will continue to receive the same amount of traffic, even if one of them is discarding some packets.