The holidays are over and your work force is back at their desk—blaming the IT team for whatever isn’t work or is slow. While most application performance issues can be blamed on the application itself, there can be other factors too, like an edge or core device behaving badly, server faults, or just another low bandwidth issue. And there are times when even the good old FTP fails too and you have no idea what’s wrong.
Here is a quick list of things to check for when you are at a dead end:
IP conflicts: Unfortunately, you are not always notified by your system in the event of an IP conflict. If the conflicting device is a rogue network device or an OS that cannot resolve an IP conflict by itself, the result would be intermittent connectivity. This happens because all devices with the same IP respond to ARP requests sent by a switching or routing device. So, during a data transfer, some of the conversation packets will go to one device while a few packets will go to the other device resulting in intermittent connectivity.
Solution? Use a ‘user device’ tracking tool or an IP conflict detection tool.
MTU: This is the largest possible size for a PDU that the communication layer can pass forward and this is set to 1518 units for Ethernet version 2 networks. But in cases when a router receives a large packet, it will either fragment the packet, or drop the packet if the DF (Don’t Fragment) bit has been set. It will also send an ICMP error back to the transmitter about the packet being too large. If your application chooses to ignore the error or your network somewhere blocks the ICMP error sent by the router, the application will continue to send large PDU thereby impacting performance. The issue is usually seen in scenarios where VPN is involved because the encapsulation load causes the MTU to exceed 1500 bytes.
Solution? Use ping or traceroute to find the MTU your router interface can forward and set that MTU on your device. And don’t forget to make sure that the ICMP error messages about MTU are not being blocked anywhere in your network.
Auto-Negotiation or Duplex mismatch: This one can be controversial but still, here we go. While there are network admins who always hard set the speed and duplex on each interface of their networking device, there are others who believe that auto negotiation issues are a myth and will never happen. In reality auto negotiation can fail, but when? When the cabling is bad, the devices in question are obsolete, cheap, or simply because one of the devices is set for auto negotiation and the other is forced. Hard setting the duplex to an interface can also cause an issue when two connected devices are set at different duplexes. The end result is an impact on performance because of packet retransmissions and a high number of errors on the affected ports.
Solution? Check for errors and retransmissions with an NMS or use auto-negotiation on all your devices. And don’t forget, when it comes to Gigabit Ethernet, auto-negotiation must be used.
TCP windowing: When there is a slow connectivity issue, the first step for many organizations is to throw expensive bandwidth at the problem. But there is the TCP window size that many admins forget about. While window scaling is an available solution, some routers and firewalls do not properly implement TCP window scaling, thereby causing a user's Internet connection to malfunction intermittently for a few minutes. When transferring large files between two systems, if the connection is slower than what it should be or intermittent, it could be an issue with low TCP window size on the receiving system.
Solution? As stated, most systems do support TCP window scaling, but when things slow down and you don’t know what is wrong, make sure that TCP window scaling is functioning properly or try increasing the TCP receive buffer. Again, make use of an NMS tool for troubleshooting.
Flow control: The flow control mechanism allows an overloaded Ethernet device to send ‘pause’ frames to other devices that are sending data to it. Without flow control, the overloaded device drops packets causing a major performance impact. But when it comes to the backbone or network core, flow control can cause congestion in areas that would have otherwise transmitted without issues. For example, a switch sends a pause frame to a transmitting device because a switch port is unable to match the transmitter’s speed. When the pause frame is received by the transmitting device, it pauses it’s transmission for a few milliseconds. But what also happens is that the traffic to all other switch ports that has the bandwidth to handle the speed is paused as well.
Solution? Use flow control on computers, but don’t have your switches send out pause frames. Instead, implement QoS in the backbone to prioritize packets based on their criticality. You can find flow control best practices from this whitepaper here.
Right QoS: And that brings us to QoS. Network admins use QoS because it can prioritize business applications and drop unwanted traffic. But a few network admins overdo QoS by using it for every type of traffic that passes through the device. This can result in a few business applications performing well all of the time, while a few other applications continue to act up most of the time as well.
Solution? Use QoS only when it is absolutely necessary. You do not have to set priority or a QoS action for all traffic that passes through your network. Prioritize what is important and set best effort or queuing for everything else. Assign bandwidth for very critical applications whose data delivery is important to business continuity.
It’s important to understand the various reasons for data delivery failure, even the uncommon ones. By doing so, you will have a better idea of where to search when issues arise. If you’ve faced data delivery problems, where did your issue stem from and how did you resolve it?