Whatever business one might choose to examine, the network is the glue that holds everything together. Whether the network is the product (e.g. for a service provider) or simply an enabler for business operations, it is extremely important for the network to be both fast and reliable.
IP telephony and video conferencing have become commonplace, taking communications that previously required dedicated hardware and phone lines and moving them to the network. I have also seen many companies mothball their dedicated Storage Area Networks (SANs) and move them closer to Network Attached Storage, using iSCSI and NFS for data mounts. I also see applications utilizing cloud-based storage provided by services like Amazon's S3, which also depend on the network to move the data around. Put simply, the network is critical to modern companies.
Despite the importance of the network, many companies seem to have only a very basic understanding of their own network performance even though the ability to move data quickly around the network is key to success. It's important to set up monitoring to identify when performance is deviating from the norm, but in this post, I will share a few other thoughts to consider when looking at why network performance might not be what people expect it to be.
MTU (Maximum Transmission Unit) determines the largest frame of data that can be sent over an ethernet interface. It's important because every frame that's put on the wire contains overhead; that is, data that is not the actual payload. A typical ethernet interface might default to a physical MTU of around 1518 bytes, so let's look at how that might compare to a system that offers an MTU of 9000 bytes instead.
What's in a frame?
A typical TCP datagram has overhead like this:
- Ethernet header (14 bytes)
- IPv4 header (20 bytes)
- TCP header (usually 20 bytes, up to 60 if TCP options are in play)
- Ethernet Frame Check Sum (4 bytes)
That's a total of 58 bytes. The rest of the frame can be data itself, so that leaves 1460 bytes for data. The overhead for each frame represents just under 4% of the transmitted data.
The same frame with a 9000 byte MTU can carry 8942 bytes of data with just 0.65% overhead. Less overhead means that the data is sent more efficiently, and transfer speeds can be higher. Enabling jumbo frames (frames larger than 1500 bytes) and raising the MTU to 9000 if the hardware supports it can make a huge difference, especially for systems moving a lot of data around the network, such as the Network Attached Storage.
What's the catch?
Not all equipment supports a high MTU because it's hardware dependent, although most modern switches I've seen can handle 9000-byte frames reasonably well. Within a data center environment, large MTU transfers can often be achieved successfully, with positive benefits to applications as a result.
However, Wide Area Networks (WANs) and the internet are almost always limited to 1500 bytes, and that's a problem because those 9000-byte frames won't fit into 1500 bytes. In theory, a router can break large packets up into appropriately sized smaller chunks (fragments) and send them over links with reduced MTU, but many firewalls are configured to block fragments, and many routers refuse to fragment because of the need for the receiver to hold on to all the fragments until they arrive, reassemble the packet, then route it toward its destination. The solution to this is PMTUD (Path MTU Discovery). When a packet doesn't fit on a link without being fragmented, the router can send a message back to the sender saying,
It doesn't fit, the MTU is... Great! Unfortunately, many firewalls have not been configured to allow the ICMP messages back in, for a variety of technical or security reasons, but with the ultimate result of breaking PMTUD. One way around this is to use one ethernet interface on a server for traffic internal to a data center (like storage) using a large MTU, and another interface with a smaller MTU for all other traffic. Messy, but it can help if PMTUD is broken.
The ethernet frame encapsulations don't end there. Don't forget there might be an additional 5 bytes required for VLAN tagging over trunk links, VXLAN encapsulation (50 bytes) and maybe even GRE or MPLS encapsulations (4 bytes each). I've found that despite the slight increase in the ratio of overhead to data, 1460 bytes is a reasonably safe MTU for most environments, but it's very dependent on exactly how the network is set up.
I had a complaint one time that while file transfers between servers within the New York data center were nice and fast, when the user transferred the same file to the Florida data center (basically going from near the top to the bottom of the Eastern coast of the United States) transfer rates were very disappointing, and they said the network must be broken. Of course, maybe it was, but the bigger problem without a doubt was the time it took for an IP packet to get from New York to Florida, versus the time it takes for an IP packet to move within a data center.
AT&T publishes a handy chart showing their current U.S. network latencies between pairs of cities. The New York to Orlando current shows that it has a 33ms latency, which is about what we were seeing on our internal network as well. Within a data center, I can move data in a millisecond or less, which is 33 times faster. What many people forget is that when using TCP, it doesn't matter how much bandwidth is available between two sites. A combination of end-to-end latency and congestion window (CWND) size will determine the maximum throughput for a single TCP session.
TCP session example
If it's necessary to transfer 100,000 files from NY to Orlando, which is faster:
- Transfer the files one by one?
- Transfer ten files in parallel?
It might seem that the outcome would be the same because a server with a 1G connection can only transfer 1Gbps, so whether you have one stream at 1Gbps or ten streams at 100Mbps, it's the same result. But actually, it isn't because the latency between the two sites will effectively limit the maximum bandwidth of each file transfer's TCP session. Therefore, to maximize throughput, it's necessary to utilize multiple parallel TCP streams (an approach taken very successfully for FTP/SCP transfers by the open source FileZilla tool). It's also the way that tools like those from Aspera can move data faster than a regular Windows file copy.
The same logic also applies to web browsers, which typically will open five or six parallel connections to a single site if there are sufficient resource requests to justify it. Of course, each TCP session requires a certain amount of overhead for connection setup. Usually a three-way handshake, and if the session is encrypted there may be a certificate or similar exchange to deal with as well. Another optimization that is available here is pipelining.
Pipelining uses a single TCP connection to issue multiple requests back to back. In HTTP protocol, this is accomplished by the HTTP header
Connection: keep-alive, which is a default in HTTP/1.1. This request asks the destination server to keep the TCP connection open after completing the HTTP request in case the client has another request to make. Being able to do this allows the transfer of multiple resources with only a single TCP connection overhead (or, as many TCP connection overheads as there are parallel connections). Given that a typical web page may make many tens of calls to the same site (50+ is not unusual), this efficiency stacks up quite quickly. There's another benefit too, and that's the avoidance of TCP slow start.
TCP slow start
TCP is a reliable protocol. If a datagram (packet) is lost in transit, TCP can detect the loss and resend the data. To protect itself against unknown network conditions, however, TCP starts off each connection being fairly cautious about how much data it can send to the remote destination before getting confirmation back that each sent datagram was received successfully. With each successful loss-free confirmation, the sender exponentially increases the amount of data it is willing to send without a response, increasing the value of its congestion window (CWND). Packet loss causes CWND to shrink again, as does an idle connection during which TCP can't tell if network conditions changed, so to be safe it starts from a smaller number again. The problem is, as latency between endpoints increases, it takes progressively longer for TCP to get to its maximum CWND value, and thus longer to achieve maximum throughput. Pipelining can allow a connection to reach maximum CWND and keep it there while pushing multiple requests, which is another speed benefit.
I won't dwell on compression other than to say that it should be obvious that transferring compressed data is faster than transferring uncompressed data. For proof, ask any web browser or any streaming video provider.
Application vs network performance
Much of the TCP tuning and optimization that can take place is a server OS/application layer concern, but I mention it because even on the world's fastest network, an inefficiently designed application will still run inefficiently. If there is a load balancer front-ending an application, it may be able to do a lot to improve performance for a client by enabling compression or Connection: keep-alive, for example, even when an application does not.
In the network itself, for the most part, things just work. And truthfully, there's not much one can do to make it work faster. However, the network devices should be monitored for packet loss (output drops, queue drops, and similar). One of the bigger causes of this is microbursting.
Modern servers are often connected using 10Gbps ethernet, which is wonderful except they are often over-eager to send out frames. Data is prepared and buffered by the server, then BLUURRRGGGGHH it is spewed at the maximum rate into the network. Even if this burst of traffic is relatively short, at 10Gbps it can fill a port's frame buffer and overflow it before you know what's happened, and suddenly the latter datagrams in the communication are being dropped because there's no more space to receive them. Anytime the switch can't move the frame from input to output port at least as fast as it's coming in on a given port, the input buffer comes into play and puts it at risk of getting overfilled. These are called microbursts because a lot of data is sent over a very short period. Short enough, in fact, for it to be highly unlikely that it will ever be identifiable in the interface throughput statistics that we all like to monitor. Remember, an interface running between 100% for half the time and 0% for the rest will likely show up as running at 50% capacity in a monitoring tool. What's the solution? MOAR BUFFERZ?! No.
I don't have space to go into detail here, so let me point you to a site that explains buffer bloat, and why it's a problem. The short story is that adding more buffers in the path can actually make things worse because it actively works against the algorithms within TCP that are designed to handle packet loss and congestion issues.
It sounds obvious, but a link that is fully utilized will lead to slower network speeds, whether through higher delays via queuing, or packet loss leading to connection slowdowns. We all monitor interface utilization, right? I thought so.
The perfect network
There is no perfect network, let's be honest. However, having an understanding not only of how the network itself (especially latency) can impact throughput, as well as an understanding of the way the network is used by the protocols running over it, might help with the next complaint that comes along. Optimizing and maintaining network performance is rarely a simple task, but given the network's key role in the business as a whole, the more we understand, the more we can deliver.
While not a comprehensive guide to all aspects of performance, I hope that this post might have raised something new, confirmed what you already know, or just provided something interesting to look into a bit more. I'd love to hear your own tales of bad network performance reports, application design stupidity, crazy user/application owner expectations (usually involving packets needing to exceed the speed of light) and hear how you investigated and hopefully fixed them!