The story so far:
- It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
- It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
- It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 6 -- by Tom Hollingsworth (networkingnerd)
What happens when your website goes down on Black Friday? Here's the seventh installment, by John Herbert (jgherbert).
The View From Above: James, CEO
It's said, somewhat apocryphally, that Black Friday is so called because it's the day where stores sell so much merchandise and make so much money that it finally puts them 'in the black' for the year. In reality, I'm told it stems from the terrible traffic on the day after Thanksgiving which marks the beginning of the Christmas shopping season. Whether it's high traffic or high sales, we are no different from the rest of the industry in that we offer some fantastic deals to our consumer retail customers on Black Friday through our online store. It's a great way for us to clear excess inventory, move less popular items, clear stock of older models prior to a new model launch, and to build brand loyalty with some simple, great deals.
The preparations for Black Friday began back in March as we looked ahead to how we would cope with the usual huge influx of orders both from an IT perspective and in terms of the logistics of shipping so many orders that quickly. We brought in temporary staff for the warehouse and shipping operations to help with the extra load, but within the head office and the IT organization it's always a challenge to keep anything more than a skeleton staff on call and available, just because so many people take the Friday off as a vacation day.
I checked in with Carol, our VP of Consumer Retail, about an hour before the Black Friday deals went live. She confirmed that everything was ready, and the online store update would happen as planned at 8AM. Traffic volumes to the web site were already significantly increased (over three times our usual page rate) as customers checked in to see if the deals were visible yet, but the systems appeared to be handling this without issue and there were no problems being reported. I thanked her and promised to call back just after 8AM for an initial update.
When I called back at about 8.05AM, Carol did not sound happy. "Within a minute of opening up the site, our third party SLA monitoring began alerting that the online store was generating errors some of the time, and for the connections that were successful, the Time To First Byte (how long it takes to get the first response content data back from the web server) is varying wildly." She continued "It doesn't make sense; we built new servers since last year's sale, we have a load balancer in the path, and we're only seeing about 10% higher traffic that last year and we had no trouble then." I asked her who she had called, and I was relieved to hear that Amanda had been the first to answer and was pulling in our on call engineers from her team and others to cover load balancing, storage, network, database, ecommerce software, servers, virtualization and security. This would be an
all hands on deck situation until it was resolved, and time was not on the team's side. Heaven only knows how much money we were losing in sales every minute the site was not working for people.
The View From The Trenches: Amanda (Sr Network Manager)
So much for time off at Thanksgiving! Black Friday began with a panicked call from Carol about problems with the ecommerce website; she said that they had upgraded the servers since last year so she was convinced that it had to be the network that was overloaded and this was what was causing the problems. I did some quick checks in Solarwinds and confirmed that there were not any link utilization issues, so it really had to be something else. I told Carol that I would pull together a team to troubleshoot, and I set about waking up engineers across a variety of technical disciplines so we could make sure that everybody was engaged.
I asked the team to gather a status on their respective platforms and report back to the group. The results were not promising:
- Storage: no alerts
- Network: no alerts
- Security: no alerts relating to capacity (e.g. session counts / throughput)
- Database: no alerts, CPU and memory a little higher than normal but not resource-limited.
- Load Balancing: No capacity issues showing.
- Virtualization: All looks nominal.
- eCommerce: "The software is working fine; it must be the network."
I had also asked for a detailed report on the errors showing up with our SLA measurement tool so we knew what out customers might be seeing. Surprisingly, rather than outright connection failures, the tool reported receiving a mixture of 504 (Gateway Timeout) errors and TCP resets after the request was sent. That information suggested that we should look more closely at the load balancers, as a 504 error occurs when the load balancer can't get a response from the back end servers in a reasonable time period. As for the hung sessions, that was less clear. Perhaps there was packet loss between the load balancer and those servers causing sessions to time out?
The load balancer engineers dug in to the VIP statistics and were able to confirm that they did indeed see incrementing 504 errors being generated, but they didn't have a root cause yet. They also revealed that of the 10 servers behind the ecommerce VIP, one of them was taking fewer sessions over time than the others, although the peak concurrent session load was roughly the same as the other servers. We ran more tests to the website for ourselves but were only able to see 504 errors, and never a hung/reset session. We decided therefore to focus on the 504 errors that we could replicate. The client to VIP communication was evidently working fine because after a short delay, the 504 error was sent to us without any problems, so I asked the engineers to focus on the communication between the load balancer and the servers.
Packet captures of the back end traffic confirmed the strange behavior. Many sessions were establishing without problem, while others worked but with a large time to first byte. Others still got as far as completing the TCP handshake, sending the HTTP request, then would get no response back from the server. We captured again, this time including the client-side communication, and we were able to confirm that these unresponsive sessions were the ones responsible for the 504 error generation. But why were the sessions going dead? Were the responses not getting back for some reason? Packet captures on the server showed that the behavior we had seen was accurate; the server was not responding. I called on the server hardware, virtualization and ecommerce engineers to do a deep dive on their systems to see if they could find a smoking gun.
Meanwhile the load balancer engineers took captures of TCP sessions to the one back end server which had the lower total session count. They were able to confirm that the TCP connection was established ok, the request was sent, then after about 15 seconds the web server would send back a TCP RST and kill the connection. This was different behavior to the other servers, so there were clearly two different problems going on. The ecommerce engineer looked at the logs on the server and commented that their software was reporting trouble connecting to the application tier, and the hypothesis was that when that connection failed, the server would generate a RST. But again, why? Packet captures of the communication to the app tier showed an SSL connection being initiated, then as the client sent its certificate to the server, the connection would die. One of my network engineers, Paul, was the one who figured out what might be going on.
That sounds a bit like something I've seen when you have a problem with BGP route exchange...the TCP connection might come up, then as soon as the routes start being sent, it all breaks. When that happens, it usually means we have an MTU problem in the communication path which is causing the BGP update packets to be dropped.
Sure enough, once we started looking at MTU and comparing the ecommerce servers to one another, we discovered that the problem server had a much larger MTU than all the others. Presumably when it sent the client certificate, it maxed out the packet size which caused it to be dropped. We could figure out why later, but for now, tweaking the MTU to match the other servers resolved that issue and let us focus back on the 504 errors which the other engineers were looking at.
Thankfully, the engineers were working well together, and they had jointly come up with a theory. They explained that the web servers ran apache, and used something called
prefork. The idea is that rather than waiting for a connection to come in before forking a process to handle its communication, apache could create some processes ahead of time and use those for new connections because they'd be ready. The configuration specifies how many processes should be pre-forked (hence the name), the maximum number of processes that could be forked, and how many
spare processes to keep over and above the number of active, connected processes. They pointed out that completing a TCP handshake does not mean apache is ready for the connection, because that's handled by the TCP/IP stack before being handed off to the process. They added that they actually used TCP Offload so that whole process was taking place in the NIC, not even on the server CPU itself.
So what if the session load meant that the apache forking process could not keep up with the number of sessions coming inbound? TCP/IP would connect regardless, but only those sessions able to find a forked process could continue to be processed. The rest would wait in a queue for a free process, and if one could not be found, the load balancer would decide that the connection was dead and would issue a 504. When they checked the apache configuration, however, not only was the number of preforked processes low, but the maximum was nowhere near where we would have expected it to be, and the number of 'spare' processes was only set to 5. The end result was that when there was a burst of traffic, we quickly hit the maximum number of processes on the server, so new connections were queued. Some connections got lucky and were attached to a process before timing out; others were not so lucky. The heavier the load, the worse the problem got, but when there was a lull in traffic, the server caught up again but now when traffic hit hard again, it only had 5 processes ready to go, and connections were delayed while waiting for new processes to be forked. I had to shake my head at how they must have figured this out.
Their plan of attack was to increase the max session count and the spare session count on one server at a time. We'd lose a few active sessions, but avoiding those 504 errors would be worth it. They started on the changes, and within 10 minutes we had confirmed that the errors had disappeared.
I reported back to Carol and to James that the issues had been resolved, and when I got off the phone with them, I asked the team to look at two final issues:
- Why did we not see any session RST problems when we tested the ecommerce site ourselves; and
- Why did PMTUd not automatically fix the MTU problem with the app tier connection?
It took another thirty minutes but finally we had answers. The security engineer had been fairly quiet on the call so far, but he was able to answer the second question. There was a firewall between the web tier and the app tier, and the firewall had an MTU matching the other servers. However, it was also configured not to allow though, nor to generate the ICMP messages indicating an MTU problem. We had shot ourselves in the foot by blocking the mechanism which would have detected an MTU issue and fixed it! For the RST issue, one of my engineers came up with the answer again. He pointed out that while we were using the VPN to connect to the office, our browsers had to use the web proxy to access the Internet, and thus our ecommerce site (another Security rule!). The proxy made all our sessions appear from a single source IP address, and through bad luck if nothing else, the load balancer had chosen one of the 9 working servers, then kept using that some server because it was configured with session persistence (sometimes known as 'sticky' sessions).
I'm proud to say we managed to get all this done within an hour. Given some of the logical leaps necessary to figure this out, I think the whole team deserve a round of applause. For now though, it's back to turkey leftovers, and a hope that I can enjoy the rest of the day in peace.