The story so far:
- It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
The holidays are approaching, but that doesn't mean a break for the network team. Here's the fourth installment of the story, by Tom Hollingsworth (networkingnerd).
The View From Above: James (CEO)
I'm really starting to see a turn around in IT. Ever since I put Amanda in charge of the network, I'm seeing faster responses to issues and happier people internally. Things aren't being put on the back burner until we yell loud enough to get them resolved. I just wish we could get the rest of the organization to understand that.
Just today, I got a call from someone claiming that the network was running slow again when they tried to access one of their applications. I'm starting to think that "the network is slow" is just code to get my attention after the unfortunate situation with Paul. I decided to try and do a little investigation of my own. I asked this app owner if this had always been a problem. It turns out that it started a week ago. I really don't want to push this off on Amanda, but a couple of my senior IT managers are on vacation and I don't have anyone else I can trust. But I know she's going to get to the bottom of it.
The View From The Trenches: Amanda (Sr Network Manager)
Well, that should have been expected. At least James was calm and polite. He even told me that he'd asked some questions about the problem and got some information for me. I might just make a good tech out of the CEO after all!
James told me that he needed my help because some of the other guys had vacation time they had to use. I know that we're on a strict change freeze right now, so I'm not sure who's getting adventurous. I hope I don't have to yell at someone else's junior admin. I decided I needed to do some work to get to the bottom of this. The app in question should be pretty responsive. I figured I'd start with the most basic of troubleshooting - a simple ping. Here's what I found out:
icmp_seq=0 time=359.377 ms
icmp_seq=1 time=255.485 ms
icmp_seq=2 time=256.968 ms
icmp_seq=3 time=253.409 ms
icmp_seq=4 time=254.238 ms
Those are terrible response times! It's like the server is on the other side of the world. I pinged other routers and devices inside the network to make sure the response times were within reason. A quick check of other servers confirmed that response times were in the single digits, not even close to the bad app. With response times that high, I was almost certain that something was wrong. Time to make a phone call.
Brett answered when I called to the server team. I remember we brought him on board about three months ago. He's a bit green, but I was told he's a quick learner. I hope someone taught him how to troubleshoot slow servers. Our conversation started off as well as expected. I told him what I found and that the ping time was abnormal. He said he'd check on it and call me back. I decided to go to lunch and then check in on him when I got finished. That should give him enough time to get a diagnosis. After all, it's not like the whole network was down this time, right?
I got back from lunch and checked in on Brett The New Guy. When I walked in, he was massaging his temples behind a row of monitors. When I asked what was up, he sighed heavily and replied, "I don't know for sure. I've been trying to get into the server ever since you called. I can communicate with vCenter, but trying to console into the server takes forever. It just keeps timing out."
I told Brett that the high ping time probably means that the session setup is taking forever. Any lost packets just make the problem worse. I started talking through things at Brett's desk. Could it be something simple? What about the other virtual machines on that host? Are they all having the same problem?
Brett shrugged his shoulders. His response, "I'm not sure? How do I find out where they are?"
I stepped around to his side of the desk and found a veritable mess. Due to the way the VM clusters were setup, there was no way of immediately telling which physical host contained which machines. They were just haphazardly thrown into resource pools named after comic book characters. It looked like this app server belonged to "XMansion" but there were a lot of other servers under "AsteroidM". I rolled my eyes at the fact that my network team had strict guidelines about naming things so we could find it at a glance, yet the server team could get away with this. I reminded myself that Brett wasn't to blame and kept digging.
It took us nearly an hour before we even found the server. In El Paso, TX. I didn't even know we had an office in El Paso. Brett was able to get his management client to connect to the server in El Paso and saw that it contained exactly one VM - The Problem App Server. We looked at what was going on and figured that it would work better if we moved it back to the home office where it belonged. I called James to let him know we fixed the problem and that he should check with the department head. James told me to close the ticket in the system since the problem was fixed.
I hung up Brett's phone. Brett spun his chair back to his wall of monitors and put a pair of headphones on his head. I could hear some electronic music blaring away at high volume. I tapped Brett on the shoulder and told him, "We're not done yet. We need to find out why that server was halfway across the country."
Brett stopped his music and we dug into the problem. I told Brett to take lots of notes along the way. As we unwound the issues, I could see the haphazard documentation and architecture of the server farm was going to be a bigger problem to solve down the road. This was just the one thing that pointed it all out to us.
So, how does a wayward VM wind up in the middle of Texas? It turns out that the app was one of the first ones ever virtualized. It had been running on an old server that was part of a resource pool called "SavageLand". That pool only had two members: the home server for the app and the other member of the high availability pair. That HA partner used to be here in the HQ, but when the satellite office in El Paso was opened, someone decided to send the HA server down there to get things up and running. Servers had been upgraded and moved around since then, but no one documented what had happened. The VMs just kept running. When something would happen to a physical server, HA allowed the machines to move and keep working.
The logs showed that last week, the home server for the app had a power failure. It rebooted about ten minutes later. HA decided to send the app server to the other HA partner in El Paso. The high latency was being caused by a traffic trombone. The network traffic was going to El Paso, but the resources the server needed to access were back here at the HQ. So the server had to send traffic over the link between the two offices, listen for the response, and then send it back over the link. Traffic kept bouncing back and forth between the two offices, which saturated the link. I was shocked that the link was even fast enough to support the failover link. According to Brett's training manuals, it barely met the minimum. We were both amused that the act of failing the server over to the backup cause more problems than just waiting for the old server to come back up.
Brett didn't know enough about the environment to know all of this. And he didn't know how to find the answers. I made a mental note to talk to James about this at the next department meeting after everyone was back from vacation. I hoped they had some kind of documentation for that whole mess. Because if they didn't, I was pretty sure I knew where I could find something to help them out.