The View From Above: James (CEO)

 

Another week, another network problem. On Tuesday morning I received an angry call from our CFO, Phyllis, who was visiting our Austin, TX site. The whole network is a mess, she told me, nothing is working properly and I can't do my job. I asked for more detail, but she just said the network was a nightmare and she couldn't even send emails. Great start to the day, especially as Austin is our main manufacturing plant, and if the network was as bad as Phyllis said it was, we were in for a bad week with our supply chain getting out of sync, which could negatively impact both our cashflow and our production output.

 

I called our new Senior Network Manager, Amanda, to let her know that the Austin office was down. She sounded surprised; apparently she had just been talking to the Inventory Management team, and they had been telling her that they were quite pleased with the performance of the company's inventory tool, especially given that it is based out of our data center in Raleigh, NC. I put her in touch with Phyllis and told her to figure out what was going on, because clearly things in Austin weren't going as great as she thought they were.

 

The View From The Trenches: Amanda (Sr Network Manager)

 

Two weeks have passed since I installed Solarwinds' Network Performance Manager, and so far things have been good. I should have guessed that the quiet wouldn't last long, however. I got a call from James around 10AM on Tuesday, and he was mad. Apparently Phyllis was on site in Austin, TX and told him that the network was broken. I knew it wasn't; I was just talking to the Inventory Management team about a project to implement handheld (WiFi) scanners, and they've been testing their old wired scanners in parallel to the WiFi scanners, and both have been working just great, so hopefully both the wireless and wired networks are functioning ok. Still, if Phyllis is upset, it's more than my job's worth to ignore her.

 

Phyllis is without question good at her job, but I get the impression that she would be happier using a large paper ledger and a pot of ink (and maybe even a feather quill pen). Computers are, in her eyes, an irritation, and trying to troubleshoot her problems over the phone is challenging to say the least. However, after a while I did manage to figure out what the problem was. It turns out that everything is down actually meant my email is working intermittently. About 9 months ago we moved our email to Microsoft's Office365, so the mail servers are now accessed via the internet. I confirmed with Phyllis that she was able to access our intranet without issue, which confirmed that our site network was not the problem, (I knew it!), but when she tried accessing the Internet -- including Outlook365 -- she was having problems. It wasn't a total loss of connectivity, but things were slow, and would sometimes lose her connection to the server altogether. Sounds like an Internet issue, but what - and where?

 

Time to fire up a browser to NPM. I checked the basics, but all the network hardware seemed fine, including our Internet routers and edge firewalls, so maybe it was something on the Internet itself. Unfortunately I know how these things work; if I can't prove where the problem is, the assumption is still that it's the network at fault. As I stared at the screen, the phone rang; Phyllis was on the line. I don't know why it took so long, she said, but it looks like whatever you did worked. Finally I can get on with my day's work. And she hung up. Had she stayed on the line I'm not sure if I would have admitted that I'd done nothing, but at least the immediate pressure seemed to be off. But what caused the problem? And worse, now the problem had cleared itself up, there aren't really any tests I could do to troubleshoot. At this point, I remembered NetPath.

 

When I installed NPM, I installed a bunch of probes and set up some monitoring of a number of services to see what it would look like. My idea was that I'd be able to monitor network performance from a few sites, but I got so consumed with setting up device monitoring I pushed that aside for a bit. In the background however, the probes had been faithfully gathering data for me about their connectivity to a number of key sites including -- by incredible good fortune -- the email service. I started off by checking what the NetPath traffic graph looked like right now, when data was successfully flowing to Office365. NetPath had identified that traffic seemed to pass through one of three potential service providers between our Austin site's internet provider and the Office365 servers on the Internet, with the vast majority (around 80%) likely to be sent through TransitCo, a large provider in Texas and the South Central states. At the bottom of the screen was the Path History bar, and it was clear to see that while everything was now green, there was a large chunk of red showing on the timeline for both availability and latency. Time to wind the clock back.

 

Clicking on one of the red blocks, the NetPath display updated and ... whoa ... ok, that explains it. TransitCo's router was lit up in red (along with some attached links) and NetPath was reporting 90% packet loss through that path, and extremely high latency. No wonder Phyllis was having problems staying connected! Data in hand, I called up TransitCo to ask them about their service interruption and they confirmed that an interface had gone bad but the routing engine had for some reason kept on pumping traffic down that link. They had completed a reboot and an interface replacement around 30 minutes earlier, and service was restored. Amazing. Our own Internet provider wouldn't have reported this as it wasn't their direct problem, and there's no way we could sign up for alerts from every other provider just to keep abreast of the outages. If we hadn't had this tool, I'd still be scratching my head wondering what on earth had happened this morning. Still, while I find out a way to get a better handle on upstream provider problems, at least I can now go back and report on the cause and scope of the outage. And maybe I can sell my VP on funding a secondary Internet link out of Austin from another provider, just in case something like this happens again.

 

I've not even had it installed for a month, but Solarwinds NPM saved the day (or my reputation, at least). I think I'll be checking out what other products they have.

 

 

>>> Continue reading this story in Part 3