The View From Above

Being a CEO is not the easy job some people think it is. As a CEO I'm pulled in multiple directions and have to do my best to balance the needs of the business with the needs of the shareholders, deal with crises as they arise, and reassure both our investors and our employees that the company is strong and has a positive future. All of this gets a bit tricky when our websites -- the place where we make 60% of our revenue by the way -- keep on going down and we lose sales. For the technical teams the problem ends when the website starts working again, but for me the ripples from each outage keep spreading for months by way of missed revenue targets, the impact on our supply chain as our order volume fluctuates, and the requests for interviews from analysts who are concerned that will not make the numbers we anticipated if we our customers can't buy things. A single big outage can make my life a misery for weeks on end, so it's not surprising, perhaps, that I am less than impressed with our network; or the “NOTwork” as I have come to know it over the last six months. You know, my home network stays up for months on end without interruption, so with all the money we spend on equipment and employees I'd hope we could do the same, but apparently I'm wrong. If we don't fix this soon, I'm just going to instruct the CTO to move everything to the cloud and we'll dump those useless network idiots. I fired the Senior Network Manager last month, and not a moment too soon if you ask me. I only hope his replacement is better than he was; I'd like to get a good night's sleep for once in a while, without worrying about whether the stock price will be plunging tomorrow.

The View From The Trenches

My first few weeks as the Senior Network Manager have been, well, challenging to say the least. My predecessor, Paul, was fired after a shouting match between him and the CEO because of yet another major outage that was being blamed on the network. After our websites had been down for two hours, James (our CEO) stormed down and practically dragged Paul into a conference room and slammed the door behind him. So while I wasn't actually in the room when the showdown occurred, that didn't make much difference because even through the closed door there was no mistaking who the CEO felt was responsible, despite Paul's protestations to the contrary.

Three weeks later we're still not entirely sure how that outage started, and worse, we have no idea how it finally ended. Of course, it may be that somebody actually does know, but doesn't want to admit being the culprit, especially after hearing James losing his mind in that room. The next day, Paul didn't come to work and I was called into my VP's office where I was given the news that I was being promoted to his position effective immediately. Did you ever get a gift you weren't sure you really wanted? Yeah; that.

So, now that you're caught up on how I got into this mess, I need to get back to figuring out what seems to be wrong with the network. I always thought Paul had his finger on the pulse of the network, but once I started spending more time looking at the network management systems, I began to wonder how he figured anything out. It seems that our “availability monitoring” was being accomplished by a folder full of home-grown perl and shell scripts which pinged the network equipment in the data center and would send an email to Paul when a device became unavailable. I mean, that sort of worked, but the scripts weren't logging anything so there was no historical data we could use to calculate uptime. Plus, the ping response could take up to a second to respond before it would time out and be considered a failure, so even if the network or device performance was completely terrible, nobody would have known about it. What I realized is that when Paul was proudly telling to the Board that the network had “four-nines uptime”, he must have been pulling that figure out of the air. I can't believe he got away with it for so long. He might have been right but neither he nor I could prove it, and I refuse to lie about it now that my neck is on the line.

First order of business, then, was to get some proper network management in place. I didn't inherit a huge budget and I was in a hurry, so I used my corporate Amex to grab a copy of Solarwinds NPM. At least now I'm gathering some real data to work with and if (when!) the next outage occurs, maybe I'll see something happening that will give me a clue what's going on. The executive team has finally put a woman in charge of the network, and I'm going to show them just what I'm capable of.

 

>>> Continue reading this story in Part 2