Warstory

 

In a network far far away on a sunny Friday afternoon something was broken badly and nobody was there to repair it.I was called in as an external consultant to help the local IT to fix the problem. I had never seen that network before and the description of the error was only saying "outage in the network".  When I arrived at the car park , a lot of sad looking duct_taped_fibre.jpgemployees where leaving the office building.

The first thing that I always do in these situations is to ask for the network documentation. There was an uncomfortable long silence after I had asked the question. Finally somebody said: "yeah our documentation is completely outdated and the guy that had known all the details  about the network has just left the company..." The monitoring looked like a F1 race win in Monza, everything was blinking red. The monitoring would really help, but unfortunatly it is also down. When you don´t know what to look for it is like looking for a needle in a haystack. In an outage situation like this, a proper documentation and working monitoring would have helped by reducing the time to find the actual problem. Instead of debugging the actual problem, you spend an enormous amount of time exploring the network. Desperatly trying to find out in which general direction you should do further troubleshooting. You will probably also get side trapped by minor misconfigurations, bad network designs and other odd details. Things that have been there for many years, but are not causing the problem that you try to fix right now. It is hard to figure out  what the actual problem is in these situations. You get also constant pressure  from management during the outage. While you are actually still exploring the network you also have to report what could have caused the outage. To summarize the situation without a valid documentation and a non working monitoring you have some time consuming challenges to bring back the network to live. It is not an ideal position and you should try everything to avoid that.

 

 

Lessons to be learned

 

To have an up-to-date documentation helps a lot. I know to keep the documentation up-to-date is a lot of work.  Many network engineers are constantly fire fighting and have the pressure of rolling out new boxes for a "high priority" project.  It helps to implement the monitoring and documentation into the rollout workflow, so that no device can be added without documentation and monitoring. In my little example from the beginning somebody, was trying to troubleshoot the initial problem and with the "try and error method" disconnected the production virtualization Host on that the monitoring was running. To avoid these situations it makes sense to have the monitoring system on a separate infrastructure which works independently even when there is an outage in the production environment. For the documentation sometimes less is better. You need a solid ground level. For example a good diagram that shows the basic network topology.  Because documentations are outdated in the moment somebody has finished them it is better to look at live data when it is possible. I am always unsure if a maybe outdated documentation is showing the correct switch  types or interfaces. In the monitoring you can be sure that these informations have been live polled and are automatically updated if somebody is making even a minor change like a software update. Some problems can be fixed fast and some are more of a long term effort. For example the mysterious "performance problem". These Tickets circulate around all IT departments and nobody could find anything. Here it helps to layout the complete picture. Find out all the components that are included and their dependencies of each other. This can be a very time consuming job but sometimes it is the only way to figure out what is really causing the "performance problems". With that knowledge integrate into the monitoring you get the live data for the involved systems. I had great success with that method to fix long term problems and have afterwards the capabilities to monitor that issue just in case it will show up again.