cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Houston, We Have A Problem Child

Level 11

Think about your network architecture. Maybe it's something older that needs more attention. Or perhaps you're lucky enough to have something shiny and new. In either case, the odds are very good that you have a few devices in your environment that you just can't live without. Maybe it's some kind of load balancer or application delivery controller. Maybe it's an IP Address Management (IPAM) device that was built ages ago but hasn't been updated in forever.

The truth of modern networks is that many of them rely on devices like this as a lynchpin to keep important services running. If those devices go down, so too do the services that you provide to your users. Gone are the days when a system could just be powered down until the screaming started. Users have come to rely on the complicated mix of products in their environment to perfect their workflows to do the most work possible in the shortest amount of time. So how can these problem devices be dealt with?

Know Your Enemy

First and foremost, you must know about these critical systems. You need to have some kind of monitoring system in place that can see when these devices are performing at peak efficiency or when they aren't doing so well. You need to have a solution that can look outside simple SNMP strings and give you a bigger picture. What if the hard drive in your IPAM system is about to die? What about when the network interface on your VPN concentrator suddenly stops accepting 50% of the traffic headed toward it? These are things you need to know about ASAP so they can be fixed with a minimum of effort.

Your monitoring solution should help you keep track of these devices while giving you plenty of options for alerts. A VPN concentrator isn't going to be a problem if it's offline during the workday. But if it goes down the night before the quarter reports are due, the CFO is going to be calling and need answers. Make sure you can configure your device profiles with alert patterns that give you a chance to fix things before they become problems. Also make sure that the alerts help you keep track of the individual pieces of the solution, not just the up or down status of the whole unit.

Be Ready To Replace

The irony of being stuck with these "problem children" types of devices is that they are the ones that you want to replace more than anything but can't seem to find a way to remove. So how can you advocate for the removal of something so critical?

The problem with these devices is not that the hardware itself is indispensable. It's that the service the hardware (or software) provides is critical. Services can be provided in many different ways. So long as you know what service is being provided, you can create an upgrade path to remove hardware before it gets to the "problem child" level of annoyance.

Most indispensable services and devices get that way because no one is keeping track of who is using them or how they are being used. Workflows created to accomplish a temporary goal often end up becoming a permanent fixture. It's important to keep a record of all the devices in your network and know how often they are being used. Regularly update that list to know what has been recently accessed and for how long. If the device is something that is scheduled to be replaced soon, a preemptive email about the service change will often find a few laggard users that didn't realize they were even using the device. That will help head off any calls after it has been decommissioned and retired to a junk pile.

Every network has problem devices that are critical. The trick to keeping them from becoming real problems lies less in trying to do without them and more with knowing how they are performing and who is using them. With the right solutions in place to keep a wary eye on them and a plan in place to replicate and eventually replace the services they provide, you can sleep a bit better a night knowing that your problem children will be a little less problematic.

17 Comments
MVP
MVP

Documentation and discovery 101.

This all needs to compared to CI's in your RTSM solution so you have a complete picture of your environment.

You've presented (again) the concept of Disaster Recovery, and it's not too soon to have people rethink what they have, and how old it is, and how important it is.

Imagine living without that device or application--or site!--for thirty minutes--or thirty days!

When something ubiquitous vanishes, you truly learn how important it is.

For example, years ago we had Microsoft DHCP servers that occasionally would all fail in a cascade.  We learned "We live and die by DHCP!"  Consequently it became apparent to Management that replacing the unreliable solution with something more reliable was impeditive.

Next we said "We live and die by DNS" and you know what failed then.  We may not always learn from our past mistakes, and when we don't, Murphy is always ready and willing to re-educate us.

Level 17

Very Nice!

MVP
MVP

rschroeder wrote:

  We may not always learn from our past mistakes, and when we don't, Murphy is always ready and willing to re-educate us.

So true! You can warn management of issues but sometimes until it fails, they won't act upon it.

Level 14

Nice article... An old manager of mine used to say "Delay on important things leads to disaster!"

rschroeder​ wrote

When something ubiquitous vanishes, you truly learn how important it is.

And sometimes... thats all it takes to get everyone onboard with upgrades....

The important thing, in my opinion, is that you took the opportunity to warn Management pre-incident, and that there is record of it.

I don't like needing a "Get out of jail free!" card, but having one beats the alternative.  It's one reason why I rely on e-mail instead of voice mail or telephone conversations.  Not only do I have a reference to my advice to Management, which I call up if something unfortunate occurs, but I also have a searchable set of information to enable me to refresh past conversations.  The more items on my plate, the more project meetings I have to attend, and the more stress that comes my way, the more I find myself wondering "What did we agree on for this topic . . .?"

Of course, the extra gray hair I sport may have its own contributing factors to why I appreciate the ability to refresh my mind about a particular conversation or meeting.

Level 13

This article is "so true".  Good job.

Level 14

I once worked on a legacy application that could not be replaced.  It was home grown, using three different flavors of Oracle, used to document time spent on a multi-level maintenance process.  That was twelve years ago.  I bet it is still in use.

Our Achilles Heel right now is that we kept on building upon, and adding to, a single CIsco 6509 switch as our main datacenter switch. So now when we are ready to re-architect our LAN we are faced with all sorts of issues, including the physical cabling issue of the mistake of racking the switch in the same rack as the patch panel. Hundreds of cables cascade down in front of this switch now. Ugh!

MVP
MVP

Trying to get this very subject comprehended by ... application masters that have evolved into IT folks; my primary enemy right now ... 1 employee that does not understand the ramifications of his actions!  I have been able to recover up to this point.   My job is an attempt to stay a couple of steps ahead of my co-workers.!  We are a very small operation - no internet access for my 200 users ... but I have 14 people that scare the living daylight out of me sometimes!   I have been fortunate enough to work for an employer that is supporting the changes I am making, and is supporting me financially to get to the NG Security, and ultimately the NG 911 infrastructure that will be focused on serving the community.  Legacy equipment is awesome!!!  My latest nightmare has been upgrading the 911 Analog system to the IP based system; users really don't understand how much more vulnerable this has made the operation.  Technology is cool .. the old stuff was rock solid ... we now must present layers of security and be quite diligent in our efforts to educate users... and then #1 - backup / restore ... 

NCM set to report configuration changes in real time will help you keep track of those loose canons.

Level 9

Good Article!! Sometimes you can warn the chain of command but they seldom listen until stuff hits the fan!

Level 21

These types of systems should be treated as risks to the company and evaluated as such.  If the risk is high enough a remediation plan should be put into place to reduce or remove the risk.

Level 20

I've still got a few problem children... some just won't go away!

MVP
MVP

rschroeder​ That Murphy... pfft... what a jerk! He's always messing up my plans...

MVP
MVP

wluther , you got one of those Murphys too ?  Small world !

Level 11

Wonderful article. The price of peace and calm for a sysadmin is proactivity and vigilance. That and an ounce of monitoring is always worth a pound of troubleshooting.

About the Author
A nerd that happens to live and breathe networking of all kinds. Also known to dip into voice, security, wireless, and servers from time to time. Warning - snark abounds.