Skip navigation

Geek Speak

4 Posts authored by: lindsayhill

Better Network Configuration Management promises a lot. Networks that are more reliable, and can respond as quickly as the business needs. But it’s a big jump from the way we've run traditional networks. I'm wondering what’s holding us back from making that jump, and what we can do to make it less scary.


We've all heard stories about the amazing network configuration management at the Big Players (Google, Facebook, Twitter, Amazon, etc). Zero Touch Provisioning, Google making 30,000 changes per month, auto-magic fine-grained path management, etc. The network is a part of a broader system, and managed as such. The individual pieces aren't all that important - it's the overall that matters.


Meanwhile, over here in the real world, most of us are just scraping by. I've seen many networks that didn't even have basic automated network device backups. Even doing something like automated VLAN deployment is crazy talk. Instead we're stuck in a box-by-box mentality, configuring each device independently. We need to think of the network as a system, and but we're just not in a place to do that.


Why is that? What's stopping us from moving ahead? I think it’s a combination of being nervous of change, and of not yet having a clear path forward.


Are we worried about greater automation because we're worried about a script replacing our job? Or do we have genuine concerns about automation running amok? I hear people say things like "Oh our Change Management team would never let us do automated changes. They insist we make manual changes." But is that still true? For server management, we've had tools like Group Policy, DRS, Puppet/Chef/Ansible/etc for years now. No reasonable-sized organisation would dream of managing each of their servers by hand. Change Management got used to that, so why couldn't we do the same for networking? Maybe we're just blaming Change Management as an excuse?


Maybe the problem is that we need to learn new ways of working, and change our processes, and that’s scary. I’m sure that we can learn new things - we’ve done it before. But A) do we want to? and B) do we even know where to start?


If you’re building an all-new network today you’d bake in some great configuration management. But we, as a wider industry, need to figure out how to improve the lot of existing networks. We can’t rip & replace. We’ve got legacy gear, often with poor interfaces that don’t work well with automation toolsets. We need to figure out transition plans - for both technology & people.


Have you started changing the way you approach network configuration management? Or are you stuck? What’s holding you back? Or if you have changed, what steps did you take? What worked, and what didn’t?

A finely-tuned NMS is at the heart of a well-run network. But it’s easy for an NMS to fall into disuse. Sometimes that can happen slowly, without you realizing. You need to keep re-evaluating things, to make sure that the NMS is still delivering.

 

Regular Checkups

Consultants like me move around different customers. When you go back to a customer site after a few weeks/months, you can see step changes in behavior. This usually makes it obvious if people are using the system or not.


If you're working on the same network for a long time, you can get too close to it. If behaviors change slowly over time, it can be difficult to detect. If you use the NMS every day, you know how to get what you want out of it. You think it's great. But casual users might struggle to drive it. If that happens, they'll stop using it.


If you're not paying attention, you might find that usage is declining, and you don't realize until it’s too late. You need to periodically take an honest look at wider usage, and see if you're seeing any of the signs of an unloved NMS.

 

Signs of an unloved NMS

Here’s some of the signs of an unloved NMS. Keep an eye out for these:

  1. Too many unacknowledged alarms
  2. No-one has the NMS screen running on their PC - they only login when they have to
  3. New devices not added

 

What if things aren't all rosy?

So what if you've figured out that maybe your NMS isn't as loved as you thought. What now? First don't panic. It's recoverable. Things you can do include:

  1. Talk. Talk to everyone. Find out what they like, and what’s not working. It might just be a training issue, or it might be something more. Maybe you just need to show them how to set up their homepage to highlight key info.
  2. Check your version. Some products are evolving quickly. Stay current, and take advantage of the new features coming out. This is especially important with usability enhancements.
  3. Check your coverage: Are you missing devices? Are you monitoring all the key elements on those devices? Keep an ear to the ground for big faults and outages: Is there any way your NMS could have helped to identify the problem earlier? If people think that the NMS has gaps in its coverage, they won't trust it.

 

All NMS platforms have a risk of becoming shelfware. Have you taken a good honest look recently to make sure yours is still working for you? What other signs do you look for to check if it's loved/loathed/ignored? What do you do if you think it might be heading in the wrong direction?

Is User Experience (UX) monitoring going to be the future of network monitoring? I think that the changing nature of networking is going to mean that our devices can tell us much more about what’s going on. This will change the way we think about network monitoring.


Historically we’ve focused on device & interface stats. Those tell us how our systems are performing, but don't tell us much about the end-user experience. SNMP is great for collecting device & interface counters, but it doesn't say much about the applications.


NetFlow made our lives better by giving us visibility into the traffic mix on the wire. But it couldn't say much about whether the application or the network was the pain point. We need to go deeper into analysing traffic. We've done that with network sniffers, and tools like Solarwinds Quality of Experience help make it accessible. But we could only look at a limited number of points in the network. Typical routers & switches don't look deep into the traffic flows, and can't tell us much.


This is starting to change. The new SD-WAN (Software-Defined WAN) vendors do deep inspection of application performance. They use this to decide how to steer traffic. This means they’ve got all sorts of statistics on the user experience, and they make this data available via API. So in theory we could also plug this data into our network monitoring systems to see how apps are performing across the network. The trick will be in getting those integrations to work, and making sense of it all.


There are many challenges in making this all work. Right now all the SD-WAN vendors will have their own APIs and data exchange formats. We don't yet have standardised measures of performance either. Voice has MOS, although there are arguments about how valid it is. We don't yet have an equivalent for apps like HTTP or SQL.


Standardising around SNMP took time, and it can still be painful today. But I'm hopeful that we'll figure it out. How would it change the way you look at network monitoring if we could measure the user experience from almost any network device? Will we even be able to make sense of all that data? I sure hope so.

Capacity planning is an important part of running a network. To me, it’s all about two things: Fewer surprises, and better business discussions. If you can do that, you'll get a lot more respect.


When I was working for an ISP, we had several challenges:

  • Average user Internet usage is steadily increasing
  • Users are moving to higher-bandwidth access circuits, which means even more usage
  • Upstream WAN bandwidth still costs money. Sometimes lots of money.

I built up a capacity planning model that took into account current & projected usage, and added in marketing estimates of future user changes. It wasn’t a perfect model, but it was useful. It gave me something to use to figure out how we were tracking, and where the pain points would be.


Fewer Surprises

No-one likes surprises when running a network. If your VM runs out of memory, it’s easy enough to allocate more. But if your WAN link reaches 90%, it might take weeks to get more bandwidth from your provider. If you hit that peak due to foreseeable growth, it makes for an awfully uncomfortable discussion with the boss. Those links can be expensive too. You'll be in trouble with the bean-counters if the budgets have been set, and then you tell them that you need another $10,000/month. You can't always get it right. There’s always situations where a service proves far more popular than expected, or a marketing campaign takes off. But reducing the surprises helps your sanity, and it improves your credibility.


Better Business Discussions

I like to use capacity planning and modeling tools for answering those “What if?” questions. e.g. The marketing team will come to you with questions like these:

  • What if we add another 5,000 users to this service? What will that do to our costs?
  • What if we move 10,000 customers from this legacy service to this new one? How will our traffic patterns change?
  • Do we have capacity to run a campaign in this area? Or should we target some other part of the country?
  • Where should we invest to improve user experience?


If you've been doing your capacity planning, then you've got the data to help answer those questions. You get a lot more respect when you're able to have those sorts of discussions, and answer questions sensibly.


This does take real effort though. Getting the data together and making sense of it can be tough. Tying it to business changes in particular is tough. No capacity planning model fully captures everything. But it doesn't have to be perfect - you can always refine it over time.


Are you actively doing capacity planning? How is it helping? (Assuming it is!) If you're not doing any capacity planning, what’s been holding you back? Or have you had any really nasty surprises, where you've run out of capacity in an embarrassing way?

Filter Blog

By date: By tag: