cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Growing Network Complexity and KISS

Level 10

Hey Everyone!

My name is Ryan Booth and I’ve been given the opportunity to post as an Ambassador focusing in on the Network Management section of Thwack. I'm excited to get the opportunity to bounce my thoughts and ideas off everyone and look forward to the conversations.

The overall theme for my posts will be the growing complexity of networks and how to still keep things simple (aka KISS).

Everyone knows K.I.S.S as Keep It Simple Stupid and it’s something I push with every change, design, and project.  Networks are getting increasingly complex with everything from converged storage/data networking to automated workload migrations in a virtualized environment. BYOD, mobile devices, and virtual desktop infrastructure (VDI) are also making the enterprise environment super crazy.

So my question is: How do you balance increasing complexity in your network but still maintain simplicity?

Here are several open-ended questions to get the conversation going:

  • What makes more sense, five 9’s and High Availability (HA) or being able to quickly recover?
    • HA = complexity. Do you really need three, four or five 9’s?
  • Do you design a perfect fit each time or stay consistent across the network?
    • Ex. Deploy latest switch/server model or stick with the same model used in every closet?

A little about me:

I’ve been in the IT game for about 10 years now with the majority of my experience focused in on routing and switch (8+ years). I currently hold 3 CCNP certifications (R&S, SP, and Design), and am working towards my CCIE. I also have experience with servers, both Windows and Linux, along with some virtualization experience but my passion is routing and switching.

I blog at my own site blog.movingonesandzeros.net and can be found on Twitter @That1guy_15. You can also find me on various forums and hangouts under That1Guy15.

So enough about me, let’s get this ball rolling!

49 Comments

It's a mixed bag - in our shop, we concentrate most HA/five9 efforts on our server and application platforms, since data access to our vital apps is paramount. MPLS between our sites at least gives us a good SLA for uptime on those connections, and fallback to IPSec VPNs as a failover solution is an acceptable path from the business' perspective. That said, we've had very good results for uptime on our MPLS pipes over the past calendar year, and are right at four 9s there as well (knocking on wood until the splinters fly here).

Consistency is what we shoot for and we've standardized on switchgear, wireless, virtual/physical HW, and the like. This consistency (as I think you're alluding to) breeds simplicity, since provisioning/builds are replicable cookie-cutter situations with few exceptions.

Level 10

Yeah cookie-cutter is exactly what I was talking about.

Level 17

Cookie Cutter Good; 5 9's Bad! We need no myth's when it comes to Network Data or it's so called available time and a poor stretch of the imagination to 5 9's only because the feller who came up with that did not want to split decimal places with 6 9's.

I wanna Rock and Roll All Night!

Level 10

HA is not as big an issue for a lot of companies that feel they need 5 9's. I think these companies will do just as good if not better if they utilize a simple redundant setup and have solid documentation and procedures for outages. 

Don't get me wrong, folks....I don't walk around talking to the business about five 9s. Does our team do our absolute best to maximize effective uptime? Sure. Do we come close to achieving end-to-end five 9s (including all the pieces of public infrastructure that are OUT of our hands)? Of course not. The knocking on wood part means we're lucky over the past months and we know it.

MVP
MVP

5 9's...can be obtainable but at a cost...until that backhoe working nearby cuts all your fiber because everything is coming into the building at the same point or everything is tied to the same vendor.

It is a SLA that makes the business happy but gives IT headaches and insomnia at night. One good outage (bad ios code, hardware, power, etc.) can ruin that SLA in short order.

Consistency across the network is a must !!

It makes maintaining the environment and monitoring much easier (cookie cutter) and therefore more scalable.

MVP
MVP

+1 for KISS

MVP
MVP

While the intent to keep solutions simple is noble and worthy of consideration, it should not be a constraint to any design. The fact is that some problems are complicated, and some solutions must be in order to perfectly solve the problem.

As always, it's a cost-benefit decision that the company needs to make. I'll build you a 99.9999% uptime solution, no problem. But you'll pay a fortune for it. Instead, I'd rather help you throughly define and understand the problem, then design a solution.

MVP
MVP

I like to operate by the 80/20 rule.

Build the simple stuff to handle 80% of the situations.  One off's for the other 20% that are not able to fall into the 80% portion.

That allows for a more overall simple approach that is scalable but allows for the other solutions where needed.

Level 12

We missed 5 9s this year in part because our service provider sent out a router update that put 16 sites into ROMMON and then couldn't get techs to those locations. Keeping it simple is great in theory but there was no failover plan for something like that and we paid the price.  As previously stated, price is often the limiting factor, even more so than complexity.

Level 10

This is a good point and I dont disagree with this at all. Sometimes requirements drive complexity.

I'd be curious to know what most VAR/Contractors run into. Do you walk in and fully design and build based off customers needs or are you handed a general design w/ requirements and asked to make it happen?

Level 17

Hey wait, did you just call me stupid?  hehe

Level 14

Several have mentioned the dreaded "cost vs risk" argument...ain't it the truth! I'm especially thrilled when something happens to impact profits, I'm asked to provide a solution, and then find it's not that big of an issue because the cost to fix it is too high. Suddenly, the risk is acceptable. So, instead of telling folks we can achieve 5x9's...I started telling them I'm happy to provide 9x5's At least that is realistic based on the budgets most of us are given.

I'm also a big fan of the cookie-cutter design approach. I find that if your cookie-cutter design is based on best practices, you shouldn't need to deviate from it. If I find myself deviating from a standard best practice, then someone has asked me to do something the product wasn't designed to do...nor will it be supported, and typically the solution is unstable and gets scrapped.

D

We aim towards the balance.  HA is something we strive for with redundancy an strict SLA's with our providers.  Usually if we have an issue with availability it goes to the service provider.  We had a situation where an engineer at one of our providers, we won't call them by name.  Let's just call them Earth L or E Link for short.  The provider had an engineer decide to do unscheduled maintenance without clearing it with any of their customers and took down an entire Central Office for half a day.

I say that to say technical people and management look at this different.  Management, who unfortunately does not always come up the ranks, throws money at the 5 9's not realizing often the obstacles are out of our control.  Technical People do their best to find the balance of making sure the infrastructure is secure and the DR model is a quick one.  Unfortunately, management usually wins out.  We are in the process of moving from a hot-site with a dr-site to having two hot-sites and a dr-site.

Level 10

Your completely right, the human error aspect is one that is overlooked a lot.

Years ago the C-Level execs wanted no downtime.  We would always build battleship systems without regarding recovery or maintenance efforts.  These systems were and some still are complex.  Frankly C-I-E-I-O (see eye e eye o) did not care how hard we worked. 

Now that those people have moved on, the system admins are making headway simplifying servers and systems. 

RT

Level 10

HA! kinda ironic for this post...

This morning we lost one of several PDUs in our DC which should not have been a problem. BUT several racks and dual homed services went down with it. Cause? Rack PDUs or devices themselves were dual connected to that one PDU and not distributed between A side and B side.

Sometimes all the HA in the world can not save you from yourself. Its all about the details...

MVP
MVP

That's funny...dual homed and routed through a single point of failure.  That wasn't a matter of if, but rather when it was going to bite.

Kind of like when the procurement department at a prior employer ordered the cheaper set of redundant power supplies for the unified Cisco switch....looked great on paper until one of the power supplies became unplugged...that's another story  The cheaper power supply was cheaper for a reason...not as much capacity therefor it could not support the entire chassis so our core switch bit the dust.  Not sure it looked so good on paper anymore after the loss of revenue that day.....

Level 14

Oh, how many stories do we all have just like these! I bet the 9x5s SLA looks pretty good to you right about now. LOL

D

Level 12

What makes more sense, five 9’s and High Availability (HA) or being able to quickly recover?

For us, definitively quick recovery. HA is not needed everywhere and it is folly to try to achieve it when it is not.

Do you design a perfect fit each time or stay consistent across the network?

We try to keep every site configuration as close to our model as possible, when technology is available of course. I hate managing exceptions.

Level 13

Until we brought NPM into our environment we didn't even have a way to measure uptime, so there was no talk of any amount of 9's.

We do try to design in HA in all our sites:  dual uplinks from access switches to distribution, dual supervisors, dual distribution switches with spanning-tree and HSRP priorities set to balance the vlans and layer 3 interfaces, active/standby load balancers and firewalls, etc.

To make all that dual-everything easier to support we pretty much had to take a cookie-cutter approach and standardize on models of hardware for each network function.  We used NCM merely for config backups when a previous engineer was managing it.  Once I inherited responsibility I started using the Policy Reporter functionality to ensure that our configs are as cookie cutter as possible also.

Funny Story.  I was in a meeting yesterday.  One of our business units is demanding 6 9's.  If I do my math right, that is only 36 seconds of down time.  There was much laughter in the meeting.

MVP
MVP

Depends on the goals of your current job how far you go towards 5 9's.  When I worked in a 24x7x365 production environment that worked off the JIT principal, uptime was king and even getting any time to do upgrades or maintenance was a pain.   My current job is much different and while keeping things up and running is important they place much less emphasis on that over cost.

In some ways I prefer a business that is willing to pay a bit more for having excellent uptime records, makes my job easier in some ways.  However, when trouble happens it's also much more intense.

I also find it interesting how other countries can interpret such goals differently.  While here in the US we worked towards full redundancy with things like equipment, multiple power circuits, diverse connections, etc.  to get to high uptimes, in Germany they were more interested in spending money on a network management system that would isolate where a fault was and having people and equipment on hand to go switch it out quickly.   They couldn't understand that if we had redundant distribution switches, if one of them went out our Solarwinds would quickly isolate only that switch as the problem, but we could take our time to go switch it out.   Their uptime suffered as a result, unfortunately they were also corporate HQ so in the end their methodology was bound to win.  Oddly enough, due to the increased manpower their IT budgets were quite a bit higher too.

Level 10

Always has with this network. Ive been here for less than a year and I find fun stuff like that all the time.

"What The!!" is uttered by me on a regular basis

Level 14

"What the!!"

"Who the!!"

"Why the!!"

"Seriously!?!"

"AGAIN!!!"

LOL

MVP
MVP

"REALLY ?!!"

Level 7

The problem I've always been faced is cost does matter if you work for big or small company they always want the cheapest cost. When you inform them about 5 9's they always say that is important as down time is money but when you present the solution I get the same question what is the cost and when that is presented I always get how do do it cheaper.

As for using the same equipment every where it can be benefical but also can limit you. I always look at life cycle of equipment to try to determine should I stay with that model or move to another model. I've seen where both situations have help me and hurt me. Guess in my mind there is no perfect answer.

Level 10

interesting on different views from different parts of the world.

I think if HA is not a priority then full visibility and alerting needs to be rock solid.

Level 10

Nice post, nice blog!

I could not agree more.

RT

Level 21

I am in complete agreement with you on keeping this as simple as possible.  We are a service provider specializing in building environments that are personalized to each customers needs and requests.  Some of our customers that are most concerned with HA and Uptime have some of the most complicated designs and ultimately have more problems than our customers with very simple environments.

Level 7

Right

Level 10

My next post is circling around this topic.

Level 21

I am definitely looking forward to that!

MVP
MVP

The biggest thing for us is the dollar signs attached to much of this.

Often we get asked for high levels of redundancy linked to low tolerances for downtime - only to have them revert to 9-5 due to extreme sticker shock

Thankfully getting closer to classification of data and services to prioritize what REALLY needs to have a high level of recover-ability or limited downtime.

Level 10

Yeah once sticker-shock and budgets come into play, the game plan changes.

I like to call it the "Model T" method.. Every network exactly  same (when possible) like a Ford Model T.. you can have whatever color you want as long as it Black.  The Model T was one of the first cars that had interchangeable parts, which mean we generally use the same network manufacture and model in every office.. So, if we close an office we can easily use the gear in other offices..

Also joke about there shouldn't be an "fancy" add-on (No fuzzy Dice, Curb feelers, carpet on the steer wheel) or one-offs at a office.. Cause generally with one-offs there are only one or two IT people that understand them and are they always going to be around to fix or troubleshoot these one-offs..

KISS and Model T Methods are key for the company that I work for as we have 13,000 staff and over 250 offices with a small network team to maintain them    <insert Plug for Solarwinds> Which Solarwinds NPM makes it easier to maintain and monitor. </insert Plug for Solarwinds>

I'm curious - what's your definition of a 'small' network team?

Level 17

4 guys for 15+ on site buildings and 25 outlying sites(multiple buildings in many cases) with 250+ Access Layer Boxes, 3 WAN Distribution Boxes, 2 Routers on each smaller remote site, and Dual Distribution on the larger along with multiple circuit and circuit types. And a DR DC. 

Yes, 4 guys.... Well, 5 since yesterday.

That's progress! +1s are hard to accomplish sometimes....IT headcount is not always the easiest sell.

Level 17

Yes, our team got +1... the Main Campus team got -1 .... but they get the next contractor.

  Very hard to get a +1... we are still back filling some positions that were not moved to another department.

     I doubt we will ever truly get an honest +1.... without the -1 first.

LOL.. Network team is 12 staff about half of them are Full time Network guys that are mostly within the Datacenter Team. The other half has some responsibility with the Server and Desktop teams.  When I say Network Team, I mean Network like switches, routers, cabling, Riverbeds, etc (true meaning of Network ) The only servers that we maintain are Content Filter, RADIUS, LDAP, and Solarwinds..  Also we  use solarwinds only to monitor network devices, no servers or applications that is some else's job LOL..

Network Elements5898
Nodes1633
Interfaces4232

We're Also running NTA and NCM..

Level 17

Separation of Powers at it's finest!

Level 11

Interesting one...Can you elaborate more.Or provide me one scenario

Level 12

+1.. Thanks that1guy15, interesting discussion

Level 12

Very Interesting indeed

Level 15

Of the cloth, that standardized is better than bleeding edge.  Have a sandbox to test equipment and concepts but try to keep switches and routers the same in both hardware and configuration.  Try to be conscious of the budget and try to keep project scope and expectations in line with the business goals and objectives.

Tough balancing act.

MVP
MVP

You had me at KISS

We use mainly quick recovery for our networking gear. There's redundancy for all the real important gear but small sites are based on quick recovery. If the link is down for too long, the users need to go mobile or to another site. The big sites all have redundant links (separate paths) as well as redundant cores.

Level 15

Interesting topic and interesting discussion.  Our network gear is also in the quick recovery.  Have spares of common components available and then the critical stuff is configured with redundancy.