The Importance of Baselining During Times of Crisis

I consider myself lucky that I have delivered many of our SolarWinds admin training courses and I would like to think that the people attending the course have also felt lucky to have spent a week, shut in a room with me, whilst I teach the finer art of managing an Orion installation. I like to get real world content into my training content and I often use the example of bad weather, train strikes etc. to bring up the importance of baselining when it comes to monitoring.

The analogy I use for highlighting this function of monitoring, is that during such times the business may require different things of IT. In this example and the purpose of my writing this post, is that staff are often unable to get into their normal work location. In our current world health crisis, with the Corona Virus having such a devastating impact around the globe; staff, the companies themselves and even the governments of countries are dictating they should work from home.

IT are therefore in a position where the business is demanding that staff are able to effectively work from home and a key issue here is how do users perform their jobs when outside of their workspace. Typically, the answer comes in two forms; the applications and services they require are SaaS based and therefore can be accessed anywhere with an Internet connection and secondly providing remote VPN access to internal resources.

No doubt you have at least one appliance that is able to provide remote client or SSL based VPN tunnels to remote users. If your current average concurrent VPN connection rate is 50, how will your environment cope if the business is sending an all hands message that everyone is to work from home from tomorrow? When and if this question came down from above, do you know if your infrastructure can cope with a change from 50 to say 2,000 concurrent VPN tunnels.

The questions you need to have answers to, when put in this position and that a solid baseline figure will provide, are going to be such things as:

  1. Is your VPN termination device, typically your Firewall able to cope with such numbers? When you purchased that firewall and decided on which model was right for you, did you contemplate such a situation. The throughput volume, active connection rates, IPS throughput etc. are all reasons why there is more than one model in your firewall vendors range.
  2. How much network bandwidth is consumed by remote VPN users?
    1. Is your WAN connection able to cope with an increase?
  3. What services do your remote users need access to?
    1. What is the makeup of the traffic going to look like?
    2. Is the routing and policy rules in place to the resources users will be expecting in place?
  4. How much compute resource will be consumed delivering remote VPN services?
  5. Where do my limits lie?

2020-03-15_16-23-23.png

If you are currently monitoring the metrics that will provide answers to the above, then this is great, you are in a position of being able to provide more accurate answers. Extrapolating your current consumption and utilisation figures, should at the very least give you data to base important decisions upon. If you do not, get them in place as quickly as you can.

What we do not want, is a situation where the infrastructure tomorrow will have a significant increase in this specific service function, and we can only guess at potential break points.

If that break point is around 1,500 users and we open up to the 2,000 figure, this could mean that the experience for the users is at best poor, at worse it not usable. Does this push your device into failsafe mode, and we have traffic not being scanned, which is a security risk? Do we have visibility of the makeup of the traffic, as the urgent nature of this requirement may open the door to data leaks occurring?

In this unprecedented time, your monitoring platform should be providing you with information on which you can make informed decisions and the important function of closely observing what actually happens when such events change the nature of the usage of your infrastructure. Maybe with the analysis of your current metrics you are able to deduce that 1,500 is the figure you feel you can support without degrading performance to those now remote staff. This is surely better than everyone being unable to have a workable connection.

Does the analysis show you need to increase the power of your firewall, does it suggest you need to implement configuration to control traffic flow or changes to QoS definitions and many more such questions with answers? As I state many times when consulting with our clients, information is key and your monitoring platform should be providing information, in this first instance in a pro-active way to make informed future decisions and then to provide the reactive solution to what is happening here and now so that changes can be applied to keep everything providing the service the business needs.

Be sure to look out for other posts which we hope will help during this dramatic period and please stay safe!

Parents Comment Children
No Data
Thwack - Symbolize TM, R, and C