1 2 3 Previous Next

Geek Speak

2,456 posts

A storage system on its own is not useful. Sure, it can store data, but how are you going to put any data on it? Or read back the data that you just stored? You need to connect clients to your storage system. For this post, let’s assume that we are using block protocols like iSCSI or traditional block storage systems. This article also applies to file protocols (like NFS and SMB) and to some extent even to hyper-converged infrastructure, but we will get back to that later.


Direct attaching clients to the storage system is an option. There is no contention between clients on the ports, and it is cheap. In fact, I still see direct attached solutions in cases where low cost wins over client scalability. However, direct attaching your clients to a storage system does not really scale well in number of clients. Front-end ports on a storage array are expensive and limited.


Add some network

Therefore, we add some sort of network. For block protocols, that is a SAN. The two most common used protocols are the FC protocol (FCP) and iSCSI. Both protocols use SCSI commands, but the network equipment is vastly different: FC switches vs. Ethernet switches. Both have their advantages and disadvantages, and IT professionals will usually have a strong preference for either of the two.


Once you have settled for a protocol, the switch line speed is usually the first thing that comes up. FC commonly uses 16Gbit and 32Gbit switches that have been entering the market lately. Ethernet, however, is making bigger jumps, with 10Gbit being standard within a rack or wiring closet and 25/40/100Gbit commonly used for uplinks to the data center cores.


The current higher speeds of Ethernet networks are often one of the arguments why “Ethernet is winning over FC.” 100Gbit Ethernet has already been on the market for quite some time, and the next obvious iteration of FC is “only” going to achieve 64Gbit.



Once you start attaching more clients to a storage system than it has storage ports, you start oversubscribing. 100 servers attached to 10 storage ports means you have on average 10 servers on each storage port. Even worse, if those servers are hypervisors running 30 virtual machines each, you will now have 300 VMs competing for resources on a single port.


Even the most basic switch will have some sort of bandwidth/port monitoring functionality. If it does not have a management GUI that can show you graphs, third-party software can pull that data out of the switch using SNMP. As long as traffic in/out does not exceed 70% you should be OK, right?


The challenge is that this is not the whole truth. Other, more obscure limitations might ruin your day. For example, you might be sending a lot of very small I/O to a storage port. Storage vendors often brag with 4KB I/O performance specs. 25,000 4KB IOps only accounts for roughly 100MB/s or 800Mbit (excluding overhead). So, while your SAN port shows a meager 50% utilization, your storage port or HBA could still be overloaded.


It becomes more complex once you start connecting SAN switches and distributing clients and storage systems across this network of switches. It is hard to keep track of how much storage and client ports traverse the ISLs (Inter Switch Links). In this case, it is a smart move to keep your SAN topology simple and to be careful with oversubscription ratios. Do the oversubscription math, and look beyond the standard bandwidth graphs. Check error counters, and in an FC SAN that has long distance links, check whether the Buffer-to-Buffer credits deplete on a port.


Ethernet instead of FC

The same principles apply to Ethernet. One argument why a company chooses an Ethernet-based SAN is because it already has LAN switches in place. In these cases, be extra vigilant. I am not opposed to sharing a switch chassis between SAN and normal client traffic. However, ports, ISLs, and switch modules/ASICS are prime contention points. You do not want your SAN performance to drop because a backup, restore, or large data transfer starts between two servers, and both types of traffic start fighting for the available bandwidth.


Identically, hyper converged infrastructure solutions like VxRail and other VMware VSAN place high demands on the Ethernet uplinks. Ideally, you would want to ensure that VMware VSAN uses dedicated, high-speed uplinks.

Which camp are you in? FC or Ethernet, or neither? And how do you ensure that the SAN doesn’t become a bottleneck? Comment below!

Welcome to another edition of the Actuator. I hope everyone is enjoying some warm spring weather. It's nice to be able to sit outside for an hour at the end of the day.


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


Alexa and Siri Can Hear This Hidden Command. You Can’t.

Fun fact for you: There is no law against sending subliminal messages to humans, or machines. The practice is discouraged and *may* be considered an invasion of privacy (for humans, not for machines). Another example of where the laws lag far behind the technology.


Digital Photocopiers Loaded With Secrets

Not mentioned in the article: the embarrassing photos from office Christmas parties.


Stanford Study Shows the Astonishing Productivity Boost of Working From Home

Glad to see we are putting some data into the productivity levels for people working from home. I’ve been doing it for eight years now. I know it’s made me more productive, happier, and healthier. I can’t go back to having a real job, ever.


Don't Skype Me: How Microsoft Turned Consumers Against a Beloved Brand

“[Using Skype] is like Tim Tebow trying to be a baseball player.” Ouch.


Amazon’s Fake Review Problem

I’ve been frustrated for years with the reviews on Amazon. I find many of them to be fake. These days I focus on the three-star ratings and do my best to discern the truth. To be fair, Amazon is not the only company with an online review problem.


Are My Friends Really My Friends?

Interesting analysis showing that despite being surrounded by constant interactions, we are more alone now than ever before.


Security researchers discover critical flaw in PGP encryption that reveals plaintext

Everything is terrible.


Seems legit:

Public cloud providers have greatly simplified the process of creating a backup, but the challenge has always been managing that at scale with things like policies for retention or simple granular file level restores or regulatory focused dashboards. This is the value added by many of the backup management solutions discussed in the following post and becomes critical once an environment scales past a few instances and databases.


The benefits of managed backup management are:

  1. Simplified Management - The backup management solutions offered by public cloud providers are generally account or subscription focused and doesn't offer a holistic view of the entire environment.
  2. Scalability - Fully managed backup and SaaS solutions have been built to scale to the largest environments without any performance impact or major concern for running out of storage space. This eliminates the need to re-architect the backup management deployment to scale with the needs of the organization for things such as keeping data for twice as long because of a new mandate.
  3. Multi-Cloud Support - Many of the legacy backup products that are available on the market only support backing up data to public cloud providers or only support backing up workloads in a single cloud provider. More and more companies are implementing multi-cloud strategies and a solution that supports multiple clouds is essential to simplifying operations.


Unmanaged Deployment

The following solutions are unmanaged deployments. This means that the software is available to be installed by the customer or has been packaged in the native cloud format such as an AWS AMI but is not available in one of the cloud provider's marketplace.


Rubrik Cloud Data Management

Rubrik Cloud Data Management is a software appliance that can be deployed to AWS, Azure and GCP. The Cloud Data Management platform supports policy based snapshot management along with advanced analytics to generate operational insights.


Managed Deployment

The following solutions are managed deployments. This means the backup management software company has added a deployment solution to the respective cloud provider's marketplace to allow the infrastructure to be provisioned with the click of a button.


Veritas CloudPoint

Veritas CloudPoint is a backup management solution that supports automated deployment into Azure but supports backing up workloads on AWS, Azure and GCP. In addition to IaaS workloads across the three major clouds CloudPoint also supports application level backups such as Microsoft SQL, MongoDB and AWS Aurora.


SaaS Deployment

The following solutions are Software as a Service (SaaS) deployments. This means the backup management software company hosts the software for its customers.


Druva Apollo

Druva Apollo is a SaaS solution that provides data protection of AWS EC2, RDS, S3, EBS, and Glacier. Druva Apollo also includes SLA-based snapshot retention policies in addition to tiering to reduce costs as older snapshots are moved to cheaper storage and eventually deleted.


Rubrik Polaris

Rubrik Polaris is a SaaS solution that integrates with Rubrik's Cloud Data Management hardware and software appliances to provide a unified management platform for both on-premises and cloud-based workloads.


CloudRanger Backup and Recovery

CloudRanger Backup and Recovery is a SaaS solution that provides backup management of AWS EC2, RDS, and Redshift instances using native AWS snapshots. Instance and file level backups are supported along with multi-region and multi-account backup restore points.


Fully Managed

The following solutions are fully managed backup management solutions such that the cloud provider manages backups on your behalf.


Built-in Snapshots

Public cloud providers allow administrators to create snapshots of virtual machine instances, databases, etc. This doesn't provide a robust feature set in terms of management but does allow administrators to backup and restore to a given point in time.


Backup management is an unsexy topic to most but of course has tremendous value when there is a disaster but many of the new backup management solutions are becoming much more than just creating a snapshot via the public cloud providers native snapshot APIs.

By Paul Parker, SolarWinds Federal & National Government Chief Technologist


Here is an interesting article from my colleague Joe Kim, in which he discusses what we can expect to see in the future as federal IT professionals.


Over the past couple of years, administrators have confronted a rapidly changing landscape that seems to shift overnight. Public vs. private cloud is an old argument; now, administrators are grappling with the reality of implementing and managing hybrid IT infrastructures.


As such, their skill sets are being tested like never before. Sixty-two percent of respondents to a recent SolarWinds IT trends survey indicated that hybrid IT has required that they acquire new skills, while 11 percent said it has altered their career path. Meanwhile, 57 percent of public sector organizations have already hired or reassigned IT personnel, or plan to do so, for the specific purpose of managing cloud technologies.


The skills that IT administrators are learning today will have a large impact on what IT management will look like ten years from now.


From service managers to service consumers


Our survey found that 96 percent of respondents have moved at least some applications and aspects of critical infrastructure to the cloud. This migration has caused federal administrators to sharpen their “as-a-service” skills, since many of the tools they are using have become software-defined and exist both on-premises and in hosted environments.


Federal IT professionals have gone from being service managers to service consumers who work with cloud providers to manage their infrastructures. In this service-oriented world, administrators are finding themselves interacting more with software than they are with hardware switches and routers. These interactions are precursors to IT practitioners’ inevitable evolution from traditional network managers into areas that may be more familiar to developers, and toward becoming service brokers, rather than service managers.


From network manager to network developer


Administrators previously needed to be savvy about command lines and hands-on management of network components, but the move toward hybrid IT and Software-as-a-Service (SaaS) applications has greatly reduced the need for these types of skills. Administrators must now begin to be able to manage the different pieces of code that comprise applications and allow those programs to work with each other.


Tomorrow’s network administrators will be familiar with application program interfaces (APIs) — essentially app building blocks — and how they can be used to solve common problems, from network management to security challenges. They will create highly customized and dynamic networks that fit the unique needs of their agencies. Furthermore, they’ll have a greater amount of control over these networks, as they will be able to tap into APIs to dictate policy, rules, user access, and more.


From servicing people to self-service


They’ll also move from being service managers to service brokers. Instead of provisioning more storage or spending their time clicking around user interfaces, they’ll be assigning applications and access rights to individual users so that those users can easily set up services on their own. The standard practice of a user submitting an online request for access to a new application will be a rare occurrence; everything that a person needs or is authorized to use will be at their fingertips in this self-service environment.


Network administrators will also have more opportunity to add strategic value to their agencies. Today, administrators spend a lot of time servicing users. Moving toward self-service will allow users to check their own boxes, download their own applications, and authorize their own access, all without having to go through their system administrators. In turn, future administrators will have more time to work on higher value services, such as developing plans for stronger security measures or using predictive analytics to anticipate and remediate network issues.


From the future to the past


Despite all of these changes, administrators will still need to focus on the “bread and butter” aspects of network management, including performance, availability, and compliance. To ensure success in each of these areas, administrators will need to use many of the same tools and processes that are commonplace right now.


Indeed, some of these tools will be even more important than they are today. For instance, network performance monitoring will be critical, particularly as IT becomes increasingly hybrid- and application-based. Agencies will need solutions that provide automated and unfettered insight into the performance of these applications, whether they are on-premises or hosted, just as they do today.


Learning these and other solutions will take some work, but there are resources available. Online forums such as SolarWinds’ THWACK provide forums where administrators can exchange ideas and ask questions, and most vendors will be more than willing to offer information on best practices.


These resources provide administrators with a chance to hone their skills today while preparing for their future. That future undoubtedly will be challenging, but it also will present many opportunities for federal IT professionals who want to expand their horizons and add more value to their agencies.


Find the full article on GovLoop.

The Simple Network Management Protocol (SNMP) has been a key part of managing network devices in the data centre for some time. It really is a pretty simple protocol to work with (hence the name), and I think it’s underrated as a key tool for monitoring unusual events. Unfortunately, SNMP has had some issues over time. One of these has been sending out a lot of information over the network in an insecure fashion. SNMP v3 was developed to address this. Another issue has been that Joe Sysadmin doesn’t always take the time to configure custom strings to use in the environment with the devices he’s trying to manage. Instead, the default “public” community string is left configured on the devices with more access than is required. This kind of behaviour drives information security folks nuts, and has operations staff questioning whether SNMP is worth the hassle.


SNMP is an extremely flexible solution that provides a robust framework with which you can leverage things like vendor-specific management information base (MIB) files. You can use these to provide both read-only and write access to networked devices. The advantage to this approach is that you can feed information into your management system that provides useful insights, rather than simply showing whether the device is up or down.


Following on from this, alerting that aligns with your devices gives you a better chance of identifying unusual issues in your environment. You could, for example, set your devices to send a trap when local user credentials are used to log in to a device rather than directory credentials. This type of activity may indicate that someone’s up to no good in your environment.


Security in your environment isn’t just about people cracking device credentials, though. It’s also about having devices available to provide the appropriate services to applications and their users. Configuring devices to send meaningful information via SNMP as issues occur can be a great way to get to minor problems before they become major issues. If one of your two firewall devices has suffered a failure, your infrastructure is compromised and you need to address the problem. I’ve seen plenty of situations where internal systems failures go unnoticed for far too long, leading to reduced performance in the environment and angst for both the operations staff and end-users. But people don’t just become unhappy with the infrastructure. They start to use workarounds to get their work done, which can involve unsafe practices such as storing unsecured corporate data in personal mailboxes or on publicly accessible file sharing sites.


A lot of people would agree that data centre operations can be a difficult thing to do well, particularly at a large scale. There always seems to be some device or another that’s run out of capacity, has a failed component, or has simply stopped doing what it’s meant to do. That’s why tools such as SNMP and syslog can help tremendously with keeping things under control in the DC. There’s a wide range of management systems available in the marketplace that can be used to do some pretty cool stuff with SNMP. Most device that can be deployed in a 19” rack can speak SNMP and syslog, so why not get as much information about what’s happening in your environment as you can? The investment in effort upfront can save you a lot of time and headaches down the trick when things invariably go awry.

In a previous Geek Speak blog post, I talked about the viability of practice leader being a career path for IT professionals. A large part of practice leadership is being fluent in vendor technologies, i.e. products. This opens up an interesting set of paths for IT professionals: product management and product marketing.


Are you passionate about discovering IT pain points and finding solutions for them? Product management focuses on customers and their problems. A product manager (PM) is the voice of the entire customer spectrum; the PM delivers market metrics that guide product decisions.


Do you like to tell stories? If you like sharing your problem-solving knowledge by using a product in a one-to-many fashion, you'd probably enjoy working as a product marketing manager (PMM). As Peter Drucker said, product marketing focuses on the product selling itself; creating a product that people want to buy; and creating an environment that encourages people to buy.


Ideally, PMs work to understand real customer problems and using that knowledge to turn solutions into products, while PMMs work to smooth the friction between products and consumers in the marketplace.



Would you consider transitioning into a new role as PM or PMM? Let me know in the section below.


P.S. We are hiring for both PM and PMM positions.

Over the years, I’ve read more than a few articles about the qualities that are found in the best administrators. These articles focus on soft skills, but sometimes will list hard skills as well. In all those years, and in all those articles, I rarely see advice on how those skills are to be applied. It’s as if the author expects the reader to just know what to do with the skills, once acquired.


No matter what type of administrator you are (network, database, systems, etc.), the best way to apply your skills is by being responsive and responsible. I wrote about this in my book more than eight years ago, and the advice is still true today.


Being responsive means you take action on an item. It matters not if the item or task is your responsibility. For example, if a disk fails and you aren’t a member of the server team, it’s not something for you to fix. If someone is reaching out to you for help, you must be responsive. The person reaching out to you has no idea what tasks you are responsible for, they just need help. You want that customer to have the perception that you are responsive to their needs.


The hardest part of being responsive are the hours. It can be difficult to be responsive at all times of the day. And the better you get at your role, the more your services will be in demand.


Being responsible is taking ownership for something. It should be common sense that you would take responsibility for tasks that are central to your job role as an administrator. But you should also take responsibility for your mistakes. When something goes wrong it can be easy to deflect blame to other teams or specific people. You must resist that urge.


Here’s an example from my past life as a database administrator. It was 3 a.m. and I needed to rebuild a server. It failed because a LUN was erased by mistake (and that mistake replicated, quickly, which is why HA is not the same as DR, but I digress). I needed to restore the master database. We were using a third-party backup software product, and the master restore required some different syntax that didn’t want to work.


It took me longer than it should to restore the master. I wasn’t happy. And I could have blamed the backup vendor, or the engineer that wiped out the LUN, or the manager that confused HA and DR. But I knew that wouldn’t help anyone. So, the next day I informed my managers that I could have done better. I outlined a training plan so that any member of my team would be able to perform the same tasks in the correct amount of time. I took responsibility.


I didn’t have to inform my managers about the delay. They have no idea how long it takes to restore a master database. But I wanted to show them that I was being responsible. I could have easily blamed others, and nobody would have thought twice.


Be responsive and responsible. I believe it pays off in the long run.

The image of a modern office has been rapidly changing in recent years. From the medical field, to sales, to technical support, users are on the road and working from just about everywhere possible. Firewalls, malware protection, and other security practices are great when a user is on-premises at your location, but what happens when they are working on an assignment at the coffee shop or showing off a report at a lunch meeting? How can you ensure these protections are installed and working correctly? The fact is that a network and systems perimeter logically cannot be defined simply by the walls of your location. Users are on the go, and this requires special accommodations from us in the support field.


Endpoint Protection

When a user is on location, it is obviously easier to deploy security policies against their device to allow it to use your network in a secure manner in accordance with your security policies. In the past, endpoint meant just providing users a firewall and some antivirus software. With malware and other intrusions getting more advanced, this can no longer be the extent of endpoint security. Modern endpoint protection has progressed to a point where endpoint agents now can contact an organization’s security management systems to get its policies, definitions, etc. This contrasts with the old way where a lot of things were only able to be controlled when a user was on-site. This version of protection will allow an organization to provide maximum security to users when on-site or on the road alike. This can be especially useful for users who are constantly on the go as they will still receive security updates and configuration as soon as they are published while you -- the administrator -- can stay in the loop too.


Making Remote Network Access Easy

Security is often the number one priority (as it should be, in my opinion), but to an end-user, ease of use is often the most important. Their focus is often just getting their work done in the easiest and most hassle-free way possible. There are multiple ways to keep a remote user (and the company network) safe and secure while allowing them to work. A couple of the most popular methods are remote or virtual desktops and VPN connections with profiling. Virtual desktops provide remote access to users as if they were sitting at your main location. This could include everything from fileshare access to applications. This makes things easy and uniform company-wide as the virtual desktop image can be maintained, and the system that a user is accessing from is irrelevant. The virtual desktop would remain secure and under your control. The other option is VPN access with profiling. Best of both worlds, right? A user could access network resources with their own device and use their own applications. The profiling aspect would allow you as the administrator to ensure malware protection is installed, the firewall is on, policies are up to date, and the device is current with operating system updates. Both solutions have their merits and a place in certain situations depending on what you might be looking for.


The Bottom Line

As I mentioned earlier, I believe security should be the number one focus (as a lot of you do too, I’m sure). I know in my day job, the security of the network is my main focus. In a lot of cases we can’t simply place the network and servers on lockdown to the outside world. That sure would be easier! Working to provide remote access where it's explicitly needed is something we as network and server admins will be tasked with even more going forward. It’s how we all approach that challenge that will determine the security of our organizations going forward.

Following my previous post about logging, I'd like to talk about another tool to manage logs that is more advanced than syslog.


Logwatch is essentially a system log analyzer and reporter. It elaborates logs that are simply collected by syslog. This kind of evolution is simplifying the daily job of modern system and network administrators. Logs are everywhere and almost everything produces logs -- not only specific IT systems, but also the elements forming the so-called "Internet of Things," or IoT. The innovation comes from this last application: "things" producing logs could be managed in a smarter way if those logs are analysed and reported to elaborate behavior to perform consequently.

This tool is a very simple one, but, at the same time, is very powerful. In its first form it is only CLI and structured in directories.


Shaping Behaviors

The real power (but not the only one) of Logwatch is the possibility to shape its behavior according to the administrator’s needs. Shaping the tool could be simple or more articulated, but it should be possible to allow the administrator to carry out their daily job as easily as possible so they could redirect their attention to solve the issues that these tools could reveal.


Customizing also means filtering, as described in my last post. Filtering would be easier if the tool would emphasize the keywords to be filtered. So, the first customization should be catching these keywords using variables in the main configuration. As an example, we could be interested in raising an alert log when a website will return a 500 error, but only after “n” times, and not immediately because we already know that this webserver is supported by an application server, which could be overloaded in particular known situations. No need to produce tons of logs when we already know that the issue occurs in the app server (THIS is the issue to solve, not the error coming from webserver).


Preventive modeling

This process implies a previous analysis by the administrator, as you can understand. As usual, this is a tool, and it works according to your decisions. It can’t make its own decisions. Other more advanced tools could give inputs and advice, but not these basics tools.


Different services may have different logging configurations. At the same time, some services could be ignored, others overridden. Furthermore, security level – there are services not so critical so that logging accuracy can be lower, other that are high-security related, which need a very high production of logs. These different behaviors are written in Perl in the case of Logwatch.


Conversely, there are cases when we need not different behavior but instead a homogeneous output: the case of different IoT components, produced by different vendors and logging in various ways. For an easier and more effective comprehension, customization of log writing and processing can “normalize” the logs written by any of them, and helping a comparison between behaviors.

Syslog evolution

We can compare an advanced tool like Logwatch with more basic ones like syslog based mainly on the scripting and the ability to choose how to behave in different situations. Syslog only allows the administrator to filter the most representative lines for troubleshooting. Logwatch can build a particular structure of the line itself, and then filter it based on this shape, being a friendlier helper for the advanced administrator. In this way, troubleshooting and monitoring will improve, and a better analysis and developing new models to apply to log writing will increase.

May has arrived and so has some warmer weather. It feels good to be able to sit outside. I'm enjoying it while I can; we have only a few weeks of warm weather before the mosquitos arrive.


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


Twitter urges all users to change passwords after glitch

Just in case you didn’t hear about this, yet. Twitter came forward to say that a “bug” had accidentally caused user passwords to be stored in clear text. I’d like more technical details on how this bug could happen. In the meantime, change your passwords. Or don’t and use this as an excuse to tweet weird things and then say you were hacked.


Home DNA Kits: What Do They Tell You?

Not much, even for the dog. Only one company identified the DNA as not human. Something to think about for those folks that are thinking of spending money on such services.


The Gambler Who Cracked the Horse-Racing Code

On the heels of the Kentucky Derby comes this riveting story of a man who built and has used a predictive model for horse racing to win $1 billion over the past 30 years. The next time someone says data analytics isn’t a real thing, show them this story.


Police Tested Facial Recognition at a Major Sporting Event. The Results Were Disastrous

Then again, maybe data analytics still has a way to go. Or maybe we should put some prize money behind facial recognition efforts. If someone knew they could make a billion dollars with the right predictive model, I’m certain we would have a solution by now.


Unroll.me to close to EU users saying it can’t comply with GDPR

If you really want people to know you that don’t care about their privacy, just tell the world you won’t comply with the GDPR. This single act tells me everything I need to know about whether or not I can trust Unroll.me with my data. (The answer is "no.")


Yes, it’s Bad. Robocalls, and Their Scams, are Surging.

Robocalls are increasing. People are mad. And nothing is being done to stop the volume of spam phone calls from increasing.


This Lego Breakfast Machine Can Make You Bacon And Eggs

Father’s Day is coming up. #JustSayin


Spring is finally here, and we all know what that means:


By Paul Parker, SolarWinds Federal & National Government Chief Technologist


It’s no secret that public cloud is becoming an increasingly popular option for enterprises to adopt as a means of storing data in a way that is easily accessible. To encourage the public sector in the UK to take the steps required to adopt public cloud, their government introduced the “Cloud First” policy in 2013; the policy states that when making technology decisions, all public sector organizations should consider using the public cloud before other options.


A recent Freedom of Information request that SolarWinds conducted found that, despite this policy, less than two thirds (61%) of central government departments have adopted public cloud in their organization. Of the departments with 25% or less public cloud usage, 65% attributed their lack of adoption to their legacy technology, half blamed security concerns, and 35% claimed that a lack of skills prevented them from using public cloud more.


The research also revealed that over a third (35%) of central government departments with low public cloud adoption are trying to monitor their on-premises technology and their cloud services with different monitoring tools, making it nearly impossible to manage the whole landscape accurately.


Lost in legacy tech


In recent years, the UK government has paid for on-premises solutions and there is little incentive to move away from this technology. Although fit for purpose now, departments will need to embrace the public cloud to ensure that they are able to maintain their services.


Full cloud adoption is unrealistic, but part of the problem with only adopting a hybrid cloud environment is that there can be added complexity. One method of easing the transition would be to strategically plan a smooth integration between cloud and legacy technology. Another would be to use specially designed monitoring tools that can manage across both environments and reduce the complexity.


No penalties means no need


One of the reasons that 39% of departments haven’t yet implemented public cloud over the last five years is that there are no consequences for not complying with the policy.


What needs to change?


One step that should be taken sooner rather than later is to implement incentives for adhering to the policy. Organizations across the public sector of the UK who proactively adopt public cloud usage could receive benefits, such as additional funding or special training, to reward their efforts. As an alternative, those who haven’t shown a demonstrated effort to consider cloud adoption could suffer from a loss of budget or resources.


In the United States, they support their cloud initiatives with a US Federal Government Certification called FedRAMP. This certification provides a common set of controls under which, public cloud providers have been judged to be secure by the government and a Third-Party Assessment Organization (3PAO). This allows agencies and bureaus to leverage the cloud environment with an assurance of availability and security. With a similar initiative in the UK, the public sector may feel more confident about adopting public cloud, and therefore would potentially be able to realize many of the benefits that come from proper cloud adoption.


Find the full article on Open Access Government.

Building a monitoring and alerting system should always be driven by your business needs. This is always a debate between the IT organization which tends to focus on granular measures, whereas the business users would like to see more of an end to end picture of the organization. An example of this would be uptime--as a DBA, if my database is available and servicing requests, I feel as though I’ve met my uptime goals, whatever they may be. However, if a load balancer goes down taking away access to the application tier, the application is unavailable to users, and that is all that matters. [MR1] Building a monitoring solution that looks at systems holistically is challenging, and sometimes requires working backwards from desired monitoring objectives (is the system up) to the choosing indicators (is the database service available and writeable), and then building a target.


Defining Service Level Objectives


You want to focus on what your users care about, and not necessarily what is easy to measure. There are two main areas you will want to use to define these objects--performance and uptime. One notion that comes from Google’s Site Reliability Engineering is the notion of an error budget--a rate at which these service level objects can be missed. Additionally, having an error budget can allow you to be more aggressive with upgrades and resolving technical debt. While evaluating projects and change control efforts you can know that if you are well ahead of your SLO budget you can be more aggressive with rollout. If you are behind the curve, you may curtail some migration efforts.


Target Values for SLOs


Target values will be a negotiation between IT and the business. From an IT perspective it is important to not overpromise--for example if you only have one physical server in your stack, you probably aren’t going to reach 99.99% uptime. This is important for a few reasons, but in my opinion the biggest is helping the business users understand the correlation between resource cost and availability. In the above one server example, if the business wants that application to deliver 99.99% uptime, it is going to have to invest in redundancy at several levels. There are a few other tenants to think about:


  • Past performance isn’t a predictor of future performance--While building a performance target off of your historic baseline is a good start, it does not address the problem of a system that performs well at its current level, but that will fall off a cliff without a major reengineering effort.
  • Don’t Overthink Your Targets--While it may be tempting to bring in someone from the data science team to create your new targets using a complex machine learning K-means clustering algorithm, you are better of creating simple targets like percentage uptime and throughput. If you can’t explain your target in a sentence it is likely too complex.
  • Absolutes are bad--The notion of a system that is always available and can scale infinitely is completely unrealistic. Even hyperscale cloud providers have difficulties delivering 99.999% uptime. It’s better to promise what you can deliver and make the business understand what the cost of delivering more is.


This process allows you to set clear expectations with your business and reduces some of the finger pointing during outages. It does require a strong relationship between IT management and senior leadership of your organization, but in the end delivers IT that can be kept up to date while meeting the business needs of the organization.


[MR1]Matters to a non dba? To a dba? Ties in?


At some point, your first storage system will be “full." I’m writing it as “full” because the system might not actually be 100% occupied with data at that exact point in time. The system could be full for another technical reason. For example, shared components in a system (e.g. CPUs) are overloaded before you ever install the maximum amount of drives, and upgrading those would be too expensive. Or it could be an administrative decision that has made you decide to not hand out new capacity from an existing system. For example, you’re expecting a rapid organic growth of several thin provisioned volumes, which would soon fully utilize the capacity headroom of the current system.


The fact that a single system has reached the maximum capacity, either for technical or administrative purposes, does not mean you need to turn away customers. IT should be a facilitator to the business. If the business needs to store additional data, there’s often a good reason for it. In health care it could be storing medical images. For a service/cloud provider, hosting more (paying) customers. So instead of communicating “Sorry, we’re full, go somewhere else!”, we should say something in the direction of “Yes, we can store your data, but it’s going to land on a different system”. In fact, just store the data and leave out the system part!


More of the same?

When your first system is full and you’re buying another one, you could buy a similar system and install it next to the original one. It might be a bit faster, or a bit more tuned for capacity. Or completely identical, if you were happy with the previous one.


On the other hand, this might also be a good moment to differentiate between the types of data in your company. For example, if you’ve started out with a block storage, maybe this is the time to buy a NAS and offload some of the file data to it.


Regardless of type, introducing a second system will create a couple of challenges for the IT department. First, you’ll now have to decide which system you want to land new data on. With identical type systems, it might be a fill and spill principle where you fill up the first system and then move over to the second box.


Once you introduce different types and speeds of systems though, you need to differentiate between types of storage and the capabilities of systems. Some data might be better suited to land on a NAS, other data on a spinning disk SAN, and another flavor of data on an all-flash SAN array. And you need to keep track which clients/devices are attached to which systems, so documentation and a clear naming convention is paramount.


Keeping it running

Then there’s the challenge of keeping all the storage systems running. You can probably monitor a handful of systems with the in-box GUI, but that doesn’t scale well. At some point, you need to add at least central monitoring software, to group all the alerts and activities in a single user interface. Even better would be central management, so you don’t have to go back to the individual boxes to allocate LUNs and shares.


With an increasing number of storage systems comes an increasing number of attached servers and clients. Ensuring that all clients, interconnects and systems are on the right patch levels is a vertical task across all these layers. You should look at the full stack to ensure you don’t break anything by patching it to a newer level.


If you glue too many systems together, you’ll end up with a spaghetti of shared systems that make patch management difficult, if not impossible. Some clients will be running old software that prevents other layers (like the SAN or storage array) from being patched to the newest levels. Other attached clients might rely on these newer codes because they run a newer hypervisor. You’ll quickly end up with a very long string of upgrades that need to be performed, before you’re fully up to date and compliant. So, it’s probably best to create building blocks of some sort.


How do you approach the “problem” of growing data? Do you throw more systems at it, or upgrade capacity/performance of existing systems? And how do you ensure that the infrastructure can be managed and patched? Let me know!

Patch management at any sort of scale has always been a mundane and time-consuming task that most administrators would like to avoid at all costs. With the proliferation of DevOps methodologies and the public cloud, the practice of immutable infrastructure has eliminated the need for patch management in the eyes of some, given the fact that there would be no long-living servers. In contrast to that notion, most environments have long-living servers that are still around and will be for the foreseeable future due to various reasons. The public cloud and DevOps are the new flavors of the month in IT for many valid reasons, but patch management is still a critical aspect of securing IT environments that can be made easier through the use of managed solutions.


The benefits of managed patch management are:

  1. Simplified Management -  The patch management solutions offered by cloud providers provide a single management interface to simplify operations. In addition to the proverbial single pane of glass most cloud providers provide a simplified manner in which to deploy the patch management agents to instances to help speed up deployment.
  2. Scalability - Fully managed solutions have been built to scale to the largest of environments without any performance impact. This eliminates the need to rearchitect the patch management deployment to scale with the needs of the organization.
  3. Managed Upgrades - One of the advantages of utilizing a fully managed patch management solution is the fact that the system for managing patches is automatically patched itself. This is a major win for many organizations that are already short on IT staff.


Managed Deployment

The following solutions are managed deployments. This means the patch management software company has added a deployment solution to the respective cloud provider's marketplace to allow the infrastructure to be provisioned with the click of a button.


ManageEngine Patch Manager Plus

ManageEngine Patch Manager Plus is a patch management solution that supports Windows, Linux and Mac OS endpoints. This solution is only available on AWS as a marketplace deployment option.


SaaS Deployment

The following solutions are Software as a Service (SaaS) deployments. This means the patch management software company hosts the software for its customers.


Kaseya VSA

Kaseya VSA is an RMM management platform created by Kaseya that includes patch management functionality. The patch management solution includes support for Windows, Mac OS X and 3rd party software.



Automox is a next generation patch management platform hosted in AWS that aims to provide a unified platform for managing patches across all environments. The patch management solution includes support for Windows, Mac OS X, Linux and 3rd party software.


Fully Managed

The following solutions are fully managed patch management solutions such that the cloud provider manages your patch management platform on your behalf and allows engineers to focus on ensuring that instances are up-to-date with their patches.


AWS Systems Manager (Patch Manager)

Patch Manager is AWS' managed patch management solution that rolls up underneath AWS Systems Manager. Patch Manager supports both Linux and Windows operating systems as well as on-premises workloads.


Azure Automation (Update Management)

Update Management is Azure's managed patch management solution that rolls up underneath Azure Automation. Azure Automation Update Management supports both Linux and Windows operating systems.


Patch management for many is simply a necessary evil that often goes overlooked but has a critical impact to the security posture of all environments. Leveraging a managed solution for patch management helps to make life that much easier for administrators given that patch management doesn't provide any business value for most organizations, but it has to be done lest the organization become another headline about a security breach due to unpatched systems.

Striking a balance between good visibility into infrastructure events and too much noise is difficult. I’ve worked in plenty of enterprise environments where multiple tools are deployed (often covering the same infrastructure elements) to monitor for critical infrastructure events. They are frequently deployed with minimal customization and seen as a pain to deal with by the teams who aren’t directly responsible for them. They’re infrequently updated, and it’s a lot of hard work to get all of the elements of your infrastructure covered. Some of this is due to the tools being difficult to deploy, and some of it is due to a lack of available resources, but mostly the problem is people.


Information security is a similar beast to deal with. There are many tools available to help monitor your environment for security breaches and critical issues, and lots of data centers with multiple tools installed. Yet, enterprises continue to suffer from having poor visibility into their environments. 


Syslog is an underrated tool. I call it the blue-collar worker of the DC. It will happily sit in your environment and collect your log data for you, ready to dish up information about problems and respond to issues at a moment’s notice. All you have to do is tell your devices to send information in its direction and it takes care of the rest. It catches the messages, stores them, analyzes them, and sends you information when there’s something you should probably look at.


But it’s not just a useful way to view events in your DC. It can also be a very useful tool when it comes to managing security incidents. 


First, set up sensible alerting in your syslog environment. An important consideration is understanding what you want to alert on. Turn on too much information and the alerts will get sent to someone’s trash file rather than read. Turn on too little and you won’t catch the minor issues before they become major ones. You should also think about the right mechanism to use for alerting. Some people work well with email, while others prefer messages or dashboard access.


The second thing is to understand what you want to look for in your events. Events such as logins using local credentials when your devices use directory services are an example of something that should trigger an alert. The key is to understand what’s normal in your environment and alert on things that aren’t. It can take some time to understand what is normal, but it’s worth the effort.


Third, you must understand how to respond when an alert comes your way. Syslog is a great tool to have running in the DC because it centralizes your infrastructure logging and provides a single place to look when things go awry. But you still need to evaluate the severity of the problem, the available resolutions, and the impact on the business.


The key to managing security events with syslog is to have the right information at hand. Syslog gives you the information you need in terms of what, when, and who. It can’t always tell you how it happened, but having that information in one place makes working that out easier.


Infrastructure operations can be challenging, particularly when there’s been some kind of security incident. Having the right tools in place gives you a better chance of getting through those events without too many problems and your sanity intact.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.