Geek Speak

8 Posts authored by: gregwstuart

When it comes to IT, things go wrong from time to time. Servers crash, memory goes bad, power supplies die, files get corrupted, backups get corrupted...there are so many things that can go wrong. When things do go wrong, you work to troubleshoot the issue and end up bringing it all back online as quickly as humanly possible. It feels good, you might even high five or fist bump your co-worker, for the admin, this is a win. However, for the higher-ups, this is where the finger pointing begins.  Have you ever had a manager ask you “So what was the root cause?” or say “Let’s drill down and find the root cause.”

 

 

I have nightmares of having to write after action reports (AARs) on what happened and what the root cause was. In my imagination, the root cause is a nasty monster that wreaks havoc in your data center, the kind of monster that lived under your bed when you were 8 years old, only now it lives in your data center. This monster barely leaves a trace of evidence as to what he did to bring your systems down or corrupt them. This is where a good systems monitoring tool steps in to save the day and help sniff out the root cause. 

 

Three Things to Look for in a Good Root Cause Analysis Tool

A good root cause analysis (RCA) tool can accomplish three things for you, which can provide you with the best track on what the root cause most likely is and how to prevent it in the future. 

  1. A good RCA tool will…be both reactive and predictive. You don’t want a tool that simply points to logs or directories where there might be issues. You want a tool that can describe what happened in detail and point to the location of the issue. You can't begin to track down the issue if you don’t understand what happened and have a clear timeline of events.  Second, the tool can learn patterns of activity within the data center that allow it to become predictive in the future if it sees things going downhill. 
  2. A good RCA tool will…build a baseline and continue to update that baseline as time goes by.  The idea here is for the RCA tool to really understand what looks “normal” to you, what is a normal set of activities and events that take place within your systems. When a consistent and accurate baseline is learned, the RCA tool can get much more accurate as to what a root cause might be when things happen outside of what’s normal. 
  3. A good RCA tool will…sort out what matters, and what doesn’t matter. The last thing you want is a false positive when it comes to root cause analysis. The best tools can accurately measure false positives against real events that can do serious damage to your systems. 

 

Use More Than One Method if Necessary

Letting your RCA tool become a crutch to your team can be problematic. There will be times that an issue is so severe and confusing that it’s sometimes necessary to reach out for help. The best monitoring tools do a good job of bundling log files for export should you need to bring in a vendor support technician. Use the info gathered from logs, plus the RCA tool output and vendor support for those times when critical systems are down hard, and your business is losing money every minute that it’s down.

If you are old enough, you might remember that commercial that played late at night and went something like this... “It’s 10 p.m., do you know where your children are?”  This was a super-short commercial that ran on late night TV in the 1980s; it always kind of creeped me out a bit.  So the title of this post is slightly different, changing the time to 2 a.m., because accessing your data is much more than just a 10 p.m. or earlier affair these days. We want access to our data 24/7/365! The premise of that commercial was all about the safety of children after “curfew” hours.  If you knew your children were asleep in their beds at 10 p.m., then you were good. If not, you had better start beating the bushes and find out where they are.  Back then you couldn’t just send a text saying “Get home now!!!” with an angry emoji .  Things are different now, we’re storing entire data centers in the cloud, so I think it’s time to look back into this creepy commercial from late night 80s TV and apply it to our data in the cloud. “It’s 2 a.m., do you know where your data is?”

 

Here are some ways we can help ensure the safety of our data and know where it is, even at 2 a.m.

 

Understanding the Cloud Hosting Agreement

Much like anything else, read the fine print!  How often do we read it?  I’m guilty of rarely reading it unless I’m buying a house or something to do with a legal matter.  But for a Cloud Hosting Agreement, you need to read the fine print and understand what it is you are gaining or losing by choosing them for your data.  I’m going to use Amazon Web Services (AWS) as an example (this is by no means an endorsement for Amazon). Amazon has done a really good job of publishing the fine print in a way that’s actually easy to read and won’t turn you blind.  Here’s an excerpt from their data privacy page on their website:

Ownership and Control of customer content:

Access: As a customer, you manage access to your content and user access to AWS services and resources. We provide an advanced set of access, encryption, and logging features to help you do this effectively (such as AWS CloudTrail). We do not access or use your content for any purpose without your consent. We never use your content or derive information from it for marketing or advertising.

Storage: You choose the AWS Region(s) in which your content is stored. We do not move or replicate your content outside of your chosen AWS Region(s) without your consent.

Security: You choose how your content is secured. We offer you strong encryption for your content in transit and at rest, and we provide you with the option to manage your own encryption keys. 

 

 

Choose Where Your Data Lives

Cloud storage companies don’t put all their eggs, or more accurately, all your eggs, in one basket. Amazon, Microsoft, and Google have data centers all over the world.  Most of them allow you to choose what region you wish to store your data in.  Here’s an example with Microsoft Azure; in the U.S. alone, Microsoft has 8 data centers from coast to coast where you data is stored at rest.  The locations are Quincy, WA; Santa Clara, CA; Cheyenne, WY; San Antonio, TX; Des Moines, IA; Chicago, IL; Blue Ridge, VA; and Boydton, VA.  AWS offers their customers the choice of which region they wish to store their data, with the assurance that they “won’t move or replicate your content outside of your chosen AWS Region(s) without your consent (https://aws.amazon.com/compliance/data-privacy-faq/).” With both of these options, it’s easy to know where your data lives at rest.

 

 

Figure 1 Microsoft Azure Data Centers in the US

 

Monitor, Monitor, Monitor

It all comes back to systems monitoring.  There are so many cloud monitoring tools out there. Find the right one for your situation and manipulate it to monitor your systems the way you want them to be monitored.  Create custom dashboards, run health checks, manage backups, make sure it’s all working how you want it to work.  If there is a feature you wish was included in your systems monitoring tool, ping the provider and let them know.  For most good companies, feedback is valued and feature requests are honestly looked at for future implementation. 

Systems monitoring has become a very important piece of the infrastructure puzzle. There might not be a more important part of your overall design than having a good systems monitoring practice in place. There are good options for cloud hosted infrastructures, on-premises, and hybrid designs. Whatever situation you are in, it is important that you choose a systems monitoring tool that works best for your organization and delivers the metrics that are crucial to its success. When the decision has been made and the systems monitoring tool(s) have been implemented, it’s time to look at the best practices involved in ensuring the tool works to deliver all it is expected to for the most return on investment.

 

 

The term “best practice” has known to be overused by slick salespeople the world over; however, there is a place for it in the discussion of monitoring tools. The last thing anyone wants to do it purchase a monitoring tool and install it just for it to slowly die and become shelfware. So, let’s look at what I consider to be the top 5 best practices for systems monitoring. 

 

1. Prediction and Prevention              

We’ve all heard the adage that “an ounce of prevention is worth a pound of cure.”  Is your systems monitoring tool delivering metrics that help point out where things might go wrong in the near future? Are you over-taxing your CPU? Running out of memory? Are there networking bottlenecks that need to be addressed? A good monitoring tool will include a prediction engine that will alert you to issues before they become catastrophic. 

 

2. Customize and Streamline Monitoring        

As an administrator, when tasked with implementing systems monitoring, it can bring lots of anxiety and visions of endless, seemingly useless emails filling up your inbox. It doesn’t have to be that way. The admin needs to triage what will trigger an email alert and customize the reporting accordingly. Along with email alerts, most tools allow you to create custom dashboards to monitor what is most important to your organization. Without a level of customization involved, systems monitoring can quickly become an annoying, confusing mess.

 

3. Include Automation

Automation can be a very powerful tool, and can save the administrator a ton of time. In short, automation makes life better, so long as it’s implemented correctly. Many tools today have an automation feature where you can either create your own automation scripts or choose from a list of common, out-of-the-box automation scripts. This best practice goes along with the first one in this list, prediction and prevention. When the tool notices that a certain VM is running out of space, it will reach back to vCenter and add more memory before it’s too late, assuming it has been configured to do so. This makes life much easier, but proceed with caution, as you don’t want your monitoring tool doing too much. It’s easy to be overly aggressive with automation. 

 

4. Documentation Saves the Day

Document, document, document everything you do with your systems monitoring tool. The last thing you want is to have an alert come up and the night shift guy on your operations team not know what to do with it. “Ah, I’ll just acknowledge the alarm and reset it to green, I don’t even know what IOPS are anyways.” Yikes! If you have a “run book” or manual that outlines everything about the tool, where to look for alerts, who to call, how to log in, and so on, then you can relax and know that if something goes wrong, you can rely on the guy with the manual to know what to do. Ensure that you also track changes to the document because you want to monitor what changes are being made and check that they are legit, approved changes.

 

5. Choose Wisely

Last, but definitely not least, pick the right tool for the job. If you migrated your entire workload to the cloud, don’t mess around with an on-premises solution for systems monitoring. Just let the cloud provider use their proprietary tool and run with it. That being said, get educated on their tool and make sure you can customize it to your liking. Don’t pick a tool based on price alone. Shop around and focus on the options and customization you can do with the tool. Always choose a tool that achieves your organization's goals in systems monitoring. The latest isn’t always the greatest.

 

Putting monitoring best practices in place is a smart way to approach a plan to help ensure your tool of choice is going to perform its best and give you the metrics you need to feel good about what’s going on in your data center.

What happens to our applications and infrastructure when we place them in the cloud?  Have you ever felt like you’ve lost insight into your infrastructure after migrating it to the cloud?  There seems to be a common complaint among organizations that at one point in time had an on-premises infrastructure or application package. After migrating those workloads to the cloud, they feel like they don’t have as much ownership and insight into it as they used to.

 

That is expected when you migrate an on-premises workload to the cloud: it no longer physically exists within your workplace.  On top of your applications or infrastructure being out of your sight physically, there is now a web service (depending on the cloud service) that adds another layer of separation between you and your data. This is the world we now live in; the cloud has become a legitimate option to store not only personal data, but enterprise data and even government data. It’s going to be a long road to 100% trust in storing workloads in the cloud, so here are some ways you can still feel good about monitoring your systems/infrastructures/applications that you’ve migrated to the cloud.

 

Cloud Systems Monitoring Tools

Depending on your cloud hosting vendor, you may have some built-in tools that you can utilize to maintain visibility into your infrastructure and applications. Here’s a look at each of the big players in the cloud hosting game and what built in tools they have for systems monitoring:

 

Amazon Web Services CloudWatch

AWS has become a titan in the cloud hosting space and it doesn't look like they're slowing down anytime soon. Amazon offers a utility called Amazon CloudWatch that offers you complete visibility into your cloud resource and applications. CloudWatch allows you to see metrics such as CPU utilization, memory utilization, and other key metrics that you would define. Amazon’s website summarizes CloudWatch as the following:

“Amazon CloudWatch is a monitoring and management service built for developers, system operators, site reliability engineers (SRE), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, understand and respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications and services that run on AWS, and on-premises servers. You can use CloudWatch to set high resolution alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to optimize your applications, and ensure they are running smoothly (AWS CloudWatch).”

 

Microsoft Azure Monitor:

Azure Monitor is a monitoring tool that allows users to navigate different key metrics gathered from applications, application logs, Guest OS, Host VMs, and activity logs within the Azure infrastructure. Azure Monitor visualizes those key metrics through graphics, portals views, dashboards, and different charts. Through Azure Monitor’s landing page, admins can onboard, configure, and manage their infrastructure and application metrics. Microsoft describes Azure Monitor as follows:

“Azure Monitor provides base-level infrastructure metrics and logs for most services in Microsoft Azure. Azure services that do not yet put their data into Azure Monitor will put it there in the future (MS Azure website)… Azure Monitor enables core monitoring for Azure services by allowing the collection of metrics, activity logs, and diagnostic logs. For example, the activity log tells you when new resources are created or modified.”

 

Full-Stack Monitoring, Powered by Google:

Google Cloud has made leaps and bounds in the cloud hosting space in the last few years and is poised to be Amazon’s main competitor. Much like Microsoft and Amazon, Google Cloud offers a robust monitoring tools called Full-Stack Monitoring, Powered by Google. Full-Stack works to offer the administrator complete visibility into their application and platform. Full-Stack presents the admin with a rich dashboard with metrics such as performance, uptime, and health of the cloud-powered applications stored in Google Cloud. Google lays out a great explanation and list of benefits that Full-Stack Monitoring provides to the end-user:

“Stackdriver Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Stackdriver collects metrics, events, and metadata from Google Cloud Platform, Amazon Web Services, hosted uptime probes, application instrumentation, and a variety of common application components including Cassandra, Nginx, Apache Web Server, Elasticsearch, and many others. Stackdriver ingests that data and generates insights via dashboards, charts, and alerts. Stackdriver alerting helps you collaborate by integrating with Slack, PagerDuty, HipChat, Campfire, and more (Google Cloud website).”

 

Trust but Verify

While there are several great proprietary tools provided by the cloud vendor of choice, it’s imperative to verify that the metrics gathered are accurate. There are many free tools out there that can be run against your cloud infrastructure or cloud driven applications.  While it’s become increasingly acceptable to trust large cloud vendors such as Google, Amazon, and Microsoft, the burden rests on the organization to verify the data they are receiving in return. 

Before we get into my list of patch management tools, we all have used WSUS and some of us have become proficient at SCCM, these tools aren't in my top 3 list... they don't even crack my top 10!  However, from an enterprise point of view, an enterprise that is primarily Windows, those tools are great and they get the job done.  I want to talk about 3 tools that are easy to set up, easy to use and provide a good value to the admin team when it comes to managing updates and patches.  Administrators that have to manage patches (which is just about all of us) want an easy solution that's not going to require a ton of overhead.  I feel like SCCM is a monster when it comes to management and overhead, maybe that's not your experience.  The end result we all desire, is to move away from manually patching and find a solution that will do that work for us.  My list is not be any means definitive, these are tools that I've actually had interaction with in the past and that I've found to be helpful and easy to use.  Without further ado, here's my top 3 list of patch management tools (in no particular order) with an accompanying video:

 

LANDesk

 

 

 

GFI LanGuard

 

 

 

SolarWinds Patch Manager

 

 

 

What do you think?  Am I way off?  Did I leave off any good tools that some of you are using out there?  I'd love to hear from you.

It goes without saying that patching and updating your systems is a necessity.  No one wants to deal with the aftermath of a security breach because you forgot to manually patch your servers over the weekend, or your SCCM/WSUS/YUM solution wasn't configured correctly.  So how do you craft a solid plan of attack for patching?  There are many different ways you can approach patching, in previous posts I talked about what you are patching, and how to patch Linux systems, but we need to discuss creating a strategic plan for ensuring patch and update management don't let you down.  What I've done is laid out a step by step process in which you will learn how to create a Patching Plan of Attack or PPoA (not really an acronym but looks like a real one).

 

Step 1: Do you even know what needs to be patched?

The first step in our PPoA would be to do an assessment or inventory to see what is out there in your environment that needs to be patched.  Servers, networking gear, firewalls, desktop systems, etc.  If you don't know what's out there in your environment then how can you be confident in creating a PPoA??  You can't!  For some this might be easy due to the smaller size of their environment, but for others who work in a large enterprise with 100s of devices it can get tricky.  Thankfully tools like SolarWinds LAN Surveyor and and SNMP v3 can help you map out your network and see what's out there.  Hopefully you are already doing regular datacenter health checks where you actually set your Cheetos and Mt. Dew aside, get our of your chair and walk to the actual datacenter (please clean the orange dust off your fingers first!).

 

Step 2:  Being like everyone else is sometimes easier!

How many flavors of Linux are in your environment?  How many different versions are you supporting?  Do you have Win7, XP and Win8 all in your environment?  It can get tricky if you have a bunch of different operating systems out there and even trickier if they are all at different service pack levels.  Keep everything the same, if everything is the same, then you'll have an easier time putting together your PPoA and streamlining the process of patching.  Patching is mind numbing and painful, you don't want to add complexity to patching if you can avoid it.

 

Step 3:  Beep, beep, beep.... Back it up!  Please!

Before you even think about applying any patches, your PPoA must include a process for backing up all of your systems prior to and after patching.  The last thing anyone wants to do is have a RGE on their hands!  We shouldn't even be talking about this, if you aren't backing up your systems, run and hide and don't tell anyone else (I'll keep your secret).  If you don't have the storage space to back up your systems, find it.  If you are already backing up your systems, good for you, here's a virtual pat on the back!

 

Step 4:  Assess, Mitigate, Allow

I'm sure I've got you all out there reading this super excited and jonesing to go out and patch away, calm down, I know it's exciting, but let me ask you a question first.  Do you need to apply every patch that comes out?  Are all of your systems "mission critical"?  Before applying patches and creating an elaborate PPoA, do a risk assessment to see if you really  need to patch everything that you have.  The overhead that comes with patching can sometimes get out of hand if you apply every patch available to every systems you have.  For some, i.e. federal, you have to apply them all, but for others it might not be so necessary.  Can you mitigate the risk before patching it?  Are there things you can do ahead of time to reduce the risk or exposure of a certain system or group of systems?  Finally what kind of risks are you going to allow in your environment?  These are all aspects of good risk management that you can apply to your planning.

 

Step 5:  Patch away!

Now you have your PPoA and you are ready to get patching, go for it.  If you have a good plan of attack and you feel confident that everything has been backed up and all risks have be assessed and mitigated, then have at it.  Occasionally you are going to run into a patch that your systems aren't going to like, and they will stop working.  Hopefully you've backed up your systems or better yet, you are working with VMs and you can revert back to an earlier snapshot.  Keep these 5 steps in mind when building out your PPoA so you can feel confident tackling probably the most annoying task in all of IT.

Let's talk about patching for our good friend Tux the Linux Penguin (if you don't know about Tux, click here.).  How many of us out there work in a Linux heavy environment?  In the past it might have been a much smaller number, however with the emergence of virtualization and the ability to run Linux and Windows VMs on the same hardware, it's become a common occurrence to support both OS platforms.  Today I thought we'd talk about patching techniques and methods specifically related to Linux systems.   Below I've compiled a list of the 3 most common methods I've used for patching for Linux systems.  After reading the list you may have a dozen more way successful and easy to use methods that the ones that I've listed here, I encourage you to share your list with the forum in order to gain the best coverage of methods to use for patching Linux systems.

 

Open Source Patching Tools

There are a few good open source tools out there for use in patching your Linux systems.  One tool that I've tested with in the past is called Spacewalk.  Spacewalk is used to patch systems that are derivatives of RedHat such as Fedora and CentOS.  Most federal government Linux systems are running Red Hat Enterprise Linux, in this case you would be better off utilizing the Red Hat Satellite suite of tools to manage patches and updates for your Red Hat system.  In the case, your government client or commercial client allows Fedora/CenOS as well as open source tools for managing updates, then Spacewalk is a viable option.  For a decent tutorial and article on Spacewalk and it's capabilities, click here.

 

 

YUMmy for my tummy!

No, this has nothing to do with Cheetos, everybody calm down.  Configuring a YUM repository is another good method for managing patches in a Linux environment.  If you have the space, or even if you don't you should make the space to configure a YUM repository.  Once you have this repository created you can then build some of your own scripts in order to pull down and apply them on demand or with a configured schedule.  It's easy to set up a YUM repository, especially when utilizing the createpro tool.  For a great tutorial on setting up a YUM repository, check out this video.

 

 

Manual Patching from Vendor Sites

Obviously the last method I'm going to talk about is manual patching.  For the record, I abhor manual patching, it's a long process and it can become quite tedious if you have a large environment.  I will preface this section by stating that if you can test a scripted/automated process for patching and it's successful enough that you can deploy it, the please by all means, go that route.  If you simply don't have the time or aptitude for scripting, then manual patching it is.  The most important thing to remember when you are downloading patches via FTP site, you must ensure that it's a trustworthy site.  With RedHat and SUSE, you're going to get their trusted and secured FTP site to download your patches, however with other distros of Linux such as Ubuntu (Debian based) or CentOS, you're going to have to find a trustworthy mirror site that won't introduce a Trojan to your network.  The major drawback with manual patching is security, unfortunately there are a ton of bad sites out there that will help you introduce malware into your systems and corrupt your network.  Be careful!



That's all folks!  Does any of thing seem familiar to you?  What do you use to patch your Linux systems?  If you've set up an elaborate YUM repository or apt/get repository, please share the love! 


tux.jpg Tux out!!

All of us that have had any experience in the IT field have had to deal with patching at some point in time.  It's a necessary evil, why an evil?  Well if you've had to deal with patches then you know it can be a major pain.  When I hear words like SCCM or Patch Tuesday, I cringe, especially if I'm in charge of path management.  We all love Microsoft (ahem), but let's be honest, they have more patches than any other software vendor in this galaxy!  VMware has their patching, Linux machines are patched, but Windows Servers, there is some heavy lifting when it comes to patching.  Most of my memories or experiences of staying up past 12 am to do IT work has revolved around patching, and again, it's not something that everybody jumps to volunteer for.  While it's definitely not riveting work, it is crucial to the security of your server, network device, desktops, <plug in system here>.  Most software vendors are good about pushing out up to date patches to their systems such as Microsoft, however there are some other types of systems that we as IT staff have to go out and pull down from the vendor's site, this adds more complexity to the patching.

 

My question is, what are you doing to manage your organization's patching?  Are you using SCCM, WSUS or some other type of patch management?  Or are you out there still banging away at manually patching your systems, hopefully not, but maybe you aren't a full blown enterprise.  I'm curious, because to me patching is the most mundane and painful process out there, especially if you are doing it manually.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.