1 2 3 4 Previous Next

Geek Speak

1,468 posts

You sit down at your desk. It's 9:10AM and your coffee is still warm. There is a smell of bacon in the air.

 

Suddenly your phone rings. The trading system is down. The time for quick thinking is now.

 

Where would you begin to troubleshoot this scenario?

 

A lot of people will reach for the biggest hammer they can find: a tool that will trace all activity as it hits the database instance. For SQL Server, that tool is typically SQL Profiler.

 

The trouble here is this: you are in a reactive mode right now. You have no idea as to the root cause of the issue. Therefore, you will configure your trace to capture as many details as possible. This is your reaction to make certain that when the time comes you are prepared to do as thorough a forensics job as possible in the hope that you can fix the issue in the shortest amount of time.

 

And this method of performance monitoring and troubleshooting is the least efficient way to get the job done.

 

When it comes to performance monitoring and troubleshooting you have two options: tracing or polling.

 

Tracing will track details and capture events as they happen. In an ironic twist this method can interfere with the performance of the queries you are trying to measure. Examples of tools that use the tracing method for SQL Server are Extended Events and SQL Profiler.

 

Polling, however, is also known by another name: sampling. A tool that utilizes polling will gather performance data at regular intervals. This is considered a light-weight option to tracing. Examples of tools that use this method are Performance Monitor (by default it samples once per second) and 3rd party applications like Database Performance Analyzer that query dynamic management objects (which are system views known as DMVs in SQL Server, and x$ and v$ objects in Oracle).

 

See, here's the secret about performance monitoring and troubleshooting that most people don't understand: when it comes to gathering performance metrics it's not the what you gather as much as it is the how you gather.


Knowing what to measure is an easy task. It really is. You can find lots of information on the series of tubes known as the internet that will list out all the metrics an administrator would want. Database size, free disk space, CPU utilization, page life expectancy, buffer cache hit ratio, etc. The list of available metrics seems endless and often overwhelming. Some of that information is even useful; a lot of it can be just noise, depending on the problem you are trying to solve.


So which method is right for you?


Both, actually.


Let me explain.


Think of a surgeon that needs to operate on a patient. There's a good chance that before the surgeon cuts into healthy skin they will take an X-ray of the area. Once they examine the X-ray, they know more about what they need to do when they operate.

Polling tools are similar to X-rays. They help you understand more about what areas you need to investigate further.  Then, when you need to take that deeper dive, that's where you are likely to use a tracing tool in order to return only the necessary information needed to solve the problem, and only for the shortest possible duration.

 

I find that many junior administrators (and developers with novice database performance troubleshooting skills) tend to rely on tracing tools for even the most routine tasks that can be done with a few simple queries against a DMV or two. I do my best to educate when I can, but it often is an uphill battle. I lost track of the number of times I've been thrown under the bus by someone saying that they can't fix an issue because I won't let them run Profiler against a production box as a first attempt at figuring out what’s going on. Rather than make people choose between one tool or the other I do my best to explain how they work well together.

 

I never use a tracing tool as a first option for performance monitoring and troubleshooting. I rely on polling in order to help me understand where to go next. Sometimes that next step requires a trace but often times I'm able to help make positive performance improvements without ever needing to run a trace. Then again, I'm lucky that I have some really good tools to use for monitoring database servers, even ones that are running on VMWare, or Amazon RDS, or Microsoft Azure.

 

There's a lot for anyone to learn as an administrator, and it can be overwhelming for anyone, new or experienced.

 

So it wouldn't hurt to double check how you are currently monitoring right now, to make certain you are grabbing the right things and at the right frequency.

I’ll be honest, when I initially saw the words configuration management, I only thought of managing device configurations. You know, things like keeping backup copies of configurations in case a device bit the bucket. However, the longer I’ve been in the IT field, the more I’ve learned how short-sighted I was in relation to what configuration management truly meant. Hopefully, by the end of this post, you will either nod and agree or thank me for opening your eyes to an aspect of IT that is typically misunderstood or severely neglected.

 

There are several components of configuration management that you, as an IT professional should be aware of:

 

  • Device hardware and software inventory
  • Software management
  • Configuration backup, viewing, archiving, and comparison
  • Detection and alerting of changes to configuration, hardware, or software
  • Configuration change management

 

Let’s briefly go over some of these and why they are so integral to maintaining a healthy network.

 

Most (hopefully all) IT teams keep an inventory of hardware and software that they support. This is imperative for things like service contract renewals and support calls. But, how you keep track of this information usually calls for question. Are you manually keeping track of this information using Excel spreadsheets or something similar? I would agree that it works, but in a world so hellbent on automation, why risk human error? What if you forget to add a device and it goes unnoticed? Wouldn’t it be easier to have software that automatically performs an inventory of all your devices?

 

One of my favorite components of configuration management is configuration backup and the ability to view those backups as well as compare them to previous backups. If your Core switch were to fail today, right now, are you prepared to replace it? I’m not talking about calling your vendor’s support to have them ship out a replacement. I’m talking about rebuilding that new shiny piece of hardware to its predecessor’s last working state. If you have backups, that process is made easy. Grab the latest backup and slap it on the new device when it arrives. This will drastically cut down the recovery time in a failure scenario. Need to know what’s changed between the current configuration and 6 months ago for audit purposes? Having those backups and a mechanism for comparing them goes a long way.

 

There are a number of ways to know when an intruder’s been in your network. One of those methods is through the detection and alerting of changes made to your devices. If you don’t have something in place that can detect these changes in real-time, you’ll be in the dark in more ways than one. How about if a co-worker made an “innocent” change before going on vacation that starts to rear its ugly head? Being able to easily generate real-time alerts or reports will help pinpoint the changes and get your system purring like a kitten once again.

 

In conclusion, configuration management is not just about keeping backups of your devices on hand. It involves keeping inventories of those devices as well as being able to view, archive, and compare their configurations. It also includes being able to easily detect and alert on changes made to your devices for events like catching network intruders. Are you practicing good configuration management techniques?

Whether it be at work or in my personal life, I like to plan ahead and be readily prepared. Specifically, when it comes to allocating storage, you definitely need to strategically plan your allocation. This is where Thin Provisioning comes inorganizations can adopt this strategy to avoid costly problems and increase storage efficiency.


Efficiently optimizing available space in storage area networks (SAN) is known as Thin Provisioning. Thin Provisioning allocates disk storage space between multiple users based on the requirement by each user at a given time.


Days before Thin Provisioning:           

Traditionally, admins allocated additional storage beyond their current needanticipating future growth. In turn, admins would have a significant amount of unused space, directly resulting in a loss on capital spent on purchasing disks and storage arrays.


Applications require storage to function properly. In traditional provisioning, a Logical Unit Number (LUN) is created and each application is assigned to a LUN. Creating a LUN with the traditional method meant a portion of empty physical storage space from the array is allocated. For the application to operate, space is then allocated to the application. At first, the application will not occupy the whole storage space allocated, however gradually the storage space will be utilized. 


How Thin Provisioning Works:

In Thin Provisioning, a LUN is created from a common pool of storage. A LUN in Thin Provisioning will be larger than the actual physical storage assigned to the LUN. For example, if an application needs 50GB of storage to start working.  50GB of virtual storage is assigned to the application so that the application can become operational. The application uses LUN in the normal procedure. Initially, the assigned LUN will only have a portion of the actual needed storage (say 15GB) and the rest35GB will be virtual storage. As the actual utilization of storage grows, additional storage is automatically taken from the common pool of physical storage. The user can then add more physical storage (based on requirement) without disturbing the application or altering the LUN. This helps admins eliminate the initial physical disc capacity that goes unused.

dig1.png

A use case:

Consider an organization that has 3 servers running different applications—a database, a file processing system, and email. All these applications need storage space to work and the organization has to consider storage space for future growth.


While using traditional provisioning, and say each application needs 1 TB each to operate. But out of 1 TB only 250 GB (25 %) will be used initially and the rest will be utilized gradually. With the whole 3 TB already allocated to the existing 3 applications, what happens if you need a new server/application in the organization? In this case, you will need more storage and unfortunately, it won’t be cheapyou will need to search for budget.

dig2.png

Now let’s look to see how thin provisioning can help with the aforementioned situation.  For example, in this scenario each server/application is provided with a virtual storage of 1 TB, but the actual storage space provided is just 250 GB. The space from the storage is only allocated when needed. When a new server/application is added, you can assign 250 GB from the physical storage space, but the server/application will have a total 1 TB of virtual storage. The organization can add the new server/application without purchasing additional storage. Also, increase the physical storage as a whole when needed.

dig3.png

Thin provisioning has 2 advantages in this use case:

  • Adding a new server/application is no longer an issue.
  • Avoid provisioning a total of 3 TB while setting up the servers. The organization can start with 1 TB and add more storage space as and when needed without disturbing the setup. 


When to use thin provisioning:

This type of provisioning used is more related to the use case and not technology. Thin provisioning is beneficial in the following situations:

  • Where the amount of resources used is much smaller than allocated.
  • When the administrator is not sure about the size of the database that needs to be allocated.
  • Situations when more servers/ applications often get added to the network.
  • When the organization wants to reduce the initial investment. 
  • When the DB administrator wants get maximum utilization from their storage.


Thin provisioning is not the silver bullet in the virtualization world. It too has its limitations. For example:

  • In regards to performance, this becomes a major factorthin provisioned storage becomes fragmented very easily, in turn decreasing performance.
  • Storage over allocationthe actual storage during thin provisioning can result in over allocation. Further, any write operation can bring a terrible failure (which cannot be repaired) on one or several drives.


Even though thin provisioning has drawbacks all these can be overcome by continuous storage monitoring. Now what you need to do is transform your ‘fat’ volumes to thin ones. But there are issues that can arise while doing so. Have you experienced any issues while moving your storage? If so, how did you resolve your issues?

In the first installment of the AppStack series, Joel Dolisy took you for A Lap around AppStackproviding a high level overview of the concept. Lawrence Garvin then connected the dots from a systems perspective in A Look at Systems’ Role in the AppStack. As Lawrence concluded his piece, he stated, “The complexity of the systems monitoring space is continuing to grow. Virtualization and shared storage was just the first step.” So, let’s take a look at how those and the application affect the AppStack.

 

Virtualization Management

Recent announcements from VMware (vSphere 6), Microsoft (Windows 10), and cloud service providers like Amazon Web Services highlight the advances made to accelerate rapid provisioning, dynamic resource scaling, and continuous application & services delivery. These capabilities extend IT consumption of anything-as-a-Service (XaaS) from on-premises to off-premises, from private cloud to public cloud, from physical to virtualization to cloud and back again.


Storage Management

Policy based storage aka software-defined storage is the latest trend that abstracts storage constructs from the underlying storage hardware. The objective is to port the advantages inherent to virtualization over to storage for actions involving storage capacity, performance, and utilization in order to meet Quality-of-Service (QoS) service level agreements (SLAs).


The Application is What Matters

Constantly changing variables in each layer make it more complex to manage the entire environment. The bottlenecks and trouble spots can either be virtual or physical constructs. And the only thing that matters is the application delivery and consumption. IT management is needed to monitor, troubleshoot, report on this complex and quickly changing environment. To adapt to that speed of IT and business, SolarWinds AppStack provides the context to connect these layers and quickly provide a single point of truth on any given application as SolarWinds CTO/CIO, Joel Dolisy, pointed out. And as fellow Head Geek, Lawrence Garvin, pointed outmonitoring is converging towards consolidated monitoring and comprehensive awareness of the end-user experience.


Indispensable to Software-Defined IT Professionals

All of the above make AppStack indispensable to software-defined IT professionals that make their living in multiple clouds, driving multiple container vehicles, and engineering the automation & orchestration of self-healing and auto scaling policies in their ecosystem.

 

For the next installment of this series, Patrick Hubbard will share his insights and experience on the AppStack concept.

IT professionals are admittedly a prideful bunch. It comes with the territory when you have to constantly defend yourself, your decisions, and your infrastructure against people who don’t truly understand what you do. This is especially true for network administrators. “It’s always the network.” Ever heard that one before? Heck, there’s even a blog out there with that expression created by someone I respect, Colby Glass. My point is, as IT professionals, we have to be prepared at a moment’s notice to provide evidence that an issue is not related to the devices we manage. That's why it's imperative that we must know our network very well inside and out.

 

With that being said, It should be no surprise to you that when I started my career in networking in 2010, I thought NMS platforms were pretty amazing. Pop some IP addresses in and you’re set.¹ The NMS goes about its duty, monitoring the kingdom and alerting you when things go awry. I could even log in and verify it for myself by looking if I wanted to be certain. I could even dig in at the interface level and give you traffic statistics like discards and errors, utilization, etc. I had instant credibility at my finger tips. I could prove the network was in great shape at a moment's notice.  Want to know if that interface to your server was congested yesterday evening at 7pm? It sure wasn't and I have the proof! Can’t get much better than that, right?

 

Until…

 

I saw netflow for the first time. Netflow has a way of really opening your eyes. “How did I ever think I knew my network so well in the past?”, I thought. I had no visibility into the traffic patterns flowing through my network. Sure, I could fire up a packet capture pretty easily, but that approach is reactive and time-consuming depending on your setup. What if that interface really WAS congested yesterday evening at 7pm? I have no data to reference because I wasn't running a packet capture at that exact time or for that particular traffic flow. It’s helpful to tell someone that the interface was congested, but how about taking it a step further with what was congesting it? What misbehaving application caused that link to be 90% utilized when traffic should have been relatively light at that time of the day? The important thing to realize is that I’m not just an advocate for netflow, I’m also a user!² Here’s a quick recap of an instance where netflow saved my team and I.

 

I recently encountered a situation where having net flow data was instrumental. One day at work, we received multiple calls, e-mails, and tickets about slow networks at our remote offices. They seemed to be related, but we weren't sure at first. The slowness complaints were sporadic in nature which made us scratch our heads even more. After looking at our instance of NPM, we definitely saw high interface utilization at some, but not all of our remote sites. We couldn't think of any application or traffic pattern that would cause this. Was our network under attack? We thought it might be prudent to involve the security team, in case it really was an attack, but before we sounded the alarm, we decided to check out our netflow data first. What we saw next really baffled us.

 

Large amounts of traffic (think GBs/hour) was coming from our Symantec Endpoint Protection (SEP) servers to clients at the remote offices over TCP port 8014. For those of you who have worked with Symantec before, you probably already know that this is the port that the SEP manager uses to manage its clients (e.g. virus definition updates). At some point, communication between the manager and most of its clients (especially in remote offices) had failed and the virus definitions on the clients became outdated. After a period of time, the clients would no longer request the incremental definition update; they wanted the whole enchilada. That’s okay if it’s a few clients and the download process ends in success the first time. This wasn't the case in our situation. There were hundreds of clients all trying to download this 400+MB file from one server over relatively small WAN links (avg. 10Mb/s). The result of this was constantly failing downloads which triggered the process to start over again ad infinitum. As a quick workaround, we decided to QoS the traffic based on the port number until the issue with the clients was resolved. With this information at our disposal, we brought it to the security team to show them that their A/V system was not healthy. Armed with the information we gave them, they were quickly able to identify several issues with the SEP manager and its clients which helped them eventually resolve several issues including standing up a redundant SEP manager. Without net flow data, we would have had to set up SPAN ports on our switches and wait  for a period of time before analyzing packet captures to determine what caused the congestion. By having netflow, we were instantly able to capitalize on it by viewing specific times in the past to determine what was traversing our network when our users were complaining.

 

That’s just one problem netflow has solved for us. What if that port was TCP/6667 and it was coming from your CFO’s computer? Do you really think your CFO is on #packetpushers (irc.freenode.net) trying to learn more about networking? No, it’s more likely a command and control botnet obtaining its next instructions on how to make your life worse. From a security perspective, netflow is just one more tool to add in the never-ending fight against malware. So what are you waiting for? Get with the flow… with netflow!

 

1. Of course it's never quite that easy. You'll have to configure SNMP on all of your devices that you want to manage and/or monitor.

2. Hair Club for Men marketing

Most networks by now are slowly making the transition from IPv4 addresses to IPv6. This new availability and abundance of global IPv6 addresses will enable businesses to easily provide services to their customers and internal users. However, there are a few things to make note of when you have IPv6 running in your network.

  1. Unknown IPv6 enabled devices: Many current operating systems not only support, but also enable IPv6 by default. Devices like firewalls & IDS equipment may not be configured to recognize IPv6 traffic in the network. Unfortunately, the attacker community can leverage this gap to infiltrate and attack both IPv4 &IPv6 networks. Unauthorized clients using IPv6 auto-configuration, can configure their own global address if they are able to find a global prefix.
  2. Multicast: A Multicast address is used for one-to-many communication, with delivery to multiple interfaces. In IPv6, Multicast replaces Broadcast for network discovery functions like dynamic auto-configuration of devices and DHCP services. The IPv6 address range FF00::/8 is reserved for multicast and by combining this with a scope, unauthorized users can easily reach hosts or application servers if they want.
  3. Stateless Address Auto-configuration (SLAAC): SLAAC allows network devices to automatically create valid IPv6 addresses. It permits hot plugging of network devices i.e. no need for manual configuration. Devices can simply connect to the network, operate stealthily, and go unnoticed for a very long time.
  4. Devices not supporting IPv6: Some network security devices like firewalls, filters, NIDS, etc. may not support IPv6 or may not be configured to work with IPv6. As a result, these IPv6 enabled hosts can access the Internet with no firewall protection or network access controls. In turn, malicious tools can be used to detect IPv6-capable hosts, taking control of IPv6 auto-configuration & tunneling IPv6 traffic in and out of IPv4 networks undetected.

 

Mitigate Risks from IPv6 in Your Network


Here are a few important tips to help you stay in control of your network, while maintaining optimum use of your IPv6 address space:

  • Deploy network controls to be both IPv4 and IPv6 aware
  • Network admins should have the same level of monitoring for both protocols
  • Define and implement baseline security controls for IPv6 environments to meet the same or better security as IPv4 environments
  • To actively manage these risks, organizations are encouraged to adopt a comprehensive IP management strategy

 

It’s important to know your IPv6 management needs and be aware of what is required to efficiently and securely manage your IPv6 address space. IPv6 addresses are more complex, longer and harder to remember. Also, IPv6 does not have the concept of static IP addresses. It uses SLAAC locally and DHCPv6 remotely. Furthermore, the existence of both IPv4 and IPv6 addresses in the network increases the complexity in management of the entire IP space.

 

IPv6 is much more complex, so spreadsheets simply won’t work for IPv6as the address boundaries are much more difficult and longer. Further, dynamic assignment of addresses makes it very difficult to manually update spreadsheets and maintain up-to-date information.

 

In short, IPv6 networks need a comprehensive & automated IP Address Management solution. Utilizing automated software allows you to effectively track all IPv6 addresses in the network, manage IPv6 network boundaries, and track dynamic IPv6 assignmentswhile helping you ensure that the existence of IPv6 in the network is not causing a security threat.

I started a thread on Twitter this week asking Who are some awesome women in monitoring? One of the common reactions (privately and respectfully, I'm happy to say) has been asking me why I started the discussion in the first place. I thought that question deserved a response.

Because, I'm a feminist. Yes, Virginia, Orthodox Jewish middle-aged white guys can be feminists, too. I think that anything that can be done to promote and encourage women getting into STEM professions should be done. Full stop.

"But why 'women in monitoring'?" I'm then asked. "Why not 'awesome women in I.T.'?"

Running a close second to my first response is that I'm a monitoring enthusiast. I think monitoring (especially monitoring done right) is awesome, a lot of fun, and provides a huge value to organizations of all sizes.

I also think it's an under-appreciated discipline within I.T. Monitoring today, reminds me of InfoSec, Storage, or Virtualization about a decade ago. It was a set of skills, but few people claimed that it was their sole role within a company.

I want to see monitoring recognized as a career path, the same as being a Voice engineer or a data analytics specialist.

Of course, this all ties back to my role as Head Geek. Part of the job of a Head Geek is to promote the amazingamazing solutions, amazing trends, amazing companies, and amazing groupsas it relates to monitoring.

One reason this is explicitly part of my job is to build an environment where those people who are quietly doing the work, but not identifying as part of "the group" feel more comfortable doing so. The more "the group" gains visibility, the more that people who WANT to be part of the group will gravitate towards it rather than falling into it by happenstance.

Which brings me back to the point about "amazing women in monitoring". This isn't a zero-sum competition. Looking for amazing women doesn't somehow imply women are MORE amazing than x (men, minorities, nuns, hamsters, etc).

This is about doing my part to start a conversation where achievements can be recognized for their own merit.

I know that's a pretty big soapbox to balance on a series of twitter posts, but I figure it's gotta start somewhere.

So, if you know of any exceptional women in monitoring: Share their thwack ID or twitter handle below to help me give them a shout out.

There’s no way to write a good opening sentence for this post.  Yesterday we lost Head Geek LGarvin, who passed away of natural causes at home.  He’ll be missed by more than all the admins he interacted with on thwack, the huge TechNet community he touched in a decade as a devoted Microsoft MVP, the Houston Rodeo and other organizations  where he volunteered, and of course most of all by his family.  I will miss my friend.

 

Lawrence taught me more about how to be a productive technical creative, wonderfully curmudgeonly yet optimistic when published, opinionated yet receptive in debate, and above all how to dig in when you’re right, than anyone I’ve met.  Yes he occasionally sent terrifically long emails, but they were always worth the read, each word with a purpose.  The Exchange server won’t miss them, but the Geek team certainly will.

 

The next SolarWinds Lab will be his last (taped) appearance, and the prospect of his first absence from live chat is difficult to imagine.  However, the video team is putting together a tribute and we’ll get the whole gang together for signoff, so if you’re a fan of Lawrence you’ll want to be there.  I found a couple of fun Lab promos today and linked them below, as well as our first episode when it was just the two of us trying to figure out what the heck we were doing.

 

Larry, you’ll live on in thousands of posts on admin communities all over the world, your articles and among your friends here.  Thanks for spending this time with us.

 

NO CARRIER


-Patrick


Classic YouTube Lawrence:

 

 

When it comes to the enterprise technology stack, nothing has captured my heart &  imagination quite like enterprise storage systems.

 

Stephen Foskett once observed that all else is simply plumbing, and he’s right. Everything else in the stack exists merely to transport, secure, process, manipulate, organize, index or in some way serve & protect the bytes in your storage array.

 

But it's complex to manage, especially in small/medium enterprises where the storage spend is rare and there are no do-overs. If you're buying an array, you've got to get it right the first time, and that means you've got to figure out a way to forecast how much storage you actually need over time.

 

I've used Excel a few times to do just that: build a model for storage capacity. Details below!

 

Using Excel to model storage capacity

Open up Excel or your favorite open-sourced alternative, and input your existing storage capacity in its entirety. Let’s say you have –direct, shared, converged, or otherwise- 75 terabytes of storage in your enterprise.

 

Now calculate how much of that is available for use both in absolute terms (TBs!) and as a percentage of used, committed space. Of that 75TB, maybe you've got 13 terabytes left  as usable capacity. Perhaps some of that 13TB is in a shared block or NFS storage array; perhaps all that remains for your use is direct-attached storage inside production servers. If the latter, you need to find a way to reflect the difficulty you’ll have in using the free capacity you have.

 

Now array that snapshot of your existing storage (75TB, 13TB free, 83% Committed) in separate columns, and in a new column, populate Month 1, 2, 3 all the way out to 36 or 72. What's your storage going to look like over time?

 

The reality is that unless you work in one of those rare IT environments that’s figured out the complex riddle of data retention, most IT environments will see storage demand grow over time. Use this to your advantage.

 

Sometimes that growth is predictable, other times it’s not. Reflect both scenarios in the same column. Assume months 1-12 there will be a .3% demand for more storage from your 13TB of remaining storage (about 39 GB every month).

 

In month 13, imagine that some big event happens; a new product launch or merger/acquisition, etc, and demand that month is extreme: the business needs about 15% of what (by then is) about 12.7TB of your storage. If you’re in a place that only has direct-attached storage, this might be the month your juggling act becomes especially acute; what if none of your servers has 2 Terabytes of free space?

 

By now your storage capacity model should be taking shape. You can plug in different categories of storage (Scale up your S3 or Azure Blob storage to meet sudden spikes in demnad, for instance,) you can contemplate disposal of some of your existing legacy storage too.

 

storagedemand.png

Once all your model is finished, you should be able to make a nice chart to present to your CIO, a model that says you've done your homework answering the question "How much storage do we need?"

 

If someone steals your password, you can change it. But if someone steals your thumbprint, you can’t get a new thumb. The failure modes are very different.” –Bruce Schneir

 

In my last post I talked about how the traditional security model is dead, and that companies have to start thinking in terms of “we’ve already been hacked” and move into a mitigation and awareness strategy. The temptation to put a set of really big, expensive, name brand firewalls at the edge of your network, monitor known vulnerabilities, and then walk away smug in the knowledge that you’ve not only checked a box on your next audit, but done all you can to protect your valuable assets is a strong one. But that temptation would be shortsighted and wrong.

 

Since I wrote last, one of the largest security breaches ever—and possibly the most damaging—was reported by the insurance giant, Anthem BlueCross BlueShield. Over 80 million accounts were compromised, and what makes this hack worse than most is that it included names, addresses, social security numbers, income, and some other stuff—pretty much everything that makes up your identity. In other words, you just got stolen. A credit card can be shut down and replaced, but it’s not so easy when it’s your whole identity.

 

Anthem is using wording suggesting that the company was the victim of “a very sophisticated external cyber attack” which, while plausible and largely face-saving, is almost guaranteed to not be the case. While the attack was probably perpetrated by an external entity, the sophistication of said attack is probably not high. In most of these cases it’s as simple as getting one employee inside the company to open the wrong file, click the wrong link, reveal the wrong thing, etc. The days of poking holes in firewalls and perpetrating truly sophisticated attacks from the outside in are largely gone, reserved for movies and nation-state cyber warfare.

 

The one thing we can take from this attack, absent of any further details, is that the company self-reported. They discovered the problem and responded immediately. What isn’t known is how long the attackers had access to the system before the company’s security team discovered and closed the breach. Hopefully we’ll get more information in the coming days and will get a better picture of the scope and attack vector used.

 

So, what do you think of the Anthem attack? Do you have processes in place today to respond to this sort of breach? Would you even know if you’d been breached?

If you are a security practitioner and haven’t heard about the 80 million personal records lifted from Anthem’s database yesterday you missed some exciting news, both good and bad. Clearly the loss of so many records is bad news and very troubling. However, the good new was that Anthem identified the breach themselves. Even though they caught the breach at the end of the kill chain (see below), they still did catch the breach before the records were exploited or showed up on a cyber underground sale site.

 

Targeted breaches such as Anthem are notoriously difficult to identify and contain, in part because the trade craft for such attacks is specifically designed to avoid traditional detection solutions such as anti-virus and intrusion detection. So as the FBI tries to determine who hijacked these records, the rest of us are trying to figure out why. Although motive, like attribution, is difficult to nail down, motive is a useful data point if you are trying to predict whether your organization is at risk.

 

In the absence of your own security analyst or FBI task force to determine motive or attribution, what can ordinary practitioner do to lower organizational risk?

 

First – Determine if your organization is a possible target

 

Don’t think because you are a smaller or less well know that you are not a target.  Cyber thieves not only desire data they can sell, they need compute power to launch their attacks from, and then need identities they can use to trick their ultimate target into allowing a malicious link or payload into their environment.

 

Who has not recently noticed a strange email from a colleague or friend that upon further inspection is not their legitimate email address? 

 

Second – Learn the kill chain and use it to validate your security strategy

 

Do you collect information from available sources across the kill chain into your SIEM?  The earlier in the kill chain you identify a potential attack, the lower the risk, and the simpler the mitigation. For example:

 

Collecting and reporting on unusual email activity may allow you to catch a recon attempt. An identification of such behavior might lead you to increase logging on high value targets such as privileged accounts, domain controllers, or database servers.

 

Another useful indicator is spikes in network traffic on sensitive segments, or increases in authorized traffic exiting the organization.

 

In the worst case, by evaluating all log sources and ensuring you are collecting across the kill chain – you will empower your IT or security team to conduct forensics or a post incident analysis effectively.

 

Finally – Have an incident response plan

 

It does not need to be elaborate, but executives, marketing, and IT should all know who is going to be the team coordinator, who is going to be the communicator, and who is going to be the decision maker.

 

By following these guidelines you are doing your part to leverage the value in your security investment, and reduce organizational risk.

 

About the kill chain.

The kill chain was originally conceptualized and codified by Lockheed Martin. Today it is used by cyber security professionals in many roles to communicate, plan and strategize how to effectively protect their organization.

 

kill-chain.jpg

As Joel Dolisy described in A Lap around Appstack, the first installment of the AppStack series, there are many components in the application stack including networking, storage, and computing resources. The computing resources include hypervisors, virtual machines, and physical machines that provide applications, infrastructure support services, and in some cases storage. Collectively, we refer to these resources as systems.


Systems really are the root of the application space. From the earliest days of computing when an application ran on a single machine, the user went to that machine to use the application and the machine had no connectivity to anything (save perhaps a printer). Today systems offer a myriad of contributions to the application space and each of those contributions has its own monitoring needs.


Server Monitoring

Historically, with the one-service/one-machine approach, a typical server ran at only ten to twenty percent of capacity. As long as the LAN connection between the desktop and the server was working, it was highly unlikely that the server was ever going to be part of a performance problem. Today, it is critical that servers behave well and share resource responsibility with others. (Other servers, that is!) As a result, server monitoring is now a critical component of an application-centric monitoring solution. 

User and Device Monitoring

One of the components that is often overlooked in the monitoring process, are the systems used directly by the end-user. The typical user may have two or three different devices all accessing the network simultaneously, and sometimes multiple devices accessing the same application simultaneously. Tracking what devices are being used, who is using those devices, how those devices are impacting other applications, and ensuring that end-users get the optimal application experience on whatever device they’re using is also part of this effort.

Consolidated Monitoring and Coexistence 

The benefit of monitoring the entire application stack as a consolidated effort is a comprehensive awareness of how the end-user is experiencing their interaction with the application and an understanding of how the various shared components of an application are co-existing with one another.

 

By being aware of where resources are shared, for example LUNs in a storage array sharing disk IOPS or virtual machines on a hypervisor sharing CPU cycles, performance issues affecting one or more applications can be more rapidly diagnosed and remediated. It’s not unusual at all for an application to negatively impact another application, without displaying any performance degradation itself. 

Increased Complexity

The last thing to be aware of is that the complexity of the systems monitoring space is continuing to grow. Virtualization and shared storage was just the first step. For the next blog in this series, Kong Yang will discuss how that impacts the AppStack.

 

(and how monitoring can solve them)

 

I spend a lot of time talking about the value that monitoring can bring an organization, and helping IT professionals make a compelling case for expanding or creating a monitoring environment. One of the traps I fall into is talking about the functions and features that monitoring tools provide while believing that the problems they solve are self-evident.

 

While this is often not true when speaking to non-technical decision makers, it can come as a surprise that it’s sometimes not obvious even to a technical audience!

 

So I have found it helpful to describe the problem first, so that the listener understands and buys into the fact that a challenge exists. Once that’s done, talking about solutions becomes much easier.

 

With that in mind, here are the top 5 issues I see in companies today, along with ways that sophisticated monitoring addresses them.

 

Wireless Networks

Issue #1:

Ubiquitous wireless has directly influenced the decision to embrace BYOD programs, which has in turn created an explosion of devices on the network. It’s not uncommon for a single employee to have 3, 4, or even 5 devices.

 

This spike in device density has put an unanticipated strain on wireless networks. In addition to the sheer load, there are issues with the type of connections, mobility, and device proximity.

 

The need to know how many users are on each wireless AP, how much data they are pulling, and how devices move around the office has far outstripped the built-in options that come with the equipment.

 

Monitoring Can Help!

Wireless monitoring solutions tell you more than when an AP is down. They can alert you when an AP is over-subscribed, or when an individual device is consuming larger-than-expected amounts of data.

 

In addition, sophisticated monitoring tools now include wireless heat maps – which take the feedback from client devices and generate displays showing where signal strength is best (and worst) and the movement of devices in the environment.

 

Capacity Planning

Issue #2

We work hard to provision systems appropriately, and to keep tabs on how that system is performing under load. But this remains a largely manual process. Even with monitoring tools in place, capacity planning—knowing how far into the future a resource (CPU, RAM, disk, bandwidth) will last given current usage patterns—is something that humans do (often with a lot of guesswork). And all too often, resources still reach capacity without anyone noticing until it is far too late.

 

Monitoring Can Help!

This is a math problem, pure and simple. Sophisticated monitoring tools now have the logic built-in to consider both trending and usage patterns day-by-day and week-by-week in order to come up with a more accurate estimate of when a resource will run out. With this feature in place, alerts can be triggered so that staff can act proactively to do the higher-level analysis and act accordingly.

 

Packet Monitoring

Issue #3

We’ve gotten very good at monitoring the bits on a network – how many bits per second in and out; the number of errored bits; the number of discarded bits. But knowing how much is only half the story. Where those bits are going and how fast they are traveling is now just as crucial. User experience is now as important as network provisioning. As the saying goes: “Slow is the new down.” In addition, knowing where those packets are going is the first step to catching data breaches before they hit the front page of your favourite Internet news site.

 

Monitoring Can Help!

A new breed of monitoring tools includes the ability to read data as it crosses the wire and track source, destination, and timing. Thus you can get a listing of internal systems and who they are connecting to (and how much data is being transferred) as well as whether slowness is caused by network congestion or an anaemic application server.

 

 

Intelligent Alerts

Issue #4

“Slow is the new down”, but down is still down, too! The problem is that knowing something is down gets more complicated as systems evolve. Also, it would be nice to alert when a system is on its way down, so that the problem could be addressed before it impacts users.

 

Monitoring Can Help!

Monitoring tools have come a long way since the days of “ping failure” notifications. Alert logic can now take into account multiple elements simultaneously such as CPU, interface, and application metrics so that alerts are incredibly specific. Alert logic also now allows for de-duplication, delay based on time or number of occurrences, and more. Finally, the increased automation built into target systems allows monitoring tools to take action and then re-test at the next cycle to see if that automatic action fixed the situation.

 

Automatic Dependency Mapping

Issue #5

One device going down should not create 30 tickets. But it often does. This is because testing upstream/downstream devices requires knowing which devices those are, and how each depends on the other. This is either costly in terms of processing power, difficult given complex environments, time-consuming for staff to configure and maintain, or all three.

 

Monitoring Can Help!

Sophisticated monitoring tools now collect topology information using devices’ built-in commands, and then use that to build automatic dependency maps. These parent-child lists can be reviewed by staff and adjusted as needed, but they represent a huge leap ahead in terms of reducing “noise” alerts. And by reducing the noise, you increase the credibility of every remaining alert so that staff responds faster and with more trust in the system.

 

So, what are you waiting for?

At this point, the discussion doesn’t have to spiral around whether a particular feature is meaningful or not. As long as the audience agrees that they don’t want to find out what happens when everyone piles into conference room 4, phones, pads, and laptops in tow; or when the “free” movie streaming site starts pulling data out of your drive; or when the CEO finds out that the customer site crashed because a disk filled, but had been steadily filling up for weeks.

 

As long as everyone agrees that those are really problems, the discussion on features more or less runs itself.

As I was catching up on the latest IT industry news, I landed on Amit’s Technology Blog. Amit (@amitpanchal76) was a delegate at this year’s Virtualization Field Day 4. I found it very cool that Amit was highlighting his visit to SolarWinds and providing his view on AppStack in his blog post, “How SolarWinds Aims to Offer a Simple Perspective – VFD4.”

 

In his blog, Amit says, “At the end of the day, application owners and end users only care if their application is working and healthy and don’t want to know about the many cogs and wheels that make up the health.” I’d have to say, Amit’s premise is spot on. And SolarWinds shares this vision and delivers this clear, concise value to IT admins and their end-users. The SolarWinds AppStack removes all the noise from information overload and quickly surfaces the root cause of trouble with an application from a single point of truth.

 

Amit also shares an AppStack use case. He believes that “AppStack will come in useful as you could simply provide a link to a custom dashboard for a particular application and let the application owner have this as their monitoring dashboard.” This customization of the monitoring, troubleshooting, and reporting dashboards to the application owner implies that the platform needs to be easy to use and easy to consume. Because, again, the app owner only cares about whether the app is working and healthy. Ease of use and ease of consumption are core tenets of the products that form AppStack. So AppStack inherits those properties by default.

 

As Solarwinds CTO/CIO, Joel Dolisy states, “AppStack is related to the products…such as Server & Application Monitor (SAM), Virtualization Manager (VMan), and Storage Resource Monitor (SRM).” AppStack is a natural extension of the app-centric view with connected context to all the major subsystems like compute, memory, network, and disk across the physical and virtual layers. Amit thinks this would be useful. Do you?

 

For background on AppStack, check out Joel’s AppStack blog post.

Robert Mueller, former Director of the FBI, has said of security that “there are only two types of companies: those that have been hacked, and those that will be”. From Home Depot and Target to Skype and Neiman Marcus, it often seems as if nobody is safe any more. What’s worse is that most of these attacks have come from within the security perimeter and were undetected for long periods of time, leaving the attackers plenty of time to do what they came to do.

 

According to a Mandiant M-Trends report from 2012 and 2013, the median length of time an attacker went undetected in a system after compromise was 243 days with an average of 43 systems accessed. What’s worse is that in 100% of those cases valid credentials were used to access the system, and 63% of victims were notified of the breach by an external entity. Those are certainly not promising statistics for those of us trying to manage IT operations for a large enterprise. It’s even worse for smaller companies who can’t staff quality security personnel.

 

While some companies might believe they are immune, or not a high value target based on any number of factors, they couldn’t be more wrong. Hackers these days are not the script-kiddies of the last 20 years, but rather nation states or organized collectives with a variety of motivations. Sometimes the attackers are looking for money and target credit cards, other times they are looking for identities—social security numbers tied to names and addresses—with which they can take a more advanced and longer term view of the value of their attack.

 

The other threat that a lot of companies fail to plan for, however, is reputation damage. Even if you have nothing of value that a hacker might want access to, you likely have a brand value that can be severely and systemically damaged. Consider the most recent attacks against Yahoo, Sony Corporate, and both the Playstation and Xbox networks. These attacks may not cause direct financial damage, but the lasting brand damage can cost millions more.

 

Given this current state of security, I’m curious what you do to secure your network and monitor for advanced persistent threats against your infrastructure. Are you relying on logging and firewalls alone, or have you moved into a more advanced monitoring model?

Filter Blog

By date:
By tag: