1 15 16 17 18 19 Previous Next

Geek Speak

1,549 posts

09-03-49.pngThis month, we’ve shined our IT Blogger Spotlight on Larry Smith, who runs the Everything Should Be Virtual blog and tweets as @mrlesmithjr. As usual, we’ve asked some deeply philosophical questions about everything from the nature of truth to the meaning of life. OK, maybe not, but we still had fun chatting with Larry. Check it out below!

 

SW: Let’s mix things up a bit this month and first talk about you before we get to Everything Should Be Virtual. Who is Larry Smith?

 

LS: Well, I’m currently a senior virtualization engineer for a major antivirus company, but prior to that I worked for a major retail company that specialized in children’s clothing. Overall, I’ve been in IT 19 plus years, though. I’ve done everything from network administration to systems engineering. And when I’m not working or blogging—which is extremely rare, it seems—I most enjoy spending time with my family.

 

SW: Wow, 19 years in IT! That’s some staying power. How did it all begin?

 

LS: I started programming when I was about 12 years old on a TRS-80 back in the early 1980’s. I always knew I wanted to be in the computer field because it came very natural for me. However, I decided after attending college for a while that programming was not for me! So, I started getting more involved with networking, servers and storage. And then I got into x86 virtualization in the late 1990’s and early 2000’s.

 

SW: With such an illustrious career, surely you’ve come to hold some tools of the trade in higher esteem than others. Any favorites?

 

LS: Really, my favorite tools are anything automation-related. I also enjoy writing scripts to automate tasks that are repeatable, so I am learning PowerCLI. Anything Linux-based as far as shell scripting. I also really enjoy finding new open source projects that I feel I can leverage in the virtualization arena.

 

SW: OK, switching gears now, tell me about Everything Should Be Virtual. Judging by the name, I’m guessing it kind of sometimes touches ever-so-lightly on virtualization.

 

LS: You guessed it. Everythingshouldbevirtual.com is, of course, focused on why everything should be virtual! This could be about an actual hypervisor, storage, networking or any type of application stack that can leverage a virtual infrastructure. I enjoy learning and then writing about new virtualization technologies, but also I really enjoy metrics. So, I spend a lot of time writing on performance data and logging data. And again, with the main focus in all this being around virtualization. I spend a great deal of time using Linux—Windows, too—but what I find is that it is extremely difficult to find a good Linux post that is complete from beginning to end. So, my goal when writing about Linux is to provide a good article from beginning to end, but also to create shell scripts that others can utilize to get a solution up and running with very minimal effort. I do this because I want something that is repeatable and consistent while also understanding that others may not necessarily want to go through some of the pain points on getting a Linux solution up and running.

 

SW: How long has it been around now?

 

LS: I started it in 2012, so a couple years. I got started with blogging as a way to keep notes and brain dump day to day activities that I encountered, especially if they revolved around something that was out of the norm. The more I blogged, the more I realized how beneficial some of the posts were to others as well. This, of course, inspired me to write even more. I’ve always had a passion for learning at least one new technology per week, and the blog allows me to share with others what I’m learning in hopes of helping someone else.

 

SW: Any specific posts stick out as ones that proved most helpful or popular?

 

LS: Yeah, some of my most popular posts are around metrics and logging—Cacti, Graylog2 and the ELK (Elasticsearch Logstash Kibana) stack. While these are typically the most popular, there are probably as many hypervisor-based articles that are really popular as well. I think this shows the value you can provide to the community as a blogger.

 

SW: As per the norm, let’s finish things off with your perspective on the most significant trend or trends in IT that have been on your mind lately.

 

LS: One of the major trends that’s still fairly new and will be a real game changer is software defined networking (SDN). I have the luxury right now of learning VMware NSX in a production environment from the ground up, so I am extremely excited about this development. This area is really going to set the stage for so much more to come in the future. Obviously, another area that I have enjoyed watching take shape is around storage. The idea of getting away from expensive shared SAN arrays makes a lot of sense in so many ways. Being able to scale compute and storage as your requirements change is huge. Instead of just rolling in an expensive SAN array and then having to pay very expensive scaling costs in the future, you can scale in smaller chunks at a more reasonable cost, which also provides more compute resources. Here’s a link to explain a bit more around using VSA's or VSAN I wrote up a few months back.

brad.hale

Announcing NPM 11

Posted by brad.hale Jul 30, 2014

Stop the finger pointing with NEW Deep Packet Inspection & Analysis


We know you've been there.  The network is always the first to be blamed when application performance is poor.  Then the finger pointing begins.  It's the network!  No, it's the application!..Now you can tell if application performance problems are the result of the network of the application with Network Performance Monitor version 11's deep packet inspection and analysis

 

By analyzing packet traffic captured by network or server packet analysis sensors, NPM's new Quality of Experience dashboard analyzes and aggregates the information in a easy-to-read dashboard.

These statistics make it easy to tell at a glance not only if you have a user experience issue, but whether the issue is a problem with the network or the application.  In addition to response time statistics, we are capturing aggregate volume metrics, and have the ability to classify application traffic on your network by risk-level and relation to typical business functions.

 

QoE_Dashboard (1).png

Business_v_social.png


Learn more


For customers under active maintenance, NPM 11 may be found in your SolarWinds Customer Portal.


You can learn more about NPM 11 and deep packet inspection and analysis here.


Sign up for SolarWinds lab, where the topic of discussion will be the new QoE functionality: July 30th, 1PM CT - LAB #16: Deep Packet Inspection Comes To NPM


Register for our upcoming Quality of Experience Monitoring with Deep Packet Inspection and Analysis from SolarWinds webcast:  Thursday, August 7, 2014 11:00 AM - 12:00 PM CDT


Organizations implement or plan to implement virtualization for benefits such as cost savings, improved server utilization and to reduce hardware. Many of these benefits can be limited by problems caused by the additional complexity of management, configuration, monitoring, and reporting caused by the new layer of abstraction and the additional complexity it can cause.

 

Given the ease of creating new VMs and how dynamic most virtual environments are, it can be very difficult to manage manually using raw information provided by the hypervisor. As a result, one simple way to increase the return on investment is to adopt an automated virtualization management tool such as SolarWinds’ Virtualization Manager. A tool will aid in configuring, monitoring, and managing virtual servers. Moreover, a virtualization management tool automates the process of data collection and report generation (CPU usage, storage, datastores, memory usage, network consumed, etc…). Thus reducing the time and effort invested on manual processes of monitoring and administration.

 

All organizations consider time and money as valuable factors.  A virtualization management tool reduces both time and money spent. Some factors are:

  • Reduced number of admins needed to manage VMs
  • Faster problem resolution reducing downtime
  • Foresee upcoming issues and assist in proactive troubleshooting
  • Avoid VM sprawl and improve resource utilization
  • Reporting made easier and faster

 

Now, let’s take a look at some return on investment figures with examples that explain how a virtualization management tool reduces both time and money spent.

Consider an organization with 500 VMs running on about 50 hosts with 100 sockets (i.e., 2 sockets/ host). With an average of only 5 VMs per socket or CPU (it can easily be higher), you would get a system of about 500 VMs to manage. Below we look at some of the key factors driving cost and efficiency of the system and the potential cost implications of not having a management system.


Uptime:

Adopting a virtualization management tool in an organization definitely increases uptime, as the tool aids in faster problem resolution. Also, virtualization management tools can help foresee or predict the most pressing issues and, in turn, increase the uptime of VM servers.

According to this EMA whitepaper, the average uptime for a virtual environment is 99.5%. Utilizing a virtualization management tool easily increases uptime to 99.95% with best in class operations achieving 99.999%.


VM sprawl:

VM sprawl occurs when VMs are created and left without proper management. This situation causes bottlenecks in system resources and can lead to performance problems as well as the economic impact of wasted resources. Keeping track of IOPS, network throughput, memory, and CPU of the VM helps find VM sprawl. A virtualization management tool helps identify idle or zombie VMs and can eliminate resource waste in the organization.

An article by Denise Dubie using data from an Embotics study provides estimates that an organization with 150 VMs has anywhere from between $50,000 to $150,000 locked up in redundant VM’s. This includes costs from infrastructure, management systems, server software, and administration resources and expenses.


Admin headcount:

A VM admin should configure, monitor, and manage VM servers. To get the best VM performance, admins should configure the VM with only resources (processors, memory, disks, network adapters etc.) required by the user. A virtualization management tool will help monitor the performance (such as CPU utilization, VM memory ballooning, memory utilization etc.) of the configured VM servers.

Without a VM management tool in place, admins face constant challenges with monitoring VM environments. Some of these challenges include:

  • Identifying bottleneck issues, especially ones related to storage I/O. Identifying CPU and memory issues are fairly simple, however if the issue is a bottleneck with storage it can be difficult. This is because storage issues can stem from various metrics (slow disk, a lot of network traffic, a lot of VM’s on a SAN volume, etc.).
  • Monitoring usage trends, system availability, and resource bottlenecks. For these results, admins should correlate data from multiple places.
  • Storage allocation is also an on-going challenge when it comes to VM management.

 

Finding the true problem without a VM management tool can be a nightmare. A virtualization management tool provides insight into VM performance, delivers key metrics, visibility to root causes of resource contention, and optimizes virtual environments. A study by EMA indicates that on average a virtual admin can manage 77 VMs without a virtualization management tool. Conversely, experts say that virtual admins can manage up to 150 VMs with a virtual management tool in place. Moreover, this means that business will scale more easily with existing staff.

Also, do not forget the expense of downtime represented by MTTR (Mean Time to Repair) problems. Extra admins are required when an issue occurs. According to VMware®, it takes 79 minutes for an admin to fix an issue without a virtualization management tool and 10 minutes with a tool, thus reducing the downtime in a VM environment. Also, a virtualization management tool reduces the dollars paid to an admin as the time needed to resolve the issues lowers. 99.5% uptime a year means 43.8 hours of downtime per year and with a virtualization management tool the uptime is increased to 99.95 (4.4 hr downtime/ year). Assuming the average cost is $50/hour for two IT administrators to identify and solve each problem, an organization will pay $4380.00 (downtime *admin’s labor charge per hour (2*43.8*50)) in direct labor for virtual troubleshooting without a virtualization management tool, compared to $440.00 (2*4.4*50) with a virtualization management tool for an annual savings of $3940.


Revenue:

It’s important to consider how revenue as an outage/downtime will impact revenue. Revenue differs from one organization to another. Based on the report from National Centre for the Middle Market at Ohio State University, most of midsize organizations revenue is between $10m and $1 billion per year. Organizations these days are heavily dependent on their IT environment. During VM downtime, let’s assume the organization loses half of its revenue earned during that time period. Consequently, a loss of $571 per hour occurs (assuming an organization with revenue of $10m/ year) when the VM environment fails.

Using publicly available data sources, we compiled the data to estimate the type of savings that could be possible using an automated virtualization management tool. We then compared it to manually managing the virtual environment using virtual infrastructure/hypervisor information. The virtualization management software licensing costs are based on SolarWinds Virtualization Manager published prices and may be higher for other management products.


Table 1 -Estimated Cost Savings with Virtualization Management (based on a 500 VM environment):

 

CategoryQuantityQuantity UnitsCost Per UnitCost UnitsCost ($/yr)Comments
No virtualization management tool
Downtime
  1. 43.8
hr / year$571 per hour$24,999
VM sprawl500VM's$50K per 150 VMs$166,665 150 VMs will have $50,000 to $150,000 lost in VM sprawl. So let’s take the minimum $50,000. Hence $333 per VM.
Admin's  daily routine6persons$50 per hour$360,000 50 weeks x 5 days x 8 hrs x $50/hr *number of admins= (100000 * number of admins)
Admin's - to work on issues
  1. 43.8
hrs$100 per hour/2 admin$4,380 # of hours downtime* number of admins * cost per unit/hour
Total$556,044
1st year of virtualization management tool
Downtime
  1. 4.4
hr / year$571 per hour$2,511
VM sprawl500VM's$41,666 And with a VM management tool assume VM sprawl will decrease to 25%
Admin's - daily routine3persons$50 per hour$180,000 50 weeks x 5 days x 8 hrs x $50/hr*number of admins= (100000 * number of admins); VMs per admin ~ double from 77/adm to 150/adm
Admin's - to work on issues
  1. 4.4
hrs$100 per hour/2admin$440# of hours downtime* number of admins * cost per unit/hour
Cost virtualization management tool (SolarWinds)$23,995 List price of a VM112 license (up to 112 sockets) for SolarWinds Virtualization Manager
Total$248,612
2nd year onwards of virtualization management tool
Downtime
  1. 4.4
hr / year$571 per hour$2,511
VM sprawl500VM's$41,666 And with a VM management tool assume VM sprawl will decrease to 25%
Admin's - daily routine3persons$50 per hour$180,000 number of admins * cost per unit/hour *8 hours * 365 days
Admin's - to work on issues
  1. 4.4
hrs$100per hour/2 admins$440# of hours downtime* number of admins * cost per unit/hour
Maintenance cost virtualization management tool ( SolarWinds)$4,799Estimated annual maintenance charge
Total$229,416

 

Using this data, Table 2 provides a summary of cost savings estimated at $307K for the first year, and slightly more for the second year with a very strong ROI.


Table 2 - Virtualization Management Estimated Savings


Virtualization Management Estimated Savings
Costs with No VM Mgmt Baseline$556,044
Cost with VM Mgmt - Year 1$248,612
With VM Mgmt - Year 1 Savings$307,432
Year 1 ROI12.8
Cost with VM Mgmt - Year 2+$229,416
With VM Mgmt - Year 2 Savings$326,628
Year 2 ROI

68.1


While these costs will vary for each company, they help illustrate how quickly an automated virtualization management tool can pay for itself. Using a similar methodology, a company could plug in alternate values that are applicable to their situation to customize the estimate for their company. These ranges may be high, but the opportunity to produce meaningful savings is certainly there and as a result it’s probably worth the effort to evaluate potential savings in your environment.

 

To conclude, new VMs are constantly created and become even more dynamic, making it very difficult to manage manually. The facts and numbers indicate there’s a substantial upside to utilizing a virtualization management tool. You can find out more about SolarWinds’ Virtualization Manager on our product page.

Stephen Covey, creator of the “7 Habits of Highly Effective People,” often said, “As long as you think the problem is out there, that very thought is the problem.”

There’s truth to that statement. But all motivational phrases and buzzwords aside, you can expend a lot of mental energy dwelling on the fact that there are problems in your day-to-day endeavors. Of course, when your role in life is a system administrator, you are no stranger to a plethora of problems. In fact, the notion that there are problems out there in your IT world is what keeps you moving each day.


The Information Technology Infrastructure Library (ITIL) tells us that a problem is the unknown root cause of one or more existing or potential incidents. Furthermore, it stands to reason that an incident is an event which is not part of the normal operation of an IT service. For example, if the employees within your corporate IT network are unable to access network related services like email, printing, and Internet services, it may be a problem to them, but according to ITIL, that’s an incident. If the network access is hampered by an issue with the core router, that’s the root cause of the incident and hence, the actual problem.


Enter the help desk. Given that problems are your help desk’s reason for living, proactive problem and incident management is necessary for effective IT support and problem resolution.

Your help desk software can do a lot toward proactively managing your IT administration issues. Here are some useful features for an effective help desk solution:

  • Ability to integrate with network performance and monitoring solutions.
  • Ability to set up your help desk solution to receive and assimilate alert data and automatically assign tickets to specific technicians.
  • Automatic triggering of new alerts and note updates on existing tickets according to changes within the network parameters.
  • Configurable variables to filter alerts based on severity and location. (This can provide information about your operating system, machine type, IP address, DNS, total memory, system name, location, alert type, etc.).
  • Simplified ticketing, such as linking an unlimited number of incident tickets to a single problem.
  • The necessary tools, ITSM incident diagnostics, and ticket routing features, allowing for better integration and relationships with knowledge base articles, CMDB asset associations, service requests, known problems, change requests, and SLAs.


Getting to the root cause of incidents and resolving problems fast creates a good-to-great situation for your help desk department and staff. When customer satisfaction is key, allying yourself with an efficient help desk solution is how you achieve that.

Here's the deal. I have a Windows 7 laptop. Updates are pending, they get installed, machine reboots, updates fail. Rinse, repeat daily. Every day I go through this and I'm LOSING MY MIND! The updates keep accumulating and nothing ever gets installed. I just get failure, after failure, after failure! @#$%^ Microsoft!

wu.png

Here's what I have done so far:

  • Reboot - nothin
  • System restore - nothin
  • Download each update individually - still fails
  • MS Fix It tool - didn't help
  • A/V off
  • Admin rights
  • Uninstall previous updates - still nothing
  • Clear all the logs and .dat files - nothing helped
  • No viruses or any other bad thingys. All clear.
    • Will attempt to uninstall the Windows Update Service - that might help, but I doubt it.
    • My last resort is the battery re-seating, which actually WORKED in the following scenario: Troubleshooting. (The Hard Way.)

 

Anyone have any other ideas other than a formatting? I'm losing it here. Coming close to "Naked Bronx in a clock tower" moment. Thanks.

Mobile computing is pervasive, with more and more work done on mobile devices.  Folks may access your website through a mobile browser or native mobile application, does this change the entire web performance monitoring stack or can we use the same applications to monitor mobile and web apps?

 

There seems to be a lot of discussion around mobile application performance monitoring, which provides an end-to-end stack visibility, all the way from the device to the back-end services and infrastructure, and real-time visibility into end user experience.  With a large variety of devices in the wild, how do you measure end user experience and ensure that your mobile services are working well vs your website.  The metrics that you report on will be different vs standard website performance motioning. In fact, you get more insights into how users use your web services being able to track deeper into the user experience on mobile.

 

Mobile applications at times offer more complexity, since your dealing with different devices, software versions, network converge and latency.  Its imperative you have insight into how these areas are performing in addition to your website.   Incorporating crash reporting into your mobile devices provides you with an additional level of intelligence, that you likely wont get from your website alone.

 

Some folks see mobile application monitoring as another level up the stack, but I see it as on par with web application monitoring, in terms of getting metrics on user experience.  I also believe that mobile application performance monitoring is a set of tools used in addition to your web performance monitoring, but in some cases may be the same product.

 

Are you measuring end-user experience on mobile devices?

 

Do you feel mobile application performance monitoring is a different discipline than web performance monitoring?

 

Do you correlate issues between your mobile web applications and your website?

 

Are you transitioning more of your web applications to mobile and how are you monitoring their performance?

 

Do you feel mobile application performance monitoring is a new area of growth or just a fad?

In my last post, I talked about some of limitations of DPI and the kind of tools used to combat it. However, DPI is useful in a couple of other scenarios and in this final post, I’ll cover them also.


Traffic Optimisation


As traffic flows across a network, it needs to be managed as it crosses points of contention. The most common point of (serious) contention are WAN links, but even inter-switch links can be put under pressure. Simple source/destination/port policies are often used to match protocols against prioritisation queues. However, for much the same reasons port matching is not good enough for security (lack of specificity), it’s not really good enough for traffic optimisation.


In an enterprise network, consider a branch office that accesses applications in a server farm at a remote data centre. Creating a class for HTTP alone isn’t much use it’s used by many distinct applications. Creating a class based upon destination subnet alone isn’t going to be much cop either. In a virtualized or VDI environments server IPs are going to change often. IP/Port classes are helpful when you need to pull protocols (such as Citrix or Oracle) to the top of pile, but that isn’t good enough in the context of highly contended or long-haul WAN links.


DPI is used by packet shaping and WAN optimization devices to identify traffic flows to the molecular level; individual user and/or Citrix/Virtual Desktop application. This is necessary for two reasons:


  1. So that the administrator may apply granular policies to individual applications
  2. To identify traffic that may be optimized; either as simple priority access to bandwidth, or something more invasive such as Layer 4-7 protocol optimization. 


In the context of protocol optimisation (read, WAN optimization or acceleration) correctly identifying traffic flows is critical. As an example, many years ago Citrix moved the default port for sessions from TCP 1494 to TCP 2598. Many bandwidth management policies identified Citrix by the TCP port alone. When the port moved, unless the network or Citrix administrator was paying particular attention, latency-sensitive traffic was thrown in with the “best effort” class. Unsurprisingly, this usually resulted in an overwhelmingly negative user experience.


Troubleshooting


Deep packet inspection is incredibly useful when it comes to troubleshooting network behaviour. DPI can be used to identify applications on the network (such as the Citrix example above) but also for identify the behaviour of applications, and is the final tool to identify a “something bad” happening on the network.
Just to recap, here is a summary of the “something bad” that DPI technologies can address:


  1. Firewall PAD: Filtering out a deliberately malformed packet that would crash a web server
  2. Firewall Signature: Identifying an otherwise correctly formatted packet that produces bad behavior on a web server that otherwise would lead to an exploit (such as dropping a root kit onto a vulnerable host)
  3. Firewall SSL Inspection: Looking into an encrypted session for either of the previous two attacks.
  4. Traffic Optimisation: Identifying and the limiting applications that wish to use excessive amounts of bandwidth
  5. Identifying applications that are behaving badly despite being from a security perspective “clean” and correctly optimized.


Making this a bit more real world; consider the following scenario. External users complain that a web application you host is performing very badly. Firewall PAD and signatures shows that the traffic between the client and the web server is clean; and there are no other apparent attacks.  Traffic optimization ensures this critical application has priority access to bandwidth between the Web server and it’s Database server. However, performance still sucks. DPI tools can be used to analyze the flows between the client, web, and database server. With access to the entire application flow, poorly optimized SQL queries may be identified. This cascade effect can only be identified by tools that understand the application flows not from a security or bandwidth perspective, but in it’s native context. In my view, these kinds of tools are the least well-implemented and understood, and their widespread and proper use could massively improve the time-to-resolution on faults, and identify many things that were previously not in use.


Deep Packet Inspection techniques are used in many places on the network to address a variety of challenges; I’ve focused on security but there are many other applications of DPI technology. And hopefully I’ve made clear that the  correct identification of network traffic is critical to the proper operation of networks and application management in general.


This is my last post in Geek Speak, I'd like to thank everyone who's read and commented. I've really enjoyed writing these articles for this community, and the feedback has steered the conversation in a really interesting direction.  I've no doubt that I'll continue lurking here for some time to come!


Peace


Glen Kemp (@ssl_boy)




Virtualization came as a boon to the IT industry. In turn, many organizations have benefited from making investments in setting up and leveraging a virtual infrastructure. Being able to show a strong ROI & delivering consistent performance and availability of a virtualized environment is one of the top priorities for any IT or virtual admin. Despite its advantages, the virtualization infrastructure can have its own issues causing performance bottlenecks, resulting in VM & application downtime. Here are some best practices on how to address them:


Datastore IOPS & Latency:

Latency occurs at various levels of the VM architecture & storage stack. Latency occurs due to:

  • Increased resource contention:
    Storage systems have a limit of sizing and can only handle a certain workload limit. If the VM-workload increases from one VM to another, this can cause performance problems with the shared storage IOPS at the data center or storage layers that impact other VMs on shared resources.
  • Improper configuration:
    When settings like cache size, and network-specific settings for iSCSI (Internet Small Computer System Interface) & NFS (Network File System) aren’t properly set or the host is configured with the wrong multi-pathing policies, performance issues can occur.
  • Architecture Isn’t properly designed:
    If the storage architecture isn’t properly designed, then it can’t support the VM workload. Poorly configured architecture examples include: low RPM HDDs using a high RAID level that has high write or array caches that aren’t sizeable, etc. Problems that occur in the physical server can impact the application layer as well. Therefore, it’s important to understand how VMs will react to your physical server.


To overcome latency, it’s important to monitor LUNs, the datastore, arrays, etc. Rather than just monitoring, it’s also important to understand high volume transactions that consume most of the server’s resources and impact performance. Lastly, define IOPS limits to avoid the VM flooding the array.


Memory Ballooning or Swapping:

The hypervisor or host allocates memory resources to all of its VMs in the infrastructure. As a result your Hypervisor uses a balloon driver, which actually resides in your guest OS. This balloon driver searches for free memory available in the VM. Once the balloon driver finds the free memory space, it pins (or reserves) the memory available so that the VM doesn’t consume the pinned memory. Then the balloon driver communicates to the Hypervisor to take back the pinned memory. This allows the pinned memory to be allocated to another VM.


When memory ballooning starts increasing, the guest OS starts paging from the disk which leads to high I/O and that can bring performance issues to the VM. Additionally, memory ballooning can lead to memory swapping which isn’t good for the virtual infrastructure. When memory starts swapping disks, the disks aren’t as fast as RAM, creating a bottleneck.


Memory management is a key area to investigate to improve performance of your virtual environment. Some tips to help you with managing & troubleshooting memory swapping include:

  • Allocate sufficient memory to the VM to support the application
  • Avoid over allocating memory so consumed memory can be used to support other VMs
  • Keep track of memory ballooning to stop swapping


Lack of resources caused by poor resource utilization:

Zombie, stale, orphaned, and over allocated VMs are one of the top causes of performance bottlenecks for any virtual infrastructure. They consume numerous resources that can be utilized by other VMs that require the necessary resources for a productive infrastructure. Some recommendations to control VM Sprawl are:

  • Systematically compare allocated to used resources to right size VMs
  • Scan for idle or zombie VMs using compute resources
  • Remove old and large snapshots consuming storage
  • Ensure requirement of VMs with all business departments
  • Analyze resources required before creating the VM
  • Scale resource consumption at each level of the virtual infrastructure

 

By controlling VM Sprawl you can save resources, increase IT productivity, and postpone purchasing of new hardware.


Virtualization is one of the hottest topics within the IT industry. Moreover, its continued growth opens the door for new product solutions & methods to monitor virtual environments. Therefore, there’s constant pressure on virtual admins to deliver consistent performance and availability of their virtualized environment. Performance bottlenecks cause VM and application downtime, or worse, business downtime. However, these bottlenecks can be avoided when you have a proactive monitoring system in place that will alert you of VM & application performance issues before the end-user is impacted. Learn how you can avoid performance bottlenecks and how to monitor and manage your virtual environment.

Resolving WAN issues can be time consuming and challenging—especially at enterprise companies with distributed networks. A common approach could look like this: The network administrator troubleshoots the problem with a local, disconnected management tool or logs into a router remotely and then checks logs. Another approach could be to send someone to the location with a packet sniffer to diagnose the problem. Although these approaches would work, they tend to be overly time consuming and we all know that in IT, Time = Money.

 

Today, business processes are becoming more and more communications-dependent. As a result, continuous network availability has become a focal point for IT. The challenge to avoid network downtime becomes even greater in enterprises with distributed networks. Unmonitored branch offices pose a significant risk to network performance, availability, reliability, and security. Network engineers handling geographically dispersed locations are well acquainted with the difficulty of obtaining a collective understanding of the entire multi-vendor, multi-device network from a centralized location.

 

Some of the biggest challenges in distributed enterprise networks include:

  • Overcoming enterprise branch-office visibility blind spots
  • Tracking and measuring business-critical services at branch locations
  • Proactively monitoring to improve end-user experience
  • Mitigating network issues at offsite locations

 

With a global view of the network, administrators have the required data to analyze issues and also plan for requirements. Whether it’s making informed decisions, monitoring critical parameters, or being notified on thresholds crossed, seamless and centralized network visibility aids in reducing network downtime and simplifies troubleshooting of network issues. 


A centralized view of branch office network monitoring provides:

  • Reduced time to troubleshoot and resolve branch network problems from the central location
  • The ability to analyze performance trends in the entire network
  • Integrated and proactive network management

 

Whether you’re at a small company or an enterprise, it all boils down to how quickly the administrator can figure out the network problem. It could be a configuration issue on a router or switch, an application like backup task taking up network bandwidth at peak hours, poor hardware performance on a server, or any number of other things. In every situation, it’s important to think about what is at risk and how long it will take to fix the issue.

 

In addition, a lot of the focus is moving away from the reactive ‘break-fix’ approach and more toward predictable, reliable performance—which is proactive. However, this requires total visibility and continuous monitoring. This is possible only with an integrated network management approach.

 

SolarWinds Enterprise Operations Console (EOC) is one such tool that consolidates critical monitoring data from multiple locations onto a single screen. Check out the new EOC 1.6 that focuses attention on mission-critical issues with global Top 10 views of bandwidth utilization, response time, CPU, memory, disk space utilization, Web applications, virtual hosts, and more.

I am a big fan of data and metrics (.. that is data that we can turn into knowledge and Commander Data from Star Trek). In IT we tend to collect a lot of it, especially as it relates to our infrastructure and reporting it within to our business folks is sometimes a challenge. For one, we tend to report in terms we understand, take for example the following DNS resolution time, TCP connection time, HTTP redirect time and Full page object load time. But do business folks really understand what we are saying?

 

Its important that our systems and reporting translates those metrics into understandable business language, an intuitive web dashboard that helps both IT and Business understand the impact of a web performance metric. Take for instance a simple dashboard that has a green - yellow - red system, for indicating problems and the metric next to it is a roll-up of different IT data points.  Its some much easier to deal with - Web forums from Austin OK rather than process xyz on server-abc is in a not running.  Its easier because you know the complete impact of the issue rather than just focusing on one issue and not understanding how its tied to other items.

 

I think machine learning is going to play an important part in future web performance monitoring applications, as business demands more from IT.  Machine learning will help correlate IT and business metrics. Business will want to know more than just if something is down or up, they need metrics delivered in a way that help them understand business impact.  BTW - Commander Data would also communicate in terms the Captain and staff would understand after compiling the data.

 

Are you more concerned about actionable IT metrics and not concerned with the business impact?

 

What are the metrics that matter to you?

 

Do you translate IT metrics for the business or do they manage a different set of metrics?

 

Have your business teams stared asking for more relationship mapping between metrics?

 

Do you see machine learning as a next phase in matching IT and business metrics?

In my last post I talked (briefly) about Protocol Anomaly Detection (PAD) and Signatures. My original intention was to talk about other topics, but given that the security aspect has been so popular, I'm going talk more about the security aspects and (quickly) circle back to optimisation and performance management later.



Obfuscation


Like any security inspection technology, there are limitations and trade-offs to Deep Protocol Inspection (DPI). DPI can only make security decisions on visible traffic; and that is becoming an increasingly common problem. Encrypting traffic with SSL or TLS was traditionally reserved for banking applications and the like, but the exploits of the NSA has lead many organisations to adopt a “encrypt everything” approach, including simple web traffic. Applying PAD or Protocol Signature checks to network flows after they have been encrypted is difficult, but not impossible. The size of the problem it is influenced by the direction of traffic you are trying to inspect. For example:


  • To inspect your organisation's inbound traffic (e.g. heading towards a public web server).  For a server under your administrative control you should have access to the private key used for encryption. In this case, the firewall (or other network security device) can decrypt the traffic flows on the fly and make a decision based on the payload. Alternatively, if you have server load balancer that supports SSL offload (and can’t think of any that don’t), you may choose to inspect the traffic after it has been successfully decrypted.
  • Inspecting traffic as it leaves your network (e.g. headed to a 3rd party website) is trickier.  When the firewall sees a client-to-server SSL session being established (after the TCP 3-Way), the firewall takes over and acts as a “man in the middle” (MITM).  The client establishes an SSL session to the firewall, and in turn the firewall creates a separate session to the outside server. This allows the firewall (or other secure-proxy device) to inspect traffic transiting the network looking for either malicious or undesirable traffic (such as credit card numbers heading out to the network). The drawback is that this method totally breaks the browser trust model (well breaks it even more). When the SSL session is set up, the server proves its identity by sending a public key; a certificate counter-signed by a Certificate Authority known (or trusted) by the browser. This mechanism is designed to prevent MITM attacks from happening. However, when the firewall intercepts the session, the user’s browser would spot the mismatch between the identity of the real server and the one provided by the Firewall. Encryption will still work, but the user will get a very visible and very necessary “nag” screen. The trick to preventing the “nag”, is to mess with the trust model further by creating a “wildcard” certificate. This allows the firewall to impersonate any SSL enabled site. For this to work, the CA key for this bogus certificate is placed in the “Trusted Root CA Authorities” list on any device that needs to connect through your network inspection point (firewall, web proxy, etc.). If your network consists of Windows 2000->8.x domain-member workstations this is about 5 minutes work. If you have almost anything else connected, it can be a significant logistical exercise.  In fact, a significant part of the challenges around “Bring your own Device” (BYOD) policies are around establishing client-to-network trust; and most of them have to mess about using certificates to do it.


Performance


As has been mentioned by several thread commenters, applying these advanced protocol-level defences have an impact on performance.

  • PAD features are relatively lightweight and are often implemented in hardware as they are dealing with a limited selection of parameters. There is only a limited definition of what the protocol-compliant service should be expected to deal with and match traffic as “good” or "bad”.  I would expect basic PAD features to be enabled on almost all firewalls.
  • Protocol Pattern matching is a bit more difficult. For each packet that comes in, it has to be matched to a potentially huge list of “known bad” stuff before it can classed as good. For common services such as HTTP, there are many thousands of signatures that have to be processed, even when the traffic has been successfully identified.  Inescapably, this takes time and processor cycles. A few firewall vendors use specialized ASICS to perform the inspection (fast but costly) but most use conventional x86 processor designs and take the hit. This is why having an appropriately and updated suite of patterns attached to your security policy is critical. Telling a firewall (or other security device) to match against all known vulnerabilities is a fools errand; it is far better to match against the most recent vulnerabilities and the ones that are specific to your environment. For example, inspecting traffic heading to an Apache server looking for IIS vulnerabilities wastes processing resources and increases latency.
  • SSL inspection. SSL inspection creates a significant workload on firewall or secure web proxy, as a result many hardware devices use dedicated SSL processors to perform the encrypt/decrypt functions. Any given device has a finite capacity, which makes it critical to decide up-front what, and how much traffic of encrypted you want to inspect.


A “good” firewall is the one that is properly matched to your environment, and properly maintained. All the fancy application identification and protocol security techniques won’t help if under production loads it barfs the first time you turn them on, or you fail to regularly review the your policy.


In my next and final post, I shall touch on the performance management and troubleshooting aspects of DPI, Thanks for reading and thank you even more to those who participated in the little survey at the end my last post!




In my last blog, I introduced some of the basics of network security and why enforcing traffic on the standard source port isn't enough to determine what’s in a payload. I've talked mostly about port 80 and HTTP, but the ports associated with DNS (UDP 53) are also often abused.


In the context of firewalls, Deep Packet Inspection (DPI) is often the first line of defence against port tunnelling or applications that obfuscate their intent. Different security vendors call it different things, but the differences amount to marketing and implementation. Firewall DPI looks beyond the simple source/destination/port three tuples and attempts to understand what is actually going on. There are two ways this is commonly achieved, protocol anomaly detection (PAD) and signatures checks.


Protocol Anomaly Detection


Commonly used protocols usually have an IETF RFC associated (the ones that don’t tend to be closed implementations of client/server apps). Each RFC defines the rules that two hosts must follow if they are to successfully communicate. As an example the specification for HTTP 1.1 is defined in RFC 2616*. It lays out the standard actions for GET, POST, DELETE etc.  PAD inspects the traffic flowing through the firewall (or IDS device, for that matter) and compares it with a literal definition of the RFCs (plus any common customizations). If TCP packet contains a payload of HTTP but is using the ports typically associated with DNS; then clearly something is amiss.  With PAD enabled, some applications that attempt to tunnel using any open port (Skype and VPN clients are common culprits) may be stopped by the firewall. Additionally, it prevents some vendor-implementation attacks. For example, if bounds checking isn't properly implemented a malformed string may cause a process to crash; or arbitrary code execution. A nice, tight, PAD engine should pick this up and protect a vulnerable server.  Code Red was the classic example and to this day, very occasionally, I see a match signature match in my firewall logs as some lonely, un-patched IIS server trawls the net looking for someone to infect..


PAD has it limitations though;


  • Some protocols change often and are not that well documented; valid traffic can sometimes be blocked by an out-of-date or too restrictive implementation of PAD
  • It’s not that hard to write an application that appears to confirm to the RFC but still allows the exchange of data


As result, PAD is often combined with another common defence; protocol signatures.


Protocol Signatures


Protocol signatures are analogous to desktop AntiVirus signatures. Every time something “bad” is detected (lets take Heartbleed as a recent example), security vendors scramble to create a pattern that matches so it can be identified and blocked. I've written about this elsewhere before, but signatures are also an imperfect measure of intention. There are often shades of grey (and no, there are not 50 of them, you filthy people) between what is definitely good traffic and definitely bad. But, it’s not all bad. As a side-effect, signatures can also be used to be provide fine control over the users web activities.
This is not an accurate science, and vendor implementations vary greatly in what they can offer.  For example, identifying Facebook over HTTP is straightforward enough, but blocking it outright is unlikely to win many friends, especially at the executive level. Apparently, a lot of business is conducted on Facebook so draconian policies are effectively impossible. As result, one has to disappear down the rabbit-hole of Facebook applications. Some vendors boast of being able to identify thousands of individual applications, but trying to establish a meaningful corporate security policy on each one would be futile. The best firewall implementations break it down into categories, and enable the creation of a policy along these lines:


  • Permit Chat applications, but warn the user they are logged (and not just by the NSA)
  • Block Games outright, except after 6pm
  • Otherwise Allow Facebook


This is not perfect, and indeed determined users will always find a way past, but it might deter that idiot in Sales from abusing his ex-girlfriend in Accounts on company time and equipment.


In most cases, both PAD and signatures are enabled on the firewall. Signatures are good for identifying stuff that is known to be bad, whilst PAD can mop up *some* of the unknown stuff and prevent some protocol tunnelling attacks.


For my next post, I’m going to move onto the “Making things go faster” aspects of DPI. Edit: Sorry, after the popular vote I've covered the security limitations of DPI. Plenty more to come!


* During the research for this post I discovered that RFC 2616 is essentially deprecated have been rewritten as a series of more precise RFCs, 7230-7235. I stumbled across this blog by one of the revision authors, so I thought it would be nice to share.





As more software is delivered through the web as software-as-a-service (SaaS), web performance monitoring tools are no different.  There are a number of companies that allow you to pay a monthly fee to use their service, but many people feel that, it's better to manage your infrastructure or web application using tools in-house.  I think there is space for a hybrid option, which can take advantage of both scenarios.

 

One of the benefits I see with an external monitoring tool, is you get to see the performance of your web application from an internet user perspective.  I've seen issues when externally, the application is performing poorly but internally folks don't have the exact same experience.  Then the old tried and true call comes "... can someone use the cable modem and test out our site".   Depending how deep you want the external monitoring to go, there are times when you need to install an agent on each server you want managed and that can add some overhead or not and your security team may not be too thrilled about that.

 

Some folks, prefer internal monitoring only, but I don't think they are getting the complete picture when it comes to external web performance.  Internal monitoring can at times alert you to issues before they become know to the broader community.  Some folks have also found that moving monitoring internally that they are not dealing with issue outside that may skew their performance results, whether its a poorly operating content delivery network (CDN) or bad internet connection from their provider.

 

What are your thoughts on in-house vs hosted web performance monitoring solutions?

 

Is there are difference between the two?

 

Do you leverage both services, targeting specific areas for each to monitor?

 

Do you prefer on over the other?

 

Does cost come into play when deciding on a solution?

 

Let me know your thoughts, and how we can build better monitoring services that can manage the entire stack from infrastructure to application.

A recap of the previous month’s notable tech news items, blog posts, white papers, video, events, etc. - For the week ending Friday, June 27th, 2014.

 

News

 

The Enterprise Is Ready for the Internet of Things—But 57% of Networks Are Not

Survey reveals how the IoT will impact IT infrastructure—network capacity and security are biggest concerns.

 

ICYMI: 'Shoddy' PayPal, Google Glass & hacking BYOD

The latest on PayPal's two-factor authentication, the launch of Google Glass and new banking attacks.

 

What is a CPU-based SDN switch? Infoblox's Stu Bailey explains

Article discusses the differences between LINCX and an ASIC-based switch, the types of controllers used for testing, and how users can get involved with the development of LINCX.

 

Cisco: WiFi Traffic Will Exceed Wired By 2018

Within five years, more IP traffic will come from devices other than PCs, the networking vendor says in its latest Visual Networking Index.

 

Network Stories

 

Kiwi firm rakes in “million dollar” savings through BYOD…

Savings of $1 million a year – and happier staff to boot – have convinced Cogent boss Ray Noonan he’s on to a winner with BYOD.

 

Blogger Exploits

 

Who Should Have Access to Your IT Monitoring Tools

Data should be available to those who use it effectively to make decisions. But who are those people – and where in your organization will you find them?

 

Turning network resource management on its head through software-defined WANs

Although much of the Software Defined Networking discussion to date has been about the data center, software-defined WANs (SDW) also shows great promise.

 

Duo Security Researchers Uncover Bypass of PayPal’s Two-Factor Authentication

Researchers at Duo Labs, the advanced research team at Duo Security, discovered that it is possible to bypass PayPal’s two-factor authentication

 

Open source PCI DSS: A strategy for cheaper, easier PCI compliance

Do open source products have a place in enterprise PCI compliance strategies? Take a look at the open source opportunities for meeting three specific compliance needs: logging, file integrity monitoring and vulnerability scanning.

 

Network Device Management, Part 1: SNMP & Logging

Learn best-practices for monitoring your network devices with SNMP and for tracking events the devices generate.

 

Webinars & Podcasts

 

BYOD: How To Protect Your Company And Employees

BYOD trend continues to gain momentum. However there are important steps companies need to take to protect their intellectual property from multiple security breaches.

 

Windows Server 2003 end of life: Plan your WS2012 migration now

Extended support for Windows Server 2003 ends on 14 July 2015. The webcast guides you through steps to help you make the best migration decisions. No registration is required.

 

Food for Thought

 

More Working From Home: Getting Things Done - The Peering Introvert, Ethan Banks

Imagine starting your day in your “war room” with folks from different parts of the IT organization.  Your teams have been assembled to deal with the typical “Our website is slow…” comments from external customers and internal business units.  You have a multi-tiered web application with the latest hardware and application code.  As you go around the table, the first response you typically get is “.. not my problem, maybe its…”.  Where do you start?

 

When dealing with web performance monitoring, folks typically start at the infrastructure or application layer, but this siloed approach can only get you so far.  Multi-tier applications are interconnected and very complex.  We need an end-to-end approach that will help connect the dots.  This approach could also help prevent unnecessary hardware spend, a typical response from folks who think just adding more compute, memory or storage would solve the issue.

 

I would be interested in hearing from you regarding which tools you use to help monitor your web performance.  There are a number of products that provide a holistic point of view that could help to reduce your Mean Time To Repair (MTTR).   Solutions such as Pingdom, Keynote and AppNeta may be the tools for you.

 

Do you currently monitor your website performance?

 

Do your users notify you when your website is slow?

 

Do you have a passive or active monitoring solution?

 

What kinds of features would you like to see from current day solutions?

 

Do you use a hosted or in-house web performance monitoring solution?

 

We have had a lot of success with web performance monitoring tools, let me know how things are going for you.

Filter Blog

By date:
By tag: