Skip navigation

Geek Speak

5 Posts authored by: demitassenz

One of the common challenges in troubleshooting performance issues is that the multiple dimensions belong to different teams. Co-ordinating the troubleshooting across the teams can bring its own challenges. I really like the feature of PerfStack where the dashboard URL contains all of the information required to recreate the dashboard. The net result is that I can paste that one URL into my help desk ticket to include the evidence to hand off an issue to another team. Equally, when another team sends me a ticket it can already have a dashboard to jump start my troubleshooting.

 

I’ve seen help desk tickets bounce around from team to team within large organizations. As any network engineer will tell you, the network is always blamed first. To prove the issue isn’t the network, you craft together some graphs showing that all the latency is in a VM. Then you paste a screenshot of the graphs into the help desk system and reassign the ticket to the virtualization team. Shortly afterward the virtualization team says they are unable to see the issue and can you provide more details. This poor handoff between departments slows the whole process. The handoff makes it difficult to resolve the problem for the application end-users. It also makes every team feel like the other teams are idiots because they cannot see the obvious problems.

 

With PerfStack, you are able to hand the virtualization team a live graph showing the performance issue as being a VM problem. The virtualization team can take that URL and make changes to the dashboard. They might add VM specific counters and also information from inside the operating system. The VM team may identify that the issue is happening within  SQL server. They hand it off to the DBAs, with the URL for an updated dashboard. The DBAs rebuild the indices (or something) and all the performance problems go away. The important thing is that the handoff between teams has far more actionable information. Each team can take the information from the previous team and adapt it to their own view of the world. The context of each team's information remains through the URLs in the ticket. This encapsulation into an URL was one of my favorite little features of the PerfStack demonstration.

 

One thing to keep in mind is that collaborative troubleshooting is more productive than playing help desk ticket ping pong. It definitely helps the process to have experts across the disciplines working together in real-time. It helps both with resolving the problem at hand and with future problems. Often each team can learn a little of the other team’s specialization to better understand the overall environment. Another under-appreciated aspect is that it helps people to understand that the other teams are not complete idiots, that each specialization has its own issues and complexity.

I have been talking about the complexity of resolving performance issues in modern data centers. I’ve particularly been talking about how it is a multi-dimensional problem. Also, that virtualization significantly increases the number of dimensions for performance troubleshooting. My report of having been forced to use Excel to coordinate brought some interesting responses. It is, indeed, a very poor tool for consolidating performance data.

 

I have also written in other places about management tools that are focused on the data they collect, rather than helping to resolve issues. What I really like about PerfStack is the ability to use the vast amount of data in the various SolarWinds tools to identify the source of performance problems.

 

The central idea in PerfStack is to gain insights across all of the data that is gathered by various SolarWinds products. Importantly, PerfStack allows the creation of ad hoc data collections of performance data. Performance graphs for multiple objects and multiple resource types can be stacked together to identify correlation. My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real- time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed.  The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace.

 

I have written elsewhere about systems management tools that get too caught up in the cool data they gather. These tools typically have fixed dashboards that give pretty overviews. They often cram as much data as possible into one screen. What I tend to find is that these tools are inflexible about the way the data is combined. The result is a dashboard that is good at showing that everything is, or is not, healthy but does not help a lot with resolving problems. The dynamic nature of the PerfStack workspace lends itself to getting insight out of the data, and helping identify the root cause of problems. Being able to quickly assemble the data on the load on a hypervisor and the VM operating system, as well the application statistics speeds troubleshooting. The ability to quickly add performance counters for the other application dependencies lets you pinpoint the cause of the issue quickly. It may be that the root cause is a domain controller that is overloading its CPU, while the symptom is a SharePoint server that is unresponsive.

 

PerfStack allows very rapid discovery of issue causes. The value of PerfStack will vastly increase as it is rolled out across the entire SolarWinds product suite.

 

You can see the demonstrations of PerfStack that I saw at Tech Field Day on Vimeo: NetPath here and SAM here.

I’ve discussed the idea that performance troubleshooting is a multi-dimensional problem and that virtualization adds more dimensions. Much of the time it is sufficient to look at the layers independently. The cause of a performance problem may be obvious in an undersized VM or an overloaded vSphere cluster. But sometimes you need to correlate the performance metrics across multiple layers. Worst of all is when the problem is intermittent. Apparently random application slowdowns are the worst to troubleshoot. The few times that I have needed to do this correlation I have always had a sinking feeling. I know that I am going to end up gathering a lot of performance logs from different tools. Then I am going to need to identify the metrics that are important and usually graph these together. There is a sinking feeling when I know I need to get the data from Windows Perfmon, the vSphere client, the SAN, and maybe a network monitor into a single set of graphs.

 

My go-to tool for consolidating all this data is still Microsoft Excel, mostly because I have a heap of CSV files and want a set of graphs. Consolidating this data has a few challenges. The first is getting a consistent start and finish times for the sample data. The CSV files are generated from separate tools and the time stamps may even be in different time zones. Usually looking at one or two simple graphs identifies the problem time window. Once we know when to look at we can trim each CSV file for the time range we want. Then there are challenges with getting consistent intervals for the graphs. Some tools log every minutes and others every 20 minutes. On occasion, I have had to re-sample the data to the lowest time resolution just to get everything on one graph. That graph also needs to have sensible scales, meaning applying scaling to the CSV values before we graph them. I’m reminded how much I hate having to do this kind of work and how much it seems like something that should be automated.

 

Usually, when I’m doing this I am an external consultant brought in to deal with a high visibility issue. Senior management is watching closely and demanding answers. Usually, I know the answer early and spend hours putting together the graph that proves the answer. If the client had a good set of data center monitoring tools and well-trained staff, they would not need me. It troubles me how few organizations spend the time and effort in getting value out of monitoring tools.

 

I have been building this picture of the nightmare of complex performance troubleshooting for a reason. Some of you have guessed the reason, PerfStack will be a great tool to avoid exactly this problem. Seeing an early demo of PerfStack triggered memories. Not good memories.

I wrote last week about how performance troubleshooting is a problem with multiple dimensions. From the client, across the network, and into the application server, there are many places where application performance can be impacted. One of the key parts of performance troubleshooting is sorting through all those dimensions, then finding the one that is limiting application performance. Unfortunately, there are many more dimensions when your applications are inside virtual machines. Understanding and considering these dimensions is critical to performance troubleshooting in a virtual environment.

 

The first place we add more dimensions is between the VM and the hypervisor. VM-sizing as well as virtual HBA and NIC selection all play a part in application performance. Then there is the hypervisor and its configuration. A lot of the same issues that affect operating systems also effect hypervisors. Are updates applied? Is the storage configuration correct? How about NIC teaming? Even simple things like storage vendor recommended optimization can make a big difference to the performance of applications in the VMs.

 

The next dimension is that multiple VMs share a single physical server. So your application’s performance may be dependent on the behavior of other VMs. If another VM, or group of VMs, uses most of the physical CPU, then your application may run slow. If your VM resides on the same storage as a bunch of VDI desktops, then storage performance will fluctuate. We see this noisy neighbor problem most often in cloud environments. But noisy neighbors absolutely can happen in on-premises virtualization. A particular challenge is the invisibility of hypervisor issues to the VM and its applications. The operating system inside a VM will report the clock speed of the underlying physical CPU, even if it is only getting a small fraction of the CPU cycles. A VM that uses 100% of its CPU is not necessarily getting 100% of a CPU, it is just using 100% if what it gets from the hypervisor.

 

There is a time dimension here, too; one related to the portability and mobility of VMs. Your application server may move from one virtualization host to another over time. So can other VMs. The noisy neighbor VM that caused a performance issue half an hour ago may be nowhere near your VM right now. On most hypervisors, VMs can also move from one piece of storage to another. That performance issue yesterday might have triggered an automated movement of your VM to a better storage location. The migration made the performance problem disappear. This VM mobility can lead to performance issues being intermittent, they only manifest when a specific other VM is on the same physical host.

 

Another virtualization dimension is when the user’s PC is really a VM. VDI adds more dimension to performance. There is a whole network between the user and their PC, as well as another device right in front of the user that may bring its own performance issues. Users seldom have the right words to differentiate a slow VDI session from a slow application. All of the noisy neighbor issues we see with server VMs are multiplied with desktops.

 

Virtualization has enabled many awesome changes in the data center. But virtualization has added serious complexity to performance troubleshooting. There are so many dimensions to understand to find and resolve the restricting dimension.

There is a reason why resolving application performance issues is so hard. Actually, I think there are a lot of reasons, but I’m going to spend a few posts looking at one set of reasons.

 

A lot of human endeavors are linear: move in one direction until you reach your goal. Troubleshooting application performance is non-linear. It has a lot of different, yet interrelated dimensions. The only way to resolve application performance issues is to address the dimension that is limiting performance. But with the interrelationships between the dimensions, the symptom may be far removed from the constraining dimension. The cause of a performance issue can be a long way from its visible effects.

 

One dimension is the performance of the user interface. Whether it is a native application or a web browser, every application has a User Interface (UI). Since the UI is close to the user, it is what they perceive as the application. Something as simple as a bug in Adobe Flash or a laptop that hasn’t been rebooted in months can lead to a poor user experience. Of course, the application that is slow for the user may not be the cause. Another application taking up 99% of their laptop’s CPU can be the bully, the application that they are trying to use may be a victim. Just inside the user’s computer, there are multiple performance dimensions.

 

The next dimension is the network between the user’s device and the application servers. From our desks outside the data center, the network can be very fast. But if your users are on the end of a busy WAN circuit, or over a VPN, or half way around the world,j then they will have a very different experience. What about load balancers, firewalls, and even simple things like name resolution? Having an out of date primary DNS server can add a minute to the application’s initial load time. The network adds still more dimensions to the troubleshooting problem.

 

Then there are dimensions on the application server. Does it have enough RAM and CPU? Are its disks up to the load of the application? What about the optimization of the operating system? Have the drivers for the storage and network adapters been updated to the latest? Or have they been updated to a faulty release? Is the cause of the performance problem a periodic scan by the anti-virus software? The application server adds more dimensions to any performance problems.

 

Maybe there is a time dimension? End-users in another country may access the application server at an unusual time. Are there maintenance tasks happening that might make the application slow at exactly the time that your European users need to get their end of day processing done? With all these possible dimensions to be investigated, it is no surprise that performance troubleshooting can take some time.

 

Performance troubleshooting has many dimensions. Until you identify the limiting dimension you will never fully resolve the performance problem. The number of dimensions increases substantially when you add virtualization to the mix. Next week I will look at some additional dimensions that you need to consider when virtualization comes into play.

Filter Blog

By date: By tag: