I’ve discussed the idea that performance troubleshooting is a multi-dimensional problem and that virtualization adds more dimensions. Much of the time it is sufficient to look at the layers independently. The cause of a performance problem may be obvious in an undersized VM or an overloaded vSphere cluster. But sometimes you need to correlate the performance metrics across multiple layers. Worst of all is when the problem is intermittent. Apparently random application slowdowns are the worst to troubleshoot. The few times that I have needed to do this correlation I have always had a sinking feeling. I know that I am going to end up gathering a lot of performance logs from different tools. Then I am going to need to identify the metrics that are important and usually graph these together. There is a sinking feeling when I know I need to get the data from Windows Perfmon, the vSphere client, the SAN, and maybe a network monitor into a single set of graphs.
My go-to tool for consolidating all this data is still Microsoft Excel, mostly because I have a heap of CSV files and want a set of graphs. Consolidating this data has a few challenges. The first is getting a consistent start and finish times for the sample data. The CSV files are generated from separate tools and the time stamps may even be in different time zones. Usually looking at one or two simple graphs identifies the problem time window. Once we know when to look at we can trim each CSV file for the time range we want. Then there are challenges with getting consistent intervals for the graphs. Some tools log every minutes and others every 20 minutes. On occasion, I have had to re-sample the data to the lowest time resolution just to get everything on one graph. That graph also needs to have sensible scales, meaning applying scaling to the CSV values before we graph them. I’m reminded how much I hate having to do this kind of work and how much it seems like something that should be automated.
Usually, when I’m doing this I am an external consultant brought in to deal with a high visibility issue. Senior management is watching closely and demanding answers. Usually, I know the answer early and spend hours putting together the graph that proves the answer. If the client had a good set of data center monitoring tools and well-trained staff, they would not need me. It troubles me how few organizations spend the time and effort in getting value out of monitoring tools.
I have been building this picture of the nightmare of complex performance troubleshooting for a reason. Some of you have guessed the reason, PerfStack will be a great tool to avoid exactly this problem. Seeing an early demo of PerfStack triggered memories. Not good memories.