Showing results for 
Search instead for 
Did you mean: 

Engineer to Engineer –Visibility into the Full Application Stack helps Pinpoint Root Cause of Problems

Level 14

The following is an actual description of a discussion one of our engineers, Matt Quick, had with a customer as told in his words.

A customer using the product evaluation copy of SolarWinds Server and Application Monitor (SAM) called us wanting an extension on the trial because “SAM was broke, keeps alerting a component when we know everything was fine.”  I asked to take a look at the customer’s historical data.  The component in question was actually from the Windows 2003-2012 Services and Counters, specifically “Pages/sec” was going critical.  I’d seen this before, and it always relates back to the disk.

“But this VM is backended by a NetApp!!  It can do 55,000 IOPS!!!”  Yeah, I was suspicious at that, so, I asked them, “ok, do you have Storage Manager (STM) or another storage monitoring product installed so we can check?”  Sure do, and he promptly informed me that NetApp’s Balancepoint was telling him that while he averaged about 860 IOPS per day, during that hour he spiked to 1350 IOPS, still well within his supposed “55,000” IOP limit.

Ok, so, I go into SolarWinds Storage Manager, hit search in the upper right, find the VM with the component in question.  I go to the storage tab and go into Logical mapping, find which LUN and aggregate it belongs to.  Next, I go into the NetApp, look at the RAID report to see how many IOPS he can do.  A quick calculation later, I estimate about 3500 IOPS total.  Customer then realizes the original number of “55,000 IOPS” probably is not real in his specific setup.  Then I look at the volume IOPS report on the NetApp, during the same timeframe.  Sure enough, March 1st @ 8:30 pm, 3,500 IOP spike.

“But Balancepoint says I got 1350 at that time!”  So, I ask him to open it up, and sure enough, 1350 @ 9pm.  I ask him to look at the next data point…800 IOPS @ 11pm.  He was looking at a bi-hourly aggregate.  Sure enough, if you aggregate the 8pm hour, you get 1350.  And we couldn’t figure out how to zoom in on NetApp’s software.  At this point the customer is speechless as he realizes his current tools were giving him incomplete information.

Then I ask him if Virtualization Manager (VMan) is installed, and sure enough it is.  I look in STM at which datastores are on that aggregate in NetApp, and I add all of them into a performance chart in VMan for the same timeframe and isolated it to a single datastore causing the problem.  From there I add all related VMs to that datastore, and boom, we found the culprit VM with the problem:  Apparently someone was running some kind of backup every day @ 8:30 pm.

All this from what looked like an ‘erroneous’ SAM alert.

This story exemplifies the value of an integrated set of tools that gives you visibility across the extended application stack, from the application and its processes and services through the underlying infrastructure so that you can identify the root cause and then solve hard problems. The following video gives an overview of how we are making this possible with the integration of Server and Application Monitor, Virtualization Manager and Storage Manager to provide extended application stack visibility.

If you have used SAM, STM, VMAN and Database Performance Analyzer to find the root cause of tricky problems or to prevent problems, please share your story (your story is worth a cool 50 thwack points)!

Level 14

This story exposes the two-edge sword that is wielded by those opposed to additional network monitoring tools.  One side is the "prove that I have a problem with my existing processes" which is nearly impossible to do without investing at least some time into a trial.  The other side is the "everything looks good to me" argument, often closely resembling the "I don't want to see the problems unless it is costing me money" argument.

Everyone knows that problems exist in their infrastructure.  Having the heterogeneous toolset, albeit built on a common framework, is key to rooting out those problems.  Convincing others that the investment of time and money is worth it --- well, maybe Solarwinds can make a tool for that too!

Level 15

Thanks for the information.


just another reason I dislike aggregated values....they hide the truth.

About the Author
Like SolarWinds, I have roots in Oklahoma and have been fond of land grant schools as I went from Oklahoma State University, moving South to Texas A&M University.  Like my college career (accounting, political science, Russian language and then MBA), I have suffered ADD in my professional career moving from finance to strategy to product management and marketing.  I have, however, settled on the broad niche of systems management and have acquired knowledge in this space over the last 11 years. I was very happy to join the SolarWinds team in January 2012 and have been very impressed with the technology.  I look forward to engaging with this community.