The following is an actual description of a discussion one of our engineers, Matt Quick, had with a customer as told in his words.
A customer using the product evaluation copy of SolarWinds Server and Application Monitor (SAM) called us wanting an extension on the trial because “SAM was broke, keeps alerting a component when we know everything was fine.” I asked to take a look at the customer’s historical data. The component in question was actually from the Windows 2003-2012 Services and Counters, specifically “Pages/sec” was going critical. I’d seen this before, and it always relates back to the disk.
“But this VM is backended by a NetApp!! It can do 55,000 IOPS!!!” Yeah, I was suspicious at that, so, I asked them, “ok, do you have Storage Manager (STM) or another storage monitoring product installed so we can check?” Sure do, and he promptly informed me that NetApp’s Balancepoint was telling him that while he averaged about 860 IOPS per day, during that hour he spiked to 1350 IOPS, still well within his supposed “55,000” IOP limit.
Ok, so, I go into SolarWinds Storage Manager, hit search in the upper right, find the VM with the component in question. I go to the storage tab and go into Logical mapping, find which LUN and aggregate it belongs to. Next, I go into the NetApp, look at the RAID report to see how many IOPS he can do. A quick calculation later, I estimate about 3500 IOPS total. Customer then realizes the original number of “55,000 IOPS” probably is not real in his specific setup. Then I look at the volume IOPS report on the NetApp, during the same timeframe. Sure enough, March 1st @ 8:30 pm, 3,500 IOP spike.
“But Balancepoint says I got 1350 at that time!” So, I ask him to open it up, and sure enough, 1350 @ 9pm. I ask him to look at the next data point…800 IOPS @ 11pm. He was looking at a bi-hourly aggregate. Sure enough, if you aggregate the 8pm hour, you get 1350. And we couldn’t figure out how to zoom in on NetApp’s software. At this point the customer is speechless as he realizes his current tools were giving him incomplete information.
Then I ask him if Virtualization Manager (VMan) is installed, and sure enough it is. I look in STM at which datastores are on that aggregate in NetApp, and I add all of them into a performance chart in VMan for the same timeframe and isolated it to a single datastore causing the problem. From there I add all related VMs to that datastore, and boom, we found the culprit VM with the problem: Apparently someone was running some kind of backup every day @ 8:30 pm.
All this from what looked like an ‘erroneous’ SAM alert.
This story exemplifies the value of an integrated set of tools that gives you visibility across the extended application stack, from the application and its processes and services through the underlying infrastructure so that you can identify the root cause and then solve hard problems. The following video gives an overview of how we are making this possible with the integration of Server and Application Monitor, Virtualization Manager and Storage Manager to provide extended application stack visibility.
If you have used SAM, STM, VMAN and Database Performance Analyzer to find the root cause of tricky problems or to prevent problems, please share your story (your story is worth a cool 50 thwack points)!