8 Replies Latest reply on Mar 24, 2016 3:29 PM by Jenya

    SolarWinds to the rescue!  A real-life success story


      These days, I've been traipsing around North America visiting various client sites to work on what has become an extremely wide variety of projects.  At one site, upon my arrival, I was asked if I knew anything about storage (I do) and if I could help them try to figure out if they were having performance issues with their NetApp.  The organization was beginning to experience troubles with both their vSphere environment as well as with a production Microsoft cluster, but they were having trouble identifying the root cause of the issues.


      We had some small pieces of information to use as a starting point:

      • The servers housing a Microsoft cluster were breaking and errors in their logs suggested storage as a root cause.
      • Basic reporting tools for vSphere were showing some occasional latency issues, but nothing comprehensive.




      To help my client, I turned to SolarWinds Storage Manager to help determine what could be going on.  While the clues we had were helpful, they were incomplete pictures. The client's NetApp SAN serves multiple needs:

      • Most of the client's vSphere environment is supported by the SAN.
      • Many of the client's non-virtualized systems use the SAN for storage.


      As such, relying on just vSphere for storage information resulted in an incomplete picture being presented.  It was a good starting point, but was hardly comprehensive.


      That's when I turned to SolarWinds Storage Manager.  I allow Storage Manager to run for a few days and we began to get a much more complete picture of what was going on.  We were able to gain a complete look at the entire storage story.


      Armed with this information, we learned that there was, indeed, a major latency issue.  We also learned that the worst of the issue was taking place at specific times throughout the day and we could track exactly which volumes were being affected in an attempt to identify a commonality.  We did learn that all of the affected volumes were on the same aggregate, which gave us a place to focus.  As such, we started to investigate that aggregate and it was eventually determined that someone had created a snapshot schedule for the entire aggregate.  This operation was taking place even during the heaviest periods of activity, resulting in service-disrupting levels of latency.


      The moral of the story:  No matter what, having the right tools to monitor the environment is critical. Without them, administrators are left to try to guess at issues and may not have what's necessary to identify true culprits in a situation.  With Storage Manager, we were able to much more easily identify the scope of the overall issue in an attempt to correct the issue and ensure that applications were getting the storage resources that they need.


      What are your thoughts?  How do you start troubleshooting potential storage-related performance issues?





      Reply to this post to get 50 thwack points and an entry in the March Ambassador Engagement contest. An iPod Nano sits in the balance!

        • Re: SolarWinds to the rescue!  A real-life success story

          Since just about all of our storage systems are NetApp we use the NetApp tools for measuring performance and troubleshooting storage related performance issues.  We also use vCenter performance metrics as well as the network performance data from Orion.


          Since it sounds like you have worked with both NetApp and the SolarWinds Storage Manager, I am curious how it compares to the NetApp performance tools?  I am looking for a use case as to why I would want to use SolarWinds Storage Manager versus the NetApp tools in a NetApp storage environment?

          • Re: SolarWinds to the rescue!  A real-life success story

            I think a better moral of the story should have been "Peer review all changes / Use change control" so that hopefully someone would have caught the config change before it went to production.

            • Re: SolarWinds to the rescue!  A real-life success story

              My previous employer was a managed service provider.  We had some customers that wanted to do their own monitoring but didn't want to take the time to figure SW and get it stood up efficiently.  With these customers I would do consulting gigs.  I would spend about a week turning up SW, tuning it to their environment, and giving admin training to the staff. 


              On every single turn up something was found that wasn't detected by previous monitoring systems.  Interfaces taking millions of errors in a day, several T1s down within the environment, mis-matches in code versioning, etc. 

              • Re: SolarWinds to the rescue!  A real-life success story

                Pretty sure we let the NetApp monitoring tools do their job. However I would like to see what Solar Winds can do for us in this department.

                • Re: SolarWinds to the rescue!  A real-life success story

                  Not only is it important to have the right tools in place, but also management that stands behind the tools.  For years the culture of our organization was one of blame someone first, then if you can fix it great.  This led to an environment where no one wanted to accept something was wrong.  Our number one response to an alert used to be, I can log in this must be a false positive.  Now, due to the power of our monnitoring suite and a management structure who would respond to the accusation of false positives with the facts of collected data, LOB's are now beating down our doors to help.  We have evolved from a up - down monitor on network infrastructure and servers to a Monitoring and Alerting Control Tower for all infrastructure and applications in the organization.

                  Peoples willingnesss to use the tools have allowed us to diagnose everything from firewall issues to code problems.

                  • Re: SolarWinds to the rescue!  A real-life success story

                    Would be gr8 to have some kind of chart of awesomeness illustrating what you can do with solarwinds products which cisco's DCNM is not capable of. Because I know a "guy" who knows a "guy" who says "they" don't need Orion because "they" have DCNM.

                    • Re: SolarWinds to the rescue!  A real-life success story

                      We had a similar situation where we were experiencing performance issues with our NetApp. During this, we installed a demo of Storage Manager to help troubleshoot. Latency was one of our problems as well. For us, this was across all aggregates. Unfortunately we didn't have a single issue to fix that would make everything better. So we used Storage Manager and a few other tools to determine a number of configuration issues that were causing problems.

                      So far, we have found:

                      •   Multiple volumes running dedupe jobs during busy hours.
                      •   Misaligned VMs. We used NetApp tools for this. It would be nice if this was in Storage Manager. Though it does look like it is in Virtualization Manager, which we are configuring.
                      •   Found one volume doing practically 1/3 of all IO on an aggregate. Tracked that down to a poorly written SQL query being run repeatedly. This started after we installed Storage Manager, but it was very quick and easy to notice, and find the cause of, using it and Virtualization Manager.


                      We're still building out the Storage Manager set up as well. Just starting to play with File Analysis, which has been interesting. Never would have guessed we would have 2 TB of .tif files.

                      • Re: SolarWinds to the rescue!  A real-life success story

                        I too would setup a trial install of SRM. Actually that is what we did when we had a latency issue on one of our SANs. Similar to the original post, we had both virtual and physical machines using this NetApp, so vSphere errors did not tell the whole story. We were already using Netapp tools to monitor all of our SANs across all data centers. We use SolarWinds to monitor everything else (we have NPM, NTA, NCM, SAM, UDT, VNQM). I like the idea of having another monitoring tool focused specifically on the problem area (SRM trial in this case) to compare with the in-house monitoring tool. Sometimes you have to verify the data you get out of your monitoring tool to be confident that it is doing what it is supposed to.