cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Thresholds and Alerting: Where the Magic Happens

Level 9

In the previous two posts, we talked about high level performance information and then we dove into the details around storage performance from the array, pool, and LUN/Volume detail. Now let's talk about thresholds and alerting. This is where we start making Storage Resource Monitor adapt to your environment, while also showing what performance information matters to you. 

Thresholds

Setting thresholds is a key step in making sure your data center runs efficiently. When you start SolarWinds® Storage Resource Monitor the first time, there are pre-set thresholds setup based on general best practices. For most situations this will work, however there are solutions that require something a little more specific. There are applications in your environment that require low latency and if any of them deviate from that it would cause major headaches. There are other applications that require a specific amount of IOPS and any dip will slow the business down and lead to your inbox being filled with not so nice requests for information. Having your thresholds set properly can help you avoid "fire drills." The "SRM Settings" section is where you can set global thresholds for key storage resources.

pastedImage_2.png

Thresholds can be set for IOPS, throughput, I/O size, Capacity, and latency (LUN & Volume specific).  In addition, some of these can be set by read, write, or total so you can even customize for applications that are heavy on read or heavy on write performance. 

pastedImage_4.png

Using global settings allows you to tailor monitoring for your data center, but, as you know, there are also applications that differ from the others that need special attention.  If that’s the case, Storage Resource Monitor has you covered. Under each details screen (array, pool, and LUN/Volume), you can adjust the thresholds for that specific resource. Pool 1 needs to maintain 500 IOPS and I need to know when it goes below it. You can set the threshold to warning when IOPS are less than or equal to 600 and critical when IOPS are less than or equal to 550. LUN 2 has to maintain latency of 50ms. You can set the threshold to warning when it hits 40ms and critical when it hits 50ms. The thresholds you set for the individual resources will translate to the summary screens we talked about before, so at-a-glance you can see if the required performance needs are being met.

pastedImage_6.png

pastedImage_7.png

Alerting

So now you’re thinking, "thresholds are great, but if something happens when the custom thresholds are  reached I need to be alerted."  In addition to custom thresholds, setting custom alerts will make sure you know when something goes wrong quickly. Like before, the standard alerts in Storage Resource Monitor will get you going, however custom alerts help make sure you understand if all of your resources are performing as required. Creating custom alerts can be done for groups of resources with the same performance profile or for specific resources that have a very unique requirement.

pastedImage_9.png

You can set a single alert for a specific storage resource or set an alert for multiple resources that share a common performance profile. There is the ability to customize everything from a specific team to handle the alert, to setting that the condition has to exist for a period of time, and even the ability to set the alert to only be enable during a certain time of day to name a few. Setting a custom alert for a specific time helps avoid the unwanted alerting noise during expected downtime and/or planned degraded performance. 

pastedImage_10.png

By using thresholds and custom alerts, Storage Resource Monitor has you covered when monitoring storage performance for all your applications. Along with dashboards and storage resource details, you can easily stay ahead of your storage performance needs and track when more resources are needed.

What are some of your best practices around thresholds? What are the items you customize with alerts?

7 Comments

Excellent.  I use solutions like this continually with NPM.  When a WAN site with a small uplink pipe reaches a specific utilization we want an alert.  When that utilization exceeds a specific amount, and when it's caused by specific application (a bad WAN Killer is Dragon--a voice to text transcription application), we want to see that so we can advise the site of the cause of their WAN problems.

Similarly, if someone's streaming audio/video for personal entertainment and it's impacting the limited bandwidth available to a site, we want to know about it so we can train the site and its users to more responsibly manage their needs.  QoS is helpful, but results in different calls to the Help Desk with complaints of slow Guest Wireless Internet performance.

Level 9

That is good stuff and shows how thresholds and alerting translate across the data center.


Great point on the custom alerting. Alerting & thresholds go hand in hand. Without the customization and intelligence added to the alerting rules you have white noise coming from your monitoring system. Custom alerting is part art and part science. And it can be a fun exercise in the realm of, "What if..."

"So what if the Exchange server falls off the map, and the primamry WAN link is down, and it is the Friday after Thanksgiving, but it is within 5 days of EOM, and...."

Level 9

@james_honey This is great, however how do you set multiple LUN overrides?  I have 35 LUNs that need a different threshold than the global(all the same). Going into each manually seems cumbersome.  Sometimes there are hundreds of LUNs that all need the same threshold adjusted. How is this handled?  The only way I can think of right now is by writing a SQL query.  If that is the only route, does anyone already have the SQL query written that they can share? (I can hunt and peck for the tables otherwise.)  The specific counter I'm looking to override is the Total Latency.  Thank you.

 

Level 9

@bmoline I am no longer at SolarWinds, but I think @jvb might be able to point you in the right direction.

Product Manager
Product Manager

@bmoline Your instinct is correct. You would need to do that via a direct query. Currently there is no way to do it through the UI or via the SDK. I don't have a query offhand to offer but if you look at the tables in either SWQL Studio or Database Manager, you should be able to easily find the tables where those thresholds are stored.

Level 9

@jfb thank you. Yesterday I was able to find the table (Orion.SRM.LUNThresholds) and if I change one manually (from global warning 2 to 10 and critical from 2.5 to 15) I can see these columns change: 

ThresholdType changes from 0 to 2
ThresholdOperator changes from 1 to 0
Level1Value changes from 2 to 10
Level1Formula changes from NULL to 10
Level2Value changes from 2.5 to 15
Level2Formula changes from NULL to 15

Does that appear to be everything on these?  It sure would be nice if there was a UI to select multiple LUNs and just adjust. Feature request suggestion.