Thresholds and Alerting: Where the Magic Happens

In the previous two posts, we talked about high level performance information and then we dove into the details around storage performance from the array, pool, and LUN/Volume detail. Now let's talk about thresholds and alerting. This is where we start making Storage Resource Monitor adapt to your environment, while also showing what performance information matters to you. 

Thresholds

Setting thresholds is a key step in making sure your data center runs efficiently. When you start SolarWindsRegistered Storage Resource Monitor the first time, there are pre-set thresholds setup based on general best practices. For most situations this will work, however there are solutions that require something a little more specific. There are applications in your environment that require low latency and if any of them deviate from that it would cause major headaches. There are other applications that require a specific amount of IOPS and any dip will slow the business down and lead to your inbox being filled with not so nice requests for information. Having your thresholds set properly can help you avoid "fire drills." The "SRM Settings" section is where you can set global thresholds for key storage resources.

pastedImage_2.png

Thresholds can be set for IOPS, throughput, I/O size, Capacity, and latency (LUN & Volume specific).  In addition, some of these can be set by read, write, or total so you can even customize for applications that are heavy on read or heavy on write performance. 

pastedImage_4.png

Using global settings allows you to tailor monitoring for your data center, but, as you know, there are also applications that differ from the others that need special attention.  If that’s the case, Storage Resource Monitor has you covered. Under each details screen (array, pool, and LUN/Volume), you can adjust the thresholds for that specific resource. Pool 1 needs to maintain 500 IOPS and I need to know when it goes below it. You can set the threshold to warning when IOPS are less than or equal to 600 and critical when IOPS are less than or equal to 550. LUN 2 has to maintain latency of 50ms. You can set the threshold to warning when it hits 40ms and critical when it hits 50ms. The thresholds you set for the individual resources will translate to the summary screens we talked about before, so at-a-glance you can see if the required performance needs are being met.

pastedImage_6.png

pastedImage_7.png

Alerting

So now you’re thinking, "thresholds are great, but if something happens when the custom thresholds are  reached I need to be alerted."  In addition to custom thresholds, setting custom alerts will make sure you know when something goes wrong quickly. Like before, the standard alerts in Storage Resource Monitor will get you going, however custom alerts help make sure you understand if all of your resources are performing as required. Creating custom alerts can be done for groups of resources with the same performance profile or for specific resources that have a very unique requirement.

pastedImage_9.png

You can set a single alert for a specific storage resource or set an alert for multiple resources that share a common performance profile. There is the ability to customize everything from a specific team to handle the alert, to setting that the condition has to exist for a period of time, and even the ability to set the alert to only be enable during a certain time of day to name a few. Setting a custom alert for a specific time helps avoid the unwanted alerting noise during expected downtime and/or planned degraded performance. 

pastedImage_10.png

By using thresholds and custom alerts, Storage Resource Monitor has you covered when monitoring storage performance for all your applications. Along with dashboards and storage resource details, you can easily stay ahead of your storage performance needs and track when more resources are needed.

What are some of your best practices around thresholds? What are the items you customize with alerts?

Parents
  • Excellent.  I use solutions like this continually with NPM.  When a WAN site with a small uplink pipe reaches a specific utilization we want an alert.  When that utilization exceeds a specific amount, and when it's caused by specific application (a bad WAN Killer is Dragon--a voice to text transcription application), we want to see that so we can advise the site of the cause of their WAN problems.

    Similarly, if someone's streaming audio/video for personal entertainment and it's impacting the limited bandwidth available to a site, we want to know about it so we can train the site and its users to more responsibly manage their needs.  QoS is helpful, but results in different calls to the Help Desk with complaints of slow Guest Wireless Internet performance.

Comment
  • Excellent.  I use solutions like this continually with NPM.  When a WAN site with a small uplink pipe reaches a specific utilization we want an alert.  When that utilization exceeds a specific amount, and when it's caused by specific application (a bad WAN Killer is Dragon--a voice to text transcription application), we want to see that so we can advise the site of the cause of their WAN problems.

    Similarly, if someone's streaming audio/video for personal entertainment and it's impacting the limited bandwidth available to a site, we want to know about it so we can train the site and its users to more responsibly manage their needs.  QoS is helpful, but results in different calls to the Help Desk with complaints of slow Guest Wireless Internet performance.

Children
No Data
Thwack - Symbolize TM, R, and C