This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Orion Platform 2020.2 - Enhanced Volume Status

If you've been keeping your Orion products up-to-date, you hopefully have noticed many of the improvements we made in the 2019.4 Orion Platform release around Node Status. This represented a fairly significant shift in how status for the node was calculated, and provided users with a tremendous amount of flexibility to decide for themselves what influences the status of a node. Prior to the 2019.4 release, node status was fairly binary and represented only if the node was up or down via ICMP. With the 2019.4 release, CPU, memory, and child objects can finally be rolled up and contribute to the overall status of the node.

What's been missing since the dawn of the Orion Platform, however, has been any sensical status of volumes. Volumes are arguably central to the overall health of a node, yet they have never represented any meaningful status beyond up/green when present, or unknown/gray when missing. Those are some fairly limited and not so useful statuses for something so critical to the overall health of a server. I mean, sure it's nice to know if a volume isn't attached or mounted to a server anymore, but what about something so basic as volume status being affected by the amount of remaining space on the disk?

As anyone who has ever run out of free disk space can likely tell you, when this occurs most servers cease to perform their designated function. The applications the server is hosting may perform slowly, erratically with unusual errors, or simply not at all. While disk space is cheap by most measures, many organizations still find themselves running out of space. Especially on thinly provisioned machines. Regardless of the reason why you may have run out of space, bubbling this information up clearly in Orion is absolutely essential to take action before it impacts critical business systems.

New Volume Statuses

In the 2020.2 Orion Platform release, volumes can now reflect warning and critical statuses based upon their volume usage thresholds. As you would imagine, this status now appears essentially anywhere a volume is shown within the Orion web interface. This includes AppStack, Orion Maps, and the popover menu that appears when you hover your mouse over a volume. In the immortal words of Martha Stewart, it's a good thing.

pastedImage_9.png

Volume Thresholds

In previous releases of the Orion Platform, volume thresholds were defined globally. This meant that if you wanted to define unique thresholds for individual volumes you had no option than to define a complex series of alerts for each volume to be notified when they exceeded their custom thresholds. Some clever individuals went so far as to utilize custom properties to make this process more manageable. While this may have worked as a crutch to ease volume threshold alerting needs, it was hardly the kind of unexpected simplicity customers have come to expect from SolarWinds and did absolutely nothing to affect how volume status was actually reflected within the Orion web interface. We knew we could do better, and that's exactly what we did.

When editing a volume you now have a new option entitled 'Override Orion Global Thresholds'. As the name suggests, this will allow you to define your own unique warning and critical thresholds for this individual volume, or even multiple volumes simultaneously when using multi-edit.

Volume THRESHOLDS.png

Null Thresholds

We've heard countless times that not everything monitored is something customers need or want to be alerted on, or highlighted within the Orion web interface as a problem for the boss to ask probing questions. Yeah, boss, it's perfectly normal for the SWAP volume to be near or at 100% capacity at all times when that volume has a fixed-sized SWAP file. Now, don't you have some TPS reports to review?

In cases such as these, the volume is likely being monitored for IOPS, Latency, or perhaps simply to know that the volume is still attached to the server. So there's likely no good reason to have thresholds defined for volume usage. Fortunately, this is now possible by simply deselecting the checkbox next to the warning and critical thresholds.

Null Thresholds.png

The warning and critical thresholds can also be enabled or disabled independently for each volume. Allowing you, for example, to only have a warning threshold defined if that's what you desired. This means the volume would never go into a critical state. This can be useful when certain volumes are deemed more important than others and status is used to define the criticality of that volume exceeding the threshold. For example, perhaps you have a volume dedicated solely to things like the original installation media, drivers, and application installers for apps used on the system. If this volume fills to 100% it really has no impact on the server or the applications it's serving. So perhaps raising the severity to critical doesn't make sense. However, you probably still want to know the volume is out of space. In cases such as these, defining only a warning severity is probably more appropriate.

Sustained Thresholds

Building upon the work done in the Orion Platform 2019.2 release with node thresholds, volumes now also include support for sustained thresholds. This can be enormously useful in a wide variety of different scenarios. One such example is conducting backup operations that rely upon the Windows Volume Shadow Copy service​. When backups occur, volume usage can temporarily shoot up to high levels, causing nuisance alerts to trigger and then reset shortly thereafter. Because this is considered normal behavior for the Volume Shadow Copy Service, this can lead to alert fatigue and cause you to miss otherwise critical alerts within the noise this creates during your nightly backup routines.

Sustained Thresholds.png

Now with the Orion Platform 2020.2 release, volume thresholds can affect status immediately after only a single poll, after a user-definable number of consecutive polls, or even after exceeding a threshold a definable number of times across a number of polling intervals (X out of Y polls). This provides a tremendous amount of flexibility in how volume usage affects overall volume status, allowing you to cut through the noise and raise only actionable alerts.

 

Threshold Operators

If somehow you need even greater levels of flexibility when defining your volume thresholds, well we have you covered there too. Need to define warning or critical thresholds when the usage of a particular volume is less than a specified value? Not a problem. Volume threshold operators in Orion Platform 2020.2 allow you to define values as 'Greater than' (Default), 'Greater than or Equal to', 'Equal to', 'Less than or Equal to', 'Less than', or 'Not Equal to'.

Threshold Operators.png

Dynamic Thresholds

That's right! We finally brought all the power of Orion Platform's dynamic baseline thresholds to Volumes! In previous releases, dynamic thresholds were available exclusively to Node and Interface metrics. All that same power you know and love can now be unleashed upon your volumes. You can even utilize dynamic baselines in conjunction with sustained thresholds to even further reduce nuisance alerts and eliminate alert fatigue, allowing you to focus on those alerts which are truly actionable. Just point, click and go. Set dynamic thresholds on one or multiple volumes simultaneously. Once you have a minimum of seven days of historical data for the baseline to be calculated, the dynamic threshold(s) will take effect and update nightly.

Baseline Thresholds.png

Node Status Roll-up

Continuing upon our efforts to simplify how the status of all monitored aspects of a node are rolled up to represent its true status, volumes like other entities can now contribute to the overall status of a node. This allows you to gain at-a-glance visibility for the node and if there are any volumes on that node that could be responsible for poor performance or application availability issues. This helps accelerate the troubleshooting process and reduce mean time to resolution by eliminating the need to dig through various screens of the UI to ferret out the root cause of the issue impacting your end-users.

Roll Up Status.png

Node Status Contributors

In our never-ending pursuit to make substantive, life-changing improvements to the Orion Platform while simultaneously limiting the impact those changes might have on existing customers, volume status by default will not affect node status when upgrading to Orion Platform 2020.2. We understand that the last thing you want to happen after an upgrade is to be flooded with a deluge of alerts. Some of you have toiled for countless hours carefully crafting and further refining your alert definitions to exacting specifications, and the last thing you want to worry about when upgrading is how much work it's going to be to rewrite those. So fret not my OCD friends, there's nothing to worry about.

For those upgrading who want to take full advantage of volume status improvements and have their status roll-up to the node, which has numerous benefits outlined in my previous blog post, it's really quite easy. Simply navigate to [Settings > All Settings > Node Child Status Participation] and click the 'Volume' slider to the on-state, and presto, you're done!

Node Status Contibutors.png

Alternatively, if you're just installing Orion for the first time, then you're already set. There's nothing additional you need to do. Volume Status is already enabled as a Node status contributor for all new installations of Orion.

Volume Management Actions

The purpose of enhancing status is ultimately to make it easier to manage volumes in Orion. To that same end, we felt it was long overdue to make volumes a full-fledged citizen of Orion by adding a Management resource to the Volume Details view. This makes it far easier and much more convenient to perform management functions against a particular volume you're currently viewing then in previous releases. From this new management resource, you can perform the following functions.

  • Edit Volume - Modify thresholds, modify the display name, and adjust the polling frequency of the volume
  • Poll Now - As the name suggests, this forces the volume to be polled immediately rather than waiting for the next scheduled poll
  • Rediscover - Forces a rediscovery of any property changes on volume, such as the volume name, size, etc.
  • Add New Alert - Takes you into the Alert creation wizard, prepopulated with the entity type and name of the volume
  • Performance Analyzer - Launches PerfStack pre-populated with metrics for that specific volume
  • Maintenance Mode - Allows you to unmanage volumes. Another first and completely new feature of the 2020.2 release! 

aLTeReGo_0-1588352277974.png

Epilogue

Volume status improvements may feel somewhat pedestrian, or even mundane when compared to the spotlight-stealing Orion Maps and Modern Dashboards.  However, we feel that these are foundational components that are absolutely essential for any monitoring solution. Status drives alerting, and signifies the severity or impact a particular issue has within the environment. Without a powerful and flexible means by which to customize thresholds, and determine how, when, and where that status is calculated and rolled up, amazing features such as AppStack and Orion Maps are simply not possible. At best they're confusing to interpret. At worst, they misrepresent the truth, either giving you a false sense of well being, or overloading your senses with illegitimate issues in the environment.

I sincerely hope you are as excited about these improvements as we are to deliver them. We thrive upon your feedback, both positive and negative. So please share your thoughts or ask any questions you might have regarding these improvements in the comments section below. Operators are standing by. emoticons_happy.png