This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Bug with Volume Alerts using 'Orion General Thresholds'

monitoringlife over 4 years ago

This might be a long post and I apologize in advance. It requires some background on how we implemented alerts and discovered an odd behavior with Orion resetting the alert during database maintenance and immediately re-triggering.

Our Volume alerts are fairly standard. Alert if volume percent used is greater than threshold.

We decided to use 'Orion General Thresholds' to set our standard.

URL: /Orion/NetPerfMon/Admin/NetPerfMonSettings.aspx

This gives us the ability to set a standard that all volumes get, and allows us to 'override' the threshold on a volume by volume basis.

This all worked out fairly easily.

The first problem I ran into is the alert builder WebUI. When setting alert context ("I want to alert on:") to 'volume'. The 'Orion General Threshold' is not exposed. It is only exposed under ‘volume capacity forecasting’.

I can link between ‘volume capacity forecasting’ to 'volume', but not the other way. While SQWL has linkages between 'Volume' and ‘volume capacity forecasting’ the alert builder WebUI cannot see in both directions.

This was fairly easy to work around. I set the alert context to ‘volume capacity forecasting’ and linked to 'Volume' to get percent used.

Alert build WebUI with alert set to ‘volume capacity forecasting’, exposing ‘Orion General Thresholds’

‘Volume Capacity forecasting’ linkage to ‘volume’

Alert build WebUI with alert set to ‘volume’ not exposing ‘Orion General Thresholds’

‘Volume’ has no linkage to ‘volume capacity forecasting’

The second issue. I discovered that alerts created under ‘volume capacity forecasting’ were resetting every night at the same time (then opening new alerts 1 minute later). After a lot of research and trail/error, I tracked this down to database maintenance. Database maintenance happens at 2:15am by default. I moved this forward an hour and the alert resets moved to match. Moved database maintenance back to default, alert resets again follow to match. These alert resets are only happening erroneously during database maintenance. I have confirmed by going to impacted volumes to confirm they are over threshold and not bouncing above/under, or cron/batch cycle clearing things. Disk usage remained consistently over threshold during these times between and after reset.

After a lengthy support case we set a custom reset condition on the alerts. Original using default "Reset this alert when trigger condition is no longer true (Recommended)".

The new reset condition is a copy of the trigger condition and reversing the operator.

Trigger

Reset

This seems to get around the issue, but seems like a bug. Does anyone else use a similar alert with 'Orion General Thresholds'? Anyone else seen any odd behavior with reset conditions?

Top Replies

0 serena over 4 years ago

monitoringlife wrote:
This might be a long post and I apologize in advance. It requires some background on how we implemented alerts and discovered an odd behavior with Orion resetting the alert during database maintenance and immediately re-triggering.
Our Volume alerts are fairly standard. Alert if volume percent used is greater than threshold.
We decided to use 'Orion General Thresholds' to set our standard.
     URL: /Orion/NetPerfMon/Admin/NetPerfMonSettings.aspx



This gives us the ability to set a standard that all volumes get, and allows us to 'override' the threshold on a volume by volume basis.

This all worked out fairly easily.
The first problem I ran into is the alert builder WebUI. When setting alert context ("I want to alert on:") to 'volume'. The 'Orion General Threshold' is not exposed. It is only exposed under ‘volume capacity forecasting’.
I can link between ‘volume capacity forecasting’ to 'volume', but not the other way. While SQWL has linkages between 'Volume' and ‘volume capacity forecasting’ the alert builder WebUI cannot see in both directions.
This was fairly easy to work around. I set the alert context to ‘volume capacity forecasting’ and linked to 'Volume' to get percent used.

     Alert build WebUI with alert set to ‘volume capacity forecasting’, exposing ‘Orion General Thresholds’

          ‘Volume Capacity forecasting’ linkage to ‘volume’

                Alert build WebUI with alert set to ‘volume’ not exposing ‘Orion General Thresholds’

                                ‘Volume’ has no linkage to ‘volume capacity forecasting’

The second issue. I discovered that alerts created under ‘volume capacity forecasting’ were resetting every night at the same time (then opening new alerts 1 minute later). After a lot of research and trail/error, I tracked this down to database maintenance. Database maintenance happens at 2:15am by default. I moved this forward an hour and the alert resets moved to match. Moved database maintenance back to default, alert resets again follow to match. These alert resets are only happening erroneously during database maintenance. I have confirmed by going to impacted volumes to confirm they are over threshold and not bouncing above/under, or cron/batch cycle clearing things. Disk usage remained consistently over threshold during these times between and after reset.
After a lengthy support case we set a custom reset condition on the alerts. Original using default "Reset this alert when trigger condition is no longer true (Recommended)".

The new reset condition is a copy of the trigger condition and reversing the operator.
     Trigger

     Reset

This seems to get around the issue, but seems like a bug. Does anyone else use a similar alert with 'Orion General Thresholds'? Anyone else seen any odd behavior with reset conditions?

Filed an improvement request under CORE-13349 to take a look at the placement of threshold in alert builder and check the design of alert resets.

0 aLTeReGo over 4 years ago

Volumes do not currently have overrides for global thresholds. The value you are overriding is specific to capacity planning. What you are looking for us currently a feature request that can be voted on here >
Cancel
Vote Up +1 Vote Down

Cancel
0 dgsmith80 over 4 years ago in reply to aLTeReGo

I suspect that Capacity Forecast values are re-calculated during DB Maintenance and that could be what is causing this issue?
In terms of improvement I would suggest setting Custom Properties against your Volumes with which you can use to customise your alerts.
Cancel
Vote Up 0 Vote Down

Cancel
0 monitoringlife over 4 years ago in reply to aLTeReGo

aLTeReGo Thanks for looking into this and replying with the existing feature request. It looks like the one you linked is marked as 'Implemented' with notes it was part of 'SAM 6.2'. The linked example is what were doing with 'Orion General Thresholds'.
Cancel
Vote Up 0 Vote Down

Cancel
0 aLTeReGo over 4 years ago in reply to monitoringlife

Thank you for bringing that to my attention. That statement was unfortunately incorrect. I have re-opened the feature request.
Cancel
Vote Up 0 Vote Down

Cancel
0 monitoringlife over 4 years ago in reply to dgsmith80

dgsmith80 I agree that it looks like an issue with DB Maintenance. I suspect its during the roll-up of the table from detail to daily and there is a blank/null entry till the next polling interval. We tested moving the maintenance window forward and back an hour and the issue moved with it.
I will look into custom properties but the implementation seems to work excluding the oddity with reset happening, and the alert builder UI not linking between 'volume' and ‘volume capacity forecasting’ (despite the link existing from forecasting to volume, as well as in SWQL studio).
What I find more odd is that we are not using ‘volume capacity forecasting’ at all. We are using 'Orion.Volume.VolumePercentUsed', and comparing it against the warning/critical threshold (found on ‘volume capacity forecasting’).
Cancel
Vote Up 0 Vote Down

Cancel
0 monitoringlife over 4 years ago in reply to aLTeReGo

aLTeReGo no stress at all. Thanks for the time and assistance looking into this!
Cancel
Vote Up 0 Vote Down

Cancel
0 monitoringlife over 4 years ago in reply to serena

serena Thanks for looking into this and raising the issue. Please feel free to reach out with a UI session if your team has any questions.
Cancel
Vote Up +1 Vote Down

Cancel
0 daniel.neeves over 3 years ago

Hi monitoringlife I came across this after trying to find a hit on the exact same issue you mention above (second issue). Our volume alerts (Linux filesystems) are resetting each night at around 02:15 causing our Ops team a massive headache. Did you ever get a proper fix from SW regarding this or are you still working around it by having to create a special reset condition?

Thanks in advance.
Cancel
Vote Up 0 Vote Down

Cancel
0 monitoringlife over 2 years ago in reply to daniel.neeves

Sorry for the late reply. There is not a fix but a workaround. I wrote a custom reset condition. Copy the logic from the trigger and change the operator for the reset (trigger being greater than, change reset to less than).
Cancel
Vote Up 0 Vote Down

Cancel