This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Bug with Volume Alerts using 'Orion General Thresholds'

This might be a long post and I apologize in advance.  It requires some background on how we implemented alerts and discovered an odd behavior with Orion resetting the alert during database maintenance and immediately re-triggering.

Our Volume alerts are fairly standard.  Alert if volume percent used is greater than threshold.

We decided to use 'Orion General Thresholds' to set our standard.

     URL: /Orion/NetPerfMon/Admin/NetPerfMonSettings.aspx

     pastedImage_2.png

     pastedImage_0.png

     pastedImage_1.png

This gives us the ability to set a standard that all volumes get, and allows us to 'override' the threshold on a volume by volume basis.

     pastedImage_3.png

This all worked out fairly easily.

The first problem I ran into is the alert builder WebUI.  When setting alert context ("I want to alert on:") to 'volume'.  The 'Orion General Threshold' is not exposed.  It is only exposed under ‘volume capacity forecasting’.

I can link between ‘volume capacity forecasting’ to 'volume', but not the other way.  While SQWL has linkages between 'Volume' and ‘volume capacity forecasting’ the alert builder WebUI cannot see in both directions.

This was fairly easy to work around. I set the alert context to ‘volume capacity forecasting’ and linked to 'Volume' to get percent used.

     pastedImage_5.png

     Alert build WebUI with alert set to ‘volume capacity forecasting’, exposing ‘Orion General Thresholds’

          pastedImage_10.png                

          ‘Volume Capacity forecasting’ linkage to ‘volume’

                        pastedImage_11.png                       

                Alert build WebUI with alert set to ‘volume’ not exposing ‘Orion General Thresholds’

                 pastedImage_12.png

                                ‘Volume’ has no linkage to ‘volume capacity forecasting’

                                                pastedImage_13.png

The second issue.  I discovered that alerts created under ‘volume capacity forecasting’ were resetting every night at the same time (then opening new alerts 1 minute later).  After a lot of research and trail/error, I tracked this down to database maintenance.  Database maintenance happens at 2:15am by default.  I moved this forward an hour and the alert resets moved to match.  Moved database maintenance back to default, alert resets again follow to match.  These alert resets are only happening erroneously during database maintenance.  I have confirmed by going to impacted volumes to confirm they are over threshold and not bouncing above/under, or cron/batch cycle clearing things.  Disk usage remained consistently over threshold during these times between and after reset.

After a lengthy support case we set a custom reset condition on the alerts.  Original using default "Reset this alert when trigger condition is no longer true (Recommended)".

     pastedImage_7.png

The new reset condition is a copy of the trigger condition and reversing the operator.

     Trigger

     pastedImage_8.png

     Reset

     pastedImage_9.png

This seems to get around the issue, but seems like a bug.  Does anyone else use a similar alert with 'Orion General Thresholds'?  Anyone else seen any odd behavior with reset conditions? 

  • monitoringlife  wrote:

    This might be a long post and I apologize in advance.  It requires some background on how we implemented alerts and discovered an odd behavior with Orion resetting the alert during database maintenance and immediately re-triggering.

    Our Volume alerts are fairly standard.  Alert if volume percent used is greater than threshold.

    We decided to use 'Orion General Thresholds' to set our standard.

         URL: /Orion/NetPerfMon/Admin/NetPerfMonSettings.aspx

         pastedImage_2.png

         pastedImage_0.png

         pastedImage_1.png

    This gives us the ability to set a standard that all volumes get, and allows us to 'override' the threshold on a volume by volume basis.

         pastedImage_3.png

    This all worked out fairly easily.

    The first problem I ran into is the alert builder WebUI.  When setting alert context ("I want to alert on:") to 'volume'.  The 'Orion General Threshold' is not exposed.  It is only exposed under ‘volume capacity forecasting’.

    I can link between ‘volume capacity forecasting’ to 'volume', but not the other way.  While SQWL has linkages between 'Volume' and ‘volume capacity forecasting’ the alert builder WebUI cannot see in both directions.

    This was fairly easy to work around. I set the alert context to ‘volume capacity forecasting’ and linked to 'Volume' to get percent used.

         pastedImage_5.png

         Alert build WebUI with alert set to ‘volume capacity forecasting’, exposing ‘Orion General Thresholds’

              pastedImage_10.png                

              ‘Volume Capacity forecasting’ linkage to ‘volume’

                            pastedImage_11.png                       

                    Alert build WebUI with alert set to ‘volume’ not exposing ‘Orion General Thresholds’

                     pastedImage_12.png

                                    ‘Volume’ has no linkage to ‘volume capacity forecasting’

                                                    pastedImage_13.png

    The second issue.  I discovered that alerts created under ‘volume capacity forecasting’ were resetting every night at the same time (then opening new alerts 1 minute later).  After a lot of research and trail/error, I tracked this down to database maintenance.  Database maintenance happens at 2:15am by default.  I moved this forward an hour and the alert resets moved to match.  Moved database maintenance back to default, alert resets again follow to match.  These alert resets are only happening erroneously during database maintenance.  I have confirmed by going to impacted volumes to confirm they are over threshold and not bouncing above/under, or cron/batch cycle clearing things.  Disk usage remained consistently over threshold during these times between and after reset.

    After a lengthy support case we set a custom reset condition on the alerts.  Original using default "Reset this alert when trigger condition is no longer true (Recommended)".

         pastedImage_7.png

    The new reset condition is a copy of the trigger condition and reversing the operator.

         Trigger

         pastedImage_8.png

         Reset

         pastedImage_9.png

    This seems to get around the issue, but seems like a bug.  Does anyone else use a similar alert with 'Orion General Thresholds'?  Anyone else seen any odd behavior with reset conditions? 

    Filed an improvement request under CORE-13349 to take a look at the placement of threshold in alert builder and check the design of alert resets.

  • Volumes do not currently have overrides for global thresholds. The value you are overriding is specific to capacity planning. What you are looking for us currently a feature request that can be voted on here >

  • I suspect that Capacity Forecast values are re-calculated during DB Maintenance and that could be what is causing this issue?

    In terms of improvement I would suggest setting Custom Properties against your Volumes with which you can use to customise your alerts.

  • aLTeReGo​ Thanks for looking into this and replying with the existing feature request.  It looks like the one you linked is marked as 'Implemented' with notes it was part of 'SAM 6.2'.  The linked example is what were doing with 'Orion General Thresholds'.

  • Thank you for bringing that to my attention. That statement was unfortunately incorrect. I have re-opened the feature request.

  • dgsmith80​ I agree that it looks like an issue with DB Maintenance.  I suspect its during the roll-up of the table from detail to daily and there is a blank/null entry till the next polling interval.  We tested moving the maintenance window forward and back an hour and the issue moved with it.

    I will look into custom properties but the implementation seems to work excluding the oddity with reset happening, and the alert builder UI not linking between 'volume' and ‘volume capacity forecasting’ (despite the link existing from forecasting to volume, as well as in SWQL studio).

    What I find more odd is that we are not using ‘volume capacity forecasting’ at all.  We are using 'Orion.Volume.VolumePercentUsed', and comparing it against the warning/critical threshold (found on ‘volume capacity forecasting’).

         pastedImage_0.png

  • aLTeReGo​ no stress at all.  Thanks for the time and assistance looking into this!

  • serena​ Thanks for looking into this and raising the issue.  Please feel free to reach out with a UI session if your team has any questions.

  • Hi  I came across this after trying to find a hit on the exact same issue you mention above (second issue). Our volume alerts (Linux filesystems) are resetting each night at around 02:15 causing our Ops team a massive headache. Did you ever get a proper fix from SW regarding this or are you still working around it by having to create a special reset condition?

    Thanks in advance.

  • Sorry for the late reply.  There is not a fix but a workaround.  I wrote a custom reset condition.  Copy the logic from the trigger and change the operator for the reset (trigger being greater than, change reset to less than).