Little bit confused while making an alerts

Hi Team,

Greetings!

I'm trying to make an alert based on the criteria below on those 11 nodes, we monitor with SW agents.

1. RAM Utilization =>96 Warning

2. RAM Utilization =>98 Critical

3. CPU Utilization => 80 Warning

4. CPU Utilization => 90 Critical

5. Disk Utilization => 75 Warning

6. Disk Utilization => 85 Critical

7. Any nodes down alert

I tried to make a few alerts to achieve the above requirements, but as per my knowledge, we can't achieve them in one alert. But when I create an alert for disk utilization, it gives me the wrong alert, maybe I used the wrong field.

I have not tested the alert for RAM and CPU and node down alert yet. Can you please tell me the best possible way to achieve above mentioned requirements? We have NTA & NPM in our SW environment.

  • There are a few ways to achieve this. 

    First, it's important to understand that an alert will trigger when a predefined condition is met on specific objects such as nodes or volumes or interfaces etc. By default, there is no option through the dropdown list to 'bind' nodes, volumes, CPU and memory utilisation in one single alert. 

    One option is to use the threshold values for each object/stat (CPU, RAM, Volumes), take advantage of the Node child status participation and create an alert with a simple condition when the node is in a warning status or in a warning status or the node is down. This would of course trigger for any other reason where the node is in a warning or critical state.

    If you're comfortable with SQL/SWQL queries (SWQL is better) you can create your own trigger definition, see sample below:

    However I don't recommend any of the above methods as they are either not going to be accurate enough or won't provide any immidiate clear indication why the alert has been raised. Instead, I would challenge whether there is any specific reason to consolidate all the trigger conditions into a single alert? As a generic principal I try to keep alerts that have to be actioned (i.e, a node is down <- needs to be investigated) and avoid informational (such as CPU is warning but not critical yet) to minimise the noise. 

    I'd suggest to create 4 alerts as per below unless there's a specific requirement:

    - CPU usage above its critical threshold for XX minutes (to avoid spikes)
    - RAM Utilization above its critical threshold for XX minutes (avoiding spikes again)
    - Node is down
    - Volume usage is above critical OR free space is below XX GBs (since percentages won't give you absolute figures and can be challenging when comparing 5% free space on a 1TB volume vs a 10GB volume)

    Antonis

  • Thanks for your time, I have created individual alerts for every requirement to avoid future mess up.

  • I think this might be the best way. Yeah, you can do whatever you want with SWQL queries, and I'm a big fan, but as a general best practice with alerts, especially if you are not really strong with SWQL, I usually recommend that people make their trigger conditions only as complex as absolutely needed. A Boolean logic engine (which is what powers our conditional logic in the Platform) is very powerful, but you can design alerts with nested groups and such that get so complex it becomes very hard to troubleshoot where your logic is going wrong if you have issues. 

    Of course, it is a fine balance. What is too simple (which commonly creates problems), and what is too complex? LOL, It's very contextual.

    In the case of alerting on different metric thresholds for different monitored entities, I would probably split those into different alerts.

    Be sure to use the Scope feature to aim those alerts at specific entities. That way, you don't have too many alerts firing on too many "things." Slight smile

  • I have to genuinely question the intent to ensure what you want.

    Are these intended for if they occur at the same time or is this a general criteria? do you want to alert on the node in question being in warning/critical?

    If the latter you could do this with a single alert (or 2 at most, one for warnings and one for critical using an alert action as a template for both) for everything except node down by having any of these criteria trigger the alert . Just have the fields output in the body and caption/cpu/memory in the subject line.

    I am a strong supporter of less created alerts not more, for reasons of organization and growth. I also recommend creating a custom property for alerts with the team name for who owns the alert, otherwise you end up in alert management messes quickly.

    as    notes, this isn't a single solution situation.