Process Count with Thresholds.apm-template

Version 1

    POWERSHELL PROCESS COUNTING

     

    My team is in the process of migrating a large amount of monitors off of our old monitoring platform (Nagios) into Orion (NPM & SAM). One of the major issues I have run into is monitoring Windows Process counts, this is simple out-of-the-box in Nagios but not so easy in SAM. We frequently get requests to monitor specific processes for the number of instances that are running on a given machine, and generally have a Warning and Critical threshold for each.

     

    Now I know what a lot of you are saying, "This is a simple request that I can pull off with a simple PowerShell script." You would be right... and you would be wrong!

     

    You can easily return the process count with a few lines of PowerShell and set Warning and Critical thresholds in the SAM component. But the problem lies with the thresholds themselves, one of our most common requests is a Warning is less than X and Greater than Y.

     

    For instance, you are asked to monitor a given windows machine for chrome, there should be always be at least 2 instances running at any time (Warning Threshold), and there should be no more than 10 instances running at a given time (Critical Threshold).

     

    In SAM, when you set the Warning threshold, the Critical threshold is also bound by the same operator (equal, greater than, less than, etc).

     

    I have written a PowerShell script that takes care of all of this for you:

    In this script you pass the process name, Warning Threshold, and Critical Threshold as arguments to the script (i.e. chrome, 1, 5). This allows the script to easily scale and all you need to do it copy the component and adjust your process name and thresholds.

     

    "But without setting the Warning and Critical thresholds in the component it will always show UP even when the thresholds are crossed, how do you fix that, huh?"

     

    After the script returns the raw count of the process it gets mauled by a switch statement that looks for 4 conditions:

    1. Is the count equal to zero. If so return that count as a Statistic Value and exit setting the component to DOWN (see my explanation of PowerShell SAM exit codes below).
    2. Is the count Less than or Equal to the Warning Threshold. If so return the count as a Statistic Value and exit setting the component to WARNING.
    3. Is the count Greater than or Equal to the Critical Threshold. If so return the count as a Statistic Value and exit setting the component to CRITICAL.
    4. If none of the other conditions are met return the count as a Statistic Value and exit setting the component to UP.


    Here is the actual script out of the Application Template:


    $ErrorActionPreference = "silentlycontinue";

    $process = Get-Process -Name $args[0]; #Sets the variable $process to the the process name in the script argumgents.

    $warn = $args[1]; #Sets the warning threshold to the the second argument in the script arguments.

    $crit = $args[2]; #Sets the critical threshold to the the thrid argument in the script arguments.

    $count = ($process | Measure-Object).Count;

     

     

    switch ($count)

      {

          {$count -eq 0} {Write-Host 'Statistic:' $count; exit(1); break} # If the process is not running the script exits as down.

         {$count -le $warn} {Write-Host 'Statistic:' $count; exit(2); break} # If the count is less than or equal to the Warning Threshold the script returns the count and exits as Warning.

         {$count -ge $crit} {Write-Host 'Statistic:' $count; exit(3); break} # If the count is greater than or equal to the Critical Threshold the script returns the count and exits as Critical.

      default {Write-Host 'Statistic:' $count; exit(0)} # If the process is running and does not match the passed Warning or Critical thresholds the script returns the count and exits as Up.

      }


    Of course you can change the operators for the any of the tests in the switch statement, these are what is most frequently used here.


    PowerShell Exit Codes:

    As you can see in the above script each line has a specific exit code. The exit code tells SAM what status to set the component to when it exits the script.

    The values for the exit codes are:

    0 = UP

    1 = DOWN

    2 = WARNING

    3 = CRITICAL


    I have attached a sample SAM application template that includes this script. I hope that this will solve some headaches that I had to go through getting this put together.