The Curious Case of the "Dead" Nodes


The Problem…

As the "monitoring person," we often find ourselves dealing with keeping the records in the database correct and current. The problem is, no matter how hard we try, our end users don't always keep us up to date when a device is turned off. Normally, we find out a device was turned off when we see a NODE DOWN alert hit the board. The team responsible will sometimes ignore the notification because to them, the node's no longer in use, so they delete the email and never circle back to ensure the device is removed from all the different databases, including the monitoring database.

Well, one day a few weeks ago, coming off a great SolarWinds User Group (SWUG) in New York, my brain was spinning with ideas on how to automate simple tasks when the idea of "Dead Nodes" hit me. I thought about the common problem of having nodes on a report showing as "Down" when they were no longer in use. And knowing the power of Server & Application Monitor (SAM) and some of the things I've already done within that tool; I knew it was possible to address this use case easily within SAM.

The Birth of an Idea

So, I turned to THWACK to see if anyone else had the same idea. I found a great post with a great script, and I wanted to take it one step further. The original post I found would deal with the dead nodes, but it wasn't integrated into an alert. Since I wanted notifications sent to the system owners, this wouldn't work for me.  So, I reached out to , told him my idea, and he came back with "Let me try it out!" A few hours later, while at an amusement park, I found myself working with Kevin to perfect the alert. The alert was key for me because, as the monitoring guy, I think it's important to at least share with my end users what I'm doing with their devices. And that was lacking from the original post I found on THWACK. Kevin and I worked together to develop the SWQL query to define the conditions, write the script to run in PowerShell to do the heavy lifting, and craft the email notification.

I'm going to walk you through the way I built this alert with some help from the community. I'll cover three basic areas: Frequency of the Alert, Trigger Condition, and Alert Actions.

At the very end of this post are some things you may encounter using the examples in your environment. I ran into a few of them, I knew about a few others, and Kevin reminded me about one or two. I highly recommend you review the Some System Requirements section before importing the alert and scripts.

I've done my due diligence and provided you the necessary warnings. Now it's off to the races!

What's the Frequency?

01_Frequency.jpg

Since our dead nodes alert isn't exactly mission-critical—it's more like good housekeeping—there's no need to check it every minute (which is the default). After a little discussion, I decided once an hour was enough for our needs. You could scale this back to once a day or even once a week (168 hours) if you like.

The Power of SQL/SWQL in an Alert Trigger

Thanks to Kevin's knowledge and understanding of SQL and SWQL, he was able to develop the original SWQL query based on the key points I wanted, which were straightforward. I wanted to find all the nodes in my system reporting as "DOWN" for the past 30 days. He came back with the following based off the original thread:

SELECT Nodes.Uri, Nodes.DisplayName FROM Orion.Nodes AS Nodes
JOIN Orion.ResponseTime AS RT
ON Nodes.NodeID = RT.NodeID 
WHERE RT.DateTime > ADDDAY(-30, GETUTCDATE())
AND Nodes.UnManaged = False
GROUP BY Nodes.NodeID, Nodes.Caption, Nodes.Uri, Nodes.UnManaged
HAVING MAX(RT.Availability) = 0

I opened SWQL Studio and ran this query to see if it passed the "sniff" test. The results looked pretty good, so I looped in my manager.

After speaking with my manager, I realized we'd cast our net a little too wide. Within my environment, I have some nodes down for over 30 days, but shouldn't be considered "Dead." These nodes are normally found within some of our locations and might be offline because of a natural disaster or the stores simply being remodeled. So, I took what Kevin gave me and changed it up to make sure it wasn't pulling in any devices down for a known reason. The result was this:

SELECT Nodes.Uri, Nodes.DisplayName FROM Orion.Nodes AS Nodes
JOIN Orion.ResponseTime AS RT
ON Nodes.NodeID = RT.NodeID 
WHERE RT.DateTime > ADDDAY(-30, GETUTCDATE())
AND Nodes.UnManaged = False
  AND Nodes.CustomProperties.Store_Known_Down = False
GROUP BY Nodes.NodeID, Nodes.Caption, Nodes.Uri, Nodes.UnManaged
HAVING MAX(RT.Availability) = 0 

It should be noted that Store_Known_Down is a Yes/No custom property I've tied to nodes so I can mark them as being down for a known reason. Your alert logic will probably differ, but it's important to think about these edge cases.

Defining the Alert Actions

With the list of devices from the alert trigger in hand, we next had to address the actions when the trigger occurs. For me, it was key to have both an email message to the system owners and the alert add a "Decommissioned Date" to the existing custom property with the same name. We use this custom property within my environment to track when a node is no longer in use, so having this date was critical for both reporting and alerting logic.

Kevin again came to the rescue and helped me develop the PowerShell script. We then tested the alert in his test lab and BINGO! The system was unmanaged and the custom property value was updated with the current date/time. But more details on the script later.

The Proof Is in the Results

So, after perfecting the query and the script, it was time to test it out. Kevin spun things up in his lab. We started by crafting a new alert and testing the query logic in the alert editor:

02_TriggerConditions.jpg

The query passed validation, so we've got no syntax errors and are good to move on to the next step.

Manually Testing the PowerShell Script

Before we could define the alert actions, we wanted to test all the parts, including the PowerShell script.

The complete script is here, and commented thoroughly. There are a few important parts to this script. The only place you will absolutely need to edit is the authentication block at lines 21-23, where you'll need to put in your Orion server and credentials.

<#
Script: Alert_Unmanage-Node.ps1
Arguments: The node ID in question
Authors:    Ben Keen (the_ben_keen) and Kevin M. Sparenberg (KMSigma)

Version: 1.0 - initial release


#>
if ( -not $args[0] )
{
    Write-Error -Message "You must provide the Node ID as a parameter to the script"
}
else
{

    # I hate using the "args" nomenclature, so I'm just going to do assign it to a better name
    $NodeID = $args[0]
   
    # Authentication
    $SwisHostname = "MyOrionServer.Domain.Local"
    $SwisUsername = "MyAdminAccount"
    $SwisPassword = "MyAdminPassword"

    # Build a SWIS Connection
    $SwisConnection = Connect-Swis -Hostname $SwisHostname -UserName $SwisUsername -Password $SwisPassword

    # When does the unmanage start?  Right now!
    $CurrentDate = ( Get-Date ).ToUniversalTime()
   
    # Flip the status to Unmanaged with no end date
    # The parameters are:
    # - The Node ID (in N:##) format
    # - The start date of the unmanage time
    # - The end date of the unmanage time (now + 10 years)
    # - false - no clue why this is required, but it is
    $Results = Invoke-SwisVerb -SwisConnection $SwisConnection -EntityName "Orion.Nodes" -Verb "Unmanage" -Arguments @( "N:$( $NodeID )", $CurrentDate, $CurrentDate.AddYears(10), $false )

    # We need the full URI to set properties
    $Uri = Get-SwisData -SwisConnection $SwisConnection -Query "SELECT Uri FROM Orion.Nodes WHERE NodeID = $NodeID"
    # Then we need to append it with the CustomProperties identifier
    # The [$Uri += "/CustomProperties"] is the equivalent of [$Uri = $Uri + "/CustomProperties"]
    $Uri += "/CustomProperties"

    # Set the Custom Property
    # Parameters are:
    # - The URI of the node in question's custom properties
    # - A hashtable of the properties and the values
    #      Denoted by @{ PropertyName1 = PropertyValue1; PropertyName2 = PropertyValue2; ... }
    $CustomProperty = @{ "Decommissioned_Date" = $CurrentDate }
    Set-SwisObject -SwisConnection $SwisConnection -Uri $Uri -Properties $CustomProperty
}

You'll notice on line 18, we make reference to the $args variable. These are the parameters you pass to this script. For this script, it's the Node ID of the device we want to decommission.  This script is only expecting a single node ID to be passed, so we are only looking at $args[0] (the first argument in the variable).

On line 37, we set the device to the Unmanage status and later on line 50-51, we set the decommission date custom property. In reality, there are only about six lines of this script that do any work.  The rest are comments so we can understand what we did even years down the road.

To test it, we opened a PowerShell prompt and then typed:

D:\Scripts\Alert_Unmanage-Node.ps1 62

This is the full path to the script, including the extension, a space, and then the node ID for marking "dead."

When executed against a testing node, we got no errors in the PowerShell prompt and the Orion pages showed the results we expected. Nice!

03_ExampleExecution.jpg

As you can see, the node was switched to Unmanaged and a Decommissioned Date was added.

Now that I know the script works, I can add it to an alert action.

Add an action for Execute an External Program and then fill in the details.

04_ExecuteExternal.jpg

The full path doesn't show up in a screenshot, so I'll put it all here for you:

"C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -File "D:\Scripts\Alert_Unmanage-Node.ps1" ${N=SwisEntity;M=NodeID}

It's a very long line, but simple in execution. Let me break it down:

                                                          

"C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe"

Full path to the PowerShell executable

-File

Parameter telling PowerShell to run the script in the next position

"D:\Scripts\Alert_Unmanage-Node.ps1"

The full path to the script. If you save this elsewhere on your computer, be sure to update the path.

${N=SwisEntity;M=NodeID}

SolarWinds variable containing the NodeID for the alerted node

Customizing the Alert Email

The user experience is key in everything, but especially in monitoring. If you're going to use the information in this post, make sure you spend some time crafting the message sent. I wrote it based on how my end users digest their alerts.  Your end users may view their alerts differently. I don't need much more than the basics for this type of alert message. I kept most of the default message and then just added some language about it being a dead node. Below is my example of the alert message.  [Yes, I know I have a typo in the first line]

05_AlertEmail.jpg

So, I have the Frequency of the Alert, Trigger Condition, and Alert Actions (execute a script and send an email)—everything we need for this alert. When completed, the trigger actions list looked like this:

06_TriggerActions.jpg

And that's pretty much it for the alert. There are no reset actions, so we're done. I just clicked through the wizard to save it. In my environment, I didn't enable the alert yet. I needed to make everyone aware of what was happening first.

The Results Are In

After clearing it with the necessary teams, I enabled the alert. Within a few minutes, the first system was found, flipped, and timestamped.

07_Results.jpg

The results speak for themselves. My Orion server will no longer waste compute power trying to poll devices that have been offline for 30 days, the associated teams got a message saying I've stopped watching their devices, and I can make a simple custom query resource to show me all unmanaged devices with a decommission date.

Edit a dashboard, add new widget, search for a Custom Query widget, drag it into your dashboard, and then save the layout.

Edit the widget. Provide a clear name and enter:

SELECT  N.Caption AS [Node Name]
      , CONCAT('/NetPerfMon/images/Vendors/', N.VendorIcon) AS [_IconFor_Node Name]
      , N.DetailsURL AS [_LinkFor_Node Name]
      , N.CustomProperties.Decommissioned_Date AS [Decommission Date]
FROM Orion.Nodes AS N
WHERE N.Unmanaged = 'TRUE'
   AND N.CustomProperties.Decommissioned_Date IS NOT NULL

For the custom SWQL Query.

If you want to enable the search, enter:

SELECT  N.Caption AS [Node Name]
      , CONCAT('/NetPerfMon/images/Vendors/', N.VendorIcon) AS [_IconFor_Node Name]
      , N.DetailsURL AS [_LinkFor_Node Name]
      , N.CustomProperties.Decommissioned_Date AS [Decommission Date]
FROM Orion.Nodes AS N
WHERE N.Unmanaged = 'TRUE'
   AND N.CustomProperties.Decommissioned_Date IS NOT NULL
   AND N.Caption LIKE '%${SEARCH_STRING}%'

For the Search query.

When done, it'll look like this:

08_CustomQuery.jpg

Save that resource and now you have a quick and easy way to search for unmanaged nodes, with hover-over information to boot.

09_CustomQueryResults.jpg

In Summary

After all this was completed, I was very pleased with the results, but began to look around for some other changes. I've already thought of some ways to tweak this logic, improve the alert language, and leverage the SolarWinds Orion API to do more of my work for me.


Some System Requirements

Since this was my first foray into using a script action, I needed to do some additional work. You may not need to do all of these, depending on the way your infrastructure is architected.

PowerShell Execution Requirements

Depending on how your Orion server is configured, you may not be able to natively execute PowerShell scripts. This is part of the Execution Policy and it's controlled by several things, including Group Policy. To check the execution policy, open PowerShell as an Administrator and execute:

Get-ExecutionPolicy

If the results are either RemoteSigned or Unrestricted you can already run PowerShell scripts on this machine. If it's anything else, you'll need to change the policy. This falls outside the scope of this document, but you can find more information about Execution Policies in the Microsoft documentation.

SolarWinds Orion PowerShell Module

To connect to the SolarWinds Information Service, you'll need to install the SolarWinds Orion PowerShell Module (SwisPowerShell). This module is freely available and published on the PowerShell Gallery. To install it on your server, open PowerShell as an Administrator and execute:

Install-Module -Name SwisPowerShell -Scope AllUsers -Force

If this is the first PowerShell module you're installing, you may get prompted to approve the NuGet package provider. This is expected, and you can answer "Yes."  The above line says to install the PowerShell module and make it available for all users on that machine.

To validate the module was installed correctly, execute:

Get-Module -List -Name SwisPowerShell

If you get a result showing a version, then it's installed correctly.

Custom Properties

For the script to execute correctly, you need to have a custom property called "Decommissioned_Date" with the date/time data type and assigned to nodes. To create this custom property, within your admin pages, navigate to the Manage Custom Properties page and click "Add Custom Property."

10_CustomProperty_Add.jpg

This custom property will be based on nodes.

11_CustomProperty_Node.jpg

Provide the name, give it a description, and select the format as Date/Time. Be sure to keep the "required property" checkbox deselected.

12_CustomProperty_Details.jpg

Lastly, don't manually assign nodes with this custom property. We'll let the script do the work.

13_CustomProperty_Assignment.jpg

Note: if you choose to use a different name for your custom property, be sure to update it within the PowerShell script (line 50).

attachments.zip
Parents
  • amazing stuff; kudos good sir!

    One thing I would add, you don't have to change your global execution policy, you can simply add a bypass in-line that will only apply during the execution of the script; and leave the global settings alone (and keeping the security people in their dungeons) emoticons_wink.png

    ex:

    "C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -ExecutionPolicy Bypass -File "D:\Scripts\Alert_Unmanage-Node.ps1" ${N=SwisEntity;M=NodeID}

Comment
  • amazing stuff; kudos good sir!

    One thing I would add, you don't have to change your global execution policy, you can simply add a bypass in-line that will only apply during the execution of the script; and leave the global settings alone (and keeping the security people in their dungeons) emoticons_wink.png

    ex:

    "C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -ExecutionPolicy Bypass -File "D:\Scripts\Alert_Unmanage-Node.ps1" ${N=SwisEntity;M=NodeID}

Children
No Data
Thwack - Symbolize TM, R, and C