This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Problem with a script to suppress alerts on specified nodes

I am having a problem that I cannot seem to figure out what is going on.  I have a set of 4 scripts that run every night in a schedule task as the System account.  These 4 scripts work exactly the same way.  They generate a list of server from a query and turn on Alert Suppression on those nodes for 10 hours.  The problem is the scripts are randomly failing.  The script that runs at 8pm will be successfully, but the one at 9pm, 10pm, and 11pm will fail.  I have error handling set up in the script but the error that is generated during the Alert Suppression task is "one or more errors occurred"  Can someone point me to the log file that SWIS SDK powershell cmdlets use so I can try and see what the one or more errors are to attempt to fix the problem?  I have added the script that I am using.

Function Create-EventSource

{

  param(

    [string]$logfileName,

    [string]$sourceName

  )

 

  If([System.Diagnostics.EventLog]::Exists($logfileName))

    {

        write-output "$logFileName Exists"

        If([System.Diagnostics.EventLog]::SourceExists($sourceName))

        {

        }

        else

        {

            New-EventLog -LogName $logfilename -Source $sourceName

        }

    }

    else

    {

        $event = "Creating $logfilename EventLog"

        New-EventLog -LogName $logfileName -Source $sourceName

        Limit-EventLog -LogName $logFileName -OverflowAction OverwriteAsNeeded -MaximumSize 50MB

        Write-EventLog -LogName Application -Source OVO -EntryType Information -Category 0 -EventId 200 -Message $event

    }

}

Function Send-Email

{

    param(

        [string]$body

    )

    $subject = "Script Failed to Connect to SolarWinds"

    $smtpTo = "some email address"

    $smtpFrom = "some other address"

    $smtpServer = "smtp server"

   

    Send-MailMessage -From $smtpFrom -To $smtpTo -Body $body -Subject $subject -SmtpServer $smtpServer

}

Function Validate-MaintenanceMode

{

    Param(

        [string]$nodeName,

        [string]$uri,

        [object]$swisConnection

    )

    $tmpquery1 = "select entityuri, suppressfrom, suppressuntil from orion.AlertSuppression where entityuri = '$uri'"

    $AlertSuppressionInfo = Get-SwisData -SwisConnection $swisConnection -Query $tmpQuery1

    If(([string]::IsNullOrEmpty($AlertSuppressionInfo.suppressfrom)) -and ([string]::IsNullOrEmpty($AlertSuppressionInfo.suppressuntil)))

    {

        $IsSet = "No"

    }

    elseif([string]::IsNullOrEmpty($AlertSuppressionInfo.suppressuntil))

    {

        $IsSet = "Manual"

    }

    else

    {

        $IsSet = "Yes"

    }

    $obj = New-Object PSObject -Property @{

                            NodeName = $nodeName

                            MMStartTime = $AlertSuppressionInfo.suppressfrom

                            MMEndTime = $AlertSuppressionInfo.suppressUntil

                            URI = "$uri"

                            isSet = $IsSet

                       }

    return $obj

}

######################### Main #############################################################

$source = "ArgoMaintenanceModeMountain"

$eventLogMessage = ""

$eventLogName = "EventMonitoring"

Create-EventSource -logfileName $eventLogName -sourceName $source

#Add the snapin | import the module for SolarWinds API

#Add-PSSnapin SwisSnapin

Import-Module C:\SLW\PSModules\SwisPowerShell\2.3.0.108\SwisPowerShell.psd1

#create a connection to solarwinds

$hostname = 'orion server ip address'

$username = 'username'

$password = 'password'

#Set DateTime Parameters

$startTime = get-date

$endtime = $startTime.AddHours(10)

$eventLogMessage += "Getting the starting time and end time for Alert Suppression for the Argo Servers in Eastern Time Zone`r`n"

$eventLogMessage += "Start Time:  $startTime`r`n"

$eventLogMessage += "End Time: $endTime`r`n"

$eventLogMessage += "Creating Connection to SolarWinds Application`r`n"

try{

    $swis = Connect-Swis -Hostname $hostname -Usernam $UserName -Password $password -ErrorAction stop

}

catch

{

    $body = "$($MyInvocation.InvocationName) was unable to connect to Production SolarWinds through API.  Please Validate the username and password are correct"

    $eventLogMessage += $body

    Send-Email -body $body

    Write-EventLog -LogName $eventLogName -Source $source -EntryType Error -Category 0 -EventId 3245 -Message $eventLogMessage

    exit

}

$eventLogMessage += "Generating the List of Argo Servers to be placed into Maintenance Mode`r`n"

$query = "select caption,uri from orion.nodes where caption like '%azargmtw%'"

try

{

    $uri = Get-SwisData -SwisConnection $swis -query $query -ErrorAction Stop

}

catch

{

    $body = "$($MyInvocation.InvocationName) hit an exception during the query to get server information.  The Error:`r`n $($error[0])"

    $eventLogMessage += "$body"

    send-email -body $body

    Write-EventLog -LogName $eventLogName -Source $source -EntryType Error -Category 0 -EventId 3245 -Message $eventLogMessage

    Exit

}

if([string]::IsNullOrEmpty($uri))

{

    $body = "$($MyInvocation.InvocationName): SolarWinds Query did not return any values.  Please Validate Query in the SWQL Studio`r`n The query was '$query'"

    $eventLogMessage += $body

    Send-Email -body $body

}

else

{

    $eventlogMessage += "Found $($uri.count) servers to put into maintenance mode`r`n"

    $eventlogMessage += "Placing Servers into maintenance mode`r`n"

    try

    {

        Invoke-SwisVerb -SwisConnection $swis -EntityName Orion.AlertSuppression -Verb SuppressAlerts -Arguments @([string[]] $uri.uri, $startTime, $endtime) -ErrorAction Stop | Out-Null

    }

    catch

    {

        $body = "$($MyInvocation.InvocationName): Exception when attempting to Set AlertSuppression on the Node.  The Error:`r`n $($error[0])"

        $eventLogMessage += $body

        Send-Email -body $body

    }

    $eventLogMessage += "Validating Maintenance Mode was Set on the Servers`r`n"

    foreach($info in $uri)

    {

        $result = Validate-MaintenanceMode -nodeName $info.caption -uri $info.uri -swisConnection $swis

        Switch($result.IsSet)

        {

            {$_ -eq "No"} {$eventLogMessage += "$($result.nodeName): Maintenance Mode was not set`r`n"}

            {$_ -eq "Manual"} {$eventLogMessage += "$($result.nodeName): Maintenance Mode was set at $($result.MMStartTime.ToLocalTime()) but needs to be manually reset`r`n"}

            {$_ -eq "Yes"} {$eventLogMessage += "$($result.nodeName): Maintenance Mode was set at $($result.MMStartTime.ToLocalTime()) to $($result.MMEndTime.ToLocalTime())`r`n"}

        }

    }

}

$eventLogMessage

Write-EventLog -LogName $eventLogName -Source $source -EntryType Information -Category 0 -EventId 200 -Message $eventLogMessage

  • Two places to look:

    1. The SWIS log: C:\ProgramData\SolarWinds\InformationService\v3.0\Orion.InformationService.log

    2. More details from PowerShell: output "$($error[0] | format-list -force)" instead of just $($error[0])

  • I see the following error in the log file for the time that the scripts run:

    System.Data.SqlClient.SqlException: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception: The wait operation timed out

    What can be some of the causes of the above error message?  I have looked at the SWNetPerfMon.db file and see that Timeout = 40, CommandTimeout = 600 and SqlCommandTimeout = 600.

    In working with support, I do have subscription errors that need to be cleaned up, which I have schedule to do next week, would those subscription errors cause this problem?

    Thanks.

  • A few weeks ago you posted about poor performance of this verb: Alert Suppression Question. Are these timeouts just a result of those performance issues? How many nodes are you suppressing in a batch?

  • I don't know if these timeouts are a result of the performance issues.  The scripts worked flawlessly for more than 25 days.  I asked the gentleman who supported the Orion platform before me and he said that he noticed different timeouts in the log file, some he could track down others he couldn't.  Of the 4 scripts that I am running, the first one is approximately 180 nodes, the 2nd one is approximately 150 nodes, the 3rd one is less than 20 nodes and the 4th one is approximately 100 nodes.

    I do have another node clean up script that is running every 30 minutes that I have changed the scheduled task on to see if that was a cause the issues that I am experiencing, even though the alert suppression failures started before I enabled my clean up script.

    How much of a performance difference is there between running a query in SWQL Studio compared to Get-SwisData?  Are the timeout variables from SWNetPerfMon.db set reasonably?  how can I tell which timeout I hit?

    I do appreciate all of the help.

  • Get-SwisData and running a query from SWQL Studio are using essentially the same code paths. Get-SwisData has a "-Timeout" option (takes a timeout value in seconds) that you can use to override the default timeout of 30 seconds.

    Is it actually timing out on the Get-SwisData line? That query looks simple - unless something is really off the rails, I would not expect that to time out.

  • The line that I am getting the exception on is the following:

    Invoke-SwisVerb -SwisConnection $swis -EntityName Orion.AlertSuppression -Verb SuppressAlerts -Arguments @([string[]] $uri.uri, $startTime, $endtime) -ErrorAction Stop | Out-Null

    Where $uri.uri is the collection of all of the uris from the nodes generated from the query.

    Instead of sending all of the uris at once, I flipped the script to use a foreach loop instead to cycle through.  It may not be as efficient, but at least if the invoke-swisverb cmdlet hits the timeout exception that I am only hitting the exception on potentially 1 server at a time instead of the remaining servers that would be left if I pass all of the uris at once.  I did open an incident ticket with you guy about the sql timeout errors that I am seeing in the information services log file, because there are more timeouts then just when my automation scripts are running.

  • Can I ask:

    - version of powershell and OS running script?

    - any DB maint tasks running at that time?

    - if you have a debugger you might put a watch on parts of the script

  • Version of Powershell - 5.1

    OS - Was 2008 r2 and is now 2012 r2

    No, my DB maintenance runs after the suppression script runs

    I made two changes since my last comments on April 24th.  The first change was not piping the full list of Uri's to invoke-swisverb, but using a foreach loop instead.  This way if there was an exception it would only affect that one server and not the rest of the servers.  The amount of time it takes the script to run is still about the same.  The second change was upgrading my primary application server to 2012 r2.  Since my switch to 2012 r2, I haven't been experiencing the failures.