This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Setting effective thresholds and polling periods

I have been working through our environment to try to cut down on the chatter for our windows server alerts.  Just to take two items that always come back i have a basic template assigned to my windows servers that monitors Total Available MB and % Processor Time (performance counters) among other things.

I currently have the template set to use baseline thresholds and they are set to go to a warning for 5 of 8 polls and then critical for 3 of 5 polls (i currently poll the servers every 5 minutes).  Even after going back and manually recalculating baselines and in some cases looking at the historical graphs and manually setting thresholds i still receive a lot of junk alerts.

So the question that's been asked a thousand times I'm sure, and has a different answer each time.  What is everyone else doing to set their thresholds?  Balancing polling types / # of polls  and the threshold values to receive good alerts?

EDIT #1

So I have been doing a lot of tweaking and my windows server templates and found that there is a lot of information out there for various purposes.  As i find them useful myself i am going to update this task and combine it into one group of ideas that can be easily referenced for other users who may find themselves in the same boat i was in.

  • tszilagyi​, First, let me ask you this....Do you have notifications for the trigger action for both out of the box Alerts: "Alert me when an Application... (goes down and/or goes into warning or crtical)" and for "Alert me when a Component goes ... (goes down and/or goes into warning or crtical)".  If so, you are receiving Two notifications every time a component is triggered.  You could disable the "Alert me when an Application...." Alerts and only have the Component Alerts enabled.  You can still see what application it is using variables in the Alert Notification.

    I know its time consuming, is to go through all of your applications and audit the individual components and determine which ones are important and which ones aren't for your environment.  You can then disable the components in the Application Template that aren't important for you.  Most of the component monitors give pretty good detailed descriptions of their functions.  I would make a list of the components that you get alerted on the most, and check to see if those are critical first.

    For example, I've had some component monitors for memory usage consistently remain critical for VM nodes.  But come to find out, the OS will allocate x amount of memory to VMs and therefore need to either adjust the threshold or disable it for that particular node.

    That link she posted above gives some very good advice as well. 

    The last thing you want is to start ignoring your alerts because you are receiving too many.

  • Ok so i wanted to swing back around on this for anyone else who may be looking for similar info and post what i have done so far.

    First of all i read through the info jkuvlesk posted and also to jpr7311's question i had already disabled the application alerts leaving only the component alerts, but we were still getting a large amount of alerts. 

    After a lot of digging i came across a few posts that have helped out a lot in limiting the number of false positive alerts here.

    First of all i went through and disabled all the critical / warning thresholds on the template.  I have the template setup so that it is really just monitoring and providing historical trends.  my alerts that i have below just pull from the template (baselines thresholds did nothing but cause issues for me)

    Just to put it out there I'm monitoring the following using windows performance counters

    Disk counters all related only to the C:\ drive

    - % free space

    - avg read queue length

    - avg write queue lenght

    - % idle time

    - avg disk sec/read

    - disk reads/sec

    - avg disk sec/write

    - avg disk writes/sec

    - split IO/sec

    Memory counters

    - total available memory

    - virtual memory

    - pages/sec

    - page file usage

    - memory - working set

    - pages output/sec

    CPU

    - % processor time

    - Processor queue length

    I also built in a generic monitor to capture all errors the application and system log.

    and a WMI monitor to alert me if a system stops reporting back via WMI (more on that one later)

    Next up how i setup my CPU monitoring.

  • Next was info from Leon Adato in the below post The Ultimate CPU Alert   - there's a lot there but its worth the read. 

    I took his info and modified it a bit for some info i found online and for my environment

    So for my setup s changed it out a bit and i have a template that is monitoring the processor CPU utilization and the queue length.  Also per Microsoft's documentation the queue length should be no more than 10 per CPU core not just the count on the # of cores. 

    With all that taken into consideration I ended up with the following SQL for the Alert trigger and reset action (notice with the reset action like Leon noted you need to modify the where clause. 

    Note: I found a few of my systems orion couldn't figure out how many cores the server had so it ended up with a null value, hence the ISNull statement in there is making an assumption of 2 cores which is a safe guess for my environment YMMV.  I still need to swing back around with support to figure that one out. 

    ****************************************************************************************************************************************************************************

    ****Alert trigger****

    SELECT Nodes.NodeID, Nodes.Caption

    FROM Nodes

    INNER JOIN APM_AlertsAndReportsData as DATA1

    ON (Nodes.NodeID = DATA1.NodeId)

    INNER JOIN APM_AlertsAndReportsData as DATA2

    ON Nodes.nodeid = DATA2.NodeiD

    INNER JOIN (SELECT SUM(ISNull(AssetInventory_Processor.NumberOfThreads,2)) as CPU_Count, AssetInventory_Processor.NodeID FROM AssetInventory_Processor

    GROUP BY AssetInventory_Processor.NodeID) as CPU

    On Nodes.nodeid = CPU.nodeid

    WHERE

    (DATA1.ComponentName = 'Processor Queue Length'

    AND DATA1.StatisticData > CPU.CPU_Count *10 )

    AND

    (DATA2.ComponentName = '% Processor Time'

    AND DATA2.StatisticData > 90)

    ****************************************************************************************************************************************************************************

    ****Alert reset****

    SELECT Nodes.NodeID, Nodes.Caption

    From Nodes

    INNER JOIN APM_AlertsAndReportsData as DATA1

    ON (Nodes.NodeID = DATA1.NodeId)

    INNER JOIN APM_AlertsAndReportsData as DATA2

    ON Nodes.nodeid = DATA2.NodeiD

    INNER JOIN (Select SUM(ISNull(AssetInventory_Processor.NumberOfThreads,2)) as CPU_Count, AssetInventory_Processor.NodeID FROM AssetInventory_Processor

    GROUP BY AssetInventory_Processor.NodeID) as CPU

    On Nodes.nodeid = CPU.nodeid

    WHERE

    (DATA1.ComponentName = 'Processor Queue Length'

    AND DATA1.StatisticData <= CPU.CPU_Count *10)

    OR

    (DATA2.ComponentName = '% Processor Time'

    AND DATA2.StatisticData <= 90)

    ****************************************************************************************************************************************************************************

    Next up is the alert I trigger action i have the same basic setup as Leon's

    Initially starting the following command to start collecting process info.  I wasn't running the latest version of Orion so i needed to update to the latest release and look at some changes with hotfixes.  I believe that just updating to the latest release and then making the change to the APM commands listed in the hotfix got everything working for me. 

    SAM v6.2 Hot Fix 1 is now available

    ****************************************************************************************************************************************************************************

    so as soon as the issue occurs i run the following command

     

    APM\SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alertid=${N=Alerting;M=AlertID} -timeout=240

    ****************************************************************************************************************************************************************************

    and send out an email with the following, used a lot of SQL here as well

    SUBJECT:

    Test CPU alert on ${NodeName} Initial Email

    BODY:

    View Node here: ${N=SwisEntity;M=DetailsUrl}

    Current CPU utilization: ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = '% Processor Time'}

    Current Processor Queue Length: ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Processor Queue Length'}

    Number of CPU's: ${SQL:Select SUM(ISNull(AssetInventory_Processor.NumberOfThreads,2)) FROM AssetInventory_Processor where AssetInventory_Processor.nodeid = ${NodeID} }

    The Processor Queue Length should be no more than 10 times the # of CPU's.

    Top 10 processes at the time of the alert are being calculated and a second email should be sent in 5 minutes with the results.  If the server is too busy to process the request it may be blank.

    ****************************************************************************************************************************************************************************

    Next up i sent an email out in about 5 minutes later that should have the results

    Subject:

    Test CPU alert on ${NodeName} Second email

    Body:

    View Node here: ${N=SwisEntity;M=DetailsUrl}

    Top 10 processes at the time of the alert are:

    ${N=Alerting;M=Notes}

    ****************************************************************************************************************************************************************************

    After playing with the setting for how long the condition must exist has done a really good job of eliminating false positives on CPU usage. I ran in to it so much because we are primarily a VMWare shop and the servers are right sized with VEEAM ONE, so just because a server sits at 90% CPU utilization most of the time its not an issue as long as its keeping up with the work.  This monitoring helps make sure that the servers are not under spec'd as well.

    next up how i setup my memory alerting.

  • so this one was based on the same concept as the CPU alert that i used but I really didn't have any secondary monitoring item to clear up alerts.  so in this case i took a look at my chatty servers and found that most of them would alert when they got down to about 4% memory available.  so i used the below SQL in an alert trigger to go off at 3% to clear out false positives.

    ****************************************************************************************************************************************************************************

    ****Alert trigger****

    SELECT Nodes.NodeID, Nodes.Caption From Nodes

    INNER JOIN

    (SELECT StatisticData, NodeID FROM APM_AlertsAndReportsData WHERE ComponentName = 'Total Available Memory (MB)' and (ComponentStatus <> 'Unknown' and ComponentStatus <> 'Unreachable') )

    AS D1

    ON Nodes.NodeID = D1.NodeID

    INNER JOIN

    (SELECT VolumeSize, NodeID FROM Volumes WHERE Caption = 'Physical Memory')

    AS D2

    On Nodes.NodeID = D2.NodeID

    WHERE

    D1.StatisticData

    <

    ((D2.VolumeSize/1048576)*.03)

    ****************************************************************************************************************************************************************************

    **** Alert reset ****

    Reset condition is not SQL just when the trigger condition is no longer true(recommended)

    ****************************************************************************************************************************************************************************

    The notification was similar

    First run the following command

    APM\SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alertid=${N=Alerting;M=AlertID} -sort=PhysicalMemory

    ****************************************************************************************************************************************************************************

    Then wait 2 min to send out an email with details

    SUBJECT:

    Test MEMORY alert on ${NodeName}

    BODY:

    View Node here: ${N=SwisEntity;M=DetailsUrl}

    Total Available Memory: ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Total Available Memory (MB)'} MB - ${SQL:Select Round(D1.StatisticData/(Nodes.TotalMemory/1048576),4)*100 From Nodes Inner Join (Select APM_AlertsAndReportsData.NodeID, APM_AlertsAndReportsData.StatisticData From APM_AlertsAndReportsData Where APM_AlertsAndReportsData.ComponentName = 'Total Available Memory (MB)' ) AS D1 On Nodes.NodeId = D1.NodeID Where Nodes.NodeID = ${NodeID} }%

    Alert Threshold is 3% of Total Memory

    ${SQL:Select Round( ((Volumes.VolumeSize/1048576)*.03),2) from Volumes where Volumes.NodeId = ${NodeID} and Volumes.Caption = 'Physical Memory'} MB

    Other Information

    Total Physical Memory - ${SQL:select Round(TotalMemory/1073741824,2) from Nodes Where Nodes.nodeid = ${NodeID} } GB

    Page File Size - ${SQL:select Round(VolumeSize/1073741824,2) from Volumes Where Volumes.Caption = 'Virtual Memory' and Volumes.nodeid = ${NodeID} } GB

    Page File % Used: ${SQL:Select Round(APM_AlertsAndReportsData.StatisticData,2) from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Page File Usage'} %

    Memory - Working Set: ${SQL:Select Round(APM_AlertsAndReportsData.StatisticData/1048576,2) from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Memory - Working Set'} MB

    Pages Output/Sec: ${SQL:Select Round(APM_AlertsAndReportsData.StatisticData,2) from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Pages Output/sec'}

    Pages/Sec: ${SQL:Select Round(APM_AlertsAndReportsData.StatisticData,2) from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Pages/sec'}

    Top 10 processes by memory utilization are:

    ${N=Alerting;M=Notes}

    This information can be up to 2 minutes behind based on the initial alert.

    ****************************************************************************************************************************************************************************

    Once again after playing with how long the condition had to exist i cleared up a lot of false positives.

  • Next up i updated my disk monitoring and alerting, for this one i borrowed from Alex Soul

    Initial post is here

    Universal Monitoring and Alerting on free space for all logical disks across all servers

    he has a second one here

    Universal Disk Free Space Monitoring (One Template Will Handle All Logical Disks + Exceptions And Overrides)

    both are worth a read.

    ****************************************************************************************************************************************************************************

    so for my side i downloaded a copy of his template and imported it, then i copied the script and everything into my windows server template.  This got the monitor for all the logical drives as he had setup.

    then i followed his notes and setup an alert with the per volume variables that he noted with one addition, i added a v_ovrd_disable as a YES/NO field and with my alert trigger setup if there is a volume i just want to completely disable alerting on i can set this.

    trigger3.JPG

    ****************************************************************************************************************************************************************************

    then for my alert notification I used the following to get some detailed info out of the system.  With the email alert i you'll see the SQL commands that break the byte count size into GB's to make it easier to read. 

    Subject:

    Disk Space alert on ${NodeName}

    Body:

    View Node here: ${N=SwisEntity;M=DetailsUrl}

    Volume Name: ${N=SwisEntity;M=FullName}B

    Volume Size: ${N=SwisEntity;M=VolumeSize}B

    Volume Used Space: ${N=SwisEntity;M=VolumeSpaceUsed}B

    Volume Available Space: ${N=SwisEntity;M=VolumeSpaceAvailable}B - Default alert is 20GB or less

    Volume Percent Available: ${N=SwisEntity;M=VolumePercentAvailable} - Default alert is 5% or less

    Override settings - If overrides are not being used these will be blank.

    Reason for override:

    ${N=SwisEntity;M=CustomProperties.v_ovrd_desc}

    Override bytes settings: ${SQL:SELECT volumes.v_ovrd_bytes/1073741824 FROM volumes where volumes.nodeid = ${NodeID} and volumes.volumeid = ${VolumeID} } GB (This setting has been converted to GB from the bytes setting in the variable)

    Override percent setting: ${N=SwisEntity;M=CustomProperties.v_ovrd_prcnt} %

    ****************************************************************************************************************************************************************************

    This one is fairly new to my template but so far seems to be working and has a lot of promise.

  • Finally the WMI monitor that i mentioned.  I also have it setup to make sure that perfmon data doesn't stop reporting in on a system. 

    This one was actually really simple, setup the following WMI monitor in my basic template

    wmi.JPG

    then setup an alert as such

    monitor.JPG

    This way if anything stops reporting in i can get an alert and try to fix the WMI / perfmon related issue so I'm not finding out about this when I actually need the data to fix a server emoticons_happy.png

    As for alerting its just a basic email with a link to the node, won't clutter up the post more.

    ****************************************************************************************************************************************************************************

    So for everyone who followed along, thanks for the read, any recommendations/comments/improvements, I love to hear them.

    And thanks again to the other users who posted information that i was able to borrow and roll into my own server monitoring solution.

  • Forgot to mention to give greater control of who receives email alerts I plan to pull from Malik Haider post.  will update here when i finally have a chance to play with it.

    Using Custom Properties sending Alert emails

    Edit #1

    So i finally got around to testing this and it works almost as expected.  The only issue i had was that the macro's listed in the post were generic like ${email} for referencing the custom property i created on the node.  When i set it up this way all emails would stop and there would be errors generated because the variable would not translate over properly and the To: field would have an invalid email address causing the mail server to freak out over it.

    The quick fix was that i had to qualify the macro as ${node.email} after i made that change everything started working.

    This looks like its really going to work in my environment, as we have a set group that watch servers in general and then smaller groups that are responsible for specific applications on each server.

    I can use this and have a static email address on the alert go to the main team for all alerts and then use this custom field so that the applications specialists only receive alerts on their specific servers.

  • Wow!  Thank you for all of the detailed information!  I may have to use some of your techniques where necessary. 

  • So in my constant evolution of this monitor I Ran into an issue.  With all of the servers i was monitoring by monitoring CPU as a performance counters (and a lot of other perf counters that were not really getting me any useful data) I started over working our poller which caused the servers to slow down. 

    Because of this I have actually re done my custom views and went back to pretty much the original SQL query provided in The Ultimate CPU Alert

    My code is posted below.

    ***Alert Trigger***

    SELECT Nodes.NodeID, Nodes.Caption From Nodes

    INNER JOIN APM_AlertsAndReportsData as DATA1

    ON (Nodes.NodeID = DATA1.NodeId)

    INNER JOIN(SELECT SUM(ISNull(AssetInventory_Processor.NumberOfThreads,2)) AS CPU_Count, AssetInventory_Processor.NodeID FROM AssetInventory_Processor

    GROUP BY AssetInventory_Processor.NodeID) AS CPU

    ON Nodes.nodeid = CPU.nodeid

    WHERE

    (DATA1.ComponentName = 'Processor Queue Length'

    AND DATA1.StatisticData > CPU.CPU_Count *10)

    AND

    Nodes.CPULoad > 90

    ***Alert Reset***

    SELECT Nodes.NodeID, Nodes.Caption From Nodes

    INNER JOIN APM_AlertsAndReportsData as DATA1

    ON (Nodes.NodeID = DATA1.NodeId)

    INNER JOIN(SELECT SUM(ISNull(AssetInventory_Processor.NumberOfThreads,2)) AS CPU_Count, AssetInventory_Processor.NodeID FROM AssetInventory_Processor

    GROUP BY AssetInventory_Processor.NodeID) AS CPU

    ON Nodes.nodeid = CPU.nodeid

    WHERE

    (DATA1.ComponentName = 'Processor Queue Length'

    AND DATA1.StatisticData > CPU.CPU_Count *10)

    OR

    Nodes.CPULoad <= 90

    In the Email actions i had to trade out the line regarding the CPU utilization as well from:

    Current CPU utilization: ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = '% Processor Time'}

    and changed it to:

    Current CPU utilization: ${N=SwisEntity;M=CPULoad}

    *****************************

    Also one other consideration, the default polling of the CPU load by the built in monitors for the CPU load is every 10 minutes but my policy module for watching the Processor queue length was set at 5 minutes.  Make sure you make adjustments as you need for the polling intervals as well as how often the alert checks and how long the condition needs to exist.