Hey everyone, fairly new user to Solarwinds and NPM, have a pretty good understanding of how things work and how alerts work.
First time post.
I'll provide background for my monitoring/alerting needs, anyone who wants to get to the actual question skip to the break:
We have 2 Windows 7 virtual machines responsible for MRP in our organization that run nightly to update our inventory and BOM's etc. These processes have a small window of time to run after the day ends in Central Standard Time in the United States until the day starts in Central European Time in Amsterdam. I do not manage these machines, our DBA does, but for a long time we have had issues. We run Macola as an ERP solution (awful for a corporation as large and diverse as ours, that's a whole other issue), this ERP has been twisted and manipulated so badly by our predecessors to meet the selfish needs of our different branches, that it often doesn't operate correctly, and, cannot be upgraded. As a result, the nightly MRP machines often receive errors during their process. We have tried many different ways to find a way to alert us to this issue arising in the middle of the night, but since no event is created in the Event Log until someone acknowledges the error, we have had little luck. Therefore, the DBA must check these machines every night to make sure they kickoff, or at random, risk having very unhappy users in the morning.
When running through the trial period of Solarwinds NPM, I realized that when one of these errors occurs, the CPU utilization drops to zero. I have attempted to create an Advanced Alert that alerts me when the CPU Load falls below 5%.
It is my understanding, from reading another article on Thwack, that the CPU Load is a cumulative amount of all CPU's.
However, for some reason unknown to me, the alert still creates false positives, an alert is sent even though the CPU is still running above a cumulative 5%.
I have determined that: Each MRP machine has 2 CPU's, CPU #1 and CPU #2. CPU #1 always operates and a very low percentage, often times below the 5% threshold, however CPU #2, always operates at a very high percentage, which should cumulatively make up for the two well above the 5% threshold. It would be ideal if I could just alert to CPU #2, as it seems to carry most of the weight anyhow, but I have yet to discover that alert possibility.
At this point I am out of ideas on how to eliminate these false-positives so we can effective rely on NPM for alerts, until we get a new ERP system in the next year or so.
Some info on how my alerts are set:
Alert Evaluation Frequency: 10 Minutes
Trigger Condition: Node Name is equal to mrp 1
field CPU Load is less than or equal to value 5
Do not trigger this action until condition exists for more than: 10 Minutes
Reset Condition: Node Name is equal to mrp 1
field CPU Load is greater than or equal to value 6
Do not reset this action until condition exists for more than: 10 Minutes
Time of Day: 06:00 PM To 02:00 AM (all days of the week)
Trigger Actions: Email sent to the DBA, with 'Execute this Action only between specific hours' checked, 06:00 PM To 02:00 AM, all days of the week.
I will also provide a screen shot of what a typical night's process CPU usage looks like.
Any help anyone can provide will be greatly appreciated!
Thank you! Let me know if there are any questions!