Currently I have CPU and Memory alerts set up. For both, I have two stages, one which goes off when over 90% for 15 minutes, and one which goes off at over 95% for 15 minutes.
The thinking is that if a box is over those thresholds for 15 minutes steady, then there's something wrong.
What we didn't realize was that the box may nor may not be STEADY over that threshold for 15 minutes. What actually happens is that Orion sees the threshold exceeded, sets a timer, and checks it periodically over that 15 minutes (whatever Statistics Collection timing increment you have set up) and if it's still over threshold on the LAST time it checks during the 15 minutes, then it sets off an alarm. If it's not, then no harm no foul, it waits for the next time it's up over threshold and dutifully resets that timer, and starts the process all over again.
So we are not positive that it's actually STEADY up over that threshold for the whole 15 minutes...which is all very fine and well, after all, Orion is not a precision instrument so we can't expect it to know on a second-by-second basis precisely what is in use, only at those moments when it checks. This equates to a "good enough" situation--it is what it is. The other thing, it turns out that resource consumption is a constantly changing picture, so you can't really nail it down anyway, you pretty much have to take a myopic view and hope you press the panic button at the right time.
My laissez faire attitude is not shared by my customers, however, who are irritated when, after Orion sends them an alert, Orion cannot then tell them what is consuming the resources. Often all they get is an alert which 1) has gone back down under the threshold, and is accused of being a "False alert" (the ultimate crime here), or 2) the offending process has already shut down, leaving "WMIPRVSE.EXE" as the highest consumer...meaning, they see Orion as the cause of the problem, because WMIPRVSE.EXE is the process under which Orion polls WMI to ask what the Top Ten highest processes are.
So they want to shut off the Orion alerts as useless...because in the instances where Orion actually DOES catch the culprit, it's the McAfee's virus scan. I have a sneaking suspicion that McAfee's is sabotaging Orion's monitoring by spinning up a scan, which triggers Orion, and then going "beneath the radar", often not being picked up by the "Get the Top Ten" script...again, leaving Orion holding the bag when the alert goes out.
Why the long treatise? It is what it is, right?
I have set up Process Monitors on several of the boxes, watching what resources McAfees consumes, keeping a running tally, and I can demonstrate that it is spinning up the CPU and Memory, but what I get back is, "McAfee's must run, we are not shutting off antivirus. WHY CAN'T YOU JUST INSTRUCT ORION NOT TO ALERT, if it's just McAfee's?" The implication being, "If you can't, we'll just shut off your alerts, and we'll use vRealize for all this stuff instead."
So here's my last ditch attempt...does anyone know how to throw a condition into the alerts, where in the alert conditions, it looks like:
Scope: Only the following set:
Nodes - where Vendor - is equal to - Windows
Nodes - on - specific subnets
Trigger conditions: Trigger alert when
Node - CPU Load - is greater than - 90%
Process consuming CPU- is not equal to - "McAfee's ANYTHING"
Only the last part is what I'm after.
Thoughts? For instance, is that information available in a customer SWQL or SQL query?