Orion failed to restart a service and I'm trying to figure out why. Odd thing is no entry in alert log and yet I know the event occurred because the APM shows the eventlog entry.
Events are populated by NPM and the modules outside of alerting as well as from alert actions. My guess is there is something in the alert that suppressed the trigger.
I did have suppression set ... if node down do not act. The node was up though.
I have two defined alert actions. One is the action of restarting the service... the other is the action of writing to the eventlog. Since I see the entry in the event log... my assumption is the alert did trip (i.e. because one of the two actions worked).
Support reviewed my diags and are suggesting that I do a clean-up. To be frank I'm a bit concerned if I don't get a clean explanation because it appears that the restart service can not be relied on.
I have two alerts defined for this monitor. The first does the restart. The second sends an email if the service has been down for more then 30 minutes. The email did not go out either.
I just re-read what you wrote... "Events are populated by NPM and the modules outside of alerting as well as from alert actions"
So then it "is" possible to have an event log entry and NOT have an alert log entry.
That said, maybe the network to the server went down causing suppression to set? Is there anyway I could prove this? IS there a suppression log? The db table seems to only show what is being suppressed verses what was suppressed.
The email not going out and the restart not happening make it look more like the alert didn't fire. Can you post the trigger and suppression?
Andy
Looks like we were typing at the same time.
I'll check on suppression logging.
Suppressions are not logged as they are run as part of the trigger query. What this means is that the condition must pass the configured trigger and suppression at once for the condition to trigger. If the suppression eliminates the trigger we do nothing, so there is nothing to log.
You could look at the availability and events for the suppression you have specified (the device(s)) and verify that it was properly suppressed that way.
In talking with DBA I removed the suppress totally.
ON this particular server, just about everything in Orion that monitors this server... network interface.... memory.... cpu..... process stats.... all read ZERO during the time in question. This kind of suggestion that connectivity was an issue.
Querying for the node ID shows nothing in either event/alert logs. If I had a "node-down" for this it would explain everything.
This node does not have a defined alert up/down on it.
I will add one. With that and the suppression removed... hopefully there will not be a reoccurrence.
Andy,
Thanks much for help. I'll post if I get anything more from support.
Glenn.
Coolio!