Orion failed to restart a service and I'm trying to figure out why. Odd thing is no entry in alert log and yet I know the event occurred because the APM shows the eventlog entry.
Events are populated by NPM and the modules outside of alerting as well as from alert actions. My guess is there is something in the alert that suppressed the trigger.
I did have suppression set ... if node down do not act. The node was up though.
I have two defined alert actions. One is the action of restarting the service... the other is the action of writing to the eventlog. Since I see the entry in the event log... my assumption is the alert did trip (i.e. because one of the two actions worked).
Support reviewed my diags and are suggesting that I do a clean-up. To be frank I'm a bit concerned if I don't get a clean explanation because it appears that the restart service can not be relied on.
I have two alerts defined for this monitor. The first does the restart. The second sends an email if the service has been down for more then 30 minutes. The email did not go out either.
I just re-read what you wrote... "Events are populated by NPM and the modules outside of alerting as well as from alert actions"
So then it "is" possible to have an event log entry and NOT have an alert log entry.
That said, maybe the network to the server went down causing suppression to set? Is there anyway I could prove this? IS there a suppression log? The db table seems to only show what is being suppressed verses what was suppressed.
The email not going out and the restart not happening make it look more like the alert didn't fire. Can you post the trigger and suppression?
Andy
Looks like we were typing at the same time.
I'll check on suppression logging.