Hi,
We have a very simple alert set up as follows:
Trigger Cond:
Machine Type is MachineType1
Node Status is Down
Condition must exist for 10 minutes to trigger
Reset Condition:
When Trigger Condition is no longer true
Trigger Action:
Log an entry to NPM log
Log an entry to Windows Event Log
Reset Action
Log an entry to NPM log
Issue:
What we're seeing is that when the trigger condition is met, multiple duplicate entries are written to both the NPM log and the Windows Application log (EventID 3003). In one of our environments it's 5 duplicate entries each time, in the other environment it's 3 entries. Then, when the node comes back up, the reset action logs the exact same amount of entries to the NPM log- 5 in the first environment, and 3 in the other. This happens with all MachineType1 nodes, not any one in particular. When we test with the Advanced Alert Manager, only 1 entry is written each time, which is what we want to happen!
The alert is not triggering and then resetting instantly by the way. For example:
7/12/2017 5:01 AM Event Type RESOLVED: CRITICAL P4 Machine1 is now back up 7/12/2017 5:01 AM
7/12/2017 5:01 AM Event Type RESOLVED: CRITICAL P4 Machine1 is now back up 7/12/2017 5:01 AM
7/12/2017 5:01 AM Event Type RESOLVED: CRITICAL P4 Machine1 is now back up 7/12/2017 5:01 AM
7/12/2017 4:57 AM Event Type Machine1 is Up
7/12/2017 4:57 AM Event Type Machine1 is responding again. Response time is 32 milliseconds.
7/12/2017 4:51 AM Event Type CRITICAL P4 Machine1 status is down 7/12/2017 4:51 AM
7/12/2017 4:51 AM Event Type CRITICAL P4 Machine1 status is down 7/12/2017 4:51 AM
7/12/2017 4:51 AM Event Type CRITICAL P4 Machine1 status is down 7/12/2017 4:51 AM
7/12/2017 4:38 AM Event Type Machine1 is Down
This really has me stumped. Anybody got any ideas what the issue might be?