This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Monitor event log for status controlled by separate events

I'm trying to monitor for a specific condition in my application whereby it sometimes loses access to its database.  When it does this an event is logged into the Windows Application Event log.  When it regains connection another event is logged.

I know how to create an event log monitor for tracking a specific event but I need help logically creating a monitor that will show the system as down if the connection loss event occurs, but then returns to normal when the recovery event occurs.  From what I can see this is not possible with the default windows event log component monitor for 2 reasons.  

1) if the original event falls outside of the polling interval it will no longer be considered "found" thus no issue detected even if the servicer has been down for quite some time, and 

2) I see no way to tie two different events together in said monitor

Am I going to have to build something custom for this?  E.g. Powershell?  

  • Yep, the default event log tool isn't really built for correlating events. You can duct tape a solution together within Orion if you are motivated but if you have powershell skills handy I expect you'll have more luck with a scripted monitor.
  • I was expecting that.  Sigh.  Now the question is, how to actually go about doing this.  Sounds like I'll have to search for both expected events each poll and if the "up" event is newer than the older event, it's up, otherwise it's down.  Thoughts?

  • That's the approach I had in mind as well.  A possibility too that might help it run faster is to leave some kind of placeholder file maybe in windows/temp on each node that tracks what the timestamp was for the last time you polled the node and if the component should be up or down based on the most recent event.  That way you only need to scan for the opposite event and only need to look at events in the last few minutes rather than crawling through all 9 billion windows events.  That might add some complexity to the whole affair but ultimately could save you from parsing a lot of extra data all the time.

  • One potential way of doing it is with the event log monitors and custom node status. I've done similar previously with triggering and resetting an alert but not setting the node status.

    1. Configure your application monitor with the 2 event log monitors
      1. Event log monitor A - connection lost - configure component to go critical or down if found
      2. Event log monitor B - connection restored - configure component to warning if found  (if its an Info event, use the Event Count and set the warning threshold to >0 )
    2. Configure a component alert
      1. Trigger condition - match the connection lost event log monitor and status = down/critical
      2. Reset condition - match the connection restored event log monitor and status = warning
      3. Trigger action
        1. Set Custom Status - change to status Critical (or Down if you want the node to show as Down)shuth_2-1593414686354.png

      4. Reset action
        1. Set Custom Status - use polled status (will revert to whatever node status is based on polled/collected data)shuth_1-1593414665921.png

    If the disconnection event is found, the alert will trigger and set the node to crtical or down (based on what you configure). The alert will stay active.

    When the connection restored event is found, the alert will reset and change the node back to polled status. If you have enhanced node status on and application status on, potentially the node will be warning for a poll or two until the application status returns to up/green.

  • Interesting solution, however given this solution doesn't really factor in timestamps of the events, wouldn't this solution get potentially confused regarding the actual status of the service?  For example:

    For example if it goes down (event 1 triggered), comes up (event 2 triggered) and goes back down again (event 3 triggered) all within a given polling cycle, both down and up events would be found simultaneously and this solution would see the service as UP ignoring the last event as being marked down, no?

  • The only way you'll be able to cover every condition like the above will be scripting. And depending on what conditions you come up with will determine how complex your script needs to be.
    You can tweak the polling interval (default 5 mins) or modify how far back in the logs it looks (default is 1.5 x polling_interval), but as you're dealing with events rather than looking at an object status, it won't cover every scenario.