Linking an Alert with a NetPerfMon Event Log

SPOLER WARNING: I am not, nor have I ever been, a developer.  This is a personal challenge and experience post.

One of the best ways to keep a record of alerts is to utilize the internal Orion event logger. It's probably the alert action I use most of all.  I like to use it for every alert and every escalation level.  I, personally, use it as an internal record of how an alert has progressed.

MicrosoftTeams-image (16).png

Quick scenario: Imagine an alert with one escalation after 15 minutes and a reset condition.

Personally, I log an event...

  • when the alert first triggers - I include all the relevant details (server seeing the alert conditions, triggering object, reason for alert, when the next escalation will take place, other stuff specific to the alert)
  • when the escalation takes place - I include the same as above, but mention that the escalation has been triggered
  • when the reset condition happens - I include that the alert is now cleared and things are back to normal

Up until yesterday I always thought of the Event Log as a one-way communication.  Something happens and you write a log.  Maybe you go back later to review the logs, but you never really "source" from them any other way.

But then I had to take a step back.  I can't know how everyone will use event details in their own environment, so I started to look at ways to tie the alerts back to the event log it wrote.

Because the Event Log isn't really designed to be used this way, I wasn't sure it was possible.  There's no out-of-the-box way to handle this, but I really wanted to look deeper in the way the data was stored and the schema of how it was related.

DISCLAIMER: This process has been tested with the Orion 2020.2 releases.  It may or may not work for other builds.

I knew that I had to start with the alert, so I opened SWQL Studio (part of the Orion SDK) and starting crawling around the Orion.Alert tables.

KMSigma_0-1596206348533.png

After looking through a few of them, I ended up on Orion.AlertActive.  This looks to be a list of all the current active alerts as opposed to Orion.AlertHistory, which appears to be historical.

KMSigma_1-1596206463178.png
SELECT TOP 10 AlertActiveID
, AlertObjectID
, Acknowledged
, AcknowledgedBy
, AcknowledgedDateTime
, AcknowledgedNote
, TriggeredDateTime
, TriggeredMessage
, NumberOfNotes
, LastExecutedEscalationLevel
FROM Orion.AlertActive
ORDER BY AlertActiveID DESC

Since alerts are triggered on a specific object that is being monitored, I needed to get that information as well.  I've got some details of the alert in general, but I still need the triggered alert.

After querying a few tables, I found what I wanted in the Orion.AlertObjects table.

 
Orion.AlertObjects.png
SELECT TOP 10 AlertObjectID
, AlertID
, EntityUri
, EntityType
, EntityCaption
, EntityDetailsUrl
, EntityNetObjectId
, RelatedNodeUri
, RelatedNodeId
, RelatedNodeDetailsUrl
, RelatedNodeCaption
, RealEntityUri
, RealEntityType
, TriggeredCount
, LastTriggeredDateTime
, Context
, AlertNote
FROM Orion.AlertObjects
ORDER BY AlertID DESC
 

Both tables have an AlertObjectID, so I knew I could join the two together.  I also saw that the Orion.AlertObjects table has the field EntityNetObjectId, which contains a way that we uniquely identify monitored elements in Orion, so I took note.

Last, but not least I needed to look at the Orion.Events table.

 
Orion.Events.png
SELECT TOP 10 EventID
, EventTime
, NetworkNode
, NetObjectID
, NetObjectValue
, EngineID
, EventType
, Message
, Acknowledged
, NetObjectType
, TimeStamp
FROM Orion.Events
ORDER BY EventID DESC

In the Orion.Events table I could see the NetObjectType and the NetObjectID.  If I put a colon between these two entries, it looks just like the EntityNetObjectId from the Orion.AlertObjects table.  So, let's add that to the output using the CONCAT function and remove the stuff from the SWQL query I don't need.  [Note: The CONCAT function just joins the elements provided by the parameters into one long string.  You can do a similar thing with the '+' operator, but I prefer the function.]

Orion.Events_NewNetId.png
SELECT TOP 10 EventID
, NetObjectID
, NetObjectType
, CONCAT(NetObjectType, ':', NetObjectID) AS [Test ID]
FROM Orion.Events
ORDER BY EventID DESC

Well, that doesn't exactly look right.  It looks like the NetObjectType field is padded with spaces.  If they are spaces, then we should be able to replace them with nothing and get the result we want.  So, I tried again to see if my hunch was correct using the REPLACE function.

Orion.Events_NewNetId_replace.png
SELECT TOP 10 EventID
, NetObjectID
, NetObjectType
, CONCAT(REPLACE(NetObjectType, ' ', '') , ':', NetObjectID) AS [Test ID]
FROM Orion.Events
ORDER BY EventID DESC

Much better.  Then it was just a matter of stringing my three tables together using some JOINs.  For the example below, I chose an arbitrary AlertObjectID to use for testing.

Orion.EventsToAlerts_JOINs.png
SELECT TOP 10 [AlertObjects].AlertObjectID
, [Events].EventID
, [Events].EventTime
, [Events].NetworkNode
, [Events].NetObjectID
, [Events].NetObjectValue
, [Events].EngineID
, [Events].EventType
, [Events].Message
, [Events].Acknowledged
, [Events].NetObjectType
, [Events].TimeStamp
, [Events].DisplayName
, [Events].Description
, [Events].InstanceType
, [Events].Uri
, [Events].InstanceSiteId
FROM Orion.Events AS [Events]
INNER JOIN Orion.AlertObjects AS [AlertObjects]
ON CONCAT(REPLACE([Events].NetObjectType, ' ', ''), ':', [Events].NetObjectID) = [AlertObjects].EntityNetObjectID
INNER JOIN Orion.AlertActive AS [ActiveAlerts]
ON [ActiveAlerts].AlertObjectID = [AlertObjects].AlertObjectID
WHERE [ActiveAlerts].AlertActiveID = 2615420

So it looks like we have everything we need to add any of these fields to an alert trigger.

For my example, I duplicated the Datastore Low Free Space alert and called it Datastore Low Free Space (with Event) becuase I am not creative.

I kept the original trigger action in place (Write to the NetPerfMon Event Log) and then created an escalation layer.  I created the escalation layer because I wanted to be 100% sure that the Orion.Events table would have the record when I went to query it.  I probably could have done this any time after the initial log write, but I'm being overly cautious.

For that second layer, you can use any type of action, because the important part is when you get to edit the message that's used.  In my example I used a log to a file because it was easy to test.

Orion.Event_AlertFramework.png

When you edit the message body (for an email, raising a ticket/incident, writing to a log file, sending a syslog/trap, or whatever else), you'll need to insert a custom variable.

Orion.Events_ActionBody.png

When you are in the body, click Insert Variable, check the box for Define SQL/SWQL Variable (Advanced), and move the radio button over to SWQL.

And this is the query I used.

SELECT [Events].EventID
FROM Orion.Events AS [Events]
INNER JOIN Orion.AlertObjects AS [AlertObjects]
ON CONCAT(REPLACE([Events].NetObjectType, ' ', ''), ':', [Events].NetObjectID) = [AlertObjects].EntityNetObjectID
INNER JOIN Orion.AlertActive AS [ActiveAlerts]
ON [ActiveAlerts].AlertObjectID = [AlertObjects].AlertObjectID
WHERE [ActiveAlerts].AlertActiveID = ${N=Alerting;M=AlertActiveID}

There are two important parts here: the first is that I'm only returning one element (because all I want is the EventID) and the second is that thing at the end between the curly braces.  This is the unique identifier bound to this alert when it's fired.  (The equivalent to my earlier test with 2615420 as as arbitrary choice).

Paste the above in that Custom Variable box and you'll get just the EventID when the alert triggers.

And I know that this works because after I saved my new alert, this is what showed up in that flat log file:

Orion.Events_Output.png

There's also at least one caveat: if you have multiple messages being logged for a single alert, you may need to use TOP 1 and an ORDER BY clause in your SWQL to only get the most recent.  I've included an example of what I think that would look like in the attached document along with the original SWQL query.

So, I hope that you've enjoyed your journey with me on how I took a problem that I thought was impossible and turned it into that was, if not simple, definitely possible.  One thing I've always said, if Orion has the data, we can find a way to get the data.

  • This is a very nice article.  This may just be the answer I'm looking for.  I have been asked to include the time that an alert is reset in the email that Orion sends.  I also write to the NetPerfMon every time an alert is triggered or reset.

    I have followed along with your example and I had to make a minor change in the query because my alert is using a UnDP poller.  In the Orion.AlertObjects table the NetObjectID field is NULL for a UnDP so I had to change the query to use the NetObjectValue for the CONCAT function.

    My question is I would like to include the timestamp for the event that is recorded when the reset action takes place in the reset email.  How would I change the query so I can get the timestamp?

  • Maybe I am misunderstanding what you are looking for, so take the following with a grain of salt.

    We USE the event table extensively.
    Most all of our alerts trigger to 2 actions
    1 – Send an email to the team responsible for the device
    2 – Write an event in the event log.
    The alert reset work the same
    1 – Send an email to the team responsible for the device
    2 – Write an event in the event log.

    This is the unique part
    The Trigger Event places a formatted event in the event log
    ORION ALERT: ${N=Alerting;M=AlertName} BREACH ${N=SwisEntity;M=Caption}
    Other information is included, Node Information, Event Timing, And Ticketing information.
    The Reset Event also places a formatted event in the event log
    ORION ALERT: ${N=Alerting;M=AlertName} NORMALIZED ${N=SwisEntity;M=Caption}
    Other information is included, Node Information, Event Timing (Start, End time Duration), And Ticketing information.

    With this information in the Event Table, I am able to generate reports on
    When and how long and what happened by node. Simple query, and counts of how often, etc.
    The data is there and can be quired as needed.

    Why all of this – We using a event analyzer that pulls the information in the event table every 10 seconds. Then uses that information is to create Ticking based on rules for the type of event.
    It looks for “ORION ALERT: BREACH”, start of an issue and then NORMALIZED to end the event and close the ticket created (if the rules allow).

    Hope this might help as an idea. But again may be way off base