cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 11

Integrating Monitoring Tools into ServiceNow Through Event Management

Why are we discussing this?

I didn't see any similar discussions or posts.  I'm not sure how many of you are using the event management suite in ServiceNow or something similar, but I wanted to put this out there to help you avoid some of the pitfalls we ran into when we rolled it out.  There are a lot of nice features available to help reduce outages and increase efficiency for your team and (if you have one) the EOC/NOC.  Our process started to address the creation of multiple incidents being created for the same outage due to alerts from multiple tools.  By having these alerts go through an event management process we are able to correlate the alerts, have them display on a service map, and only generate a single incident now.

Definitions you need to know

  • Event Management – The process responsible for managing events throughout their lifecycle. Event management is one of the main activities of IT operations. It is a way to consolidate all events/alerts from disparate monitoring systems in one place to give you both more information and reduce noise for your teams.  Not all events should become an alert and not all alerts should become incidents
  • Event – A change of state that has some significance for the management of an IT service or a configuration item.  These records can vary greatly in their importance from “telling you of the addition of a device to monitoring” to “telling you a Data Center is offline”.
  • Alert – A notification of a threshold breach, something has changed, or a failure has occurred. Monitoring tools create and manage alerts.  The event management process manages the lifecycle of an alert.  An alert must have first been an event.
  • Incident – An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident.
  • Noise – Alerts that are unneeded, duplicated, or correlated to a larger issue.
  • Signal – Unique alerts that are usable, actionable, and result in either the creation of an incident or automated remediation.

Why ServiceNow?

There are other tools that can do many of the same things that I will be talking about here.  I am focusing on ServiceNow, because I have experience in building out this integration from monitoring tools to ServiceNow.  Regardless of the tool you use to handle event management, the discussion would still help your journey.

Why would I want to use an event management tool?

While tools have their own ways of handling events, alerts, and creating an incident; they do not talk to each other.  Unless you have a single tool handling all your monitoring, you will likely run into issues where two or more tools generate an incident for the same thing.  This is avoidable using tools like the Event Management module inside ServiceNow to reduce noise.  Other benefits include the ability to start tracking alerts for devices, change records can mute alerts during the change window, look for trends, create reports, and of course build automation to eliminate repetitive tasks caused by alerts (i.e. a service hung and needs to restarted).

Where do I start?

When you have multiple tools handling your monitoring for the enterprise, which tool should you start with?  Well, that depends on a lot of factors.  Is there a tool that’s super easy to integrate, is there a tool that has the most reliable alerts, or is there a tool that most of IT is using?  To throw some buzz worthy phrases at you, “Don’t try to boil the ocean”, “Get the low-hanging fruit”, and “What gets you the most bang for your buck”.  Meaning simply this; I would start small as you can always expand, knocking out the easy stuff if it makes sense, but ultimately what provides the greatest benefit with the least amount of effort.  I will leave that decision to you, but I will tell you what we did to get where we are today.

We started with a single tool that was hosting a lot of our monitoring.  There was no out of the box connector for it in ServiceNow and that tool was sending emails to open incidents.  We found that while we were unable to pull data from that tool using an API call, it did support sending the data through API.  We built out the integration so that when the tool generates an alert, it sends this to ServiceNow via the API to the event table.  This allowed me to then build rules on how to handle these events.

The next decision was to integrate with SCOM. There was an out of the box integration for this one and it was in place fast.  The same process followed: cleaning up noisy events and building rules to provide better alerts.

Various other tools used an email integration to event management.  The emails were sent in plain text in a JSON format.  These were fairly easy, but not a preferred method due to relying on additional points of failure by including our (or tool vendor) email servers and ServiceNow email servers.

Next in the line was SolarWinds.  SolarWinds was interesting for two reasons.  The first was that there was an out of the box connector.  The second was SolarWinds had a plugin for ServiceNow integration.  What I came to find was that the plugin was for incident creation and not event creation and the out of the box connector worked but needed tweaking. 

I will explain more later in the lessons learned section.  We found a few issues along the way, but it came together nicely.  I am now able to build reports for teams that show their health as reported from all tools, associate devices in a change window to their alert and mark it as in maintenance, build a customer experience dashboard, and (thanks to the work of our CMDB guy) we can feed these alerts to service maps.

What is the plan moving forward?

There are other features we haven’t started playing with, yet.  Operational Intelligence would be the feature I am most interested in pursuing.  This is a portion of the suite that collects the metric data from your tools, looks for anomalies, and proactive alerts based on machine learning. 

What lessons did you learn?

Integrating alerts was not as simple as we originally thought it would be.  I want go over each of the tools lessons learned and end on what we learned about the ServiceNow platform itself.  Hopefully sharing this information will save you from the same issues when you do your integration.

Having Vistara send the events via an API call to ServiceNow worked well.  We had a few instances of their API service dying, but over two years that’s not bad. When we started receiving the events from Vistara in ServiceNow, we found many of these weren’t actionable and built rules to silence them.  The remaining events then became alerts.  For the alerts, I made a different set of rules that provided additional information for our EOC (Enterprise Operations Center) about the alert.  That information could be things like a knowledge base article that tells them how to fix the issue, who to contact, if this should become an incident, what the severity of the incident should be, and much more.

SCOM was a bit of a pain.  We found with our instance there were two different places we had to build out the integration.  To pull the alerts from SCOM we had the connector hit one of our web servers.  The metric data was not accessible from the same server and that connector had to point directly at the DB server.  This worked well until security locked down the ports and we couldn’t connect to it anymore.  The alerting has since been moved to email integration to work around the security “features” blocking our API connection and we had to disable the metrics collection all together.

The email integrations are a stop gap until the monitoring these tools provide is moved to SolarWinds.  The plus side is that these are easy to customize and quick to set-up, but the flip side is that they have additional points of failure. Another issue we have encountered is with getting the clear messages to work.  This comes down to the message key.  A message key is what you will use in event management to separate different occurrences of the same issue and to have the clear messages associate properly with the triggered alert.  If you run into this issue work with the team sending the email to work on a unique message key for their alerts.

The SolarWinds connector pulls data from the event table in SolarWinds instead of the alerts.  Events in SolarWinds can be triggered either by the thresholds assigned to the machine directly or by a something forcing the system to write to the event log.  This means that you will need to add a filter to hide the noise you were already filtering by setting up alert actions in SolarWinds.  One of the ways we combatted that was to block any alert without and eventType of 5000 or 5001.  Event type in SolarWinds is a number that identifies what triggered the event.  A 5000 event says that an alert rule caused an entry to be written to the event log.  A 5001 event type says the issue is cleared.  That simple change in ServiceNow stopped over 9000 additional noisy alerts per day.  The biggest thing we found is that the “swEventId” does not make a good message key.  This forced us to create our own message key using the initial event time field.  An example would be like this JSON piece: {“initial_event_time”:”5/19/2018 16:43:00”, "netObjectId":"10053"}becomes a message key of 2018.170.16.43.00.10053

ServiceNow has several connectors out-of-the-box, however; I would still recommend having someone that can program using JavaScript go through the default connector or building your connector.  I would not change the default connector, but instead make a copy of the default if you want to make changes.  Code patches could over-write your changes to a default connector definition.  Here are a few other quick hints:

  • Map out which fields you want to use on the alert form before you start
    • Default for node is the hostname
    • Default for resource is what on the device/application has an issue (i.e. CPU for a High CPU alert)
    • Type needs to be something in the CMDB (i.e. server/application/network)
    • Severities need to be in number format (1-5 Exception to Informational)
    • Custom fields can be added to the alert form
  • Event and Alert Rules can let you change the entire alert message and field data
  • Technical services are for things like Exchange not a custom monitor
  • Discovered Services are for service maps
  • Manual Services can be for custom monitored service
  • If you plan to use the dashboard to display the health of your services, ensure that all services are using the default numbered Business criticality as the non-standard criticalities will break the dashboard display
  • Link KB articles to alerts and provide instructions for handling the alert or how to fix the issue
  • You can build out automation around what to do with alerts

Where does this leave us?

The ITOM suite has provided us a wealth of information to improve our services, increase first to know, and identify trends to avoid issues in the future. While we encountered some issues when we began this journey, the destination was well worth the trouble.  The key takeaway, planning what data you want to collect and how you want to use this information is critical to making it successful. 

Have any of you run into these or other issues?  Do you have any other suggestions/comments/concerns?

46 Replies
Level 8

Hi dmartzall,

We have Event Rule created in ServiceNow to expect the events in a particular format from Solarwinds. All rules work except the Group Events. When i checked on Solarwinds i do see the events being created when a Group goes down but this event isnt captured on ServiceNow. Its only specific to Group events. Is there any specific condition or integration update we need to do on ServiceNow end for the connector instance to capture Group events in Solarwinds?

0 Kudos

Are you triggering an alert action for the group events?  You could check the regex you are using on the ServiceNow side to validate it's matching your alerts using Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript​ .

0 Kudos
Level 14

Thanks for posting this.  It's these types of threads that are shared from personal experience, in both the original article and in the comments from the community, that keeps me engaged and informed. 

0 Kudos
Level 16

dmartzall

Have one question for you, not sure if have already come across....

We are doing Solarwinds-SNOW integration using MID server concept so that we can have control on event management. I applied the settings of taking only 5000 and 50001 event types...

But i also see another problem. For most of the alerts, resource field is coming blank so if its a CPU/ Device down alert or/Disk space, the resource field should show the actual type right..

If it comes blank then for correlation it will be come very difficult... so any suggestions?

pastedImage_0.png

0 Kudos

There are several different things we did to ensure all of the fields populated.  One is that with any field we wanted to have populated within the SNOW form, we ensured that field information was contained within the event.  The second was to build a regex expression to parse these fields from the event using event rules.  For example we used, XXX-(.*): (.*) for (.*) to parse out the LoB, the metric, and the resource from the description.

0 Kudos

Thanks.. so my assumption was correct..

One more thing... if we are doing alerting using custom poller and trap viewer then suppressing the other events apart from 5000 and 5001 will still reach SNOW?

Asking incase u have already done the testing of such scenario...

0 Kudos

The custom poller will for sure.  I have not tested the Trap viewer alerts.

0 Kudos

I has tried for custom poller but it doesn't reach snow. It does come up in

even log however.

0 Kudos

Have you already made a rule in SolarWinds for the poller?  If you look in the alert you made are you looking to alert on the custom node poller:
pastedImage_0.png

0 Kudos

Yes i have already created trigger condition and action... i am alerting on both- node poller and table poller... I had tried once but the events never reached ServiceNow.. i dont know why....

0 Kudos

Hi, thanks for sharing this.

For SolarWinds what kind of credential is required to pull the events from SolarWinds and show in ServiceNow. I know that it's a bit tricky for Nagios as the password is the API key that needs to be generated for the logged in user.

Can we still use basic authentication for SolarWinds? and what kind of username/password is needed?

thanks

0 Kudos

It needs the ability to administer alerts.

0 Kudos

rahman.mahmoodi

We used basic auth for our instance.  The account needed to be able to administer events and alerts, but should not need to be a full admin account.

0 Kudos

You can filter the events with the api json by modifying the "where: statement

        var query = "SELECT TOP " + MAX_EVENTS_TO_FETCH +

            " EventID, EventTime, NetworkNode, NetObjectID, EventType, Message, Acknowledged, " +

            "NetObjectType, Timestamp FROM Orion.Events " +

            "WHERE NetworkNode > 0 AND NetObjectID > 0 ";

To

WHERE  Events.EventType =5000 OR Events.EventType = 5001

0 Kudos

Is this the only way to filter down to just the 5000 and 5001 events,  I have been trying with the discard event rule but I can't seem to get it to work with two conditions and I am not about to build a million for every every eventid outside of these two. 

0 Kudos

There is a limitation in the logic that defines the severity of an event.  The severity is defined by the EventTypes.Icon.  This means that all event type 5000 (Alert Triggered) will have the exact same severity.  The database structure in solarwinds does not allow you to join the Alert severity, found in AlertConfigurations.severity with the Event data, or at least I cannot figure it out.  So the only way to pass through the alert severity is to include it in the Netperfmon message and to parse it out with an event rule.

0 Kudos

Severity variable is available by default?

0 Kudos

Adding an attachment to assist in the explanation.  When an alert is created in SW, there is a drop choice for "Severity of Alert".  This information is not passed through to sNow.  The severity of an alert that is passed through to sNow is defined by the event type.  see sheet "SolarWindsEventTypes" column e = Icon.  That is what is passed through to sNow for the severity.  So for an event type = 5000 Alert Triggered, the icon RedYield is translated to = 4 = Warning.  see sheet "sNow Mapping Tables".  The solarwinds-icon-severity table translates the RedYield to = 4, and the Severity Mapping table translated the 4 to = Warning.  Following through to the SW database table where the data resides, sheet "SWDatabaseTables" you can see that there is no straight forward way to join the severity of an alert that is held in the AlertConfigurations table, to the event.

SW_Alert.PNG

0 Kudos

I know about this option but I was referring to this text in your comments " So the only way to pass through the alert severity is to include it in the Netperfmon message and to parse it out with an event rule."

so do we have a SQL variable for severity which we can pass in the message field?

0 Kudos

What we did for the severity was assign it through the ServiceNow "Event Rules - transform and compose alert" page.  You can tweak and change things there to fit.  The difficulty here would be in if you wanted to make it simple or more complex with the alerts.  You can pass the severity you want through the alert to make the event rules more dynamic via regex or if you have tons of event rules in SNOW you can directly call them the severity you want from the transform page.
pastedImage_0.png

if you leave it default it will go off the icon or whatever you use in the event field map:
pastedImage_0.png

pastedImage_1.png

0 Kudos