24 Replies Latest reply on Jun 27, 2019 4:16 AM by pratikmehta003

    Integrating Monitoring Tools into ServiceNow Through Event Management

    dmartzall

      Why are we discussing this?

      I didn't see any similar discussions or posts.  I'm not sure how many of you are using the event management suite in ServiceNow or something similar, but I wanted to put this out there to help you avoid some of the pitfalls we ran into when we rolled it out.  There are a lot of nice features available to help reduce outages and increase efficiency for your team and (if you have one) the EOC/NOC.  Our process started to address the creation of multiple incidents being created for the same outage due to alerts from multiple tools.  By having these alerts go through an event management process we are able to correlate the alerts, have them display on a service map, and only generate a single incident now.

       

       

      Definitions you need to know

      • Event Management – The process responsible for managing events throughout their lifecycle. Event management is one of the main activities of IT operations. It is a way to consolidate all events/alerts from disparate monitoring systems in one place to give you both more information and reduce noise for your teams.  Not all events should become an alert and not all alerts should become incidents
      • Event – A change of state that has some significance for the management of an IT service or a configuration item.  These records can vary greatly in their importance from “telling you of the addition of a device to monitoring” to “telling you a Data Center is offline”.
      • Alert – A notification of a threshold breach, something has changed, or a failure has occurred. Monitoring tools create and manage alerts.  The event management process manages the lifecycle of an alert.  An alert must have first been an event.
      • Incident – An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident.
      • Noise – Alerts that are unneeded, duplicated, or correlated to a larger issue.
      • Signal – Unique alerts that are usable, actionable, and result in either the creation of an incident or automated remediation.

       

      Why ServiceNow?

      There are other tools that can do many of the same things that I will be talking about here.  I am focusing on ServiceNow, because I have experience in building out this integration from monitoring tools to ServiceNow.  Regardless of the tool you use to handle event management, the discussion would still help your journey.

       

      Why would I want to use an event management tool?

      While tools have their own ways of handling events, alerts, and creating an incident; they do not talk to each other.  Unless you have a single tool handling all your monitoring, you will likely run into issues where two or more tools generate an incident for the same thing.  This is avoidable using tools like the Event Management module inside ServiceNow to reduce noise.  Other benefits include the ability to start tracking alerts for devices, change records can mute alerts during the change window, look for trends, create reports, and of course build automation to eliminate repetitive tasks caused by alerts (i.e. a service hung and needs to restarted).

       

      Where do I start?

      When you have multiple tools handling your monitoring for the enterprise, which tool should you start with?  Well, that depends on a lot of factors.  Is there a tool that’s super easy to integrate, is there a tool that has the most reliable alerts, or is there a tool that most of IT is using?  To throw some buzz worthy phrases at you, “Don’t try to boil the ocean”, “Get the low-hanging fruit”, and “What gets you the most bang for your buck”.  Meaning simply this; I would start small as you can always expand, knocking out the easy stuff if it makes sense, but ultimately what provides the greatest benefit with the least amount of effort.  I will leave that decision to you, but I will tell you what we did to get where we are today.

       

      We started with a single tool that was hosting a lot of our monitoring.  There was no out of the box connector for it in ServiceNow and that tool was sending emails to open incidents.  We found that while we were unable to pull data from that tool using an API call, it did support sending the data through API.  We built out the integration so that when the tool generates an alert, it sends this to ServiceNow via the API to the event table.  This allowed me to then build rules on how to handle these events.

       

      The next decision was to integrate with SCOM. There was an out of the box integration for this one and it was in place fast.  The same process followed: cleaning up noisy events and building rules to provide better alerts.

       

      Various other tools used an email integration to event management.  The emails were sent in plain text in a JSON format.  These were fairly easy, but not a preferred method due to relying on additional points of failure by including our (or tool vendor) email servers and ServiceNow email servers.

       

      Next in the line was SolarWinds.  SolarWinds was interesting for two reasons.  The first was that there was an out of the box connector.  The second was SolarWinds had a plugin for ServiceNow integration.  What I came to find was that the plugin was for incident creation and not event creation and the out of the box connector worked but needed tweaking. 

       

      I will explain more later in the lessons learned section.  We found a few issues along the way, but it came together nicely.  I am now able to build reports for teams that show their health as reported from all tools, associate devices in a change window to their alert and mark it as in maintenance, build a customer experience dashboard, and (thanks to the work of our CMDB guy) we can feed these alerts to service maps.

       

      What is the plan moving forward?

      There are other features we haven’t started playing with, yet.  Operational Intelligence would be the feature I am most interested in pursuing.  This is a portion of the suite that collects the metric data from your tools, looks for anomalies, and proactive alerts based on machine learning. 

       

      What lessons did you learn?

      Integrating alerts was not as simple as we originally thought it would be.  I want go over each of the tools lessons learned and end on what we learned about the ServiceNow platform itself.  Hopefully sharing this information will save you from the same issues when you do your integration.

       

      Having Vistara send the events via an API call to ServiceNow worked well.  We had a few instances of their API service dying, but over two years that’s not bad. When we started receiving the events from Vistara in ServiceNow, we found many of these weren’t actionable and built rules to silence them.  The remaining events then became alerts.  For the alerts, I made a different set of rules that provided additional information for our EOC (Enterprise Operations Center) about the alert.  That information could be things like a knowledge base article that tells them how to fix the issue, who to contact, if this should become an incident, what the severity of the incident should be, and much more.

       

      SCOM was a bit of a pain.  We found with our instance there were two different places we had to build out the integration.  To pull the alerts from SCOM we had the connector hit one of our web servers.  The metric data was not accessible from the same server and that connector had to point directly at the DB server.  This worked well until security locked down the ports and we couldn’t connect to it anymore.  The alerting has since been moved to email integration to work around the security “features” blocking our API connection and we had to disable the metrics collection all together.

       

      The email integrations are a stop gap until the monitoring these tools provide is moved to SolarWinds.  The plus side is that these are easy to customize and quick to set-up, but the flip side is that they have additional points of failure. Another issue we have encountered is with getting the clear messages to work.  This comes down to the message key.  A message key is what you will use in event management to separate different occurrences of the same issue and to have the clear messages associate properly with the triggered alert.  If you run into this issue work with the team sending the email to work on a unique message key for their alerts.

       

      The SolarWinds connector pulls data from the event table in SolarWinds instead of the alerts.  Events in SolarWinds can be triggered either by the thresholds assigned to the machine directly or by a something forcing the system to write to the event log.  This means that you will need to add a filter to hide the noise you were already filtering by setting up alert actions in SolarWinds.  One of the ways we combatted that was to block any alert without and eventType of 5000 or 5001.  Event type in SolarWinds is a number that identifies what triggered the event.  A 5000 event says that an alert rule caused an entry to be written to the event log.  A 5001 event type says the issue is cleared.  That simple change in ServiceNow stopped over 9000 additional noisy alerts per day.  The biggest thing we found is that the “swEventId” does not make a good message key.  This forced us to create our own message key using the initial event time field.  An example would be like this JSON piece: {“initial_event_time”:”5/19/2018 16:43:00”, "netObjectId":"10053"}becomes a message key of 2018.170.16.43.00.10053

       

      ServiceNow has several connectors out-of-the-box, however; I would still recommend having someone that can program using JavaScript go through the default connector or building your connector.  I would not change the default connector, but instead make a copy of the default if you want to make changes.  Code patches could over-write your changes to a default connector definition.  Here are a few other quick hints:

      • Map out which fields you want to use on the alert form before you start
        • Default for node is the hostname
        • Default for resource is what on the device/application has an issue (i.e. CPU for a High CPU alert)
        • Type needs to be something in the CMDB (i.e. server/application/network)
        • Severities need to be in number format (1-5 Exception to Informational)
        • Custom fields can be added to the alert form
      • Event and Alert Rules can let you change the entire alert message and field data
      • Technical services are for things like Exchange not a custom monitor
      • Discovered Services are for service maps
      • Manual Services can be for custom monitored service
      • If you plan to use the dashboard to display the health of your services, ensure that all services are using the default numbered Business criticality as the non-standard criticalities will break the dashboard display
      • Link KB articles to alerts and provide instructions for handling the alert or how to fix the issue
      • You can build out automation around what to do with alerts

       

      Where does this leave us?

      The ITOM suite has provided us a wealth of information to improve our services, increase first to know, and identify trends to avoid issues in the future. While we encountered some issues when we began this journey, the destination was well worth the trouble.  The key takeaway, planning what data you want to collect and how you want to use this information is critical to making it successful. 

       

       

      Have any of you run into these or other issues?  Do you have any other suggestions/comments/concerns?