Regular Checkups: Is your NMS still loved?

A finely-tuned NMS is at the heart of a well-run network. But it’s easy for an NMS to fall into disuse. Sometimes that can happen slowly, without you realizing. You need to keep re-evaluating things, to make sure that the NMS is still delivering.

Regular Checkups

Consultants like me move around different customers. When you go back to a customer site after a few weeks/months, you can see step changes in behavior. This usually makes it obvious if people are using the system or not.


If you're working on the same network for a long time, you can get too close to it. If behaviors change slowly over time, it can be difficult to detect. If you use the NMS every day, you know how to get what you want out of it. You think it's great. But casual users might struggle to drive it. If that happens, they'll stop using it.


If you're not paying attention, you might find that usage is declining, and you don't realize until it’s too late. You need to periodically take an honest look at wider usage, and see if you're seeing any of the signs of an unloved NMS.

Signs of an unloved NMS

Here’s some of the signs of an unloved NMS. Keep an eye out for these:

  1. Too many unacknowledged alarms
  2. No-one has the NMS screen running on their PC - they only login when they have to
  3. New devices not added

What if things aren't all rosy?

So what if you've figured out that maybe your NMS isn't as loved as you thought. What now? First don't panic. It's recoverable. Things you can do include:

  1. Talk. Talk to everyone. Find out what they like, and what’s not working. It might just be a training issue, or it might be something more. Maybe you just need to show them how to set up their homepage to highlight key info.
  2. Check your version. Some products are evolving quickly. Stay current, and take advantage of the new features coming out. This is especially important with usability enhancements.
  3. Check your coverage: Are you missing devices? Are you monitoring all the key elements on those devices? Keep an ear to the ground for big faults and outages: Is there any way your NMS could have helped to identify the problem earlier? If people think that the NMS has gaps in its coverage, they won't trust it.

All NMS platforms have a risk of becoming shelfware. Have you taken a good honest look recently to make sure yours is still working for you? What other signs do you look for to check if it's loved/loathed/ignored? What do you do if you think it might be heading in the wrong direction?

  • Perhaps some rules to live by:

    1.  Keep NCM up to date

    2.  Learn the product, then learn how the product can accommodate more customers.  One of the best things I've done is offer monitoring service to other departments.  Especially non-IT staff.  They'd no idea they could be notified automatically when systems they work with are unavailable.  I've made some great friends and allies this way.

    3.  Share your dreams with others, show them how you're accomplishing them via Orion tools, use those pretty reports and stats and trends to prove your value. 

    And then get ready for some attention from folks who can change your life.

  • This sounds like a solid way to keep the NMS in front of the team and keeping it solidly performing.  I may see if something like this will work in my team.  We are just cresting to that point where we are wanting to increase our monitoring capabilities.

  • If you are interested in some of the things I do to accomplish this in my shop, read on...

    First, I have monthly meetings with the teams to review new developments within NMS as well as to understand what changes they would like to see with the NMS and what training they feel they need.  This was mentioned by Jfrazier‌ below although I found in my shop that more than monthly was too much interruption in the teams schedules.  But not only do I provide the forum, but I also encourage the team to contact me directly if anything comes up outside of the meetings.  Many Engineers just don't like talking about certain topics in a meeting setting and respond better to the open door policy.  In any case, without a platform for discussion, there is a lot you may not know that could be hurting the success of your NMS.

    Second, I added a notification service to my overall NMS solution which allows only targeted Engineers to receive the alerts.  If they don't respond within a given period, the alert rolls to the next Engineer.  If it is a Critical alert and no one responds within a give period of time, a Manager is notified with the same sort of roll to a secondary Manager.  If an alert, Engineer or Manager level, is acknowledged, the acknowledgement times out and re-triggers after a given period to ensure they are not ignoring it.  This service also allows the engineer to choose how, and how often, they are notified as well as the ability to change that whenever they wish.  They are in control of how they are notified and thus more invested.

    Third, I do everything possible to eliminate the noise by using Node Dependencies and also by building in dependency scenarios for node level components (see and vote for ‌) like applications running on top of other applications or login ports.  Getting alerts for multiple components all related to the same cause does nothing to help in getting a speedy resolution to that cause let alone maintaining the credibility of your NMS.

    Fourth, I have found that Engineers too often forget to inform the NMS Admin of new components added.  This lack of monitoring will go on for sometimes years before something happens with that component that should have been caught by the NMS (black-eye).  Albeit not the eye of the NMS Admin if responsibility is clearly defined as being the Engineer's, still a black-eye on NMS.  I am now building monitors that keep track of the component types monitored on a node and then it compares them to actual monitored components.  If these lists don't match for more than a given period of time, depending upon importance, the difference is alerted to the appropriate team member.  If they don't act on it within a second period of time, the NMS Admin is alerted to ensure something is done to get that component added.

  • I believe that in order for a NMS to be successful you need to target the right staff, keep them engaged in the NMS, keep monitored components current and eliminate the noise.  Those should be top goals on any NMS Administrators list and are at the top of mine.  Without the discipline to keep that focus, you start to loose the audience.

    If you are following the above, having an Engineer ignore an alert for any reason is just not acceptable.  As a NMS team member, they need to be engaged in either resolving the reported issue or explaining to the NMS Administrator why they should not be getting an alert for the particular topic.

    To the point lindsayhill‌‌ mentioned, sometimes having the flexibility above is not available.  But...  any NMS Admin worth their weight needs to at least make sure those decision makers understand the impact that not taking these steps/having these goals, will have on the overall success of the significant investment the company has made to get to the point of having a functioning NMS.

    IMHO

  • Terry Auspitz wrote:

    Wow. I'm not a manager, but if I were, that would result in, at the very least, a closed-door meeting with the engineer to discuss why that was unacceptable.

    I understand the sentiment, but I'd probably start from a gentler position. Usually people take action like that simply to try and manage the volume of irrelevant email they receive.

    Sometimes people setting up monitoring systems think (or get told) "Let's email the team whenever a device goes down!" The problem is that most of the team isn't responsible for doing anything about that - e.g. it might the Operations team, or the on-call person that needs to do something. Everyone else just ignores those emails, and eventually they auto-delete them.

    The other problem is that when you send notifications to an email list, when it's "everyone's" responsibility, then no-one takes ownership of it. E.g. people often tell me "Send an email to the whole team if the disk is getting full, and send an SMS to the on-call engineer if it gets critically full." The problem is that those emails just get ignored, because no-one is responsible for them. Instead everyone does nothing, and then it becomes critical, and the on-call person has to deal with it.

    These days I'm a fan of using some sort of alert management/escalation tool, to better tune notifications. That gives you much better control. It also makes it easier than trying to update the 20 different cron scripts that send email to the wrong team...

Thwack - Symbolize TM, R, and C