cancel
Showing results for 
Search instead for 
Did you mean: 

Regular Checkups: Is your NMS still loved?

Level 9

A finely-tuned NMS is at the heart of a well-run network. But it’s easy for an NMS to fall into disuse. Sometimes that can happen slowly, without you realizing. You need to keep re-evaluating things, to make sure that the NMS is still delivering.

Regular Checkups

Consultants like me move around different customers. When you go back to a customer site after a few weeks/months, you can see step changes in behavior. This usually makes it obvious if people are using the system or not.


If you're working on the same network for a long time, you can get too close to it. If behaviors change slowly over time, it can be difficult to detect. If you use the NMS every day, you know how to get what you want out of it. You think it's great. But casual users might struggle to drive it. If that happens, they'll stop using it.


If you're not paying attention, you might find that usage is declining, and you don't realize until it’s too late. You need to periodically take an honest look at wider usage, and see if you're seeing any of the signs of an unloved NMS.

Signs of an unloved NMS

Here’s some of the signs of an unloved NMS. Keep an eye out for these:

  1. Too many unacknowledged alarms
  2. No-one has the NMS screen running on their PC - they only login when they have to
  3. New devices not added

What if things aren't all rosy?

So what if you've figured out that maybe your NMS isn't as loved as you thought. What now? First don't panic. It's recoverable. Things you can do include:

  1. Talk. Talk to everyone. Find out what they like, and what’s not working. It might just be a training issue, or it might be something more. Maybe you just need to show them how to set up their homepage to highlight key info.
  2. Check your version. Some products are evolving quickly. Stay current, and take advantage of the new features coming out. This is especially important with usability enhancements.
  3. Check your coverage: Are you missing devices? Are you monitoring all the key elements on those devices? Keep an ear to the ground for big faults and outages: Is there any way your NMS could have helped to identify the problem earlier? If people think that the NMS has gaps in its coverage, they won't trust it.

All NMS platforms have a risk of becoming shelfware. Have you taken a good honest look recently to make sure yours is still working for you? What other signs do you look for to check if it's loved/loathed/ignored? What do you do if you think it might be heading in the wrong direction?

23 Comments
Level 15

Vote up the idea of AppInsight for Orion or AppInsight for Solarwinds

Level 17


Very good points, the 3 signs that your nms is not loved anymore could not be more spot on.

MVP
MVP

I wish I knew why some engineers are drawn to tools like Orion NPM while others view them as purely operational. I keep a tab open for each monitoring tool we have, even though I'm not responsible for monitoring (at least not on paper). Who doesn't do that, right?

We have a few dozen tools in the environment, and I bet half of them are in major need to care and feeding. Cleaning up the old alerts, as you mentioned, is HUGE. People get used to seeing alerts if they aren't cleared within a few days. Then you get so used to seeing red that legitimate alerts get lost. It's the old signal-to-noise ratio problem that we've been dealing with for... ever.

If you want to know what tools are loved, ask your coworkers if they have any rules set up in Outlook to auto-delete or auto-sort messages from the NMS. Easy.

Level 10

I think #3 might be most important.  Devices not getting added, removed, slipping through the cracks will devalue any NMS system very quickly.

Level 9

Yeah...auto-discovery can help, but it's not perfect. In a small-medium organisation, you can keep your ear to the ground and hear about projects deploying new gear, but in bigger shops it's hard

Level 9

michael stump wrote:

If you want to know what tools are loved, ask your coworkers if they have any rules set up in Outlook to auto-delete or auto-sort messages from the NMS. Easy.

Oh I love this one - that's perfect!

Level 15

And don't forget lots of audit tools.

Currently adding a custom property in IPAM which gets auto populated with nodeID or na to indicate it's not meant to be monitored. Everything else gets reported.

Level 9

Sounds like a good approach. Let us know how it's been working in practice after a few months.

Level 15

Already in use for say 6months as a live report joining IPAM and NPM by IPAddress. This is v2 where we have a method for noting either the nodeid or that it's to be ignored.

Another level of reporting has been around for years and provides a view of what is (nms) vs what should be (cmdb)

Level 15

I wonder around the office and what do I find.  I am the only with the NMS open.  The others only open when they Need something.  I am working on the training side to change that.  Also, if nodes are not kept current, then it quickly falls apart.  Good post!

Level 13

Excellent points. I'm working on a new deployment currently, and will be sure to share this.

Level 13

Another issue I've directly observed is that of knowledge drain.

If the teams using or managing the NMS have too much turnover then the knowledge of what the NMS is looking at and why can be lost.  Over time the knowledge may be less useful to the newer teams and thus the NMS becomes deprioritized, alerts get auto-routed to spam folders, node change/add/delete stops occurring, and the product withers on the vine.

BTW - A product upgrade is a great opportunity to get teams re-educated and re-engaged.

MVP
MVP

Unfortunately I know a lot of engineers who have filters sending it straight to deleted. Or they have a filter for the NMS but never look at the folder the messages are sent to.

MVP
MVP

One of the biggest issues I've come across is the "build and forget" deployments. The customer has purchased SolarWinds years ago, gets the system up and running, loads up all their devices, configures alerting, etc.... and then nothing. New devices added, old devices removed, IP addresses updated, etc. but no-one updates the NMS. That's usually the point we get a call from their management asking for to come in and help clean it up or to explain why they should keep with the products.

If it becomes part of the operational process that whenever a change is made, a change is made to the NMS, this system has a lot less chance of becoming useless - or as they say, "garbage in, garbage out".

Level 13

Wow. I'm not a manager, but if I were, that would result in, at the very least, a closed-door meeting with the engineer to discuss why that was unacceptable.

When an alert is received, one of two things has happened:

  1. The alert indicates a system problem, and that system needs attention.
  2. The alert is "noise," in which case the alert needs to be tuned to only triggers when there is a system problem, or deactivated if it's just not useful at all.

In either case, I would expect the engineer to explain to me why they thought that ignoring or deleting the alert was a better course of action than the ones listed above.

Level 15

Or I have been part of a system where the NOC gets the alerts but fail to escalate them to the engineers.  This is the opposite side of tuning.  The problem then is the engineers don't know something is wrong and fail to act.

Level 13

I've always been an advocate of sending the alerts directly to those responsible to act upon them.

MVP
MVP

Your Talk, Talk , Talk point is a key here.  I set up bi-weekly meetings with various groups just for that point.  It becomes a forum where we can let each party know future plans, what is headed the others way, what can we do to help you do your job better, etc. sort of thing. 

Level 9

Terry Auspitz wrote:

Wow. I'm not a manager, but if I were, that would result in, at the very least, a closed-door meeting with the engineer to discuss why that was unacceptable.

I understand the sentiment, but I'd probably start from a gentler position. Usually people take action like that simply to try and manage the volume of irrelevant email they receive.

Sometimes people setting up monitoring systems think (or get told) "Let's email the team whenever a device goes down!" The problem is that most of the team isn't responsible for doing anything about that - e.g. it might the Operations team, or the on-call person that needs to do something. Everyone else just ignores those emails, and eventually they auto-delete them.

The other problem is that when you send notifications to an email list, when it's "everyone's" responsibility, then no-one takes ownership of it. E.g. people often tell me "Send an email to the whole team if the disk is getting full, and send an SMS to the on-call engineer if it gets critically full." The problem is that those emails just get ignored, because no-one is responsible for them. Instead everyone does nothing, and then it becomes critical, and the on-call person has to deal with it.

These days I'm a fan of using some sort of alert management/escalation tool, to better tune notifications. That gives you much better control. It also makes it easier than trying to update the 20 different cron scripts that send email to the wrong team...

Level 12

I believe that in order for a NMS to be successful you need to target the right staff, keep them engaged in the NMS, keep monitored components current and eliminate the noise.  Those should be top goals on any NMS Administrators list and are at the top of mine.  Without the discipline to keep that focus, you start to loose the audience.

If you are following the above, having an Engineer ignore an alert for any reason is just not acceptable.  As a NMS team member, they need to be engaged in either resolving the reported issue or explaining to the NMS Administrator why they should not be getting an alert for the particular topic.

To the point lindsayhill‌‌ mentioned, sometimes having the flexibility above is not available.  But...  any NMS Admin worth their weight needs to at least make sure those decision makers understand the impact that not taking these steps/having these goals, will have on the overall success of the significant investment the company has made to get to the point of having a functioning NMS.

IMHO

Level 12

If you are interested in some of the things I do to accomplish this in my shop, read on...

First, I have monthly meetings with the teams to review new developments within NMS as well as to understand what changes they would like to see with the NMS and what training they feel they need.  This was mentioned by Jfrazier‌ below although I found in my shop that more than monthly was too much interruption in the teams schedules.  But not only do I provide the forum, but I also encourage the team to contact me directly if anything comes up outside of the meetings.  Many Engineers just don't like talking about certain topics in a meeting setting and respond better to the open door policy.  In any case, without a platform for discussion, there is a lot you may not know that could be hurting the success of your NMS.

Second, I added a notification service to my overall NMS solution which allows only targeted Engineers to receive the alerts.  If they don't respond within a given period, the alert rolls to the next Engineer.  If it is a Critical alert and no one responds within a give period of time, a Manager is notified with the same sort of roll to a secondary Manager.  If an alert, Engineer or Manager level, is acknowledged, the acknowledgement times out and re-triggers after a given period to ensure they are not ignoring it.  This service also allows the engineer to choose how, and how often, they are notified as well as the ability to change that whenever they wish.  They are in control of how they are notified and thus more invested.

Third, I do everything possible to eliminate the noise by using Node Dependencies and also by building in dependency scenarios for node level components (see and vote for ‌) like applications running on top of other applications or login ports.  Getting alerts for multiple components all related to the same cause does nothing to help in getting a speedy resolution to that cause let alone maintaining the credibility of your NMS.

Fourth, I have found that Engineers too often forget to inform the NMS Admin of new components added.  This lack of monitoring will go on for sometimes years before something happens with that component that should have been caught by the NMS (black-eye).  Albeit not the eye of the NMS Admin if responsibility is clearly defined as being the Engineer's, still a black-eye on NMS.  I am now building monitors that keep track of the component types monitored on a node and then it compares them to actual monitored components.  If these lists don't match for more than a given period of time, depending upon importance, the difference is alerted to the appropriate team member.  If they don't act on it within a second period of time, the NMS Admin is alerted to ensure something is done to get that component added.

Level 15

This sounds like a solid way to keep the NMS in front of the team and keeping it solidly performing.  I may see if something like this will work in my team.  We are just cresting to that point where we are wanting to increase our monitoring capabilities.

Perhaps some rules to live by:

1.  Keep NCM up to date

2.  Learn the product, then learn how the product can accommodate more customers.  One of the best things I've done is offer monitoring service to other departments.  Especially non-IT staff.  They'd no idea they could be notified automatically when systems they work with are unavailable.  I've made some great friends and allies this way.

3.  Share your dreams with others, show them how you're accomplishing them via Orion tools, use those pretty reports and stats and trends to prove your value. 

And then get ready for some attention from folks who can change your life.

About the Author
Lindsay is a network & security consultant based in New Zealand.