cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Useful Alerts Help You Be Proactive

Level 10

In previous weeks, I have talked about running a well managed network and about monitoring services beyond simple up/down reachability states. Now it's time to talk about coupling alerting with your detailed monitoring.


You may need to have an alert sent if an interface goes down in the data center, but you almost certainly don't want an alert if an interface goes down for a user's desktop. You don't need (or want) an alert for every event in the network. If you receive alerts for everything, it becomes difficult to find the ones that really matter in the noise. Unnecessary alerts train people to ignore all alerts, since those that represent real issues are (hopefully) few. Remember the story of the boy who cried wolf? Keep your alerts useful.


Useful alerts help you to be proactive and leverage your detailed monitoring. Alerts can help you be proactive by letting you know that a circuit is over some percentage of utilization, a backup interface has gone down, or a device is running out of disk space. These help you to better manage your network by being proactive and resolving problems before they become an outage, or at least allowing you to react more quickly to an outage. It's always nice to know when something is broken before your users, especially if they call and you can tell them you are already working on it.


What's your philosophy on alerts? What proactive alerts have helped you head off a problem before it became an outage?


45 Comments
Level 13

It's funny that you mention interfaces. Until now, we have not been monitoring interfaces because previously, many switches were added with interfaces monitored that shouldn't have been. I spent all of last week cleaning this up, by either removing interfaces we don't particularly care about from monitoring, or by changing interfaces we want to monitor for errors but not for up/down to report unplugged instead of down.

The reason we weren't alerting on interfaces was exactly the reason you described; the noise would have trained everyone (myself included) to ignore the signal.

Level 9

Alerts are great, like you mentioned when you get a call being able to tell them your already working on the issue is invaluable.  Just basic up/down is very important to receive alerts on, but more importantly the things you can't see such as environmental alerts such as a temperature threshold being violated or fan tray going bad.  Now with NPM's ability to monitor route changes this further strengthens the network admins ability to trouble shoot and monitor their networks and alert on unwanted routing changes. 

Level 12

Agree with naburleson alerts are awesome we monitor everything to try to head it all off as soon as possible. the less the managers and execs know the better.

Level 9

I have our alerts setup to monitor node up/down status and server reboots. Although we do monitor some (not all) interfaces, we don't have any monitoring setup for interfaces. This is our next step in getting a managed network instead of a status update network. I think our group has moved from just wanting to know if something is on to how can we make it perform better -- kind of like keeping invisible so no one actually knows the network is running. We never want to get to the point of everyone knowing the network is down. I'm looking forward to see how/what/why everyone is monitoring and alerting on. I'm sure I'll get lots of take-a-ways.

Level 11

I would say that currently we dont have proactive alerts.  But this article has made me consider it more and more.  We typically have critical alerts that blast our phones when major issues are taking place, and then informative alerts that annoy the heck out of us.  I'm working on trimming down the informative alerts.

Jim

Level 17

Proactive alerts are more than useful, and this is why I enjoy cultivating Trap Alerts for some cases. There is a fine line to over alerting, as Scott McDermott has indicated Crying Wolf only get's you ignored; so when the real Huffer Puffer comes through your cries get no response until you have already pulled your hair out. The alerts that end up telling a story I reserve for myself, or only a few key people who can understand the sequence, carefully placed these are good. I do not need to know every disconnect, but the major lines even the redundant ones need consistent and active alerting.

I like to use a breakdown of Custom Properties to group or label my nodes, and by this property it is easy to squash the interface alert on my Access Layer node, while leaving the up links to trigger an alert. Another easy way is to standardize your description or labeling. Looking for that certain key word DIST or otherwise to know that this is an important interface makes the alerting easy. Just document and on board your folks properly so they are aware of your standards.

There are a million ways to reduce the noise, as Terry Auspitz has indicated; all that noise only trains you to ignore the little boy in the machine.

Level 9

We use Alerts in many different scenarios, but speaking of circuits, we use Alerts on older equipment--in order to help us watch the change in state, because it helps us determine whether or not a chassis, interface, etc. is aging to the point that a replacement is necessary. It is important for me to note that we look at the change in state over time and whether or not the change in state is accelerating over time. Additionally, we create a group of devices and also look at the incoming link--to ensure that it is the older piece of equipment and not incoming link, node, etc.--that is the root problem. Lastly, by looking at all interfaces--over time--we are able to also determine that it is a bad device/chassis, and not simply a bad interface.

This is not to say that we Alert on each change in state; rather, we only Alert on critical changes in state. However, we run a report that runs every N days or weeks--depending on the criticality of the piece of equipment and type of Alert--which, when taken together, helps us know what is requires replacement and how soon we must replace the equipment.

Great topic and I look forward to reading other responses and uses--for inclusion into our environment! (Thank you in advance.)

Level 12

definitely agree on a sea of alerts that cause all to be ignored.

I have several alerts configured that I use for reactive as well as proactive response.

I have alerts configured for all my nodes as any of these being down would require a reactive response. They can be considered proactive when I receive one during non-business hours, that allow me to resolve the issue before business hours commence, so the users are not affected. Same for alerts on my critical interfaces.

Proactive monitoring of my disaster recovery traffic (interface status AND traffic thresholds) alerts me to trouble brewing that can be addressed before the issue escalates. Same concept for video surveillance traffic on my network.

Proactive alerts on EoL, EoW on devices definitely give me a better handle on ageing devices.

Summary reports on scheduled config backups might be considered proactive in a sense as this would constitute a proactive state of readiness if a device were to fail, as available configs mean a shorter rectification time.

I also have some monitoring alerts set-up on the UPS's that power my IDF's. I am not responsible for power supply on my network, but when I get an alerts that a UPS is running on batter, then at least I can contact the electrical department to look at the circuit, before the UPS runs dead and the users are affected.

Level 14

I see a ton of comments about alerts for hardware related issues, but in my world, nothing is an outage unless a user is going to complain (or already has). For this reason, I tend to build my alerts from an IT Service/User Experience perspective. If a circuit/interface goes down, but I built my environment with redundancy, then the user is not impacted. So, while I get an alert for an "actionable task!", it is not critical or user impacting. However, if web services stop for our Oracle environment and users are unable to work, then I have a CRITICAL alert that needs immediate attention. So, monitoring hardware is only a piece of the puzzle, services are also very important.

My alerts are configured with a subject line that includes CRITICAL, MAJOR, MINOR or CLEARED.

  • CRITICAL means the users are adversely impacted - IT Service is unavailable
  • MAJOR means I have something that needs immediate attention, but users are not impacted - generally something redundant failed or network congestion
  • MINOR means something that is not redundant has failed, but its a non-critical service
  • CLEARED means whatever was broke, is back online

In every environment I've been in that has implemented monitoring and effective alerting, the entire environment is taken into consideration from end-to-end, and alerts are built based on this philosophy. I know in a world of silo's this can be difficult, but it's my experience this has been the most successful practice.

D

Level 10

I really like these ideas, particularly the clear indication of the severity level right on the subject line that is also easy to build into email handling rules. Thanks for sharing!

Level 11

I'm in the middle of a "redesign" of my alerts and if "imitation is the highest form of flattery" then I hope you soon find yourself flattered.  I'm going to borrow some of this.  deverts, do you have a standard alert email (or whatever) that is sent out using variables and such to keep the alert consistent regardless of what is down?

Jim

Level 17

I like these flags, and once documented and implemented to procedure based on the written policy will keep the standardization of these flags in line.

Very Nice use of a basic concept!!! deverts well played.

Level 14

I started with 1 alert and modified several accordingly, custom fields are also required.

  1. We used custom field "City" and defined it as "<city> - Production" or "<city> - Development", etc.
  2. We used custom field "Comments" and defined it as the environment like "Oracle" or "E-mail", etc.
  3. Then I built alerts based on combinations of these and whatever other criteria I needed

Most of my environment is redundant, so this is an example of an e-mail I get for a node down:

Subject:  ** MAJOR ** - Network Node ${NodeName} is ${Status}

Message:

${NodeName} is ${Status} as of ${AlertTriggerTime}

Node Details:  ${NodeDetailsURL}

-----

If I have a circuit outage:

Subject:  ** MAJOR ** - MPLS Circuit on ${NodeName} is ${Interface.Status}

Message:

Interface:  ${Interface.Caption}

Node:       ${NodeName}

Status:      ${Interface.Status}.

-----

If we have a CRITICAL service down:

Subject:  ** CRITICAL ** - ${ComponentName} ${ComponentStatus}

Message:

${ComponentName} is  ${ComponentStatus} on ${NodeName}

at ${AlertTriggerTime}.

${APM:ComponentDetailsURL}

Acknowledge:

"${AcknowledgeURL}"

Alert that triggered this email:

${AlertName}

-----

The alerts are all based on your criteria, but I recommend taking FULL advantage of the custom fields.

D

Level 14

I think the very best alerts proactive.  deverts nailed it with that design but I think that design of alerts that are proactive takes a collaborative approach from the monitoring team.  You have to know the business and all of the moving bits in the business so that you can understand how to best serve the needs of everybody from the customer-facing to the back-office teams.

I also believe that there is a time and place for the monitoring team to simply say that something is out of scope.  Truly proactive alerts mean more than just triggering under certain conditions, it requires a response.  If that response doesn't exist, does the alert really matter?

My 5 cents.  (Canadians don't have pennies anymore.  I rounded up.)

Level 11

When recently replaced our switches in our remote offices..around 100. There are only two of us who monitor everything on solarwinds and we noticed it was running so slow well to make a ling story short just about every interface on those switches were being monitored so we both spent a good deal of time removing unwanted interfaces. We also get a lot of calls from users complaining about slow response time. The best alert that helps us get ahead of the game is the email we get when a site has a high utilization. Normally before that office can call us we have researched that office and found what the problem is and can remedy it.

Level 11

We keep it pretty proactive around here, but we still (and always will) have some cleaning-up to do.

Level 15

I spend a good amount of time on an almost weekly basis discussing alert design and theory with my various clients. Generally speaking over the years, I have found that changing the subject line of your emails (as mentioned several times above) tends to be the absolute most effective process.

However, effective is by no means always correct. There have been many many times that the local teams have convinced me that they have a use case way outside of the 'norm' for what I would consider proactive alerting. Some teams will forever be retroactive to issues in the enterprise. But if that's by their own design, who are we to argue?

Proactive alerting, IMHO, should ALWAYS be the goal of any NMS. (In this context, I include all of the Orion core modules in NMS). Otherwise, it feels that we are spending a lot of money on licensing and overhead to have a really fancy way of getting an email at about the same time the users are calling.

Every enterprise is unique in its needs and problems, so it's really hard to say that X alerts will allow for proactive monitoring. However, I have found that tiered alerts will really help bring attention to a focused area before critical incidents occur.

For example: a WARNING level alert at interface utilization > 70% for longer than 10 minutes; CRITICAL level for > 90% longer than 5 minutes.

Of course, these numbers are made up. But that really brings up another key topic. BASELINE. BASELINE. BASELINE. (it's a golden nugget from the monitoring gods really)

Level 21

While there is a lot of more complexity (custom properties, specific criteria, etc) to our alerts; at the most basic level some of our most useful alerts have been the following...

  • Upstream interface down
  • Volume filling up
  • Service down
Level 11

I agree to many alerts will just be ignored.  Alerts need to be meaningful to the person receiving the alert.  One useful one I have seen is hard drive space low alert which is very useful in knowing when your system is on the verge of shutting down because the drive space is full.

Level 9

UPS alerts for when you have power issues are helpful. Gives you a good warning before they drain out and stuff starts shutting down. Also lets you know if your building has any weird power issues.     

Level 14

A good alert to have is for backup jobs that didn't complete.  I would like to know that I have a backup when I go to use it.  I don't like needing to use a backup, only to find out that the backups have not completed successfully for the last month.  lol

Level 11

I started scheduling everything in my life to help with productivity.

Level 13

The users love when, after the power goes out in a branch, we call them to confirm it. It freaks them out a little though.

MVP
MVP

we are looking at this currently and are attempting to make sure that we get alerts for only those things that require alerts, but still allow other monitors to trigger hour of business alerts or email only alerts etc.

I see this as a Event Management function (ITIL) rather than a monitoring function - if its important enough to send a text alert to a tech there should be a record in the servicedesk.

Those two functions should not be disconnected otherwise you end up losing track of what was done and why, as well as what other systems were affected.

Level 12

This is the number one thing I deal with on a weekly basis with clients - figuring out how to slow the deluge of alerts into a meaningful trickle.

Level 11

I love SNMP.

MVP
MVP

For some events we will send an "advisory" alert...more of a heads up that a condition exists that may be indicative of an outage.twork team

We also have the network team label the interfaces of the switches with a criticality number (1-4) a responsible group designator (2 characters) and a name of attached device.
all interface alerts go to the team specified by the 2 cahracter designator.

The criticality level determines the types of alerts similar to the following list:

1 - email, ticket, and phone call

2  - email and ticket

3 - ticket

4 - email

MVP
MVP

that sounds extremely cool

Are you able to share how you did that?

Level 17

My philosophy was simple:

  1. Alert for when action is necessary, otherwise
  2. Don't

Being alerted to something that doesn't require action is just white noise. You are better off logging that information somewhere to be reviewed later.

Level 11

Sometimes i feel like if I have enough alerts setup, i can almost wait for a problem to arise and get the instant gratification that an alert is working and saved the day. I like that part of system administration. Knowing before anyone else that there is a problem and being able to repond any time of day. I believe everyone on here likes that feeling.

Today we had a major circuit outage...and did not know it.  I monitor the interfaces including this one but it is still Up - Up.  It is connected to a telecom switch and the switch is up....but the network circuit on the other end of their switch is down.

We have a different circuit at our other data center that took over for the failed circuit.  We have a large circuit between data centers.  So NPM was happy that all was well.....but it is not.

So I need to make an alert that looks for 0 bytes transferred for two minutes and apply it to that interface.

Anyone have a better plan?

RT

Level 14

Are you using Cisco gear? Do you have VNQM? I have a couple similar situations like yours and I use IP SLA to get that alert.

D

MVP
MVP

can you see the device at the other end of the circuit ?

Can you ping something on the other end that will not be available when that circuit is down ?

Yes to the first question...no to the second.  VNQM is one of the few I do not have.

RT

Jfrazier I could ping its peer IP.  That would be the simplest way to monitor it.  I will ping the peer and name it something ominous like "MPLS at location X outage".  That will get some attention.

The way I found out it was down....A Solarwinds user has been tracking a certain systems usage with netflow everyday.  Today the graphs said "No Data Available" and he asked why.  The graphs were correct....with the circuit outage that data was not available.

We then told him to track the usage using the circuit that was running and he got his data.

I have a bunch of modules but next year I hope to add few others like.....VNQM, FSM, Database Performance Analyzer and Mobile Admin.

Thanks,

RT

Level 10

We had a very nice alert system. The first alert came out immediately to the on-call phone (we used a directory number in Call Manager, and it was changed every Friday morning at 0800 to the cell of the on-call), then 5 minutes later to the on-call; and if it still was not acknowledged after 10 minutes, it went to the backup on-call (we had 12 guys, for a 6 week rotation). After 20 minutes it went to the boss. Needless to say, we would acknowledge the alerts ASAP. We were a 24x7 business.

Level 12

I think the worst ones are the ones you didn't think you need to alert on.  We weren't monitoring state on our Shortel gear and when the phones started going straight to voicemail we found that we needed at least ICMP for them.  Lesson learned.

Level 17

You have to be a bit careful with these; something where the light stays on, or a circuit that runs through the *shudder* Cloud a lot of times will only show down on one or none sides.

EIGRP Change Alert, or Trap Alerts are good, check your events to see if there is a interface speed change or other flag you could use. You might enable an alert or three based on what you see.. just to you to see if it really works for what you need, then open it to the masses or make on-call inclusive.

Level 11

I think threshold or anomalous behavior alerts are where it's at for being proactive. "Disk Full" or "Interface Down" are definitely important alerts, but if there is something going on before it gets to that point, I want to know. Also tracking config changes is critical, I really want to know when someone on my team has made a change on a switch or router, not to pin blame but to see when changes are made and what changed to help speed up remediation.

Level 9

I agree that Alerts are useful in two critical instances:

  1. Proactive measures to mitigate a problem; and
  2. Reactive steps--taken to correct the problem (after the fact).

Although we cannot dream-up every possible proactive Alert that is necessary, nor are we able to respond quick enough to every proactive Alert--before a problem does occur, beginning with a proactive stance, makes for a more successful team and use of the SolarWinds tool.

Level 10

Haven't set anything up yet, but hope to soon 🙂

Level 12

I do something similar to deverts with CPs.  One identifies if the node is in my Live environment so my team can see it or not.  Another identifies if a SAM App Monitor is live or not so I can add/QA/upgrade/test without Alerts going to the team.  Still another identifies my Support Team(s) related to the node's support so one Business Unit (BU) can handle their alerting differently than another.  And another identify each node as either Vital or Non-Vital meaning that things that go wrong on the node could affect the Production environment (business critical environment).

Vital would therefore extend to nodes such as network devices, UPS, HVAC, etc.  Anything that can cause an impact on the Production environment.

I then structure the Alert notifications to read something like "Critical Alert:  Vital serverA System Faults Statistics are currently Critical" which immediately tells the Engineer that it is a "Critical Alert" so must be dealt with immediately and is on a "Vital" node so not doing so could affect the Production environment.  For Non-Vital nodes, even for Critical Events, the Alert status drops to Warning.  So any notification that starts with "Warning Alert", be it against a Vital or Non-Vital node, immediately tells the Engineer that they can put this further down in their list of to-dos for the day.  This configuration is used both for Alert email notifications that go out to the full BU Support Team so they are aware of the issue as well as APIs that go out to a notification service that contacts specific Engineers, to deal with the issue, and Managers, for immediate notification of Critical Alerts and escalation of non-Engineer-response to issues, based on time of day, rotating on-call schedules, normal work hour schedules, etc.

Together with creating intelligent monitors that zero in on the root cause, saving troubleshooting time for the Engineers and avoiding alert storms, and fixing issues with monitors offline and ASAP, I am able to meet my top goals.  Those are:

  1. Eliminate as many unnecessary/inaccurate Alerts as possible
  2. Only Alert on the root cause whenever possible
  3. Only engage those that need to be engaged to resolve the issue
Level 12

Agreed, a flag that's always red means nothing and is probably just monitoring the wrong thing.

Level 12

We usually alert when a node is down, CPU on switch (internet sw) is nearing 90%, when business/patient web servers are running IIS at 85% - 100%.

We also alert when certain interfaces to certain systems are down, or our distros and core switches are down.

We've setup some of our critical servers to alert us when CPU utilization is 85-100%, low drive space, and RAM maxing.

We also alert when there is a configuration change on our switches. And we send ourselves reports everyday so we can see what changes have been made and if backups of these configs are being made.

Thank you!

Cheryl

Level 15

Thinking about all this and trying to avoid alert fatigue.  Good feedback and food for thought.

About the Author
I'm a network engineer working primarily with R&S and Wi-Fi. You can follow me on twitter at @scottm32768 or on my blog at www.mostlynetworks.com. I hold multiple IT certifications, including the Solarwinds Certified Professional (SCP2532).