Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials

TIPS & TRICKS: Stop the madness! Avoiding alerts but continuing to pull statistics.

This is the first in a series of posts where, in the name of giving back to the community, I’m going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer’s premises and set up NAT’s for the devices they want (read “pay us”) to monitor, and we’re good to go. This is a perfect fit for our customer base, where they don’t want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who’s going to handle all those pesky tickets).

So our model – where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
•    How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
•    How to set thresholds for devices when that could be different on nearly a device-by-device basis
•    How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data

This post is going to look at our solution for the first bullet – how to stop alerting but continue to collect statistics.

Of course, we all know that SolarWinds has the “unmanage” feature. This is a nifty little function that even has a scheduler associated with it, and can handle one-time or recurring events.

But our problem was that in some cases we needed to continue to collect the statistics even during the window where alerts would be a problem. For example, when a circuit goes down, our Network Operations Center (NOC) staff contact the customer’s carrier and act as the point of contact for testing and resolution. During that time, *we* want to know the status of the WAN circuit, but we don’t want additional alerts (read “tickets, where we have an SLA that carries $$$ penalties if we fail to acknowledge and close”). Unmanage would certainly turn off the alerts, but we’d have no way of knowing what SolarWinds thought about the circuit status until we managed the interface again and – you guessed it – potentially cut another ticket.

So we developed the “MUTE” field. The logic is very simple:
1.    Set up a custom property (a yes/no field) labeled "mute"
2.    for specific nodes, set that property to "yes"
3.    Within your alerts, make sure one of your logic checks is something like "MUTE is not equal to YES"

That’s the basic idea. But here at Sentinel we’ve made it a bit more granular. The following mute fields are in place:

•    n_mute - node mute. This is an overall mute. All alerts should check for node-mute, and if it is set to "yes", the alert should be ignored.
•    i_mute - interface mute. This is, as the name implies, used in any interface-related alert
•    v_mute - volume mute. Again, the name should be a good clue to the usage. Very valuable when you have disks that are always at the edge of being full, but (for whatever reason) you don't care.
•    APM_mute - This mute option is very useful when you are bringing new applications online and want to pilot them, but you still need to get hardware alerts (CPU, RAM, etc).

The logic for any alert then looks like this:

Where ALL of the following are true
N_MUTE is not equal to YES
<the rest of your alert criteria>

For an interface alert, the logic would simply include two lines:

Where ALL of the following are true
N_MUTE is not equal to YES
I_MUTE is not equal to YES
<the rest of your alert criteria>

Along with the MUTE fields, there are associated DESCRIPTION (n_mute_desc, i_mute_desc, etc) fields. That way we can add comments about when and why the element was muted.

As long as everything stays nice and standard, views and reports can be designed that let you know which elements are muted and why.

We’ve developed a standard set of terms for use in these description fields so that, for example, we can create a view that shows all the muted nodes – so that we can know when a device has been muted for too long - but ignores ones that are purposely muted forever based on customer requirements.

IN THE NEXT POST: How to easily set per-device thresholds.

Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/

Find more posts tagged with

network_monitoring

Mute

network_management

alart_suppression

npm

Alerts

Accepted answers

All comments

KwameB

This is the perfect solution to my problem of missing stats collection during maintenance windows...the unmanage function is ok but didn't fully achieve what I needed...

I truly appreciate this post, it is a concise explanation of a simple solution to a complex problem.

I'm already anticipating your upcoming posts.

byrona

We are a hosting solutions provider and provide similar monitoring offerings to our hosted customers, I have even used a very similar solution to what you mention here (great write up by the way).

Assuming that you provide your monitored customers access to Orion to see their stuff, I would be very interested in hearing about how you have configured Orion as a multi-tenant solution as this is something I am constantly working to improve.

adatole

For those who are following this series, the second post is now up. You can find it here:

adeimel

excellent work, question on where you are adding your apm_mute field?

I only recently brought our first APM box online and noticed there is apparently no ability to add custom fields to the application tables? If I have multiple templates on a node it doesn't seem plausible to mute the entire node if say only 1 of 3 apps in down for say maintenance. I'm curious if you've found a way to get down to the template level of granularity?

Thanks

mdriskell

Is there any way to script this? I am looking for the ability to do this in the same manner as unmanage. I need to be able to set it for a future time and have it restore at a set time.

Unmanage is great for a node that is down IMHO but for a maintenance event I would love to be able to script this muting so that I don't lose all my data for a node that might go down for 5 minutes in a 6 hour maintenance window.

adatole

I'm sure there is a way to script it - the data is a simple field in the nodes table after all. So you can write a web front-end that changes that field at will.

At Sentinel, we've created a system for our NOC staff to mute/unmute devices without needing to give them "manage nodes" permissions within SolarWinds. We've also scripted the ability to mute a mass number of nodes at a time (everything for a single customer).

We haven't scripted the mute/unmute on a schedule - although we've discussed it and call it a "blackout" feature, similar to unmanage.

What's stopped me is that I really don't feel like coding all the various calendar routines. It's pure laziness, but our need for this kind of feature hasn't been that great.

That having been said, that's all it would require. IMHO.

- Leon

Gavin55

Thanks for the very useful info. I agree a script would be helpful.

mdriskell

We haven't scripted the mute/unmute on a schedule - although we've discussed it and call it a "blackout" feature, similar to unmanage.

That's exactly what we are trying to accomplish. We are replacing an Openview suite and their ENotify product has the ability to schedule blackout periods for a device.

So basically if I can get one of my SQL developers to give us some kind of web front end to have the NOC go in and blackout a device that would be ideal as I'm not big on giving out manage node functionality either.

We've also scripted the ability to mute a mass number of nodes at a time (everything for a single customer).

Hmm...sounds like putting a customer on service hold for non payment

freemen

adatole,

Please forgive my denseness here, but I need a little help understanding how the MUTE custom property works.

Will the value of the MUTE property always be Yes for those nodes, or is it manually set to Yes under certain circumstances?

mdriskell

Freemen,

It is up to the administrator. If you have the device/vol/app muted it will not alert. Once you set that property it will exclude it from any alerts if you have muting supressed from your alert conditions.

Here is an example of where I would use muting. We get a proactive maintenance notification from Verizon telling us that a circuit is going to be down somewhere between 12AM - 5AM tonight. I would have the NOC mute that device at 12AM and unmute at 5AM (unless I can figure out how to script it). Why do this instead of simply unmanage. Well if I unmanage Solarwinds doesn't collect anything for that node during the unmanage period. If I use the muting concept it will collect stats but simply not alert. I use the Verizon proactive maintenance for an example because these maintenance windows are typically about 5 hours but the impact is usually 15 minutes or less. I don't want to lose 5 hours of data for a 15 minute hit.

If you left a node always muted Solarwinds would collect data but never alert. The reason you would do this is so you don't have to create unique exclusions in your alerts based on node name or something else but rather simply always check for the node muted field.

I hope that this helps explain the concept a little better. I really love the idea of doing this but I need to figure out how to script this from a web front end for it to be really useful in our cases.

freemen

Yes, I understand the idea now, but this only matters on interfaces, volumes and applications correct? If the node is what will actually be down, there is no sense in polling it - you will get no data. Correct?

If a child element or application is going down, then you can still get data on the node itself. Is that the idea?

netlogix

for a scheduler, you could use windows task that runs a batch file:

@echo off
rem Set SQLCMD="Update Nodes Set N_Mute='Yes' where NodeID in (1,2,3,4)"
Set SQLCMD="Update NetPerfMon.dbo.Nodes Set N_Mute='Yes' where Caption Like '<servername>.%'"
sqlcmd -S <ORION_DB_SERVER> -E -Q %SQLCMD% -h -1 -W

Another cool concept might be to use a date/time instead on "yes"/"no" that would auto expire a maintenance window (like solarwinds built in one). I think you would have to do an advanced SQL alert then though.

mdriskell

Let me give you another example.

Say you are dealing with a circuit issue. You have a node going up and down constantly and the carrier can't fix the issue until Monday. You would potentially still want solarwinds collecting data so that you have a history of outage information (to potentially recoup money from the carrier for an uptime agreement) but do you need to get alerted every time it happens if it's a known issue.

The entire point of this method is to supress the alert but still poll...if you don't need to poll then yes unmanage works perfect for you and you don't need to utilize this method.

adatole

Freemen:

Yes, BUT...

You might conceivably mute a note during a maintenence window; when the node is just coming online but in pilot mode; when it's having an intermittent long-term problem that is already being looked into; etc.

Otherwise, if the node was just down-down and it was going to stay down, no sense in muting.

adatole

Netlogix:

Absolutely right. What's slowing me down is the actual user interface and the back-end data to keep track of the on/off. It's not HARD, but I just don't feel like writing the web-based calendar applet, then the shim to write the nodeID, mute-on, mute-off data to the db, and the OTHER interface to list out the upcoming blackouts with a set of MAC (move add change) options. I know it's small potatoes for someone who codes all day. Just not something I do a lot of.

When I get annoyed enough, it'll happen. And a week later SolarWinds will announce it's integrated into their system, and I will weep in self-pity.

;-)

- Leon

mvjames

While i like the idea of implementing this, if you don't use the UN/manage option would that not affect the Uptime reports? While this would suppress alerts, WebConsole, charts and REPORTS would show the downtime.

Wouldn't it be nice if the Alerts would exclude based on a date in the custom property fields... FROM and TO.

d09h

Regarding alert suppression without getting holes in your data/ graphs, you could set up a separate alert covering the time to exclude. Example:

main alert--M-F 0800 - 1700

after hours alert 1701 - 0759 M - F

weekend after hours alert 0001 - 2400 Sa Su

same alert content except for actions (no email, no SMS, etc.) on the the non-main alert. Before using the unmanage utility, I had to do this for thousands of nodes being backed up.

mmelton

This has worked great for me and my company. I get tired of sending emails "Don't worry this is me!"

Thanks,

jbeville

This seams like it would work great.

I created the n_mute and i_mute custom fields. These are not showing up in the Advanced Alerts. Do you know if I must restart services for these to show?

Thanks,

Jay

Orion NPM - 10K elements and growing!!!

d09h

For ReportWriter to see your custom properties, you will need to update report schemas. I suspect this is also the case with Advanced Alerts. This might help:

Launch Custom Property Editor. Right-click the top near the icons...click Customize. Select the Commands tab. Click Update Report Schemas.

Drag that to where the icons appear at the top of Custom Property Editor.

Click that new icon (looks like a Dialog box with a lightning bolt). You'll see "Custom Properties have been added to all Report Schemas".

jwhitten

Nice idea-- and the other post too. One useful improvement that could be made, is to add a second set of mute fields, so that dual alerts could be generated and/or suppressed for a single device. Say a customer set and a provider set. Each would have their own mute fields they could twiddle to control their own alert sets without interfering with the other's.

mdriskell

I submitted an idea http://thwack.solarwinds.com/ideas/1056 for SolarWinds to expand on this concept and hopefully have it built into the product. I urge other users faced with this issue to consider voting on it.

Thanks

adatole

Thanks! I had put in the feature request back in 2010 - but after it was clear we weren't going to get it soon we coded around it, and the solution you see is what we came up with. Hopefully if enough people ask, they can bake it into the interface.

mdriskell

Yeah, if I can get my hands on a dev to write the code I need I'll do the same but haven't had the resources here available for me to do just that. I want to do it from a web ui and have it fully controlled for both on/off which is why I submitted the idea. Hoping to create a demand for it which is why I chimed in on this thread so everyone that has notifications will be aware of it

jbiggley

I upped the ante a bit on a current customer build. I built two maintenance fields, one for maintenance start and the other for maintenance end. I perform three checks -- is maintenance start less than maintenance end (both fields are time/date) and I check to make sure both fields are not empty. This allows the customer to schedule maintenance in advance and have alert suppression enabled when the maintenance comes along, but have the node also come out of suppression as soon as the change window is closed.

I would have liked to check N_Maint_Start (I have the same checks for interfaces and volumes) against current times instead of having to do a comparison of values, but I couldn't find a way to get that done.

I have also built some reports (or I am in the process of building them, I should say) that will notify the NOC of the following conditions:

1. If the maintenance start time > maintenance end time. (This is an invalid condition)

2. If the maintenance end time has passed, but the change procedure has not resulted in the clearing of the maintenance values from the node/interface/volume, etc.

I'd love to get some feedback on other reports, alerts, etc. you use to keep custom field data aligned with the design.

Josh

adatole

Sounds like you will soon be learning about SQL alerts, instead of using the query builder. Ditto reports.

;-)

Honestly, your report requirement is the easiest - again, by using a SQL report, it would look something like this (I'm creating this off the top of my head with no testing. Caveate executor!

select nodes.nodeid, nodes.caption, nodes.n_maint_start, nodes.n_maint_end

from nodes

where nodes.n_maint_start > nodes.n_maint_end

or nodes.n_maint_end < getdate()

As for the first item (alerts that compare current date), your best bet (again, this is off the top of my head at 1:00am, so run at your own risk) is to create the basic alert structure like you normally would, then switch the alert type from Node to SQL Custom. THEN modify the resulting query code to include a check for:

where nodes.n_maint_start > getdate()

and nodes.n_maint_end < getdate()

...instead of comparing the two dates to each other.

jbiggley

Ahh yes -- it looks like I am going to have to finally learn SQL query language, eh? Here's what I have so far, but it doesn't appear to be working. I throws an error -- see the text below.

Select Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

WHERE

(

(Interfaces.I_Maint_End > getdate()) AND

(Interfaces.I_Maint_Start< getdate()) AND

(Interfaces.I_Maint_Start IS NOT NULL) AND

(Interfaces.I_Maint_End IS NOT NULL)

)

However, that throws this error when I try and validate the SQL:

SQL Error:

-2147217900 - The multi-part identifier "Interfaces.I_Maint_End" could not be bound.

Any ideas for a SQL simpleton?

adatole

It's a common mistake if you are new to SQL. You are referencing items in teh interfaces table, but you haven't told the SQL how to "get" there (ie: how to link your interface info to your node info.

Try this (again, I haven't tested this *at all*)

Select Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

join interfaces on nodes.nodeid = interfaces.nodeid

WHERE

(

(Interfaces.I_Maint_End > getdate()) AND

(Interfaces.I_Maint_Start< getdate()) AND

(Interfaces.I_Maint_Start IS NOT NULL) AND

(Interfaces.I_Maint_End IS NOT NULL)

)

NOW... that said, if the majority of the information you want is actually out of the interfaces table (meaning it's not the NODE that goes into maint, it's each individual INTERFACE) then you might be better off making this an interface alert rather than a node alert. Then your query would start to look like this:

Select interfaces.interfaceid AS NetObjectID, interfaces.fullname AS Name

FROM interfaces

WHERE

(

(Interfaces.I_Maint_End > getdate()) AND

(Interfaces.I_Maint_Start< getdate()) AND

(Interfaces.I_Maint_Start IS NOT NULL) AND

(Interfaces.I_Maint_End IS NOT NULL)

)

HTH

- Leon

sja

Hi Leon

First I like to say thanks sharing your wisdom and experience.

I like the "i_mute"and the "n_mute" .

I play with the "mute" so there is alert but no trigger of sms or email to the noc.

So I made new alert with the name "alert me when a node goes down (mute)"

n_mute is equal to YES

Node is down

Trigger is just post in "event log-active alert"

like it because it's not hide the alert from the noc.

Is that something you try working/play with?

/SJA

FormerMember

Thanks a tonne I have to say that, I and my firm are entirely new to outsourced managed services such as this, so this wasn't even something I considered as a potential problem but after reading this post I'm really appreciative that you have created such a simple work around to something which they should consider adding as a feature. Especially given how many of the users on here seem to think that this workaround is a stroke of genius.

rharland2012

Leon,

this is excellent stuff - thanks for sharing it.

AlexSoul

This is great post, thank you very much for sharing.

I have been using it for a while and stumbled upon a problem - how do I efficiently track all those nodes which are muted temporarily only (I guess most of them will be on a temp bases). Reports are fine, but require extra work to regularly review and decide what can be un-muted and what can stay muted. Besides it makes it more difficult when system is being managed by different people.

Here is my improvement:

1. Replace n_mute (boolen) with n_mute_until (date)

2. Here is how NODE DOWN alert condition would look like:

Thanks,

Alex

AlexSoul

you don't need SQL query. See my post below...

AlexSoul

by the way, if I want to mute indefinitely (which is practically the same as using n_mute) I would do the following:

n_mute_until = 01/01/3000

I hope my successors will not be disappointed too much by excessive alerts on the New Year

jason1320

We have several members of our team adding nodes and I can easily see on of them forgetting to set this property, so the way the alert is written "If n_mute IS NOT YES" then all new nodes would default to no alerts. I see this as a potential problem.

I was going to ask how you mange the introduction of new nodes since there appears to be no way default "n_mute" to NO and there is no way to make the custom field mandatory when adding a new node. I'm still curious if either one of these is possible, but I like Alex Slv's suggestion of using the date instead of the Yes/No.

jason1320

Also, what is the benefit of writing the mute logic in to the trigger condition? Wouldn't it simplify things to use the alert suppression field instead?

FormerMember

I use the Alert Suppression field
while (Maintenance = Yes)
{
suppress, node down, node reboot, and high packet loss;
}

adatole

Remember, this is a YES/NO field type. Not text. So it can only be checked or unchecked. Since checkboxes default to blank (ie: "no") then stating "where n_mute is NOT yes" means you DO get the alert. Only if someone explicitly sets the checkbox is muting turned on.

If the "if n_mute is NOT yes" is too convoluted, then re-adjust your alert logic to "if n_mute is NO".

adatole

The problem with alert suppression is that it's NOT specific to a particular node. If any node anywhere has maintenance set to yes, then the alert in question is suppressed.

Where alert suppression works (and it's a really REALLY limited case) is if one of your key systems (like the core switch or something) is down, you can suppress an alert.

Otherwise, don't use it.

IM(ns)HO

dave_mcmillan

This is awesome I actually read this a couple of years ago and even set the "Mute" custom property...I hate to say it but after so many changes and focusing on multiple projects this was forgotten just became a custom property we did not use. Now that we are moving forward with the patch management program this will save me a lot of trouble.

pdicky76

Hey Leon, great notes, thank you!

I'm a rookie so forgive me but, I'm wondering if you created this system with a utility within SolarWinds or just something external accessing the DB? I'm guessing external but if I'm wrong where would I start within SolarWinds to build that?

Again many thanks!

Paul

marunderwood

A similar idea is being worked on, https://thwack.solarwinds.com/ideas/1056. Whoot!! Whoot!!

mdriskell

Yes basically I tried to take his original concept and proposed it as a feature request to get it built in the code.

chadsikorra

It would be nice to actually see an expected ETA on this sort of feature. Looks like that idea has been around for quite some time! I ended up implementing a powershell script using the Orion API to automatically set an "InMaintenance" custom property on nodes for their maintenance windows based off of WSUS group information. It works quite well and is all automated, but would be nice to have a built-in feature for this.

fsiomar

I know this thread is really old but would appreciate any help anyone may be able to provide.

I've created a volume_mute custom property against the "Volumes" object type with a Yes/No format in SolarWinds Orion (Versions: Orion Platform 2015.1.2, IVIM 2.1.0, DPA 10.0.0, NPM 11.5.2, QoE 2.0, SAM 6.2.2) and during the creation process it asked me which volumes it would apply to. I added the volume I want to mute but when going to that volume on the node i cannot see the newly created Custom Property. (nor can i see it on the Node itself)

Under Admin > Manage Custom Properties I can now see the property I created:

Full Custom Property list on the Volume itself:

As you can see I cannot see the volume_mute property listed.

Am I looking in the wrong place?

*edit* - not sure if it's worth mentioning that I only want to set the volume_mute property on ONE of this Nodes' 2 volumes.