This is the first in a series of posts where, in the name of giving back to the community, I’m going to share some of the customizations that make SolarWinds a little more robust for us and our customers.
First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.
One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer’s premises and set up NAT’s for the devices they want (read “pay us”) to monitor, and we’re good to go. This is a perfect fit for our customer base, where they don’t want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who’s going to handle all those pesky tickets).
So our model – where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
• How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
• How to set thresholds for devices when that could be different on nearly a device-by-device basis
• How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data
This post is going to look at our solution for the first bullet – how to stop alerting but continue to collect statistics.
Of course, we all know that SolarWinds has the “unmanage” feature. This is a nifty little function that even has a scheduler associated with it, and can handle one-time or recurring events.
But our problem was that in some cases we needed to continue to collect the statistics even during the window where alerts would be a problem. For example, when a circuit goes down, our Network Operations Center (NOC) staff contact the customer’s carrier and act as the point of contact for testing and resolution. During that time, *we* want to know the status of the WAN circuit, but we don’t want additional alerts (read “tickets, where we have an SLA that carries $$$ penalties if we fail to acknowledge and close”). Unmanage would certainly turn off the alerts, but we’d have no way of knowing what SolarWinds thought about the circuit status until we managed the interface again and – you guessed it – potentially cut another ticket.
So we developed the “MUTE” field. The logic is very simple:
1. Set up a custom property (a yes/no field) labeled "mute"
2. for specific nodes, set that property to "yes"
3. Within your alerts, make sure one of your logic checks is something like "MUTE is not equal to YES"
That’s the basic idea. But here at Sentinel we’ve made it a bit more granular. The following mute fields are in place:
• n_mute - node mute. This is an overall mute. All alerts should check for node-mute, and if it is set to "yes", the alert should be ignored.
• i_mute - interface mute. This is, as the name implies, used in any interface-related alert
• v_mute - volume mute. Again, the name should be a good clue to the usage. Very valuable when you have disks that are always at the edge of being full, but (for whatever reason) you don't care.
• APM_mute - This mute option is very useful when you are bringing new applications online and want to pilot them, but you still need to get hardware alerts (CPU, RAM, etc).
The logic for any alert then looks like this:
Where ALL of the following are true
N_MUTE is not equal to YES
<the rest of your alert criteria>
For an interface alert, the logic would simply include two lines:
Where ALL of the following are true
N_MUTE is not equal to YES
I_MUTE is not equal to YES
<the rest of your alert criteria>
Along with the MUTE fields, there are associated DESCRIPTION (n_mute_desc, i_mute_desc, etc) fields. That way we can add comments about when and why the element was muted.
As long as everything stays nice and standard, views and reports can be designed that let you know which elements are muted and why.
We’ve developed a standard set of terms for use in these description fields so that, for example, we can create a view that shows all the muted nodes – so that we can know when a device has been muted for too long - but ignores ones that are purposely muted forever based on customer requirements.
IN THE NEXT POST: How to easily set per-device thresholds.
Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/
I know this thread is really old but would appreciate any help anyone may be able to provide.
I've created a volume_mute custom property against the "Volumes" object type with a Yes/No format in SolarWinds Orion (Versions: Orion Platform 2015.1.2, IVIM 2.1.0, DPA 10.0.0, NPM 11.5.2, QoE 2.0, SAM 6.2.2) and during the creation process it asked me which volumes it would apply to. I added the volume I want to mute but when going to that volume on the node i cannot see the newly created Custom Property. (nor can i see it on the Node itself)
Under Admin > Manage Custom Properties I can now see the property I created:
Full Custom Property list on the Volume itself:
As you can see I cannot see the volume_mute property listed.
Am I looking in the wrong place?
*edit* - not sure if it's worth mentioning that I only want to set the volume_mute property on ONE of this Nodes' 2 volumes.
It would be nice to actually see an expected ETA on this sort of feature. Looks like that idea has been around for quite some time! I ended up implementing a powershell script using the Orion API to automatically set an "InMaintenance" custom property on nodes for their maintenance windows based off of WSUS group information. It works quite well and is all automated, but would be nice to have a built-in feature for this.
This is awesome I actually read this a couple of years ago and even set the "Mute" custom property...I hate to say it but after so many changes and focusing on multiple projects this was forgotten just became a custom property we did not use. Now that we are moving forward with the patch management program this will save me a lot of trouble.
We have several members of our team adding nodes and I can easily see on of them forgetting to set this property, so the way the alert is written "If n_mute IS NOT YES" then all new nodes would default to no alerts. I see this as a potential problem.
I was going to ask how you mange the introduction of new nodes since there appears to be no way default "n_mute" to NO and there is no way to make the custom field mandatory when adding a new node. I'm still curious if either one of these is possible, but I like Alex Slv's suggestion of using the date instead of the Yes/No.
Remember, this is a YES/NO field type. Not text. So it can only be checked or unchecked. Since checkboxes default to blank (ie: "no") then stating "where n_mute is NOT yes" means you DO get the alert. Only if someone explicitly sets the checkbox is muting turned on.
If the "if n_mute is NOT yes" is too convoluted, then re-adjust your alert logic to "if n_mute is NO".
This is great post, thank you very much for sharing.
I have been using it for a while and stumbled upon a problem - how do I efficiently track all those nodes which are muted temporarily only (I guess most of them will be on a temp bases). Reports are fine, but require extra work to regularly review and decide what can be un-muted and what can stay muted. Besides it makes it more difficult when system is being managed by different people.
Here is my improvement:
1. Replace n_mute (boolen) with n_mute_until (date)
2. Here is how NODE DOWN alert condition would look like:
Also, what is the benefit of writing the mute logic in to the trigger condition? Wouldn't it simplify things to use the alert suppression field instead?
The problem with alert suppression is that it's NOT specific to a particular node. If any node anywhere has maintenance set to yes, then the alert in question is suppressed.
Where alert suppression works (and it's a really REALLY limited case) is if one of your key systems (like the core switch or something) is down, you can suppress an alert.
Otherwise, don't use it.
I am the one managing it though... So if you have a fairly central management structure for your node monitoring and default the property to no then you have no problem. You have to deliberately set the node property to yes and it is node specific since each one has a node property of maintenance, it would make no sense if you had a reboot on switch1 and it used polling stats from switch 2.
With that said, there are a few ways to quickly update the MUTE (or if you want to call it "Maint") field (updating properties from ManageNodes, using a direct DB query, using the Orion SDK, etc) whether you want to do it for one node or many. But the key is to get the logic into the alert trigger.
Feature request for what?
Here is a doc which describes in detail how SolarWind's alerting and suppression works:
Yes, you are right, it WOULD make no sense if you had a reboot on switch one and it used the "maint" field setting from switch 2. but it does. Honest.
See this comment from netlogix (he makes it often in various posts) from this thread: http://thwack.solarwinds.com/message/210026#210026
"The suppression tab is for global suppression. suppress everything if it is true anywhere. If you suppress "Comment = Printer", then if you have any comments that say printer for any node, suppress for every node until there isn't a comment of printer. Try to stay away from the suppression tab, unless you can not do it in the alert tab."
by the way, if I want to mute indefinitely (which is practically the same as using n_mute) I would do the following:
n_mute_until = 01/01/3000
I hope my successors will not be disappointed too much by excessive alerts on the New Year
Thanks a tonne I have to say that, I and my firm are entirely new to outsourced managed services such as this, so this wasn't even something I considered as a potential problem but after reading this post I'm really appreciative that you have created such a simple work around to something which they should consider adding as a feature. Especially given how many of the users on here seem to think that this workaround is a stroke of genius.
First I like to say thanks sharing your wisdom and experience.
I like the "i_mute"and the "n_mute" .
I play with the "mute" so there is alert but no trigger of sms or email to the noc.
So I made new alert with the name "alert me when a node goes down (mute)"
n_mute is equal to YES
Node is down
Trigger is just post in "event log-active alert"
like it because it's not hide the alert from the noc.
Is that something you try working/play with?
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.