cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Monitoring and the dreaded patch cycle - dealing with node down notifications

Level 10

Over the last couple of companies I've worked in, I have used a variety of different monitoring and management solutions. One of the top features that I look for in my system is the ability to effectively manage nodes during planned outages.

I'm dabbling with the SolarWinds Orion SDK for updating nodes to unmanage them during planned maintenance windows and I'm slowly making ground with it. Ultimately I'd like to have my whole system manageable by using RESTful URLs, or SMTP, so that I can use my patch management tool to actively disable monitoring before affecting systems.

What tools have you been able to use for this? I'd love to hear how anyone has used tools like Microsoft SCCM, LANDesk, or other patch deployment solutions to interact with their monitoring solution like SolarWinds Orion?

  • Do you actively disable monitoring during patch cycles?
  • Do you have a simple method to unmanage monitored devices such as email or a RESTful API?
  • What are the top features that you would add to your application/node monitoring tools?

Today I'm using SCOM 2007 R2 and migrating to SCOM 2012, with SolarWinds Orion, plus SCCM 2007 R2 for patch management. Has anyone gotten similar infrastructure to be more self-aware through better processes?

Hope to hear your solutions as there are a lot of folks looking for good alternatives to this

42 Comments
Level 12

We do not disable monitoring during patch cycles.  We want to know immediately if something does not come back up after patching.

I like to use the Schedule Unmanaged Utility as I can select multiple nodes and schedule in advance.

A nice feature to have would be the ability to select multiple nodes to unmanage all on a single page.

Level 10

Our mgmt decided not to disable monitoring during maintenance windows (we had staff argue both sides of it).  We have a 24 x 7 NOC who always want to see what's happening, even if it's planned maintenance.

Level 12

the company im at uses a crazy amount of tools from SCOM to HP Open View and everything in between got to make sure we dont miss a single thing.

Level 10

Good to know Richard. I seem to get a lot of folks who line up with the same requirements. I love the idea of the multi-node unmanage option. That would be my number 1 feature request for sure in the absence of a full RESTful API.

Level 10

A common message here that many folks do still want to see the down/up cycle in their monitoring even though it generates a lot of traffic. I just worry about what I call "notification blindness" where too many false positive alerts hide some genuine issues and they get missed.

Level 10

We in IT sure have the love of an all-seeing eye don't we?

Level 8

Maintenance schedules and/or blackouts are always a challenge.  Leaving the alerting on during patching can screw with performance statistics if your company does any kind of SLA's or they track change management.  The most successful I've used is when we forwarded everything to a manager of managers, then used the cmdb to schedule the blackout/maintenance period.  When an alert came in, it checked the cmdb to see if the device/node/application was supposed to be active.  Alerts still came in and could be tracked at the MoM, but were suppressed based on the time scheduled in the CMDB.  Since the SLA reports were generated from data in the cmdb, no one was penalized for downtime.

MVP
MVP

This is an issue I've seen discussed with various companies. By not unmanaging the devices in SolarWinds, you can see what's happening as the devices go up/down during the maintenance but any downtime during maintenance will affect your availability statistics. If you unmanage, you miss statistics during the maintenance window but the availability statistic should give a more accurate SLA.

Your example above solves this problem by having SolarWinds continue to monitor during the window and having a separate system for uptime SLA reports (with suppression for the maintenance window).

Level 13

Perhaps there is a threshold of importance? If I restart some machines to apply updates, I'm not so worried about that, but if it never comes back up! then yes, problem. Or if it came back up, but there was a snafu with the update/software conflict and some service didn't work right. I want to know about that too. Make Sense?

Level 13

Currently at my company, we do not unmanage for updating, they'd rather wade through the cruft. And for distributing updates we take the good old flashdrive approach or just downloading them again. Doesn't make much sense to me. If we have 20 servers, then we should at least have WSUS.

I didn't know that solarwinds had an SDK, I'll have to check that out

Level 9

I would love to see a "Disable Notifications" feature. We are at the tail end of migrating off of Nagios and it has a feature to "Disable Notifications", this allows you to either disable notifications for all assigned monitors or all the assigned monitors and the node itself but still gather metrics on it.

I believe that it would be very beneficial to have this in SolarWinds, it gives you the best of both worlds, you can keep the noise floor down while still gathering metrics for the node.

Level 10

Great stuff! I like this idea as well. The ability to maintain the statistics, but at least suppress the notifications during known outages would be a very cool feature.

Level 10

The SDK is pretty cool. I've got to spend some time with it and dig into the programming a bit more. I do a lot of PowerShell scripting, but I have only dabbled with full development platforms like .NET. Something for many folks in the sysadmin side is that we are definitely needing to be a jack of many trades to fully leverage the tools.

Level 13

Ironically I'm the other way round. Never touched powershell, but I'm familiar with full vb.net. Done a little Batch, and even Bash (mostly just sed and awk for data manipulation). Of course anything that you could make in batch I could make in vb.net, it's just dependent on the .net framework.

Level 10

sed and awk FTW!

Level 10

We don't disabled monitoring during patching/maintenance in case something happens or doesn't come back correctly. Although the scheduled unmanage feature is pretty useful and I've been looking into the SDK as well, just not for maintenance windows atm

Level 13

Something I've seen mentioned in a few previous posts is the use of another custom field to signify Alert Y/N.  If the field is Y and node is down then send alert, otherwise don't send alert.  This allows the continued gathering of stats while suppressing alerts for rebooting nodes.

Now if someone provides a means to modify this field via API call from SCOM both prior to and X minutes after a patch install, then I see that becoming really popular.

Level 9

Depends on which client, for some we do switch off monitoring,. But without sounding like a broken record, we also have some that like to be notified of every little blip and cough...

Level 10

Great point Michael. I kind of glossed over the fact that the feature would be great in general and may not be tied directly to the patching cycle. Agreed that the scheduled unmanage would be a popular item for sure.

Level 10

Haha very true! There are folks who do like to see all of the up/down detail. What I try to do for that is to ensure them that it is all logged, but the email should be treated like a secondary step. It is challenging to meet everyone's needs for sure.

Level 10

There are options in the Advanced Alert section to exclude some nodes from sending out emails, but they still log. You can create a Group and set that as the exclusion. I've started working on some exotic email notification scenarios and it is proving to be a pretty flexible tool in that way.

Level 12

yeah got a few buddies that like to call me an Internet Nazi lol

Level 21

Well it seems I am amongst friends here. 

We also do not disable monitoring during patching because we (like others have said) want to know if something breaks as a result of patching.  We have a NOC that is aware of our patching schedule so they know to watch for problematic behavior as a result of patching.

Level 10

We are not alone!

MVP
MVP

Same here. In fact, I like catching the reboot traps to confirm that the servers have restarted after the patch installed. I'll run a report after the patching is done, and confirm that all of my hosts rebooted as expected.

MVP
MVP

We generally disable monitoring during patching since that is a known scheduled down time and so we don't count it against SLA's.  The challenge is when you have two back to back (on different days) patching or scheduled changes for the same device...Orion only allows you to have one scheduled unmanage time so we have to set reminders to go in and add the other.

Level 20

We unmanage some devices but not all... only ones we know for sure are ok to have not monitored during patching.  100% availability isn't the be all end all for us.

Level 7

We do not unmanage devices either, it is helpful to us to see the alerts when certain devices are going down that we know are suppose to. Then when maintenance is complete the alerts show the devices coming back up and if something is still down it helps in troubleshooting.

Level 10

That is an interesting challenge on multiple scheduled outages. This where having the SDK trigger an unmanage process from the patch management system would integrate nicely so that the system is self-maintaining. Looks like I've got some investigation ahead of me to get a good process cooked up!

Level 10

Great point! Chasing uptime is rough because there are genuine needs for downtime and many systems don't need to be online 24/7. Balance is key

Level 10

Definitely a common theme here among most of the responses that unmanaging devices isn't a high priority during patch cycles. Thanks for the great feedback.

Level 10

I wish there was a clean way to either disable alerting on a group of servers. I've looked at updating custom properties with a SQL query, but you would think there would be an out of the box way to accomplish this.

Level 13

Lol telling your bosses "our business is closed from 7pm - 6am. Thats almost HALF the day. That means server costs would drop by a third. Boss says: Nah then they have to get started back up in the morning

Level 7

Not yet

Sent from my Windows Phone

Level 10

Group management would be really great too. I should dig into the SDK and see if there is a way to affect servers based on group membership. I do some alerting based on group, but never tried managing/unmanaging tasks.

Level 13

Yeah. And I think that you could get fancy.server start is initiated at 630, and bass your alerting on 645 or later. That way if the server wasn't up 15 minutes after start, you could alert the related IT guy that such and such hit a snag before it became a production problem.

Level 16

We also chose to monitor during maintenance because our survielance engineers watch the events 24/7 and will cancel any tickets generated before the go to the responsible group.

Maintenance NEVER goes as planned, so its hard to plan downtime.  People start late, or early, and never finish at the correct time.  Also, an engineer can look at the big screen to see if his/her system came up correctly before they complete the change.

Level 16

Isnt that why we have NTA?

NO NET FOR YOU!!

Level 16

Terrific idea!!!

Level 16

We have power outages at plants and the scheduled outages could come in handy for these, but the above idea of disabling notifications for a node or group would be much better...

Level 13

that makes a lot of sense when they are monitored 24/7

Level 10

Good topic.  Creating scheduled maintenance windows for each department or client that disables alerts.  And have a setting in Kaseya to reboot and disable alerts for XXXmin.

About the Author
Eric Wright is a Systems Architect, VMware vExpert, Cisco Champion, OpenStack enthusiast, and overall IT generalist with a background in virtualization, Business Continuity, PowerShell scripting and systems automation in many industries including financial services, health services and engineering firms. As the author behind www.DiscoPosse.com, a technology and virtualization blog, Eric is also a regular contributor to community driven technology groups such as the VMUG organization in Toronto, Canada. You can connect with Eric at www.twitter.com/DiscoPosse. Eric is the also co-creator if Virtual Design Master (www.virtualdesignmaster.com) which debuted in 2013 as the first ever technology oriented reality competition. When not working in technology, you may find him with a guitar in his hand or riding a local bike race or climbing over the obstacles on a Tough Mudder course. Eric also commits time regularly to charity bike rides and running events to help raise awareness and funding for cancer research through a number of organizations.