This is a touchy subject for a lot of people, if hanging around some of the other thwackians is any indication.
I'm always between being strict about alerting practices
Or a more loose utilization
Or probably the worst use
Now, in living some of this, I find lots of admins, and engineers, bosses, and users conflate the golden "Actionable" alert and the warm and fuzzy "notification" that holds free the recipient from any responsibility.
This seems like a concept that, to me; could use some retooling in Orion.
Adding some natively supported divisions between an "alert" where its a full-fledged supported thing with expectations attached that someone is using it to fix a problem and a "notification" where one or many users will receive some information about an alertable-element with no-strings-attached.
Specifically I added "natively" because I believe this should be built in. Sure, you could create your own methodology to this or slice your existing alerts apart with some Custom SQL or SWQL and try to leverage the CC or BCC or maybe use Orion as just another source of data to relay to some type of information management system where your "real" logic is - but Orion is so close to that now - at least for network monitoring related intelligece. Some might consider event management core to this subject.
I'd love to see Orion gain the ability to offer Push Notifications that might run alongside Alerts, but would be a bit of a different animal, less tracking, less alert actions more single-serving data points. Keep the automation with "Alerts."
And (naturally) taking that one step further, what if the Orion platform would have the ability to have its own natively integrated status page for each of these paradigms? One with a focus on the platform, showing operational statuses of modules, reported incidents and the ability for anyone to start an email, text, or RSS subscription for those status changes. The other with a focus on all the elements your system is monitoring and being able to configure which appear and can be subscribed to in the same fashion.
Anywho - I'm curious as to what your collective hive-minds' out there thoughts and ideas about these topics.
Preach jeremyxmentzell. I will confess to looking for an event management tool to do some of that more complex logic. Heck, even Kiwi can do "If X events in Y time". Complex event logic is necessary to the future of our environment.
I recommend CatieM20's talk from Monitorama2016: Monitorama Live Stream Day 1 - YouTube
All alerts have actions, even if the action is to delete the alert.
I also recommend the Checklist Manifesto as a way of ensure the right Actions are carried out when alerts go off. (recommended by both Catie and my sister)
In my environment we push the most significant alerts into a slack channel using a perl script, so if people are distracted from Solarwinds then 10 minutes after the alert triggers they can go and look at the actual alert and figure out hat should be done.
Good breakdown. From most of my IT experience getting the proper documentation and process in place is one of the hardest parts. Either the team is small, sometimes just one person, and doesn't feel like it has time to do it, or the team is a bit larger, but overworked and getting ownership of the various areas can be tough. I've seen situations where everyone knew their responsibilities, but didn't want to - or weren't allowed to - cross over into other areas, sometimes called "silos." And I've seen situations were "everyone does everything" which leads to no one owning anything. In the first situation there may be responsibilities, but it often leads to a fractured team with finger pointing. In the second situation there is less finger pointing but also less accountability. In either case the documentation and processes get neglected. Personally, I like to see a balance - everyone has their areas where they are experts and held accountable - yet they each have enough breadth to assist in other areas.
I believe our people really want to provide a quality product and/or service to our customers, but since few people really enjoy documentation/paperwork it tends to fall to management to drive this.
When it comes to alerting/notifying people want to be notified about everything - until you notify them about everything. It takes good leadership to decide the who/what/where/when and especially the why of alerts and notifications.
I always push my clients toward a model where all email based alerts should be actionable. If there is warm and fuzzy stuff then I try to consolidate those into reports or dashboards. Once a shift someone can review all the objects at once that have hit the warning type thresholds in the previous day or whatever, see how often it happens, try to determine a trend; but criticals generate an email or, if we are integrated, a ticket. If the alerts are false positives then we need to work on thresholds and the alert trigger parameters to make them accurate, never let them pile up and get ignored. I see lots of people really drop the ball on their threshold in SAM but it usually only takes a little time talking to their SME to get real actionable thresholds.
Runbooks are mandatory if you're going to have great buy-in and efficient responses to alerts or e-mails.
One environment I'm familiar with has alerts to technical staff and e-mails to managers and teams whose systems could be affected by the problem.
Another environment gets overwhelmed with all the e-mailed NPM Alerts that are too general and too broad--those folks set up rules for all that e-mail to be dumped into a folder, and who knows when it gets reviewed?
A third relies on Dependencies, which REALLY is a great way to cut down spam/alerts. But those who receive or see those alerts MUST be familiar with what they mean. To see one Parent router down could mean thousands of PC's are down, since they hang off dozens of switches behind that router. Understanding what you see and prioritizing it appropriately (when Group Dependencies are in play) is critical to properly prioritizing the triage and prioritization of repairs.
It just crossed my mind, to try and get some more visibility into the downstream impact of an outage I could probably build a SQL selection for the alert message to tell how many downstream nodes that have the downed node as a parent. When I get off this plane I will have a go at it.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.