The average user should not be given rights to setup monitors and alerts.
One bad config can cause epic damage.
I saw it with an alert configured with an any any without and kind of logic and it brought down an entire email system with 5 million alert emails that brought down a large company's mail system for 4 hours.
You cannot easily give normal non solar winds admins the ability to create any kind of monitors.
None of of my DBAs or Help Desk or Engineers that do not work with Solar Winds with the slightest passion for it could manage it.
It takes a dedicated guy(s) who know what they're doing.
we use a CP "owner" to multi tenant the differen users groups..
if the company has a some type of central CMDB you could minimized the risk by automation some part.
orioncrack has something about who should be allowed to do what...
but I don't see that as problem with the software and more problem with setting "ground rules to the users"
multi tenant or not :-)
1 of 1 people found this helpful
yes, we do the same with a CP 'sector'
We make use of JIRA Software - Issue & Project Tracking to manage requests for new alerts.
Alerts require a run-book, and using Jira we can subtask that out so someone has to write documentation on what the alert means, the <5 things to check, and the ways it can be fixed.
No runbook, no new alert...
1 of 1 people found this helpful
I manage a NOC for 7 different networks stretching from Tokyo to Chicago, and also four hospitals including Harborview Medical Center. We're here 24*365.25
the run book is NOT detailed -- see the Checklist Manifesto for the ethos behind this (https://www.amazon.com/Checklist-Manifesto-How-Things-Right/dp/0312430000 )
i.e. it doesn't say anything about acknowledging alerts, creating tickets, basic tasks everyone should get right every time.
The monitoring team is the same as the NOC, and we do advanced troubleshooting, firmware upgrades, router reloads, planned traffic re-routes. i.e. the things one would expect a highly qualified NOC to do.
If we were just sat here waiting for alarms to trigger and then page someone I'd outsource it.
Here is a VERY typical runbook for a low-level alert that we use for preemptive action; as you can see it says why the alert triggered, why you should care about the alert, remind people to make sure the device is up, to read the syslog messae, to use their juniper account to see what the message means (actually now I read it I see I should remind them to open a JTAC case if the message is not documented), and then suggested appropriate actions.
The KB# appears in the alert message, and is tied to our ticketing system so there's no ambiguity about which KB article this alert applied to. In this case this document has been referred to 22 times, which tells me it's not something we do very often.
This is an continual service improvement process, and I try to update an article each week or otherwise clean up an alert,
SWO: Juniper Chassis
This alert triggers when a juniper router or switch reports more that ten (10) syslog messages from chassism in an hour (60 minutes).
The alert clears if less than ten (10) syslog messages from chassism are received in the last hour (60 minutes)
Impact to Customers:
This normally indicates some serious issue that needs to be investigated. In most cases there might not be an immediate impact but it could indicate some Access Points are not getting power, that one power supply has failed, or some other foreshadowing of something major.
- Check the device is reachable
- Check syslog for the messages
- Use your Juniper account to search for the message and determine its impact,
the default impact should be 3 - Low
if adjacent devices (Access points, UPS) are impacted increase the impact to 2 - Medium
- Consider if an out of hours page is necessary and increase urgency to 1 if needed.
Network Core / Layer_3 for routers, and icas switches
NIM -- for other switches
[links to JIRA and alert definition]
Loving the ideas keep them coming, I really like the runbook idea maybe I can get this added to our change request process.
Really liking the JIRA idea currently we run an home brew system that is outdated and the original creator is no longer around. It was designed more for networking devices and the flow does not go well with application or server monitoring.