we recently deployed SRM in our environment and found that our dashboard is flooded with Critical and warning events for each category of storage arrays.
we attempted to set the threshold based on dynamic baseline and the number of triggered alerts did not seem lower. the problem occurred because the SRM compares every second and logs events as critical/warning. Since there are many storage the dashboard is always cluttered with very high number of these not useful info and overload my SolarWinds server and DB.
There is no way to control the triggering of these events by placing a wait time nor the ability to turn off like the alerts settings. Presented this case to SolarWinds support but they are unable to help us and recommended us to submit a feature request.
So this post is a 2nd post after my original post three days ago on feature request page for comment. I post it again for wider domain to get opnion and comment on this problem.
the feature request is the ability to control event triggering on SRM by including a wait time and/or ability to disable. Further wanted to see how the SolarWinds community cope this problem.
Thanks for your suggestion.
I will miss the benefit of getting an overview of my monitored environment if event summary is removed from my dashboard view. In addition, it will not address the underlying problem such as overloading the server and database
Problem with looking at the events widget is you have no control over what SolarWinds deems is an event worth recording and how severe they display it as. For me the most obvious example was the case where a network interface negotiates a different speed with the remote system. Solarwinds natively displays that as a warning event, but it is something that happens every time a computer idles down and enters a power saving mode and is really not the kind of thing that I cared to have my techs chasing down. Same with your example in this request relating to the SRM usage constantly generating noise. The events themselves are just not a reliable indicator of the problems in an environment it is just a multicolored list of stuff that happened, good bad or indifferent and most of it is just noise.
If you want to have control over what gets displayed then the events list is a bad plan, you build alerts for bad events and then just display those alerts and now you get to make the decisions about what is important.
Yes, they already have this ability in their SAM application components and they recently added it to Node thresholds (like CPU, Memory, etc...). They're also rolling it out to Volume and Interface thresholds as well, so hopefully yeah, they bring it to SRM. I wouldn't be too hopeful that it will be coming soon though, because I doubt SRM is high on their priority list because it isn't nearly as popular as NPM.
My hope is that eventually every single threshold in every Orion module has a time-based modifier available.
Thanks for your comments and info.
It is good to hear that they are working to add this feature. What would be the point to have two engines comparing actual performances with thresholds. i.e. if an alert is already available for a given metric, there is no need to compare and trigger an event. The alert gives much degree of customization based on the environment.
I hope also taking into account SRM advantage, in that it offers much granular storage performance info linked with servers and applications performance to troubleshoot problems and identify service degradation root causes via Perfstack analysis, the roll out will get a considerable wait .
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.