Example - from this morning (00:00 am) we have received 60 emails (It's now 10.29) it's 650 minutes, which is one every 10 minutes (I am aware that's different from one every minute, but we received one at 7.39 about a hardware alert down, then 7.40 about a h/w alert up for the same node.) Wwould you not look at the alert if it's alerting? Either the alert is redundant, in which case why is it alerting, or it needs investigating and at least acknowledging the alert.
You will usually want to tweak the thresholds to fit your environment some. Has your instance been running for at least six months? If so, you will be able to build a good baseline for how your environment performs on a daily basis. If not, then you can start making small changes to bring the thresholds down. You can also tweak if you get an alert or if it just changes on the dashboard. Keep in mind while making these changes, you want to only reduce the non-actionable alerts. Here would be an example of each:
w/six months data - bandwidth utilization spikes to 95% every Sunday from 1 AM until 5 AM but with that exception it's usually around 60%
On the alerts section you can suppress the alerts for bandwidth on Sunday during that window and at the same time you can change the threshold for all other times to say 70%
w/a few weeks data - if say the bandwidth utilization alert is set to 35% utilization and you see your norm during your sample period is 40%
you are not going to want to bump it all to 65% as that could see you missing a leading indicator
This has been in place for maybe a year, possibly more. I wanted to do this next week. If I am being truthful I wanted to do start this today as my boss wants all green on the dashboard (this includes applications, CPU, HDD and Memory) and tried to explain that I had run a report that gives us a year/6 months of how a VM had been performing and that we can start tweaking the thresholds for each CPU and then manage the alerts so that they reflect the new thresholds that have been set. I wanted to do a small amount of servers (only 50 that have the same corresponding peak values for the year) run that, see how the servers perform for a week and then start slowly changing the thresholds as you receive more information.
The questioning was 'using resources isn't bad for a server' and then told to lave it for a future project. I am certain right now I am still correct in my thinking and it has disgruntled me slightly.