I encountered a problem where there was an error on the Alerting Engine in Orion and it stopped sending us alerts. The error was caused by the max pool size being reached and resulted in a loss of connection to the database. I contacted support and they showed me how to increase the size to 1000 connections so that this should not happen in the future. I've been sitting here thinking about what happened now though and it did point out a single point of failure in our monitoring environment. The cliche of who's watching the watcher I guess. I need to figure out an automated solution to knowing if alerts are being triggered but not being sent.
I thought about using task scheduler to setup a task to send an email if there is ever an error with the Alerting Engine. I've never actually done anything with task scheduler though so I wasn't sure how to go about it. I also wasn't sure if this would be adequate to prevent us from being blind sighted by something like this in the future. Has anyone else experienced something similar to this and what steps did you take to prevent it? I would appreciate any and all comments.