OK so I think I understand how the hot standby engine works and I also know how I would like Orion to be more proactive and redundant when it comes to monitoring itself. When I say that I mean just being alerted that something is wrong I am not even talking about it failing over to something or trying to fix itself. The whole issue right now is pollers will stop or other functions will stop and you will have no clue unless you manually check or someone says "hey we have not gotten any alerts in sometime???".
So with all that said I would like to actually propose a possible solution instead of just ask for a feature.
I have done this in the past with other less capable monitoring systems and it has always worked well.
The scenario is that for a nominal fee you buy basically a second stripped down version of your monitoring software (optimally placed at a secondary site with its own internet pipe for alerting even if the main site internet goes) . In Orions case it could obvioulsy be tailored for this purpose, IE it is a special version.
It would monitor the following:
-all pollers (their actual polling function)
-all core services on all Orion servers
-all Orion servers themselves (IE node up/ node down)
-required paths/circuits/nodes for alerting like the internet circuit at the main site where email would normally go out in one form or another (BES or otherwise)
-job processing on pollers
I am sure others can add more to this list but these are my primary concerns.
Now the idea here is that this secondary system can be placed at another site that has its own internet pipe. This way if ANYTHING at the main site fails that would stop even alerting you would know about it even if its just a text message to a core group of folks assuming something has happenned at the main site where primary alerting paths are broken.
My main idea is to plug ALL holes where something might fail in the Orion system or outside of it that would stop it from alerting that something is wrong with it and it cannot do its job.