In a multi-poller 24x7 shop like ours it is vital that we are able to drop out one polling engine at a time for regular maintenance, upgrades, service pack installation or new module additions without impacting the other engines and ideally without impacting any data collection. At the moment, any such improvement requires a full shut-down of all services on all polling engines and so our support desk becomes blind to issues and is very reluctant to authorize change requests. Our multi-server shut-down and restart typically takes around half an hour and so this, together with the actual work, often causes a blind-spot of about an hour.
There must be a better way of doing maintenance. Perhaps swapping all of a polling engine's services out to a fully supportive Hot Standby while that polling engine is being upgraded. But this will assume that support is available for differing patch levels and even for different modules (as in adding a new module or upgrading VoIP to IPSLA) while the database continues to be used.
Some other multi-server solutions use multiple primary pollers, each with their own SQL Express attached to a master primary which gives the opportunity for any of the pollers to go off line for up to 2 weeks, but on re-connection, replicating their data back to the main SQL. This gives them the piece-meal maintenance and upgrade path that I need for Orion. Server licensing is maintained within the cluster as it denies the chance to operate the extra pollers autonomously by requiring them to connect to the core within 2 weeks.
This could lead to an almost transparent maintenance and upgrade path of failing a server's services to a local alternate Hot Standby host, performing the maintenance or upgrades on the server and then failing the services back to their natural server without suffering any outage of data collection or web site delivery.
What plans are there regarding resilient maintenance?