The single servers are reporting and alerting well, but the CCR clusters are proving a bit more trying. All of the servers are monitored in NPM by SNMP & each server has numerous APM templates assigned with various component types. Some of the threshold settings have been tuned, while some have been removed so that they will never raise alerts and these are used for historical tracking only. The CCR cluster servers are also monitored by both NPM and APM. The intention is to give visibility of the service to users (CCR status), but also to track and alert on the detail of their individual member servers.
Some services naturally run on the Active node, but not on the Passive node, so if I test for all of the Active services, the template will alarm on the Passive node, but if I have different templates for the Active node and for the Passive node, they will be fine until a failover occurs when the active-specific services will not be tested on the new Active node which would still have the Passive template assigned to it.
Our 3rd party mailbox backup solution will only backup from the Active node and will fail from the Passive node so I need to be able to initiate a scripted changeover of the server name in the backup software. I therefore need to notice the failover and to run a script, but not produce an alert.
We run a number of wall mounted overview screens showing overall topology and various production service workflows which engineers can drill down through to the detail of an incident and these all roll-up red on a single failure.
If a CCR Active server fails over to its Passive I need to know
1 that the failover has occurred
2 that the service to the users has been maintained
3 that a script needs to be run to change the server specified in the data backup routines
4 to create an alarm for the broken server to be fixed
5 but not create an alarm, for the failover itself
6 nor for the fact that one member is in Passive & running fewer services
7 and, if using different Active & Passive templates, to swap them over.
I can’t see how to do this. Have I missed something very basic?
SolarWinds Orion Core 2010.1.0 SP1, APM 3.5, IPSLAMGR 3.5, NPM 10.0.0 SP1, NTA 3.6