Insights to HA Status changes?

Hello, 

We have HA with 1 Active Server and 1 Standby. 

Periodically, HA  switches servers. 

We have increased the Member Down Interval to 9 mins but still there have been random failovers and failbacks. 

From the HA log: 

----

ReasonOfFail: SuicideRule/Rev:46, StatusMessage: 'Pool member committed suicide.'

ReasonOfFail: NotResponding/Rev:19, StatusMessage: 'Pool member is not responding.' 

----

I am wondering about suggestions or ideas on root causes with potential remedies.

thank you

Top Replies

  • Hi

    Take a look at the bullets below. Hopefully going through these will help you get to the bottom of yoru problem:

    1. Check the Message Center for the time/date of the pool failover events:



      Change…
Parents
  • Hi

    Take a look at the bullets below. Hopefully going through these will help you get to the bottom of yoru problem:

    1. Check the Message Center for the time/date of the pool failover events:



      Change the filter to the hostname of the previous primary server of the HA pool, and review all event and audit events around the time of the failover, these might give you an insight as to what is going on within the platform at the time of the failover.

    2. If you have SAM, apply the "Orion Server xxxx - Main Polling Engine" to both the primary and secondary servers. This will help you by giving you visibility of the key Orion services. I suspect one or more of them are failing which is causing the HA pool to fail over.

    3. Review the requirements for HA and ensure that nothing has been missed during the initial setup.

    4. Install Orion agents on  both HA pool members. Your original server will have one already, but check that the secondary also has one as you will then see the  Application Dependency for the aforementioned SAM templates in point 2 (I'm a big fan of SAM Stuck out tongue).

    5. You could even setup a NetPath Service to monitor SWIS (TCP 17777), Orion HA (TCP 5671) and RABBITMQ (TCP 4369 and TCP 25672) between the two HA pool members - These ports need to be open bidirectionally without interruption for healthy HA operation.

    Hope the above helps! Let me know how you get on!

    -Jez

Reply
  • Hi

    Take a look at the bullets below. Hopefully going through these will help you get to the bottom of yoru problem:

    1. Check the Message Center for the time/date of the pool failover events:



      Change the filter to the hostname of the previous primary server of the HA pool, and review all event and audit events around the time of the failover, these might give you an insight as to what is going on within the platform at the time of the failover.

    2. If you have SAM, apply the "Orion Server xxxx - Main Polling Engine" to both the primary and secondary servers. This will help you by giving you visibility of the key Orion services. I suspect one or more of them are failing which is causing the HA pool to fail over.

    3. Review the requirements for HA and ensure that nothing has been missed during the initial setup.

    4. Install Orion agents on  both HA pool members. Your original server will have one already, but check that the secondary also has one as you will then see the  Application Dependency for the aforementioned SAM templates in point 2 (I'm a big fan of SAM Stuck out tongue).

    5. You could even setup a NetPath Service to monitor SWIS (TCP 17777), Orion HA (TCP 5671) and RABBITMQ (TCP 4369 and TCP 25672) between the two HA pool members - These ports need to be open bidirectionally without interruption for healthy HA operation.

    Hope the above helps! Let me know how you get on!

    -Jez

Children
No Data