This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

HA FAILOVER CAUSE

Does anyone know what the "real" is?  emoticons_happy.png

Pool 'SolarHA' experienced failover. Pool member 'HA1' changed state to Critical or Down. Pool member 'HA2' overtook responsibilities.
Reason of fail: Orion engine resource is critical due to unexpected engine services status. Service 'NetFlowService' has expected status Running but the real is Stopped.


  • THe message is saying that the service should be 'Running' but it's actual state is 'Stopped'. That is the reason for the failover.

  • Thanks. We are trying to understand the the sequential process of how a fail-over is triggered and logged.

    1. At 9:01:50 pm netflow service chokes.

    2. At 9:02:03 pm netflow service recovers. 13 seconds! Doesn't seem long enough to trigger HA if Windows event log is trusted.
    3. NPM writes fail-over event in NPM log at 9:36pm

    Why the 30+ minute delay between NPM event and Windows event log? Why is 30 seconds the maximum fail-over event window? Seems low.  We're worried that the slightest hiccups are going to cause fail-overs. Of course, ignoring issues isn't the goal.

    Netflow crash 2018-12-21_15-54-12.png

    Netflow recovery 2018-12-21_15-54-12.png

    npm ha event2018-12-21_15-54-43.png

  • There wasn't a later incident of the netflow service failing closer to ~9:35 was there?  If you review the logs for the HA service you will see it normally checks all the services every minute (maybe it was 2 mins, can't recall for sure off the top of my head).  When I'm investigating things like this I typically would open up the HA logs for the time period in question as well as the netflow logs to try and pin down the root cause of the service going down, as well as seeing where the HA service noticed the issue and triggered the failover.

    And yes, the HA service does trigger failovers even on brief outages.  Basically at the first sign of trouble it is going to kick over and it can be a bit of a pain.  The way that I found causes me the least pain during those failovers is if you have the Additional Web Server product, because instead of having a failover drop my connection to the active web server and move into a cold system without cached data in it I just keep my connection going in place, all the necessary IIS stuff stays cached, and the changeover all happens on the back end relatively invisibly to users.

  • Thank you very much for the insight. Support recommends upgrading NTA to suppress the timeouts with a new SQL architecture so we're going in that direction. A hopefully successful band-aid.

    I definitely agree about the AWS license. I got a quote last week.

    It seems HA is, as described in another thread "overly sensitive..."

    I understand the need for the sensitivity, otherwise why even have HA? But, for a large environment it will be another beastly aspect of the product to control. In addition to the down alerts for devices engineers added without whitelisting the standby HA box. This will be our biggest hurdle to quell the negative PR campaign momentum for purchasing and deploying HA. Obviously, this is not SolarWinds' responsibility to police customer engineers in following new obscure processes. emoticons_happy.png