5 Replies Latest reply on Jul 30, 2019 3:46 PM by rcbarr

    RabbitMQ - Multi-Subnet HA - CortexEvents Queue

    rcbarr

      We are in-flight implementing a Multi-Subnet HA solution using all physical servers, two datacenters running 8 pollers at each datacenter.  We are running NPM 12.3 HF6.

      ** We have 10gb links between the datacenters.

       

      We are actively managing roughly 75,000 elements, about 9500 elements per poller well within supportable tolerances.

       

      On the backend we are using beefy Dell database servers running 24 cores and 192gb of memory.  The databases live on 3.2tb FusionIO drives, the fastest storage on the planet imo.

      We are running with SQL Server 2016 Standard at each datacenter utilizing SQL Server 2016's flavor of "Always On" synchronization between the databases (we have tested both sync and async, currently running async).

       

      ** New to me, reading and understanding the Queue behavior in RabbitMQ.

      Effectively what we are seeing is the CortexEvents queue continues to rise over time to the point we believe processing of messages to the database will never catch up.

       

      03/23/2019 @ 2:30pm

      Ready          6,455

      Unacked      50

      Total            6505

       

      03/23/2019

      Ready          5155

      Unacked     101

      Total           5256

       

      03/24/2019 @ 9:53am

      Ready          10,275

      Unacked      50

      Total            10325

       

      ** I have seen the Ready > 45,000

       

       

      On the last chart above 03/24/2019, our publish rate is fluctuating between 150/s to 570/s

       

      Is anyone else seeing this behavior?  Any help/advice would be greatly appreciated.

      We have case number 00282902 open for this issue.

       

      Lastly I do have one question which will help me better understand the RabbitMQ/CortexEvents queue, if you could answer the following

      questions from your environment.

       

      Number of Pollers:

      Number of Total Elements:

      CortexEvents Queue rates from RabbitMQ

           - Ready

           - Unacked

           -Total

       

           -Publish

       

      thanks

        • Re: RabbitMQ - Multi-Subnet HA - CortexEvents Queue
          rcbarr

          One thing I have learned, RabbitMQ handles pub/sub communication between the pollers.  Second, MSMQ is used for managing polling results, it's temporary storage.

          • Re: RabbitMQ - Multi-Subnet HA - CortexEvents Queue
            rcbarr

            We also are implementing a new hotfix (buddy drop) from Core development today to address the queuing issue we are seeing on a Orion Platform 2018.4 HF3 implementation.  The issue seems to be as follows:

             

            Note: We have a multi-subnet HA implementation.

             

            ############################################################

            the issue with the
            RabbitMQ is that it is trying to talk to both the services on the active
            Primary and the Inactive Primary poller. Since the inactive server has no
            services running, no one consumes the message and it just sits there. This
            happens when something like a Cortex job gets scheduled, and it sends the
            message to the Cortex service on both servers the active server processes the
            message and it just sits there on the inactive server. So, yes your assumptions
            are correct, but it is messages to the services and not metrics.

            ############################################################

              • Re: RabbitMQ - Multi-Subnet HA - CortexEvents Queue
                rcbarr

                We implemented a solution today, let it cook for awhile.

                  • Re: RabbitMQ - Multi-Subnet HA - CortexEvents Queue
                    johnlad

                    Curious what the solution was as we have the same issue running a primary and HA.  Seems the queue ready message just steadily increase and do nothing.

                      • Re: RabbitMQ - Multi-Subnet HA - CortexEvents Queue
                        rcbarr

                        Hey John, for us it was a development provided hotfix that solved the problem.  (I am not sure if the fix was included in the GA 12.4x release).

                         

                        Fortunately for us there was one other customer that saw the problem before us and had already engaged development where development had already built and was validating a fix.

                         

                        The BuddyDrop was 49752, it included (1) DLL, SolarWinds.Orion.Swis.PubSub.dll, the fix required Orion Platform 2018.4 or higher.

                        ------

                        The fix addresses an issue where RabbitMQ queues are created for each HA Main pool member and down member queue is growing because of no consumer.

                        It changes how RabbitMQ bus:// queues are created. After fix applied only one bus:// queue are created per HA pool.

                        ------

                         

                        We are production with HA today, working like a charm...