Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

RabbitMQ - Multi-Subnet HA - CortexEvents Queue

We are in-flight implementing a Multi-Subnet HA solution using all physical servers, two datacenters running 8 pollers at each datacenter. We are running NPM 12.3 HF6.

** We have 10gb links between the datacenters.

We are actively managing roughly 75,000 elements, about 9500 elements per poller well within supportable tolerances.

On the backend we are using beefy Dell database servers running 24 cores and 192gb of memory. The databases live on 3.2tb FusionIO drives, the fastest storage on the planet imo.

We are running with SQL Server 2016 Standard at each datacenter utilizing SQL Server 2016's flavor of "Always On" synchronization between the databases (we have tested both sync and async, currently running async).

** New to me, reading and understanding the Queue behavior in RabbitMQ.

Effectively what we are seeing is the CortexEvents queue continues to rise over time to the point we believe processing of messages to the database will never catch up.

03/23/2019 @ 2:30pm

Ready 6,455

Unacked 50

Total 6505

03/23/2019

Ready 5155

Unacked 101

Total 5256

03/24/2019 @ 9:53am

Ready 10,275

Unacked 50

Total 10325

** I have seen the Ready > 45,000

On the last chart above 03/24/2019, our publish rate is fluctuating between 150/s to 570/s

Is anyone else seeing this behavior? Any help/advice would be greatly appreciated.

We have case number 00282902 open for this issue.

Lastly I do have one question which will help me better understand the RabbitMQ/CortexEvents queue, if you could answer the following

questions from your environment.

Number of Pollers:

Number of Total Elements:

CortexEvents Queue rates from RabbitMQ

- Ready

- Unacked

-Total

-Publish

thanks

attachments.zip

Find more posts tagged with

Accepted answers

rcbarr

Hey John, for us it was a development provided hotfix that solved the problem. (I am not sure if the fix was included in the GA 12.4x release).

Fortunately for us there was one other customer that saw the problem before us and had already engaged development where development had already built and was validating a fix.

The BuddyDrop was 49752, it included (1) DLL, SolarWinds.Orion.Swis.PubSub.dll, the fix required Orion Platform 2018.4 or higher.

------

The fix addresses an issue where RabbitMQ queues are created for each HA Main pool member and down member queue is growing because of no consumer.

It changes how RabbitMQ bus:// queues are created. After fix applied only one bus:// queue are created per HA pool.

------

We are production with HA today, working like a charm...

All comments

rcbarr

One thing I have learned, RabbitMQ handles pub/sub communication between the pollers. Second, MSMQ is used for managing polling results, it's temporary storage.

rcbarr

We also are implementing a new hotfix (buddy drop) from Core development today to address the queuing issue we are seeing on a Orion Platform 2018.4 HF3 implementation. The issue seems to be as follows:

Note: We have a multi-subnet HA implementation.

############################################################

the issue with the
RabbitMQ is that it is trying to talk to both the services on the active
Primary and the Inactive Primary poller. Since the inactive server has no
services running, no one consumes the message and it just sits there. This
happens when something like a Cortex job gets scheduled, and it sends the
message to the Cortex service on both servers the active server processes the
message and it just sits there on the inactive server. So, yes your assumptions
are correct, but it is messages to the services and not metrics.

############################################################

rcbarr

We implemented a solution today, let it cook for awhile.

johnlad

Curious what the solution was as we have the same issue running a primary and HA. Seems the queue ready message just steadily increase and do nothing.

rcbarr

Hey John, for us it was a development provided hotfix that solved the problem. (I am not sure if the fix was included in the GA 12.4x release).

Fortunately for us there was one other customer that saw the problem before us and had already engaged development where development had already built and was validating a fix.

The BuddyDrop was 49752, it included (1) DLL, SolarWinds.Orion.Swis.PubSub.dll, the fix required Orion Platform 2018.4 or higher.

------

The fix addresses an issue where RabbitMQ queues are created for each HA Main pool member and down member queue is growing because of no consumer.

It changes how RabbitMQ bus:// queues are created. After fix applied only one bus:// queue are created per HA pool.

------

We are production with HA today, working like a charm...