We are in-flight implementing a Multi-Subnet HA solution using all physical servers, two datacenters running 8 pollers at each datacenter. We are running NPM 12.3 HF6.
** We have 10gb links between the datacenters.
We are actively managing roughly 75,000 elements, about 9500 elements per poller well within supportable tolerances.
On the backend we are using beefy Dell database servers running 24 cores and 192gb of memory. The databases live on 3.2tb FusionIO drives, the fastest storage on the planet imo.
We are running with SQL Server 2016 Standard at each datacenter utilizing SQL Server 2016's flavor of "Always On" synchronization between the databases (we have tested both sync and async, currently running async).
** New to me, reading and understanding the Queue behavior in RabbitMQ.
Effectively what we are seeing is the CortexEvents queue continues to rise over time to the point we believe processing of messages to the database will never catch up.
03/23/2019 @ 2:30pm
Ready 6,455
Unacked 50
Total 6505
03/23/2019
Ready 5155
Unacked 101
Total 5256
03/24/2019 @ 9:53am
Ready 10,275
Unacked 50
Total 10325
** I have seen the Ready > 45,000
On the last chart above 03/24/2019, our publish rate is fluctuating between 150/s to 570/s
Is anyone else seeing this behavior? Any help/advice would be greatly appreciated.
We have case number 00282902 open for this issue.
Lastly I do have one question which will help me better understand the RabbitMQ/CortexEvents queue, if you could answer the following
questions from your environment.
Number of Pollers:
Number of Total Elements:
CortexEvents Queue rates from RabbitMQ
- Ready
- Unacked
-Total
-Publish
thanks