This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

RabbitMQ - Queues without Consumers

Hello,

We are facing an issue where we have a Multi-Subnet HA SolarWinds Deployment across Datacenters.  When 1 server is HA Active, there is a queue named bus://<hostname>/Subscription/SolarWinds/Cortex/Service-SolarWinds/DataCollection/ that contiunes to grow, where the hostname is the Standby Server, and there is no Consumer subscribed to the queue.  This queue grew to over 400K messages, and crashed the Active server, where there was no access to the GUI.  Forcing a failover by rebooting the Active, caused the now-active to have a consumer for that out-of-control queue, and the queue drained, and we started getting all sorts of alerts from the backlog of messages.

We are realizing that this is now no longer a real-time Monitoring System, since critical Monitoring messages are being "stuck" in this Queue.  We are not receiving real-time Monitoring of our Production Systems by SolarWinds, since SolarWinds is just "caching" critical monitoring messages into a RabbitMQ Queue with no consumer.

SolarWinds support is very slow and unresponsive, so hoping someone here might be able to help.

Thanks.

  • If there is no subscriber then it's likely because the RabbitMQ service can't connect to the other member in the pool. Also note that RabbitMQ is used only for Pub/Sub communication. It is not used at all as temporary storage for monitoring or statistic collection. I recommend you double check that all the appropriate ports are open between members of the HA pool.

    Port requirements

    PORTPROTOCOLSERVICE/
    PROCESS
    DIRECTIONDESCRIPTION
    53UDPSolarWinds High Availability ServiceoutboundUsed when failing over with a virtual hostname to update the virtual hostname's DNS entry and for periodic monitoring.
    4369TCPRabbitMQbidirectionalPort 4369 must be open between the main and secondary servers to allow RabbitMQ clustering between the two servers. These ports exchange EPMD and Erlang distribution protocol messages for RabbbitMQ. They do not need to be open in additional polling engine pools.
    5671TCP

    SolarWinds High Availability

    bidirectionalPort 5671 must be open into the HA pool with the main Orion server from all Orion servers. Traffic is encrypted using TLS 1.2.
    17777TCPSolarWinds installerbidirectionalUsed when installing the standby server software. You can close this port after installation.
    25672TCPRabbitMQbidirectionalPort 25672 must be open between the main and secondary servers to allow RabbitMQ clustering between the two servers. These ports exchange EPMD and Erlang distribution protocol messages for RabbbitMQ. They do not need to be open in additional polling engine pools.
  • Hello,

    Yes, thank you, but we've checked the ports a million times.  I just got a response from SolarWinds Development and we are indeed hitting a bug since upgrading to the latest version.  This will most likely impact any customer using Multi-Subnet HA.  For now, the only work-around is to disable HA and run our Core Orion in a SPOF, otherwise messages get caught in the inactive queue with no consumer, and eventually the Active member's CPU hits 100% and the GUI is unavailable.

  • nickcat, I appreciate the feedback. We are tracking this bug internally under CORE-12365

  • We are trying to implement HA now, 8 pollers per datacenter.  We are seeing very similar issues with the queues.

  • We put in a UDT buddy drop today (provided by dev for us and what we are seeing with UDT) aLTeReGo;   did not fix it, still queuing on both the MSMQ and RabbitMQ.  I will tell A., tomorrow about the CORE-12365 and see if that issue tracks with what we are seeing here.  Thanks for the response.

    niccat (We just upgraded to hoping to fix our message queuing issue), no go.  We to are setup with a Multi-Subnet HA solution to.  Thank goodness we are not the only one seeing this behavior, thought I was losing my mind.  We have dug, tweaked everything we can think of and are actively working with a very good application engineer. Suffice to say it has kicked our a$$.

  • nickcat, what does your MSMQ look like > Computer Management/Services and Applications/Message Queuing/Private Queues, if your queuing here, yeah, problem.

    Sort on "Number of Messages", what do you see?

    For us, [solarwinds/collector/processingqueue/solarwinds.udt.wireless.snmp] queue is the only one queuing, right now 5674 and will continue to grow, it should always be ZERO.

  • That's not to say queuing with RabbitMQ queuing is not a problem, at all.  We are seeing that to, but to this point thought it was a symptom of a larger problem, idk.

  • Oh and we have you beat nickcat, we saw the RabbitMQ/CortexEvents queue get to 19 million, still syslog, traps, events, alerts, polling completion rate, database syncs, all solid, no apparent issues, it's crazy.  Go Dell, go physical, a little FusionIO on the backend helps a lot to :-)

  • we put in CORE-12365 today aLTeReGo, it worked for RabbitMQ.

    We still have an issue with MSMQ, one single queue (but we are killing it, meaning running a lot of messages through it), but we will continue to work with A. and get that one going our way as well.

    > solarwinds/collector/processingqueue/solarwinds.udt.wireless.snmp

    We are managing a whole lot of wireless APs.

    Anyway

    Thank you very much for tracking that bug above and bringing it here, very, very much appreciated.

    I owe you one, you and A.

    -Richard