Orion & Distributed Deployments

Whether your organization is a multinational conglomerate, or you're simply deploying Orion into Azure, AWS, or another public cloud provider, there are plenty of reasons it might be advantageous to take advantage of Orion's distributed polling architecture. 

One key reason you may decide to deploy Additional Polling Engines closer to the devices being polled is to ensure latency and packet loss measurements are representative of conditions at that location. As you can imagine, polling across large geographic expanses via the internet or even dedicated circuits incurs latency overhead and potential packet loss for each and every device polled at that remote location. This means that any problem along the path between the Main Poller at the central office, to the router at the regional branch office, will be multiplied by the number of nodes monitored at that remote location. An example of that is depicted in the diagram below. Similarly, if for any reason there's an outage anywhere along that same path, all entities monitored at that location would would appear 'Down' in Orion; likely resulting in a flood of alerts. 

If 'Availability' is a success metric that your organization measures closely, then your bonus, raise, promotion, or possibly even your job itself might be tied closely to these important metrics. If you're an MSP, customer contracts may be structured around service delivery, and more specifically, availability. These contracts can, and often do, contain penalty clauses that could result in substantial financial penalties or breach of contract that would allow the customer to move to a competitor if certain metrics are not sufficiently achieved. As a result, you may be looking for ways in which to provide the most accurate representation of availability for the services at each of these locations. One of the best ways to achieve that is to perform the polling as close to the monitored devices as possible. 

Another distinct advantage local Polling Engines have over remote or 'centralized' polling, is their store and forward queuing system. The queue provides temporary local storage for polling results in unfortunate event the Polling Engine is unable to communicate directly with the Orion SQL database server for any reason. Should this occur, the APE continues executing its assigned polling jobs on their scheduled intervals. Collecting those values and storing them in the queue until a connection to the Orion SQL database server is reestablished. For those running Orion Platform 2020.2.6 and later, the message queue system used by Orion is RabbitMQ. For those running Orion Platform 2020.2.5 and earlierMicrosoft Message Queuing service (MSMQ) was used. 

Provided latency between the Additional Polling Engine (APE), Main Poller, and the Orion SQL database is less than 100ms, then APEs can be deployed in branch offices, remote data centers, and even the cloud. When doing so, however, it's important to understand the bidirectional communication requirement between Orion's Main Poller and any Additional Polling Engines deployed. While this blog post isn't intended to serve as an educational lesson on the topic of how stateful vs. stateless network sessions are (or are not) managed in today's modern networks, the key takeaway is that bidirectionality is defined here is either the Orion Main Poller, or Additional Polling Engine initiating TCP based network communication to the other independent of the other. 

 

As the diagram above outlines, all communication is unidirectional with the exception of the SolarWinds Information Service (SWIS) running over TCP Port 17777. That protocol is bidirectional, which can present some challenges when attempting to deploy an APE to a remote location across a NAT. This becomes even more interesting when both the Main Poller and APE each reside behind their own site's NAT firewall. In that scenario, there's simply no direct route for each server to communicate with the other, which is why many customers who choose to deploy Additional Polling Engines at remote sites often utilize site-to-site VPNs. The site-to-site VPN eliminates the Network Address Translation (NAT) traversal problem by making the remote subnet appear and function no different than any other local network segment.

I'm by no means advocating that anyone ditch their VPN, but what if a site-to-site VPN simply isn't an option?  It not uncommon that customers who are undergoing an acquisition find that the equipment at the newly acquired company is incompatible with the corporate HQ VPN standard. Or, that there is substantial amounts of overlapping RFC IP address space between the two organizations. There are also situations where a VPN may not be desirable. Certainly they add a level of encryption to the connection, but that also comes with additional overhead and latency as a result of tunneling data that can force packets to be broken up beyond the maximum transmission unit (MTU). VPNs can also add an additional layer of complexity, as well as add yet another failure point along the path that some customers simply feel is unnecessary. 

If for any reason you find yourself in a similar situation, it is possible to deploy an Orion Additional Polling Engine at a remote location where both ends of the connection are hidden behind a NAT. If you're concerned about snooping data in motion, the SolarWinds Information Service (SWIS) running on TCP 17777, as well as the RabbitMQ connection running over TCP 5671 are full end-to-end encrypted, similar to that of an SSL VPN. Also, since the introduction of the 2017.3 Orion Platform release, even the SQL connection to the Orion database can be fully encrypted using TLS encryption. Additionally, utilizing Windows Authentication for the Orion SQL database connection provides an additional layer of credential encryption. 

Example

Before we delve into the details, it's probably helpful to understand how communication is established between the Main and Additional Pollers. Contrary to popular belief, the IP address of either server is never referenced when attempting to communicate between APE's and the Main Poller. Instead, Orion relies upon the hostname of the machine it wishes to communicate as seen in the Engines table of the Orion database. This can be seen by running the following query against the Orion database, or by running the 'hostname' command from the Windows command prompt of each poller. 

SELECT TOP 1000 * FROM [dbo].[Engines]

The result is most typically the short name of the Windows server, rather than a fully qualified domain name (FQDN) or DNS name you might more commonly use to connect to these servers. Using the hostname rather than the IP address to connect to the pollers, however, has its advantages. The most obvious is that the IP this name resolves to can be easily overridden just on the pollers themselves, simply by updating the hosts file located by default under 'C:\Windows\System32\drivers\etc'. In the example below, there is no direct route between the two sites because each reside behind a NAT. Therefore, it will be necessary for Orion to communicate via the routable IP addresses of each firewall performing the NAT. This is a concept you're probably already familiar with, called Port Forwarding

The idea is relatively easy. We're going to update the 'hosts' file on each poller to resolve the name of the other poller it needs to communicate with using the external, routable IP address of the firewall. From there, we simply add some port forwarding rules at each firewall to forward the appropriate ports onto the polling engine behind the NAT and voila! 

While this solution does provide end-to-end encryption similar to a VPN, it does not provide authorization like a VPN ordinarily would. For that reason, I always strongly suggest exclusively limiting inbound traffic on these forwarded ports to only the IP address of the other firewall. This should prevent external, nefarious individuals from probing these ports remotely. 

Do you have an unconventional Orion Deployment of your own, or are you planning a deployment architecture that seems like it might require an unorthodox approach? Let me know in the comments below. I'd love to hear from you. I'd especially appreciate those of you who have deployed Orion in unusual ways to share those successes as well as failures. E.G. what were you wanting to accomplish that necessitated a unique approach? What worked for you, and what didn't? 

Anonymous
  • So I have been tasked to come up with a replacement for our Orion monitoring seeing as I doubt we will be bringing it back online anytime soon, if at all. I am looking for something that is fairly scale-able as we have over 1k network devices (only counting switches and routers). We are currently using Statseeker (we had both up and running as redundancy) as our only monitoring tool and it leaves a bit to be desired, especially compared to Orion.

    SkylightPaycard