Last night (roughly 12:30 a.m. USA Central Time) I saw my NPM report WAN outages affecting some of my businesses across multiple regions and states. The outages were on the order of three to five minutes--what you might expect for a power event. Everything came back up without issue.
I opened a ticket with my WAN Service Provider, listed the cities and circuits involved, and asked for an explanation of the outage. Usually the WAN provider is very good about proactively notifying us about scheduled maintenance, and they're also excellent about owning up to mistakes on their part. In this case, some of the circuits involved use secondary or tertiary regional WAN providers like ATT. It's not unusual to see those kinds of sites with shared nodes fail and come back at the same time. But this outage involved only some of those sites. Some of the other circuits that bounced last night don't use secondary providers.
While waiting for a response from my query to the WAN provider, I SSH'd into some of the affected remote routers to review their logs. I was surprised to find there were no physical or logical errors reported on their WAN interfaces. Usually I'll see the WAN interface go down, or the entire router may have rebooted, or a BGP or EIGRP neighbor relationship lost at the time NPM reported the site down. But nothing like that showed up in the logs. The routers don't think they had a WAN outage.
Further, no one from any of my 7x24 sites called our Help Desk to report a problem. And THAT'S unusual if there really were an outage.
When the WAN Service Provider's NOC responded to my query, this is what they shared:
"You are using our Solarwinds to look at graphing for your sites. When a puller goes down they do not report anything on a site. The sites were up but no data was being pulled from those nodes."
To make it clear, I do not use the WAN provider's Orion. My organization has our own multiple regional NPM pollers that report to our main NPM instance. They were not down, our NPM main instance was not down. Further, when a poller goes down, the site is not reported down. The site's status is just not updated--as far as I can see.
Yet the WAN provider's ORION polling was unavailable at the same time that MY Orion said some of my remote sites were down.
A very strange coincidence.
It would be even stranger if other Thwack members reported an outage on their systems at the same time as my outages. I'm leaning towards a bug in Orion, but I'd love to hear from other Thwack members about whether they saw anything flaky last night. In the mean time I'll be looking for feedback from my WAN provider.