First off, I apologize for the lengthy post.
tl;dr version: high response times, lots of troubleshooting, can't figure it out.
I've had a response time problem with all the nodes in my network for years now (since at least 6.x) and I'm confused as to why it's happening. Orion records response times much higher than what a CLI session shows, even over the course of 30 minutes to an hour. I've tried everything I can think of to get Orion to report somewhat normal response times.
- Tuned the poller to just slightly higher than the recommended max. Done just slightly higher (by usually no more than 5 polls per second) to account for the addition of new nodes without having to re-tune every week. Configuring to exactly the recommended settings shows no discernable difference in results.
- Attempted playing with NIC settings such as IP checksum offload, believing that maybe the processors in the server would be able to service checksum verification faster than the processor chip on the NIC. No change.
- Configured my direct upstream firewall (which shows response times in Orion of 45ms!) to treat ICMP as high priority traffic. This actually had the complete opposite effect and caused response times to jump into the hundreds of milliseconds in Orion only but not in the CLI.
- Set the ping data portion in Orion NPM Settings/Network to 0 bytes. Unfortunately, one of my nodes stopped responding when I did this, so I had to set it to a minimum of 18 bytes. No change whatsoever.
- Taken Wireshark captures of the pings and see that they leave the Orion server and arrive mostly within 1 to 2 ms, which is what the CLI reports. Orion mysteriously records these times as in the tens of milliseconds. See edit below.
- No ports are misconfigured as far as duplex / speed settings and none have errors.
I'm not entirely convinced that the problem is with Orion or the server itself, but I'm not ruling it out. It seems that everything I do to pin down the problem area leads to conflicting results, sending me back to square one. For instance, response times for a box which is on the same subnet as the Orion server are what I would expect (<1ms), so that would lead one to believe the problem is further upstream. However, after performing the QoS configuration on my firewall, Orion response time recording showed increased latency while CLI response times remained the same. I would expect that if I had misconfigured something, then my CLI response times would reflect an increase just like Orion.
This response time thing has some pretty important repercussions. We can't provide this data to our customers for SLA proof that we're "up to speed", so to speak. I can't configure alerts to go off on abnormally high response times for nodes, which on our regional network 15 to 20ms is considered abnormal.
Has anyone experienced these same issues and have they been able to resolve it? I'm willing to try just about anything to get these values to normal levels. My biggest question for Solarwinds is, does Orion use anything else to calculate response time besides ping times? For instance, SNMP response time, database write time, etc..
EDIT: Actually comparing the times in Wireshark for Orion pings and CLI pings show vastly different results. Orion pings do show that the responses come in between 10 and 75ms (I was reading the times wrong originally). Compared to CLI pings in Wireshark, CLI pings show proper response times in the single digits (<3ms). Furthermore, the CLI ping packets are larger in size than the Orion pings (74-bytes CLI compared to 62-bytes for Orion).