Hi Team,
During this year we have noticed some issues with SQL performance on our multi-module Orion installation with 20 additional polling engines and 6 additional web servers.
SolarWinds is great for troubleshooting itself, so we have installed DPA in trial license and added our own Orion SQL instance.
Our first DPA results were very surprising. We have noticed that there were two SQL queries (green and purple), which are consuming most or our CPU wait time.
You can see how many times those queries were executed, around 1.600 times per hour.
We have checked deeper and found out that Log Management service (Log Viewer/Log Analyzer) service was executing them.
This was strange, as we did not send that many Syslog/Trap messages to our Orion installation. So why Log Viewer is consuming so much CPU wait time?
We have created a new support case 01253910 and we have investigate it further.
SolarWinds Development team reached out to us and explained that Log Viewer/Log Analyzer is frequently refreshing its list of IP addresses, for which Syslog/Trap messages are being saved in the database.
The issue here is that each poller needs to perform its own synchronization, which is performed very very often. So, in a very large environment, the synchronization itself can consume most of the SQL CPU resources.
Thankfully, there is an easy fix.
We needed to modify “Node Cache Refresh Interval” and “Node Cache Refresh cool down” in the Advanced Configuration. The change requires a restart of Log Viewer/Log Analyzer service (which Adv Config page can do for you)
By default, those values are 10 seconds and 5 minutes. Please check below DPA results and value changes for 1min&10min as well as 5min&20min.
The difference in CPU wait time is huge and we really noticed a better SQL performance.
After some testing, we decided to use 10 minutes for NodeCacheCooldownPeriod and 30 minutes for NodeCacheRefreshInterval.
I asked Support what are possible downsides and they replied that only for newly added nodes, syslog and trap messages would have up-to 30min delay in saving to the database. After 30 minutes (starting from the event of adding a node to Orion), all Syslog/Trap messages will be saved without any delay or drops, indefinitely.
Already added nodes are not affected. Also synchronization for LV/LA rules is not affected.
In conclusion, I do not see any issues or important disadvantages in performing those modifications and I strongly recommend performing them for your large environment
I believe that in the near future, support will perform some changes to those settings or will create a technical article, but for now there are no public document, so I wanted to share it with you.
Take care, Marcin.