So I recently applied the following template to measure the performance of my SW environment. Orion Server.apm-template
I have concerns regarding a specific component of this template Job Scheduler v2: Results Notified Error. Per the document it is described as
Job Scheduler v2: Results Notified Error
This monitor returns the number of errors that occurred when sending the results back. This value should be zero at all times.
The value should be zero at all times....I have values over 10K on one of my polling engines. In fact 3 out of 4 of my servers have some value greater than 0. However this value isn't incrementing on any of them. Can someone explain in more detail what this is? Could this be old errors? Is there a way to clear and see if any new errors develop? What particular problem does this indicate and how do I go about resolving?
Any help would be appreciated.
This counter represents serious underlying issue for my environments.
When I get this alert, its a big deal that requires immediate attention. All statistic collection for many agents and SNMP devices stop for us.
Service restart/reboots is always required for me.
Just my .02 cents in case you have a real issue -VS- a non critical statistical counter issue for the monitor itself.
I have same problem. I have "Count as Difference" set to True and still receivig error as msawyer said. The static data is about 107k. How can i solve this problem. I couldn't find any solution on the internet. Any idea ?
Thanks for your time.
So, this metric shows number of failed attempts to deliver polling job result from Job Engine v2 service to registered job result consumer. Consumer is usually running in Collector Polling Controller service or Orion Module Engine service. If there are some active polling jobs producing results and result consumer is not running (accepting results) then you can observe quite fast growth of this counter. Job engine has buffer for results and retry mechanism on result delivery so short term growth of this counter (e.g. during restart of services when consumer service needs more time to start than job engine) is not real problem, but as this buffer is limited, permanent growth of this counter means that data from polling are being discarded and there will be gaps in historical data.
Value of this counter should be cleared by restart of Job Engine v2 service.
It is very probably cumulative error counter, so it would be logical to have it defined as "count as diference" to warn only when its value grows. Does it drop to zero when you restart orion services on given engine?
I've logged the absence of the "count statistic as difference" option being set for this component monitor as a bug with this template. It is being tracked internally as FB141719.
On which metric do you have the 'Count Statistic as Difference' enabled? Also note that the effect is not immediate. You will need to wait until the next scheduled poll before the value will be updated.
FYI, I'm running 10.7 with SAM 6.1 & the template has not changed for this.
My counter went from 0 yesterday to 781 today, and is stuck at 781 now.
Of course is flagged as a critical problem in the Orion server template.
....I'm modifying the template for this now.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.