I've been having sporadic issues with one of my polling engines. I'm running NPM 10.2.2 and on my primary engine polling seems to just stop even though all services are operational. I have gone through these steps from this KB but the issue keeps occurring.
http://knowledgebase.solarwinds.com/kb/questions/2517/Collector+Data+Processor+and+Collector+Polling+Controller+start+and+stop+intermittently
I'm not seeing any services stop but the poller just ceases collecting data. I'm opening a case with support and will reference this thread.
Can you post here your case number?
How big was collector SDF files before you reset them?
Thanks
I did it yesterday and don't remember the sizes.
As of right now the polling controller is around 200MB, the job tracker 75MB and the Job Engine V2 61 MB
Case # 350334
Running diags now for support...as soon as I'm done I'm going to replace all the SDF files as this seems to get the system running again.
We've seen a similar issue on one of our pollers as well. NPM 10.2.2, core 2011.2.2We rebuilt the box thinking the OS was corrupted but the issue did not disappear with the rebuild.
A consistent symptom is that we can try to stop all services but the Job Engine v2 never finishes stopping. Reboot of the server is required.
One thing we've noticed is that the issue seems to occur anytime that OS patches are applied to the server but the server isn't rebooted.
We haven't opened a ticket on this. We figured we'd wait until we migrate to NPM 10.3 so we can skip the answers of "it's fixed in the next version".
JobEngine V2 will not stop using the service manager for me either. I had to kill the process using task manager before doing my repairs.
Really hate that the system censors my use of the word K .I .L. L
Hah! I think it made the post funnier when I inserted my favorite 4-letter word while reading.
Apparently not a unix-oriented vocabulary in the filters.
Pardon my interjection, but I would appreciate it if SolarWinds would produce an official white paper on this issue. The tendency of these SDF files to require periodic regeneration has been around a long time. I realize that there is a KB on the topic, but a white paper could also shed light on why this happens, and even how to predict it. Could those of us with SAM benefit from using FIle Size monitoring to keep an eye on how the SDF files grow and alert us when thy reach a dangerous size?
So to make matters even worse the repair of the Orion Core services has truncated my trap rules again. This marks the third time this has happened (did it on my last two NPM upgrades). Specifically the Trap Detail section of Table TrapRules gets truncated down to 30 characters which breaks most of the rules we have in place. I'm beyond frustrated right now as I have team after team calling to find out why they are getting traps that should be caught by the filters we have written. Working with my DBA to restore that table.
Agreed in fact I'm working on a script to do it for me when the need arises.
Hi,
there was leak in Collector SDFs in some special situations, but this was fixed in 10.2.2. Maybe (just maybe) it's possible that your upgrade wasn't finished fluently and that's why you can still see this type of issue. If you are on 10.2.2 and your SDF is over 500MB, please collect diagnostic + SDF and open case. I would love to look at it.
Otherwise it's caused by something else and I can only suggest open support ticket.
I also have this issue and have been trying to figure out how to fix it. I haven't opened a support ticket yet because I want to get it working right away so I reboot and the problem with that is then it is harder to troubleshoot. This started happening regularly (2-3 times a month) with the last 3-4 months. I am in the process of rebuilding the Server OS and splitting stuff out to see if it is load based (moving syslog/traps to Kiwi, NCM to it's own server, getting the additional website, etc - I wish I could separate SAM from NPM) after I am done with that I was going to open a support case. Now, I am also going to watch this to see what else I can steel from you guys.
We have killed the filter so that it no longer censors the word "kill".