I've been having sporadic issues with one of my polling engines. I'm running NPM 10.2.2 and on my primary engine polling seems to just stop even though all services are operational. I have gone through these steps from this KB but the issue keeps occurring.
http://knowledgebase.solarwinds.com/kb/questions/2517/Collector+Data+Processor+and+Collector+Polling+Controller+start+and+stop+intermittently
I'm not seeing any services stop but the poller just ceases collecting data. I'm opening a case with support and will reference this thread.
Can you post here your case number?
How big was collector SDF files before you reset them?
Thanks
I did it yesterday and don't remember the sizes.
As of right now the polling controller is around 200MB, the job tracker 75MB and the Job Engine V2 61 MB
Case # 350334
Running diags now for support...as soon as I'm done I'm going to replace all the SDF files as this seems to get the system running again.
We've seen a similar issue on one of our pollers as well. NPM 10.2.2, core 2011.2.2We rebuilt the box thinking the OS was corrupted but the issue did not disappear with the rebuild.
A consistent symptom is that we can try to stop all services but the Job Engine v2 never finishes stopping. Reboot of the server is required.
One thing we've noticed is that the issue seems to occur anytime that OS patches are applied to the server but the server isn't rebooted.
We haven't opened a ticket on this. We figured we'd wait until we migrate to NPM 10.3 so we can skip the answers of "it's fixed in the next version".
JobEngine V2 will not stop using the service manager for me either. I had to kill the process using task manager before doing my repairs.
Really hate that the system censors my use of the word K .I .L. L
Hah! I think it made the post funnier when I inserted my favorite 4-letter word while reading.
Apparently not a unix-oriented vocabulary in the filters.
Pardon my interjection, but I would appreciate it if SolarWinds would produce an official white paper on this issue. The tendency of these SDF files to require periodic regeneration has been around a long time. I realize that there is a KB on the topic, but a white paper could also shed light on why this happens, and even how to predict it. Could those of us with SAM benefit from using FIle Size monitoring to keep an eye on how the SDF files grow and alert us when thy reach a dangerous size?
So to make matters even worse the repair of the Orion Core services has truncated my trap rules again. This marks the third time this has happened (did it on my last two NPM upgrades). Specifically the Trap Detail section of Table TrapRules gets truncated down to 30 characters which breaks most of the rules we have in place. I'm beyond frustrated right now as I have team after team calling to find out why they are getting traps that should be caught by the filters we have written. Working with my DBA to restore that table.
Agreed in fact I'm working on a script to do it for me when the need arises.
Hi,
there was leak in Collector SDFs in some special situations, but this was fixed in 10.2.2. Maybe (just maybe) it's possible that your upgrade wasn't finished fluently and that's why you can still see this type of issue. If you are on 10.2.2 and your SDF is over 500MB, please collect diagnostic + SDF and open case. I would love to look at it.
Otherwise it's caused by something else and I can only suggest open support ticket.
I also have this issue and have been trying to figure out how to fix it. I haven't opened a support ticket yet because I want to get it working right away so I reboot and the problem with that is then it is harder to troubleshoot. This started happening regularly (2-3 times a month) with the last 3-4 months. I am in the process of rebuilding the Server OS and splitting stuff out to see if it is load based (moving syslog/traps to Kiwi, NCM to it's own server, getting the additional website, etc - I wish I could separate SAM from NPM) after I am done with that I was going to open a support case. Now, I am also going to watch this to see what else I can steel from you guys.
Oh and as for it being load based...it happened earlier this week and I only had 3 devices on that polling engine at the time.
We have killed the filter so that it no longer censors the word "kill".
Looks like it just happened again. Restart of services doesn't correct I have to rebuild the SDF files to get polling to resume.
That differs from our symptoms. Restart of services does restart the polling. But as I mentioned before, we end up rebooting the server to get the services restarted.
Looks like the NPM polling restarts by killing the services but the SAM polling does not unless I rebuild the SDF files.
what about the table "APM_CurrentComponentStatus"
Something like:
SELECT TOP 1000 *
FROM [NetPerfMon].[dbo].[APM_CurrentComponentStatus]
where Availability = 1 and (PercentCPU is null or PercentMemory is null or PercentVirtualMemory is null)
Thanks steve....I'll test that at the next outage....Had another one at 12:45. I have since moved almost everything off of poller 1 to minimize impact except a small handful of "test nodes" so I can see if it fails.
Still waiting to hear from support....submitted diags almost 7 hours ago.
I apologize support emailed 2 hours ago and it went to my junk folder. The recommendation was to repair which I informed them was done prior to opening the case.
Ok, I have finished my first attempt a single alert per engine for this:
What it is - Looks for Stale Last Sync or Null in RAM or CPU for at least 50% of the nodes monitored by an engine
Requirements - The Engine must be Monitored by Orion and the IP needs to be the same in the nodes table as the engine table
Custom SQL Node alert with:
where IP_Address in (
Select Engines.IP
From Nodes
inner Join (
Select COUNT(*) as BadNodes, EngineID
WHERE (Nodes.Status = '1' AND (Nodes.PercentMemoryUsed IS NULL OR Nodes.CPULoad IS NULL
OR DATEDIFF(ss, Nodes.LastSync, getdate())/Nodes.PollInterval >= 10))
Group By EngineID
) as BadNodeCount on Nodes.EngineID = BadNodeCount.EngineID
inner join Engines on Nodes.EngineID = Engines.EngineID
WHERE Nodes.Status = '1'
Group by BadNodes, Engines.IP
having BadNodes * 100 / COUNT(*) > 50
)