nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
- SolarWinds Academy
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

Polling engine stops polling without providing any error

mdriskell

I've been having sporadic issues with one of my polling engines. I'm running NPM 10.2.2 and on my primary engine polling seems to just stop even though all services are operational. I have gone through these steps from this KB but the issue keeps occurring.

http://knowledgebase.solarwinds.com/kb/questions/2517/Collector+Data+Processor+and+Collector+Polling+Controller+start+and+stop+intermittently

I'm not seeing any services stop but the poller just ceases collecting data. I'm opening a case with support and will reference this thread.

Find more posts tagged with

Accepted answers

All comments

Can you post here your case number?

How big was collector SDF files before you reset them?

Thanks

mdriskell

I did it yesterday and don't remember the sizes.

As of right now the polling controller is around 200MB, the job tracker 75MB and the Job Engine V2 61 MB

mdriskell

Case # 350334

mdriskell

Running diags now for support...as soon as I'm done I'm going to replace all the SDF files as this seems to get the system running again.

wbrown

We've seen a similar issue on one of our pollers as well. NPM 10.2.2, core 2011.2.2
We rebuilt the box thinking the OS was corrupted but the issue did not disappear with the rebuild.

A consistent symptom is that we can try to stop all services but the Job Engine v2 never finishes stopping. Reboot of the server is required.

One thing we've noticed is that the issue seems to occur anytime that OS patches are applied to the server but the server isn't rebooted.

We haven't opened a ticket on this. We figured we'd wait until we migrate to NPM 10.3 so we can skip the answers of "it's fixed in the next version".

mdriskell

JobEngine V2 will not stop using the service manager for me either. I had to kill the process using task manager before doing my repairs.

mdriskell

Really hate that the system censors my use of the word K .I .L. L

wbrown

Hah! I think it made the post funnier when I inserted my favorite 4-letter word while reading.

Apparently not a unix-oriented vocabulary in the filters.

borgan

Pardon my interjection, but I would appreciate it if SolarWinds would produce an official white paper on this issue. The tendency of these SDF files to require periodic regeneration has been around a long time. I realize that there is a KB on the topic, but a white paper could also shed light on why this happens, and even how to predict it. Could those of us with SAM benefit from using FIle Size monitoring to keep an eye on how the SDF files grow and alert us when thy reach a dangerous size?

mdriskell

So to make matters even worse the repair of the Orion Core services has truncated my trap rules again. This marks the third time this has happened (did it on my last two NPM upgrades). Specifically the Trap Detail section of Table TrapRules gets truncated down to 30 characters which breaks most of the rules we have in place. I'm beyond frustrated right now as I have team after team calling to find out why they are getting traps that should be caught by the filters we have written. Working with my DBA to restore that table.

mdriskell

Agreed in fact I'm working on a script to do it for me when the need arises.

Hi,

there was leak in Collector SDFs in some special situations, but this was fixed in 10.2.2. Maybe (just maybe) it's possible that your upgrade wasn't finished fluently and that's why you can still see this type of issue. If you are on 10.2.2 and your SDF is over 500MB, please collect diagnostic + SDF and open case. I would love to look at it.

Otherwise it's caused by something else and I can only suggest open support ticket.

netlogix

I also have this issue and have been trying to figure out how to fix it. I haven't opened a support ticket yet because I want to get it working right away so I reboot and the problem with that is then it is harder to troubleshoot. This started happening regularly (2-3 times a month) with the last 3-4 months. I am in the process of rebuilding the Server OS and splitting stuff out to see if it is load based (moving syslog/traps to Kiwi, NCM to it's own server, getting the additional website, etc - I wish I could separate SAM from NPM) after I am done with that I was going to open a support case. Now, I am also going to watch this to see what else I can steel from you guys.

mdriskell

Oh and as for it being load based...it happened earlier this week and I only had 3 devices on that polling engine at the time.

RogerWong

We have killed the filter so that it no longer censors the word "kill".

mdriskell

Looks like it just happened again. Restart of services doesn't correct I have to rebuild the SDF files to get polling to resume.

wbrown

That differs from our symptoms. Restart of services does restart the polling. But as I mentioned before, we end up rebooting the server to get the services restarted.

mdriskell

Looks like the NPM polling restarts by killing the services but the SAM polling does not unless I rebuild the SDF files.

netlogix

what about the table "APM_CurrentComponentStatus"

netlogix

Something like:

SELECT TOP 1000 *

FROM [NetPerfMon].[dbo].[APM_CurrentComponentStatus]

where Availability = 1 and (PercentCPU is null or PercentMemory is null or PercentVirtualMemory is null)

mdriskell

Thanks steve....I'll test that at the next outage....Had another one at 12:45. I have since moved almost everything off of poller 1 to minimize impact except a small handful of "test nodes" so I can see if it fails.

Still waiting to hear from support....submitted diags almost 7 hours ago.

mdriskell

I apologize support emailed 2 hours ago and it went to my junk folder. The recommendation was to repair which I informed them was done prior to opening the case.

netlogix

Ok, I have finished my first attempt a single alert per engine for this:

What it is - Looks for Stale Last Sync or Null in RAM or CPU for at least 50% of the nodes monitored by an engine

Requirements - The Engine must be Monitored by Orion and the IP needs to be the same in the nodes table as the engine table

Custom SQL Node alert with:

where IP_Address in (

Select Engines.IP

From Nodes

inner Join (

Select COUNT(*) as BadNodes, EngineID

From Nodes

WHERE (Nodes.Status = '1' AND (Nodes.PercentMemoryUsed IS NULL OR Nodes.CPULoad IS NULL

OR DATEDIFF(ss, Nodes.LastSync, getdate())/Nodes.PollInterval >= 10))

Group By EngineID

) as BadNodeCount on Nodes.EngineID = BadNodeCount.EngineID

inner join Engines on Nodes.EngineID = Engines.EngineID

WHERE Nodes.Status = '1'

Group by BadNodes, Engines.IP

having BadNodes * 100 / COUNT(*) > 50

)