This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

After installing UDT, I frequently see this alert: "New UDT polling jobs (388) are still running..." Support can't fix it. What's this mean?

A couple of months ago I upgraded NPM to 12.2, and all the other matching upgrades for NCM, NTA.  I also installed the recommended Hot Fixes.

Then I bought IPAM and UDT and installed them, along with one more APE.

They ran OK for a few weeks, but now I've got this message showing up in the Alerts part of NPM's home page:

pastedImage_0.png

New UDT polling jobs (388) are still running...

I searched for that message in Support & Thwack; haven't found the issue.  I was hoping it was something I could correct by changing UDT  polling timeouts, but although I've tried doubling them once, and doubling them a second time, there's been no improvement.

Might you have an idea about what's going on here?  I've had reoccurring issues where the C drives on APE have filled, and SW Support seems to think the issue is possibly:

  • The C drives are sized too small (they're not, per SW recommendation)
  • Anti-Virus is stopping reports from getting to the main NPM instance (AV is configured to exempt the recommended directories, but the logs say communications are failing between the APE's and NPM)
  • I removed failing NetPath monitors in hopes that might fix it.  No luck.

If you have ideas on what's going on with my UDT, based on this alert, I'd love to hear from you.

  • This was a symptom I had after initially upgrading to UDT v3.3, but a hotfix seems to have taken care of it. There were some extra steps that support had me do, though. Here's the full Hotfix procedure they gave me:

    1) Do the following for ALL POLLING ENGINES with UDT installed

    Use an Administrator account to log in to the server that hosts your SolarWinds Orion installation.

    Run the Orion Service Manager, click Shutdown Everything, and close the Orion Service Manager.

    Run the SolarWinds-UDT-v3.3.0-HotFix1.msp file ON ALL POLLING ENGINES with UDT installed.

    This hotfix can be installed through Falcon installer as well.

    2) Execute SQL query

    DELETE FROM udt_nodecapability

    DELETE FROM udt_job

    ! THIS QUERY WILL UNMONITOR ALL UDT NODES (REMOVE ALL NODES FROM UDT MONITORING).

    3) Clean up Collector 'sdf' data and JobEngine 'sdf' data on each Poller

    JobEngine data: C:\ProgramData\Solarwinds\JobEngine.v2\Data

    Collector data: C:\ProgramData\SolarWinds\Collector\Data

    For clean up data file please do the next steps:

    All SolarWinds services should be stopped on all polling engines;

    Open up Programs and Features

    Right click and uninstall on the Soalrwinds Job Engine, Solarwinds Job Engine V2, and the Solarwinds Collector.

    Reinstall them by navigating to C:\ProgramData\Solarwinds\Installers and run the Job engines/Collector MSI files located there.

    **NOTE: if you have issues getting the Job Engine to reinstall, you might need to open a command prompt as admin and then run the MSI file via the command line**

    after cleaning up all UDT polling engines start all SolarWinds services for all engines.

  • Those are great notes.  And they are exactly what SW Support did on the pollers that had their C drives filled.  But Support did NOT do this on all pollers . . .

  • I thought the problem was caused by too few UDT polling engines for the number of UDT monitored devices/ports.

    Although the hotfix and the cleanup procedures provided above are excellent, and seem to help, I don't believe they are sufficient for prevention--only for temporary recovery from the problem.

    I've found conflicting SW's documentation about UDT.  One resource says a poller can only handle up to 3000 UDT devices.  Another says a poller can handle 100,000 UDT nodes.

    I opened a support case with SW and they provided link to UDT sizing, with the caveat that they wouldn't give me a hard & fast rule for how many nodes a single UDT instance can poll, simply because the poller typically has other Solarwinds modules on it.  If it's handling NCM jobs, and doing NTA tasks, and polling for NPM, of course it'll have fewer resources available for UDT jobs.

    Some of the documentation seems to indicate these could be helpful options:

    1. Add more processing power to the existing servers to ensure they can complete their jobs in the allotted time (not a problem in my case)
    2. Add more pollers to handle the demand (likely the best solution from a SW viewpoint; not so great from my SysAdmins' viewpoint, eating up eighteen VM servers where I was getting by with five). It turns out this is NOT necessary, since UDT pollers can handle up to 100,000 UDT nodes (per the documentation Support sent me).
    3. Reduce the polling frequency so the jobs get done on just five pollers, at the risk of generating stale UDT information because devices may have moved before the next polling schedule begins. Again this should NOT be necessary because UDT Pollers can handle 100,000 UDT nodes.

    So, this isn't resolved yet.

    Per Support's request, I've blown away three recently-built APE's and rebuilt them from scratch, this time with different permissions applied.  They seem to be working MUCH better in several ways, but I still see these UDT Job errors that indicate jobs are still running when they should have completed.

  • Understanding what physical resources UDT requires is an interesting conundrum.

    I see what appears to me to be conflicting information about how many UDT devices can be monitored by a single UDT poller.

    https://support.solarwinds.com/@api/deki/files/9924/ScalabilityEngineGuidelines.pdf

    The first document linked shows the Primary Poller Limit is 100 K ports.  And every APE can each manage  100K additional ports.  The limitation is 500K UDT Ports per instance (1 main poller and 4 APE’s).  My environment is identical to this, although I’m not clear whether all UDT polling is occurring from only the main instance, or from it and all APE’s.

    https://support.solarwinds.com/Success_Center/Network_Performance_Monitor_(NPM)/NPM_Documentation/Scalability_Engine_Guidelines_for_SolarWinds_Orion_Products

    pastedImage_2.png

    However, the onboard Solarwinds UDT Help Page in my SW Orion deployment says only 3000 UDT devices per poller:

    pastedImage_3.png

    Knowing which information is correct remains the challenge.  I've requested clarification from Support.

  • I am running into the same issue since we upgraded to UDT 3.3.  I'm not sure if the problem was merely masked before, or if 3.3 introduced the issue.  Did you get clarification from support on the official number of supported UDT devices/ports per poller?

  • Support told me this information is the correct information:

    pastedImage_0.png

    However, it does't explain why I still have the UDT Job time out errors.

  • I would look at the SQLserver --

    If you have multiple polling engines running over the same L2/L3 infrastructure then they may be trying to update the tables simultaneously leading to deadlocks and timeouts in the database. I did a LOT of work with the UDT folks a couple of years ago to get UDT 3.2.4 stable in our environment.they had to tweak a lot of the SQL to avoid deadlocks.

    UDT job polling is controlled/managed by the jobengine V2. You may have to cleanup jobs there:

    a) stop the job engines

    b) in C:\ProgramData\Solarwinds\JobEngine.v2\Data

    copy " JobEngine35 - Blank.sdf" over " JobEngine35.sdf"

    c) restart the job engine (this file will get repopulated as it resyncs)

    [I'v never had to reinstall the jobengine]

    I'm going to be looking at alternatives to UDT because it has been more than 5 years since I asked for VRF polling and we're gradually loosing the utility in UDT without it.

  • I'm surprised--I rely on UDT to see where end devices live, and who's logged into them.  It saves me a good chunk of time tracking devices to ports manually, and the SQL database is mined by my CMDB for tracking devices & users, too.  But that's all I got it for, and it does the job pretty well.

    I monitor everything else with NPM, which can discover/ping/monitor VRF's.

    What am I missing by not being able to use UDT to poll VRF's?  How do you imagine UDT would make your life better if it could poll or discover VRF's?

  • rschroeder​, would you share (or PM me) what the different permissions applied were?

    TIA!

  • on Juniper routers the ARP data is stored by VRF context, and not in a global MIB table.

    We don't have it linked to Active Directory, because not everyone logs into an active directory controller.

    I do have a really good source of login information through our SSO portal, but there is no way to feed that information in

    It also lacks an API to turn on UDT monitoring, and make sure everything is monitored

    We used to only have 4 VRFs 7 years ago, and I could mange using additional pollers to poll the VRF contexts.

    Now we're up to 144 and will probably go through another explosion in VRFs after we get departmental-IT managed firewall service rolled out so departments can create their own virtual networks of VLANs behind their own managed firewall... we're doing for networks what VMware did for physical servers.

    So basically all UDT offers me today is MAC<>Switch matching for 1/3rd of the network