2 of 2 people found this helpful
This was a symptom I had after initially upgrading to UDT v3.3, but a hotfix seems to have taken care of it. There were some extra steps that support had me do, though. Here's the full Hotfix procedure they gave me:
1) Do the following for ALL POLLING ENGINES with UDT installed
Use an Administrator account to log in to the server that hosts your SolarWinds Orion installation.
Run the Orion Service Manager, click Shutdown Everything, and close the Orion Service Manager.
Run the SolarWinds-UDT-v3.3.0-HotFix1.msp file ON ALL POLLING ENGINES with UDT installed.
This hotfix can be installed through Falcon installer as well.
2) Execute SQL query
DELETE FROM udt_nodecapability
DELETE FROM udt_job
! THIS QUERY WILL UNMONITOR ALL UDT NODES (REMOVE ALL NODES FROM UDT MONITORING).
3) Clean up Collector 'sdf' data and JobEngine 'sdf' data on each Poller
JobEngine data: C:\ProgramData\Solarwinds\JobEngine.v2\Data
Collector data: C:\ProgramData\SolarWinds\Collector\Data
For clean up data file please do the next steps:
All SolarWinds services should be stopped on all polling engines;
Open up Programs and Features
Right click and uninstall on the Soalrwinds Job Engine, Solarwinds Job Engine V2, and the Solarwinds Collector.
Reinstall them by navigating to C:\ProgramData\Solarwinds\Installers and run the Job engines/Collector MSI files located there.
**NOTE: if you have issues getting the Job Engine to reinstall, you might need to open a command prompt as admin and then run the MSI file via the command line**
after cleaning up all UDT polling engines start all SolarWinds services for all engines.
Those are great notes. And they are exactly what SW Support did on the pollers that had their C drives filled. But Support did NOT do this on all pollers . . .
2 of 2 people found this helpful
I thought the problem was caused by too few UDT polling engines for the number of UDT monitored devices/ports.
Although the hotfix and the cleanup procedures provided above are excellent, and seem to help, I don't believe they are sufficient for prevention--only for temporary recovery from the problem.
I've found conflicting SW's documentation about UDT. One resource says a poller can only handle up to 3000 UDT devices. Another says a poller can handle 100,000 UDT nodes.
I opened a support case with SW and they provided link to UDT sizing, with the caveat that they wouldn't give me a hard & fast rule for how many nodes a single UDT instance can poll, simply because the poller typically has other Solarwinds modules on it. If it's handling NCM jobs, and doing NTA tasks, and polling for NPM, of course it'll have fewer resources available for UDT jobs.
Some of the documentation seems to indicate these could be helpful options:
- Add more processing power to the existing servers to ensure they can complete their jobs in the allotted time (not a problem in my case)
- Add more pollers to handle the demand (likely the best solution from a SW viewpoint; not so great from my SysAdmins' viewpoint, eating up eighteen VM servers where I was getting by with five). It turns out this is NOT necessary, since UDT pollers can handle up to 100,000 UDT nodes (per the documentation Support sent me).
- Reduce the polling frequency so the jobs get done on just five pollers, at the risk of generating stale UDT information because devices may have moved before the next polling schedule begins. Again this should NOT be necessary because UDT Pollers can handle 100,000 UDT nodes.
So, this isn't resolved yet.
Per Support's request, I've blown away three recently-built APE's and rebuilt them from scratch, this time with different permissions applied. They seem to be working MUCH better in several ways, but I still see these UDT Job errors that indicate jobs are still running when they should have completed.
Sigh. I would. But it's been too long, too much water over the dam.
I recommend you talk with Support if you have similar concerns.
But I also suspect that any issues will be fully resolved, permissions included, by running the just-released updater and/or Scalability Engine solution by installing the latest hotfixes on your Main Poller, and then pulling them down to each APE using the Polling Engines options on each of them. That latest fix really cleaned up a LOT of my environment's issues.
1 of 1 people found this helpful
Understanding what physical resources UDT requires is an interesting conundrum.
I see what appears to me to be conflicting information about how many UDT devices can be monitored by a single UDT poller.
The first document linked shows the Primary Poller Limit is 100 K ports. And every APE can each manage 100K additional ports. The limitation is 500K UDT Ports per instance (1 main poller and 4 APE’s). My environment is identical to this, although I’m not clear whether all UDT polling is occurring from only the main instance, or from it and all APE’s.
However, the onboard Solarwinds UDT Help Page in my SW Orion deployment says only 3000 UDT devices per poller:
Knowing which information is correct remains the challenge. I've requested clarification from Support.
I am running into the same issue since we upgraded to UDT 3.3. I'm not sure if the problem was merely masked before, or if 3.3 introduced the issue. Did you get clarification from support on the official number of supported UDT devices/ports per poller?
I would look at the SQLserver --
If you have multiple polling engines running over the same L2/L3 infrastructure then they may be trying to update the tables simultaneously leading to deadlocks and timeouts in the database. I did a LOT of work with the UDT folks a couple of years ago to get UDT 3.2.4 stable in our environment.they had to tweak a lot of the SQL to avoid deadlocks.
UDT job polling is controlled/managed by the jobengine V2. You may have to cleanup jobs there:
a) stop the job engines
b) in C:\ProgramData\Solarwinds\JobEngine.v2\Data
copy " JobEngine35 - Blank.sdf" over " JobEngine35.sdf"
c) restart the job engine (this file will get repopulated as it resyncs)
[I'v never had to reinstall the jobengine]
I'm going to be looking at alternatives to UDT because it has been more than 5 years since I asked for VRF polling and we're gradually loosing the utility in UDT without it.
I'm surprised--I rely on UDT to see where end devices live, and who's logged into them. It saves me a good chunk of time tracking devices to ports manually, and the SQL database is mined by my CMDB for tracking devices & users, too. But that's all I got it for, and it does the job pretty well.
I monitor everything else with NPM, which can discover/ping/monitor VRF's.
What am I missing by not being able to use UDT to poll VRF's? How do you imagine UDT would make your life better if it could poll or discover VRF's?
on Juniper routers the ARP data is stored by VRF context, and not in a global MIB table.
We don't have it linked to Active Directory, because not everyone logs into an active directory controller.
I do have a really good source of login information through our SSO portal, but there is no way to feed that information in
It also lacks an API to turn on UDT monitoring, and make sure everything is monitored
We used to only have 4 VRFs 7 years ago, and I could mange using additional pollers to poll the VRF contexts.
Now we're up to 144 and will probably go through another explosion in VRFs after we get departmental-IT managed firewall service rolled out so departments can create their own virtual networks of VLANs behind their own managed firewall... we're doing for networks what VMware did for physical servers.
So basically all UDT offers me today is MAC<>Switch matching for 1/3rd of the network
Hmph. It almost seems like a limitation with VRF's, and a tiny reason to stay away from them. I realize there are much larger reasons to go TO VRF's.
VRF's ought to fully support your needs, and UDT ought to fully support VRF's.
That's the way it would be in fictional "Rick-Perfect-World."