I've been working with Solardwinds support for 30 days on this problem, and for some reason my !@#$ass didn't post something out here. So here it goes.
We are running: Orion Core 2011.2.2, NCM 7.0.2, NPM 10.2.2, UDT 2.0.0, IVIM 1.2.0
On a 2008 VM w/8 Xeon 3.something procs, 16 gig RAM, Dual 10 GIG fiber backbone to the router core. SQL system is uplinked at 10gig to an SQL 2008 RC 2 server w/ similar fancy specs.
Our hospital network has aprox 1500 managed nodes w/ a total UDT monitored interfaces @ around 49,000. Our environment is a mix of L2/L3 networks. We recently purchased the UDT unlimited module and added it into our mix for the security team. The goal is to locate MAC addresses that have previously caused issues w/ virus activity or know potentially spoofed addresses. It also provides easy tracking of users that violate P2P policies and such. bla bla bla you get it.
First go at the install and I find that by default it (UDT) chooses to poll all devices as both L2 and L3 capable. You can go back and uncheck these node by node, or by groups such as machine type. However, for a first time install, this is not an option. The flood of 49000 interfaces killed the NPM jobengine.exe process and we received aprox 200 false positive down node reports. This would go on around the time the L3 node poll would take place.
After my first call to tech support, we found the L3 issue and removed it from every device. The issues stopped, but of course we now were not getting 100% out of our UDT install. Even adding 20 L3 nodes would cause a false report of up to 10 devices. The nodes are the same every time, but I have removed them and added them in. None are in the same building, and in fact four are in a different city.
We also deploy the HP Network Node Manager which is essentially a really expensive Orion tool, but its clunky and difficult to use. It however, never sees these down nodes. Whats more, I can ping/telnet to all the "down" nodes from the Orion application server without issue. During my console sessions into the switch, I see the proc useage is nothing at all. The bandwidth isn't a problem, and we don't packetshape.
I need to figure out why the nodes are reporting as down. ICMP is obviously happening between the Orion server and the nodes in question. I'm looking to start a topic on the NPM/UDT jobengine, and how it can be managed to work together in this type of installation.
Holy crap, no more typing. Thanks for the help.