6 Replies Latest reply on Jul 30, 2018 10:34 AM by lcsw2013

    Agent monitoring brings me endless headaches.


      Pain points:


      1. Agents causing certain monitored servers to spike in CPU and others in ram and some spike in both. On NPM version 12.2 .. Can't for the life of me figure out why?

      2. Agents causing problems with pollers. job engine spiking. Collectors crashing. And ephemeral port spike. How much agents per poller can the pollers handle? Am I overloading my system?

      3. On my DMZ servers I can't get anything to work. Even less with agents. Even if manually installed I can't get them to communicate with host. Should I place a poller in the DMZ to make this happen?

      4. Moving devices from poller to poller I'm having to manually go into manage agent and move the devices to a different poller manually.

      5. 2003 servers are a pain to monitor no matter which way you choose. Even with agents they still like to be problem child's. Have issues trying to figure out good way to monitor these servers. Is agent the way to go?


      Just the top 5 pressing issues. Help would be appreciated thanks.

        • Re: Agent monitoring brings me endless headaches.

          Agreed, I have always discouraged my clients from using agents as their standard polling method on Windows because I think they bring in so many more headaches compared to using WMI.  I feel like they help enough with Linux machines to be worth the hassle there, and so far the new mapping features have not quite been enough to convince me to change my standard recommendation.


          1) Personally never seen the agent themselves using a lot of cpu/memory but I have heard people had that issue.  I have seen cases where the server is using high CPU for something else and the agent stops responding or sending new data so the metrics you see in Orion are stale and you have to restart the agent to get them updated.  Can't offer any ideas on the agent itself having high consumption, but maybe others will weigh in. 


          2) According to the scalability guide each APE is supposed to be able to handle about 1000 agents

          Scalability Engine Guidelines for SolarWinds Orion Products - SolarWinds Worldwide, LLC. Help and Support


          3) Placing a poller into the DMZ certainly makes it easier, otherwise you just have to sit down with whoever manages the firewall and look for packets hitting the rules from/to the polling engines until you identify what rules are messing with the communications.


          4) Agreed, there doesn't seem to be a very good mechanism for changing the pollers on agents that i ever came across, if you change them in Manage nodes then the agent just stops responding.


          5) The main thing I find with 2003 servers is that you need to be using RPC or SNMP for any SAM monitors, the version of WMI they run is half baked at best.


          Good luck

          1 of 1 people found this helpful
            • Re: Agent monitoring brings me endless headaches.

              Good Sir I think we've communicated several times on here.


              I have one of your colleagues I am working with. I believe you are with loop1 right? I'm working with Douglas K. Sharp guy. And between him, the several tickets I have with solarwinds, and thwack forum I still can't get my issues resolved haha.


              You offered some good ideas here. But previously to going to agents on windows side we where on WMI and had an even bigger headache with WMI failing constantly. I had tried to build a SAM component who's sole purpose was to monitor for WMI connection going down then send me an email or generate a report to let us know the servers dropping off monitoring. Daily that list was always 100+ and of course my windows team didn't fess up to the problem but rather blamed this as a "SolarWinds problem". SolarWinds doesn't have much trust here as it's a perfect storm of never ending issues and tickets and nightmare haha.


              And because of this no one really trusts the system that's known to always have issues and false information. So it's easy for others to become convinced it's solarwinds even after offering proof this wasn't a solarwinds problem. So to avoid headaches we started switching to agents. We have added 1000 servers out of 3 thousand windows servers. And since have been fighting headaches to keep those 1000 servers from falling off.


              In our environment SNMP, WMI, Agent.... Are all unstable. And I have no ammo or solid evidence to prove beyond all doubts that it's not solarwinds. Someone always finds something on the solarwinds side to discredit my information.


              I got pain points to share with everyone on this forum and still have left overs.


              But I will give some of these things a try. I'll update this a little later on with more information. Thanks again.

            • Re: Agent monitoring brings me endless headaches.

              So we are slowly backing away from agent polling. And only using it on servers that actually need it. DMZ server I believe in our case will just have to wait. We have a palo alto that's directing traffic to and from and that palo alto is just a nightmare when it comes to solarwinds traffic as it allows some but not all to properly go through. Using the free solarwinds trace route tool I saw that tcp packets are getting dropped at our palo alto for a majority of servers but oddly enough for a small group it is allowing it. Even our network guys are scratching their heads with this behavior. SAM component count is no where near 1000 components a server. But it's close. Still can't understand why sam throttles itself. And we are considering switching our 2003 machines to snmp and their templates to rpc.


              We're trying anything at this point to stabilize this enviornment and end the problems.

              • Re: Agent monitoring brings me endless headaches.

                Windows 2003 are our biggest pain points. Suggested we move them to snmp and rpc combo, but idea was denied and was told Agents is the only option they want to go with. Thankfully though management has said that we are dwindling and reducing the number of 2003 servers. But it'll take time and the ones currently active need to be monitored through agents until they are gone. I guess it's just something we will have to deal with.


                As far as DMZ, We have a stinking Palo Alto that has an algorithm that likes to drop packets on packets that aren't perfect. So if solarwinds sends one bad packet it starts dropping solarwinds traffic. This is pain point number two we'll have to deal with until we can get approval to place a poller inside the DMZ.


                All other issues have been for the most part resolved. But the whole thing still remains largely unstable. We are planning to move to 12.3 and hoping that the new agents will provide us more stability.