35 Replies Latest reply on Jul 2, 2018 3:29 PM by aLTeReGo

    Issues with agents following upgrade to SAM 6.2.3?

    shuth

      Has anyone else had issues with agents following an upgrade to SAM 6.2.3?

       

      One particular install I've been working on went from SAM 6.2.1 and NCM 7.4 to SAM 6.2.3 and NCM 7.4.1, then we put NPM 11.5.3 on top of that. The install seemed to go fine, no errors during the Config Wizard, and the web console loaded fine after each install. Things went downhill from there:

      • None of the agents are returning any data (working previously), approximately 623 agents.
      • The Orion Module Engine service is crashing every 30-45 minutes - potentially related to the above
      • Approximately 2,800 Cisco devices showing Hardware Health is "Unknown" even though Hardware Health polling for these nodes is disabled. This is confirmed via List Resources as well as Manage Pollers. Assume this is from putting NPM on top of an existing NCM/SAM install and enabling hardware health polling (along with VLAN polling, routing polling, etc).

       

      I presume the agent version would be the same regardless of which module I upgraded above as they were all released same day.

       

      Currently have a case open with SolarWinds support but thought I'd post this while waiting for an AE. Case # 934710.

       

      We will give support a bit of time to analyse and hopefully come up with a fix/solution but we have a database backup and server snapshot we can roll back to if it doesn't look likely.

        • Re: Issues with agents following upgrade to SAM 6.2.3?
          squinsey

          Can't help you there sorry shuth. I've actually only just started planning my upgrade but it mentioned upgrade NPM first, then SAM followed by NCM last.

          I recall a post about having to run the SCWizard again but since you didn't have NPM to begin with...

           

          Maybe running the installer for SAM now that NPM is there may assist. Though you do have a backup of the Db don't you?

            • Re: Issues with agents following upgrade to SAM 6.2.3?
              shuth

              Correct I didn't have NPM initially. I thought it would be best to bring SAM and NCM up to date first before putting on NPM.

               

              We reinstalled the job engines and collector services, the module services, cleared out the jobs and collector files from ProgramData and replaced with the blank files, and reran the Config Wizard.

               

              I noticed that Orion is generating a lot of TaskProperties.xxxxxxxxxxxxxxxxxxxxxxxxx files in Windows\Temp - to the tune of 85,000 files. These seem to be in use by the SolarWinds BusinessLayerHost.exe but I guess when it crashes it doesn't clear them out.

               

              The Orion application monitor also found the the job scheduler has generated a lot of errors. The File Count Monitor works after I cleared out a lot of the 85,000 files but is climbing rapidly again.

              orion_components.JPG

            • Re: Issues with agents following upgrade to SAM 6.2.3?
              shuth

              I've been doing some additional work with support. We disabled the DPI and Agent Management plugins in a couple of .config files. Since then the service has stopped crashing. Progress!

               

              orion_blh-process.JPG

               

              Also, it seems to have stopped generating a large number of files in C:\Windows\Temp. Except it then filled up the drive with VMware results logs.

              • Re: Issues with agents following upgrade to SAM 6.2.3?
                shuth

                Following additional requests from SW Support. I increased the connection limits in the agents management config file and re-enabled DPI and AMS. The Orion Module Engine service started crashing so I disabled them again and uploaded diagnostics.

                • Re: Issues with agents following upgrade to SAM 6.2.3?
                  chadsikorra

                  I just did an upgrade from SAM 6.2.1 to SAM 6.2.3 in our environment on Monday. We are at NCM 7.4 and NPM 11.5.2 (with the latest hotfixes). I haven't upgraded to NPM 11.5.3 yet. I haven't had any of the issues you're experiencing yet. The only issue I noticed is that all of our servers with the Agent (which is about 75 at the moment) deselected their Hardware Health Sensors from monitoring after the upgrade. So I had to go to each agent node and re-select Hardware Health Sensors for monitoring...quite tedious. I have a case open with support about it to see what they say (It's the second time that has happened actually. It happened when I installed the latest NPM hotfixes too).

                  • Re: Issues with agents following upgrade to SAM 6.2.3?
                    shuth

                    Working with support, it seems there is a memory issue relating to agent plugin deployment logic. We changed the agent plugin deployment retry interval from 10 seconds to 10 minutes and re-enabled the AMS components.

                     

                    The service continued to crash for about 20 hours every 40ish minutes but then stabilised (presumably once it had updated the agents). Most of the agents are now running version 1.4.0.9 of the agent with some still unknown (server is currently down/unreachable or we need to manually reinstall the agent).

                     

                    Agent Status

                    OK = 625

                    Unknown = 22

                    RebootRequired = 17

                    PluginErrorOccurred = 2

                     

                    We'll start moving the polling back to the Agents from WMI and see how the system goes.

                    1 of 1 people found this helpful
                    • Re: Issues with agents following upgrade to SAM 6.2.3?
                      colew

                      I hate to resurrect a thread from almost 2 years ago, but we recently started having this issue in one of our Orion environments that is completely agent-based (about 9500 agents total), and support hasn't been able to give us any specific reasons to why it's happening, nor resolutions to fix the issue outside of doing an upgrade to 12.2.

                       

                      Did you guys ever get your environments back stabilized?  Our issues started happening when a group deleted around 2,000 nodes and checked the remove agent box while doing so (all 2000 were showing offline at the time so the thinking was that the agent wouldn't get removed and would just re-register when the machines came back online).  I've had a case open for almost a month now, and we just aren't making any progress...

                       

                      I'm having the exact same symptoms mentioned here - no agents returning statistics, pollers Windows\Temp folder getting filled with TaskProperties files, all components in templates returning 'Initial Poll in Progress', ...

                       

                      Current environment is Orion Platform 2017.1.3 SP3, DPAIM 11.0.0, NPM 12.1, VIM 7.1.0, NetPath 1.1.0, QoE 2.3, CloudMonitoring 1.0.0, SAM 6.4.0

                        • Re: Issues with agents following upgrade to SAM 6.2.3?
                          aLTeReGo

                          When you delete the node and state that you want to remove the Agent also, the agent record and it's associated private key is purged from the system. That agent will never be able to communicate with the Orion server again until it's reprovisioned. That involves opening the Agent Control Panel in Windows or running swiagent command in Linux and reauthenticating to the Orion server to re-establish communication for agents in 'Active' mode. Alternatively, you can redeploy/push the agent from the Orion web console to the servers you wish to re-manage to accomplish the same thing. For Passive Agents, you can reprovision those from the Orion web console by going to the 'Manage Agents' page and adding an existing deployed agent.

                           

                          If you had deleted the node and not checked the 'Uninstall Agent' checkbox (which it's not by default), then you could have simply re-added the nodes via the 'Add Node' wizard or via the Sonar Discovery Wizard.

                          1 of 1 people found this helpful
                            • Re: Issues with agents following upgrade to SAM 6.2.3?
                              colew

                              Thanks for the info.  That sheds some light into how the agents actually work, which we haven't really found great documentation on.

                               

                              What would you think would cause the instability issues with all of the agents that didn't get removed, and the polling jobs?  Is it a matter of the agents that are out there that the entries were deleted for constantly trying to communicate back with the pollers and send info to them, so the pollers are tied up trying to figure out what they are?  I am just curious here if reinstalls on the 2000 or so that were removed is going to stabilize things again...

                                • Re: Issues with agents following upgrade to SAM 6.2.3?
                                  aLTeReGo

                                  What is the maximum number of agents you have assigned to any single polling engine? If you have 9500 Agents, you would need at least one main poller and nine Additional Polling Engines to handle that amount of Agents.

                                    • Re: Issues with agents following upgrade to SAM 6.2.3?
                                      chadsikorra

                                      Is there a technical reason why the number of agents per poller is limited to 900? Is it due to the amount of data they process from the agents?

                                        • Re: Issues with agents following upgrade to SAM 6.2.3?
                                          aLTeReGo

                                          Each polling engine supports 1000 Agents. The Agent Management service running on the Orion server is responsible for encryption and compression for all Agents which are in constant communication. As such these channels are persistent, much like VPN tunnel. As such, they're not infinitely scalable without standing up Additional Polling Engines to distribute the load.

                                           

                                          Note that if you have 2000 disassociated Agent's continuing to phone home to the same Orion server that services other functioning agents, that could be contributing significantly to your instability issue. That is because the Agent Management Service must still deal with these incoming requests, identify them as unmanaged agents, before ultimately discarding their request. The Agent on the other end doesn't know why it's unable to communicate with the Orion server, and in turn, aggressively attempts to reestablish communication continuously. This adds substantial strain to the Orion server when thousands of Agents are all doing this in parallel, and can even lead to port exhaustion issues.

                                          2 of 2 people found this helpful
                                            • Re: Issues with agents following upgrade to SAM 6.2.3?
                                              colew

                                              So it looks like this may be our issue with this particular environment then.  The engineer that spec'd this environment before me set it up as a 6 poller, 1 core environment, which currently has about 1300 agents reporting to each system.  It's odd because at one point I remembered seeing 1500+ reporting to each poller, and they have ran fine since September without any issues.  I guess the agent deletions were the straw that broke the camels back in this case...

                                               

                                              So, moving forward, I believe I am going to request that all agents be stopped and set to disabled, upgrade the environment over to 12.2, and bring all modules up-to-date.  At that point, I'll bring in a few more pollers, and package up the plugin files to push out to the systems prior to doing the agent updates.  One question though - would the agent update that will take place after our environment upgrade be enough to allow the agents to re-register with the system and start communicating again?

                                               

                                              If anyone has a better suggestion to getting this thing back stabilized, I'm all ears....or eyes in this case I guess

                                                • Re: Issues with agents following upgrade to SAM 6.2.3?
                                                  aLTeReGo

                                                  colew  wrote:

                                                   

                                                  So, moving forward, I believe I am going to request that all agents be stopped and set to disabled, upgrade the environment over to 12.2, and bring all modules up-to-date.  At that point, I'll bring in a few more pollers, and package up the plugin files to push out to the systems prior to doing the agent updates.  One question though - would the agent update that will take place after our environment upgrade be enough to allow the agents to re-register with the system and start communicating again?

                                                   

                                                  If anyone has a better suggestion to getting this thing back stabilized, I'm all ears....or eyes in this case I guess

                                                  If you could add a few polling engines, spread the load more evenly across them, stop the Agent service on those 2000 disassociated agents, then upgrade, you should be fine.

                                                • Re: Issues with agents following upgrade to SAM 6.2.3?
                                                  colew

                                                  So we are finally at a place where we are completely current on the core/modules, and ready to attempt the upgrade to the most current agent version.  Question though...before doing the install of the new agent, should we delete all nodes/agents from the system?  They have been in a stopped state for weeks now, and I feel like starting with a clean slate may be best unless you think we will run into issues once everything starts coming back online similar to the way we did the first time we did a mass delete.

                                                    • Re: Issues with agents following upgrade to SAM 6.2.3?
                                                      aLTeReGo

                                                      By default, any Agents that are communicating with the Orion server should automatically upgrade themselves provided you haven't disabled that feature. For those agents which have lost connection with the Orion server, simply doing an in-place upgrade is usually enough. If you'd prefer, you can uninstall the old version and install the latest without losing history in Orion provided you don't delete the Agent or Node in the Orion web interface.

                                                        • Re: Issues with agents following upgrade to SAM 6.2.3?
                                                          colew

                                                          We have disabled that feature due to a lot of these nodes being behind extremely low bandwidth connections.  We use SCCM to do deployments/upgrades, so that's the plan for these.  We tried to do an in place upgrade on a couple earlier, and completely new agents were registered, and the originals were left orphaned.  I'm just trying to determine the "cleanest" way to do this to avoid any issues similar to the ones that originally got me on this thread.

                                            • Re: Issues with agents following upgrade to SAM 6.2.3?
                                              ecklerwr1

                                              Are you holding off on upgrading to 12.2 because you're on windows server 2008 R2?  If not then I'd upgrade as it's easy once you're on newer server OS.

                                                • Re: Issues with agents following upgrade to SAM 6.2.3?
                                                  colew

                                                  We are holding off based on the amount of agents we have out there that would have to be updated post upgrade.  We have 9500+ total, with some being at external locations with extremely limited bandwidth, so pushing the agent updates isn't a possibility.  We have had to work a method up with SCCM where we deploy all of the plugins to distribution points, and pull them to the agents at those sites before initiating the agent installs.  Trying to let SolarWinds send the plugins tied up all of the bandwidth at many of these sites when we went through this exercise initially.  It's mainly an issue of  the effort involved to make it happen...