26 Replies Latest reply on Mar 15, 2018 8:04 AM by nks7892

    APE Servers going in Hung State - Issue not solved

    nks7892

      Hi ,

       

      I am posting this issue when nothing worked out, means everything - Support case, Windows troubleshoot, Logs capturing etc. etc.

       

       

      Environment -

      1. NPM, SAM, SRM on Primary Poller. Configuration  - 160 GB RAM and 16 Core CPU. OS - 2008 R2 SP1.

      2. SQL DB - On Separate Server - 256 GB RAM and 16 COre CPU. OS - 2008 R2 SP1.

      3. Three more Additional Polling Engines for NPM and SAM. SRM is also there but we are not using SRM so far. Configuration  for each APE : 40 GB RAM, 16 Core CPU. OS - 2012 Standard.

       

      Issue -

      APE server is going in Hung state every 3-4 days that results in Solarwinds to stop monitor, Bulk of false email alert of past events and DCOM fail events. That happens in Evening time only of EST time i.e. arround 4-6 PM EST.

      Initially we had only one APE and when we distributed the load of Primary poller it worked fine for one week but after that started going Hung state, nothing could be done except reboot and monitoring started worked fine.

      After having a case with Solarwinds Support Team, at last they suggested to rebuild another APE machine. We did builted another APE machine with same configuration as emntoned above for APE's and moved all the nodes form that APE to 2nd APE. Now issue started with new APE.

      We had multiple cases where Solarwinds had said that this is related to System issue not with Solarwinds Product.

      Lot of Subscription errors - but they have stopped after upgrading Solarwinds to 2017 SP2. Polling rate is also normal.

       

      I had another APE (Same configuration , and it is on same host and same LAN) assign to another Solarwinds instance and that has NPM, NTA and IPAM only and APE is running fine. No hanging issue so far after bulting.

      If that issue is with system host or configuration then it should happen with other APE on another Solarwinds because all are builted at same time with same configurations.

       

      Steps that we have done so far.

      1. Registry modification for TCP/Port Excaution is done.

      2. Exclude Solarwinds Folders from Antivirus Scanning.

      3. Re-building APE's.

      4. Increased Resources.

      5. Disabled all Down or Unknown AppInsight for SQL and IIS.

      6. Un-Managed all down nodes and those nodes which are not responding to WMI or SNMP.

      7. Upgraded Solarwinds platform to latest version i.e. Orion 2017 SP2  with NPM 12.1, SRM 6.4 and SAM 6.4

       

      Solarwinds Support Case Numbers .

      118543

      1189038

       

      Windows team has verified everything and nothing could be find, we tried building new server but same issue over there. So finally Solarwinds has said no issue at application level and Widnows team has said no issue at Server level. We are now stuck in between, We don't know what is causing Solarwinds APE servers to go in hung state in a same pattern i.e. after every 3-4 days in evening time i.e. 4 -6 PM EST.

      Please help us to find out the root cause.

        • Re: APE Servers going in Hung State - Issue not solved
          nks7892

          Based on our continuous observation and experiments , we have found the cause of this issue but issue is not resolved yet.

          There are few nodes around 10-15 nodes out of 1000s which are either down or not responding to SNMP and WMI, few of them also have SAM templates assigned.

          For some good reason - TCP connections (Established) for these nodes are keep on increasing like - 10k for one node, can you beleive it, but yes it is happening. For few nodes it is 7k, 5k and so on.

          This is building up over a time of period and then probably causing system to go in hung state and rebooting the server.

          I do not have the answer why this is happening, I believe Solarwinds should have some mechanism to auto close the ports which are in "established" state and for more than some specific time period. But unfortunately they are not going in "Wait" state and keeo on increasing till system crashes. Can some body help me to figure it out why this is happening, is this a system level issue or some thing is wrong with the Solarwinds Application behavior.

            • Re: APE Servers going in Hung State - Issue not solved
              tigger2

              I've been experiencing similar issues and support basically found a few things and said "upgrade to NPM 12.1 and then if the problem persists, reopen the case".

               

              For what it's worth:

              I'm on NPM 12.0.0, with SAM, UDT, NCM, and IPAM all on 1 Windows 2008 R2 SP1 server (not latest versions, but not too far behind).  DB on separate server.

              Support DID find a few oddities from the log dumps I sent them:

                   1. There was "lots of DB fragmentation" and "we had DB auto fragmentation turned off".  For this, our DBA's confirmed they didn't have the built in MS SQL DB auto defragmentation turned on (as their best practice) and they had custom  jobs running daily to defragment.  THey checked and there was more fragmentation that shodul be there, and found their scripts were not running properly on the Orion DB. I poked around and it appears that SolarWinds turns off ( or does not turn on?) DB auto-defragmentation as a best practice (I found it someone on Thwack or on the Support site), so it's not supposed to be on anyway.  In addition, it appears taht the NPM nightly maintenance job was running because we don't get "maintenance not running" alerts that appear if the maintenance doesn't run.  However, maybe the maintenance doesn't run fully?  I can manually run the maintenance and it runs without visible issues.

                   2. I looked into the release notes of all the modules I have up to 12.1 NPM and there appears to be fixes for several things I've see with my server, so it seemed reasonable support may be right about just needing to upgrade to 12.1.

                   3.  SW support did (for one crash dump) point out the large number of connections/ports I had open at one time.  I have been noticing it since then, as previously my system froze so hard I couldn't log in to see it

               

              However...since you're at 12.1 .... I'll send you all the speculation I have about this issue...

               

              I have had "Orion issues" that involved the system just grinding to a halt (fast and slow) since v10. I've been on the same old server since I inherited it so I surmised it was just symptoms of an old system, with many uninstalls and re-installs of software over the years, coupled with bugs in X version of software.  So basically I've mostly just put up with it.

              Without going into a ton of detail, here's what I've got:

              It appears there is a "slow grind to a halt" (e.g. takes a week) and a "fast grid to a halt" condition (maybe 2 days can pass then system is suddenly not responsive).  I think the symptoms may cross over depending on what is happening

               

              - Possible speculative cause #1: System/OS is re booted and Orion processes do not shut down properly, corrupting things like the poller and job engine databases.  Maybe other things.  I have noticed that when using teh "Orion services App" to shut down services, all services will stop (well, ok, maybe MSMQ or a few non-Orion core services are up).  I can go to service manager and they are all stopped.  I open process manager and there see to be a few processes still running (I believe JobEngine or InformationService processes).   These may take several minutes to shut down on their own.    I believe that during an automated reboot ( like for patching) OR if you just believe the services are down and reboot, the boot prematurely kills whatever processes are still running and causes some corruption.  After Orion comes back up, it seems to hurt the system where symptoms slowly start manifesting over about a week ( "jobs lost" and other Orion self monitoring metrics hop up and down, system starts slowing down, etc.).  The fix for this, for me, was to do cleanup of the job engine and other local databases on the Orion server.  There are several guides on how to do so on the support site, most of which seem to be:  Find the installer for the service, uninstall it ( clears the DB files) and then re-install.   The problem is, you really don't know "why" or "where" the issue lies, so maybe these fixes don't work.  It seems the "best" option is the more involved one where you run the config wizard and have it reinstall/repair the DB/services/and website.  It takes more downtime, but most of the time the system stays up and doesn't exhibit symptoms...until it starts again, usually a week after automated patching   If I manually stop the services, and wait until any background processes stop on their own, it seems the system does not experience this issue.

               

              - Possible speculative cause #2: I have noticed ( and there appears to be fixes for this in NPM 12.1) that the SolarWinds Job Engine v2 process crashes quite frequently (like a few times a day, sometimes it can go for a day without crashes, others it's crashing 4 times in an hour...and then it's OK for awhile).  I also have the SNMPTrap service crashing rather frequently.  Calls to support have not been too helpful as the fix above ( uninstall/reinstall the service) DOES seem to clear it up for awhile until it comes back.  I believe (myself) this constant crashing intermittently can cause/contribute to the corruption issues in #1.  I also believe that it is a symptom of another issue, where you have hit on in your posts about "weird connection issues".

               

              - Possible speculative cause #3: Weird connection issues.  This seems to be what I have experienced for the "fast grind to a halt" issue, and I'm still speculating on it.  I've had the fast grind to a halt" issue happen maybe 10 times, usually it just keeps happening until I find some temporary cure.  Because of this, and calls to support identifying it, I've built a SAM template collecting data on # of connections from Orion, and # connections for the most common system (which is usually the Orion DB). Here's what I have seen and my guesses

               

              a. Many connections to a **single** server.  Like thousands.  This has happened only 1 time, and I have only 1 guess.  In Orion we have set up some "speculative Windows OS soft hang alert", which basically looks at the CPU data collected on the windows servers.  Since we use WMI for all our widows servers, it seems that the Widows server will sometimes go "unresponsive" for any number of reasons (MS support blames this being some driver or similar issue, so maybe soft memory fault is a better term).  When this happens, we see some metrics in Orion collect data, and CPU data stop. The server is still pingable.  You can map a drive walk the filesystem and remotely connect to it via service manger...but generally you cannot log in ( or login hangs).  The "fix" is to hard boot the server/VM, and maybe hunt down the app/driver issue, if you can.  This issue IMHO has been in Windows since as far as I can remember and I've been monitoring for it for over a decade via various means.  Anyway, when the server gets to this point we normally get an alert.  One time Orion had the "fast crash" and I was able to log in and get a diagnostics dump. I then booted Orion, ran fixes above, etc. Support looked at the dump and pointed to all the connections open to a single server.  I checked the server and it was in that "soft hang" state.  Normally this doesn't appear to hurt Orion, but my guess is the remote server was stuck in some special state where connections to it (WMI) would not get the hint to drop their connection.   I put this node into maintenance in Orion, hard booted the server, and put it back.  The issue has not come back.  So the take away is it looks like some issue exists where Orion will not meter its connections to a remote server and the remote server will just keep sucking them up if it's having serious issues.

               

              b. Many connections to a **multiple** servers.  Like thousands.  I've seen this where we have been doing patching of groups of servers and afterwards Orion has the fast crash.  On the Orion server there's several connections ( 5-10) to many servers.  I suspect that when those servers are being patched/applying patches, they do the same thing as above, where they start accepting connections for things like WMI, but they really won't return any info and the connection won't drop.  The "fix" for Orion has been the same as above ( shutdown Orion services, run maintenance/reinstalls, reboot Orion server).

               

              c. Many connections to a **multiple** servers.  Like thousands, but patching is not occurring/occurred recently.  I suspect that the crashing process issue I have is causing this.  I'm betting Orion is running ( JobEngine or Polling Engine, whatever is crashing today) and then it crashes, but some processes remain "up".  This is similar to above, except I believe the process starts back up and suddenly there are 2 "threads" polling/collecting/etc. info at the same time..and this causes the sudden jump in connections.

               

              For the cases above, I found I cannot stop remote systems just causing issues, but I have been much more careful around when we patch servers Orion is monitoring (being prepared to nicely shut down Orion after it's over and reboot the Orion app server), as well as manually patching the Orion server itself (so I can shut it down nicely). Identifying problematic external systems and taking care of them ( removal/maintenance mode) has been done for a few systems that are "mostly not up all the time".  When I have to repair Orion I do the full "config Wizard complete reinstall/repair and not the piecemeal single service fixes.

               

              Also: I mentioned I have some monitoring (and alerting) for the "many connections open" issue.  it hasn't done much good because it happens so fast the system cannot alert me (and the alerting system is failing at this time too, as is everything).  I know I'm experiencing the port exhaustion issue because after it's all over and I look at this monitoring, I see maybe 1 or 2 data points (where there should have been many) where it actually saved data and it jumps from a "normal" number of say 1.5k connections to 7+k.   During these times the CPUs in my Orion serer get pegged ( as evidenced when I can log in, and what few data points can be collected). All errors int eh Widows event logs make it loos like a database issue ( cannot open connection to database) or network issue with the DB ( cannot resolve name of DB server, etc.).

               

              I've checked if backups are contributing and they are not.  Antivirus exclusions on my system is not up to SoalrWinds specs (it's close, probably put in long ago and not updated) and I'm trying to get their specs approved, but I suspect it's not the case.

               

              Also: I've noticed that sometimes my Orion system will "recover" from this issue (usually when I can't get to fix it fast enough).  Suddenly I get a lot of alerts (after it's recovered), I check the data for the "connection monitor" and I see it had issues.  I look in the Orion and Windows logs and he Orion system/services restarted itself.  Normally I don't want around for it to self recover...but it appears to be able to do it sometimes.

               

              Anyway, hope this helps

                • Re: APE Servers going in Hung State - Issue not solved
                  tigger2

                  Also: had the issue again this Sunday afternoon.  It fits into the "Possible speculative cause #3:", part "c."

                   

                  New info: This time I was barely able to log into the system and do a "netstat -a" to confirm the connections issue. I was able to eventually shut down the SolarWinds services (via the Orion Service Manager) and the system became responsive again (Orion processes were eating up the CPU).  After everything was shut down, I noticed that there were several "conhost" processes and "cscript ..." processes that looked like they were related to monitors we have in SAM (custom built scripts) that were still running (like 20 of them). I was able to kill the conhost processes, but the cscript processes could not be killed ( I tried multiple built in Windows tools, including sysinternals tools).  It seems, from some internet reading, that these processes were owned by SYSTEM and could not be killed for that reason.  Each of the cscript processes had a timeout setting in their command line string that looks like it was being ignored, since they were still hanging raound.

                  After clean up, I rebooted the server, and ( it seems to be only when this happens ) it did not go down all the way and had to be manually powered off.   I then performed the full "run NPM config wizard and have it reinstall/repair the DB/services/and website." and it fixed the issue.

                   

                  Anyway: I'm not sure if the cscript processes are a cause or a result of the symptoms. Probably a result.  We have had these monitors in our environment for a long time and they have not given us issues like this, and they all open remote connections to servers ( to read logs, etc) so they probably just hung trying to connect.

                   

                  In addition: I have gone through the timelines of some of the recent issues I've been having and it seems that, even though I've seen this occur for years (intermittently), it's only cropped up again recently with such vengeance.  It looks like I started having these recurring issues around the time the Microsoft patches occurred/were released in June as I've not had this issue in the months prior to that.  Maybe there was something released in these patches that has changes the OS slightly and is not playing well with the Orion app?  I've had the issue re-occurring about every 4-6 days in the last 30 days, since Jul 17th, and another cluster of them in June.  I've been shutting down all Orion services and then restarting every few days to see if that helps, and I got lazy and didn't do it this week hoping my issues were resolved (nope)

                    • Re: APE Servers going in Hung State - Issue not solved
                      aLTeReGo

                      After reviewing the entirety of the symptoms you described above, I would concur with supports assessment that upgrading to NPM 12.1 would be the most prudent action. I say this not because it's seemingly what every software vendor says whenever you encounter an issue, but because many of the symptoms you describe sound identical to issues which were addressed as part of the NPM 12.1 release and subsequent hotfixes for that release.

                        • Re: APE Servers going in Hung State - Issue not solved
                          nks7892

                          We are already on NPM 12.1. Issue was there with 11.5.3 but we upgraded hoping  that 12.1 will take care of such issues.

                          but no luck, even after upgrading we are facing those issues. Only option we have is to unman age those nodes.

                          • Re: APE Servers going in Hung State - Issue not solved
                            tigger2

                            I'll be finding out very soon as I'm upgrading tomorrow.    Given nks7892's issues I'm not too hopeful, but I would also assume that an issue this observable would be happening to others so maybe there's something common amongst our environments that's not common for everyone else?  nks7892 mentions it occurring with APE's, and my system is "everything on a single server" that I've been upgrading since it was an eval setup of NPM 10.x (with all the products installed) with SQLExpress all on the same server, so I can't imagine our environments have too much in common.

                             

                            Another possible piece of the puzzle: If it continues, or if anyone else is here reading this, I talked to our DBA and he mentioned that our Orion database is sometime like 10x *underutilized* due to how we build DB servers here and what our initial plans were for the database server.  We're planning on moving it to a less powerful system and using our current DB server for a more business centric app.  Anyway, this takes some of the possible focus away from it being a DB performance issue (not really mentioned above but we were looking into it) and during our talk he mentioned that we are not in a failover DB cluster (which we knew and are wanting to go towards). He indicated that there have been known issues with a MSSQL 2012 DB set up to be in a cluster, but not being clustered (a "go nowhere" cluster) and maybe that could be causing some of our symptoms. Once we upgrade to 12.1, if we have more issues we're going to move the DB to a "standard" cluster and see if that helps.  It's a long shot, but we're going to do it anyway so might as well see if we have to do it now because issue persists after upgrade or later when it's actually convenient and in our plans.

                             

                            Given this, is there a SolarWinds "recommended setup for MS SQL clustering of the DB" guide or info?  I know there's tuning of the DB guides and other things, I just don't recall ever seeing anything DB cluster specific.  If it matters, were thinking about building out our system to use the new HA features to cluster the app components.

                              • Re: APE Servers going in Hung State - Issue not solved
                                tigger2

                                Tiny update:  I just now upgraded to "the latest of things that support Win 2008 R2 SP1", which are for me:

                                Orion Platform 2017.1.3 SP3,

                                NCM 7.5.1,

                                IPAM 4.3.2,

                                VIM 7.0.0,

                                NetPath 1.1.0,

                                UDT 3.2.4,

                                DPA 10.0.1,

                                NPM 12.1,

                                QoE 2.3,

                                SAM 6.2.4

                                 

                                In addition, our DBAs have found what they believe a possible issue with a "go nowhere" database cluster like I have where they believe that (possibly) some recent Windows updates have made a change so that the standalone DB does a "reset" of it's "lease" (like maybe a checksum or something) and drops the connection for a short period.  It's like having a heartbeat in between MS SQL nodes, but there's never another server to respond to the heartbeat, so the single server decides to renew it's "lease"/connection to reconnect to the cluster, and once it tries one time  it gets a new lease to use for a while.  At my site, there have been 2 apps (Orion and another) that are set up like this and have recently been having intermittent issues.  Our DBA's set up a fake replication DB (or something, I don't really understand it) so the heartbeat/lease check will succeed, even though there's not another DB on the other side to talk to/replicate to.

                                 

                                So: I upgraded to NPM 12.1, and we made some changes to our DB.  Waiting and seeing how it goes...

                                  • Re: APE Servers going in Hung State - Issue not solved
                                    nks7892

                                    Thanks for the updates. Let us know if that works.

                                    The only gap I can see in my environment is the OS upgrade, rest everything is Updated.

                                      • Re: APE Servers going in Hung State - Issue not solved
                                        tigger2

                                        Update: The system had the same (apparently the same. the symptoms were the same)  issue last night at 6 PM.  The upgrades I did to go from NPM 12.0.0 to 12.1 took all day and were finished around 7 PM on Wed, so it went down on it's own about 24 hours after the upgrade .

                                        I've re-opened my support case, and I took a diagnostics dump last night.  From my past history of support cases, I've been having "system crashes, uses up CPU resources" type issues that I've opened tickets on ( because I do know how to temporarily remedy it, and sometimes it stays fixed for awhile), since Dec 2014 (NPM 10.6).  The tickets are closed because, the system gets fixed (support may find some anomalies, or recommend some re-install of a component) and seems stable.  Then it crops up again, but it's never been this bad in the past.

                                         

                                        For what it's worth, last night, I was barely able to log into the app server and take a 'netstat -a' and shutdown the Orion services.  From the netstat, I didn't see anything that looked too weird, but the port monitoring I set up, for the few data points it could collect, showed 5.6k total connections. Normally (after the upgrade) it seems to be about 1.2k connections total, with 250 to the Orion DB (Note: before the upgrade, it was about 1.5k total, 56 connections to the Orion DB.  So something about the upgrade made the connections to the DB go up about 5x).  Anyway, when I shut down the Orion services, the InformationServiceV3 and the Collector service took a long time to shut down (like 3+ minutes or more) after everything else shut down rather quickly.  It seemed the system was not very responsive until they shut down, but I was not doing a lot of testing to try to prove anything.  From looking at the running processes I did not see any that were "hung" after all services stopped, like I've seen in the past, so I didn't reboot the server.

                                         

                                        Another note: I ran the diagnostics tool to get diag info to upload...and it took about an hour to run.  I think it's taken this long in the past, but I also think it didn't used to take this long.   I poked around and found some directories with lots (100k) of very old log files the diag tool should not be picking up, so I'm wondering if maybe there's some other places in the filesystem with many logs/files that the diag. tool slowly parses through OR that the information service/Collector service has to deal with which makes them shut down so slowly?  It seems like a long shot, but that's where I am at this point and I'm going to see if I can identify and do some clean up of junk on the system.

                                         

                                        Also: To get the system going last night, I ran the NPM config tool and just had it reinstall all the services.

                                          • Re: APE Servers going in Hung State - Issue not solved
                                            tigger2

                                            More info, probably useless:

                                            I did some checking and it appears that our "persistent issues" started on 6/18.  OS patching occurred on 6/15, but *also* it appears we added IIS Appsight monitoring to 3 servers on 6/12.  Since this appears to use powershell remoting, I was hesitant because we've had issues with monitoring using powershell remoting in the past due to (apparently) known issues with WinRM in some situations not releasing connections and causing memory issues on the remotely monitored servers (we had this issue with the Appinsight for Exchange awhile back).   Anyway, I warned the  app support person about this, and we added teh montioring. I'm not saying this is the cause, just something I notice in the timeline that seems to correlate with changes, and connections causing issues.  I'm seeing if we can put this into maintenance is the system crashes again just to see if things clear up, or stay up longer.  I'm not too hopeful it's this though.

                                              • Re: APE Servers going in Hung State - Issue not solved
                                                aLTeReGo

                                                tigger2, do you have an open case with support? If so, what is your case number?

                                                  • Re: APE Servers going in Hung State - Issue not solved
                                                    tigger2

                                                    I've had a case I've re-opened a few times.  It's: 1184575.

                                                     

                                                    If it helps, look into these cases as well as it looks like these are all similar issues I've opened in the past:. I haven't read thorough them all though so maybe they're not related but the details/symptoms are very similar

                                                    869987 <- Issues with NPM using up system resources, polling failing

                                                    727472 <- System instabilty after hard power down of server OS

                                                      • Re: APE Servers going in Hung State - Issue not solved
                                                        tigger2

                                                        Update: Apparently, as part of our migration from NPM 12.0.0 to 12.1, our upgraded of IPAM didn't uninstall the old version properly (as I see it, maybe it's my fault ), so we had two IPAM installations installed on our app server (i.e. in "Programs and Features" there were multiple IPAM installations in the list.).  Worked with support to uninstall both of them, then re-installed the same/latest version that we upgraded to.  Ran NPM config wizard to finish up.

                                                         

                                                        Apparently the original issue we had should have been fixed by migrating to NPM 12.1, and the recent issues, which have similar symptoms, was because of multiple IPAM installations on the same server.

                                                         

                                                        I'll update on how it goes if we run into anything else major/eye opening, or if this resolves it.. We're still looking into some changes to make on our side as there are various updates here and there we can make on the OS/VMWare/hardware etc.

                                                        1 of 1 people found this helpful
                                                          • Re: APE Servers going in Hung State - Issue not solved
                                                            tigger2

                                                            Update: Even after the fix, and some additional config changes to our SNMP settings as our SNMPTrap service was also crashing every 4 hours, we're still babysitting the system. Basically this means letting it run X days, then restarting services and then letting it run slightly longer.  It's not absolutely necessary to do this but I'm hoping to see if three's an upper limit to how long it can go before it happens again. If I can get a solid week without issues and memory/resources are stable, etc. I think I'd call it "good". We had a similar issue occur this Sunday but we have a definite cause for this one.  There was a confirmed connectivity "blip" of a few seconds in our database cluster affecting multiple databases (not just Orion's).  What makes this annoying is that the Orion app server did not recover from the blip, it just degraded into the same state(s) as posted previously, and had all the same symptoms, event log messages, netstat output looked similar, many down alerts sent to support teams, etc.

                                                             

                                                            I'm still seeing that after services are showing as shut down (in windows and the SolarWinds service manager) that some of the SolarWinds processes tied to these services are hanging around for awhile before they shut down.  This time it was 1 process (a jobengine I think) the stopped on it's own about a minute after being listed as shutdown.

                                                             

                                                            The same fix of using the config wizard to reinstall all services appears to have fixed it. I took some diagnostics and sent to support as, even though the cause is different and the system seems *much* more stable than in the past, the system doesn't seem to be able to handle a slight loss of DB connectivity and degrades quickly (this time it took about 2 hours before alert flooding occurred and it was noticed).  I'm still working with my management to get more changes made, specifically the licensing and setting up the new HA pools, so I don't know if that will alleviate everything or if the issue will just be less noticeable (to users) with the secondary/HA server.

                                                             

                                                            nks7892, have you seen or been able to look into the database side for connectivity/latency issues?  Our DBA's definitely saw this one on their side, but I'm not sure where/how they saw it.  I can hit them up for more info if it would help.

                                                              • Re: APE Servers going in Hung State - Issue not solved
                                                                nks7892

                                                                Yes We checked database connectivity from primary Poller and other additional polling engine server, but that was fine in our case.

                                                                And I have not seen any issues since we have stopped monitoring down nodes, this is strange. Because TCP/IP count for these nodes keeps increasing up to 10k, 12k and so on.

                                                                 

                                                                We do not have any other solution for now, except saying that these nodes are down and not working and responding to WMI, this is how we are managing.

                                                            • Re: APE Servers going in Hung State - Issue not solved
                                                              aLTeReGo

                                                              tigger2, I have asked to have your case 1184575 escalated.

                                                                • Re: APE Servers going in Hung State - Issue not solved
                                                                  tigger2

                                                                  Thanks!  I've been putting together a migration plan to get our single Orion system off of our old physical 2008 R2 server (we have to do it anyway) and onto some VM's with the HA setup.  If the system can just behave well enough until we can migrate it onto a fresh install of "everything", maybe everything will go away.  We've been upgrading our current system for years on the same hardware/OS and it had a lot of extra stuff on it since it originally was a POC for multiple Solarwinds products, some which we don't use anymore (or were EOL'd) and were uninstalled (or partially uninstalled).  If nks7892 didn't have similar issues I'd be pointing the finger much more at the old OS, out of date firmware, and other cruft that's too hard to track down.

                                                                    • Re: APE Servers going in Hung State - Issue not solved
                                                                      tigger2

                                                                      In case it's of any help to anyone here:

                                                                      Moving away from our old Windows 2008 R2 server onto a new Windows 2012 R2 (a fresh install of both the OS and Orion) and currently on NPM 12.1 appears to have fixed everything in our environment.  The system no longer hangs/crashes on it's own and it also seems like it can survive a reboot (automated windows patching) without damaging anything (I'm still babysitting it to make myself feel better).  In the past shutting down services (via the Orion service manager) took a few minutes and sometimes some processes never shut down. Now all processes shut down in under 15 seconds.

                                                                       

                                                                      The only tuning we did on this new default install was this, which I believe is the "port exhaustion" tunings mentioned in the initial post: https://support.solarwinds.com/Success_Center/Server_Application_Monitor_(SAM)/Tweaking_performance_of_Windows_Server

                                                                       

                                                                      Additionally, we moved IPAM and UDT onto their own servers and databases.  This was both to alleviate possible load on the core system as well as to hand these over to other teams for support/maintenance.  It seems IPAM is now running it's scan jobs much faster than before. Our "core" NPM environment now has just NPM 12.1, SAM 6.4.0, and NCM 7.6

                                            • Re: APE Servers going in Hung State - Issue not solved
                                              nks7892

                                              Below are the support cases that we raised with support team-

                                              one of them is still open.

                                              118543

                                              1189038

                                              • Re: APE Servers going in Hung State - Issue not solved
                                                j_dennis

                                                Any update to this case? We have had issues in our environment with this too.

                                                  • Re: APE Servers going in Hung State - Issue not solved
                                                    nks7892

                                                    If this is happening with some pattern like after every 5 days or 7 days server is going in hung state then it means something is growing over a period of time.

                                                     

                                                    Like CPU, Memory for a process or service or the most probable cause whcih we found in our environement was TCP/IP connection (if you are using SAM).

                                                     

                                                    So here are the steps -

                                                     

                                                    1. login to the problematic server.

                                                     

                                                    2. open command prompt and run below command -

                                                     

                                                    netstat -noa | find /i "estab" /c

                                                    netstat -noa | find /i "wait" /c

                                                     

                                                     

                                                     

                                                    if the output of above command is more than 3000 (in my case) then something is not right with few nodes. Then next step is to find out which node is consuming high TCP connections.

                                                     

                                                    Try this and let me know.

                                                    1 of 1 people found this helpful
                                                      • Re: APE Servers going in Hung State - Issue not solved
                                                        j_dennis

                                                        Thanks. I have done that exercise when I am able to RDP into the box. If there's port exhaustion, I cannot login obviously. I now monitor for port consumption on my APEs to band-aid fix it, but still never found a root cause of why it's happening. There doesn't appear to be a common node that consumes a ton of ports. The APEs are running on Windows 2012 R2.

                                                          • Re: APE Servers going in Hung State - Issue not solved
                                                            tigger2

                                                            If it helps anyone...

                                                            I have this in a SAM template inside a "Windows Powershell monitor" component.  The template is associated with my main poller server (this won't run against a remote server, but you could probably find a way to do it).  I have it scheduled to run at 10 minute intervals.

                                                             

                                                            The code runs and produces 2 metrics/statistics and a message:

                                                            Metric: MostHostRemoteConnections = The count of TCP ports of the IP address that is using up the most ports

                                                            Message: Message.MostHostRemoteConnections: The IP address that is using up the most TCP ports

                                                            Metric: TotalRemoteConnections = The total number of TCP connections

                                                             

                                                            For me, the MostHostRemoteConnections is always pointing to my Orion database server and is very stable (unless there's an issue with port exhaustion and it's a single server using up all the ports), and the TotalRemoteConnections stays mostly stable unless port exhaustion starts occurring (Could be a single server or multiple servers with many connections. It climbs fairly quickly if it's many).  I set alerts on both metrics that were slightly higher than the "normal" data range of these metrics and in the past this has allowed me to get an alert and log into the Orion server/poller before the system freezes up.  Once port exhaustion hits the odds of getting the alert are low so if it occurs very fast or your thresholds get set too high you may not get any alert.

                                                             

                                                            You should be able to run this on the powershell command line to see the output and compare it to the standard "netstat"  I'm not making any claims as to how well this code works so use it with 2 grains of salt and test/understand it as much as you can before attempting to use it (as always).

                                                             

                                                            #####################

                                                             

                                                            Try {

                                                                 # To be run locally from Orion App server ONLY

                                                             

                                                                 $results = netstat -ano | Select-String -Pattern '\s+(TCP)'

                                                             

                                                                 # $totalCount = $results.count

                                                                 $total_remote_count = 0

                                                                 $host_table = @{}

                                                             

                                                                 foreach($result in $results) {

                                                                      $item = $result.line.split(' ',[System.StringSplitOptions]::RemoveEmptyEntries)

                                                             

                                                                      if($item[1] -notmatch '^\[::'){

                                                                           #parse the netstat line for remote address and port

                                                                           if (($ra = $item[2] -as [ipaddress]).AddressFamily -eq 'InterNetworkV6'){

                                                                                $remoteAddress = $ra.IPAddressToString

                                                                                $remotePort = $item[2].split('\]:')[-1]

                                                                           }

                                                                           else {

                                                                                $remoteAddress = $item[2].split(':')[0]

                                                                                $remotePort = $item[2].split(':')[-1]

                                                                           }

                                                             

                                                                           if(

                                                                                ($remoteAddress -ne "0.0.0.0") -and

                                                                                ($remoteAddress -ne "127.0.0.1")

                                                                           ){

                                                                                # write-host "$remoteAddress : $remotePort"

                                                                                $total_remote_count++

                                                                                if($host_table.ContainsKey($remoteAddress)) {

                                                                                     $val = $host_table.Get_Item($remoteAddress)

                                                                                     $val++

                                                                                     $host_table.Set_Item($remoteAddress, $val)

                                                                                }

                                                                                else {

                                                                                     $host_table.Add($remoteAddress, 1)

                                                                                }

                                                                           }

                                                                      }

                                                                 }

                                                             

                                                                 # sort hash by values and so first value it highest value, then get just that value

                                                                 $most_host_connections = $host_table.GetEnumerator() | Sort-Object Value -descending | Select-Object -First 1

                                                             

                                                                 ##########

                                                                 # Send the data back to Orion

                                                                 ##########

                                                                 write-host "Message.MostHostRemoteConnections: Most connections from this host are to $($most_host_connections.Key)"

                                                                 write-host "Statistic.MostHostRemoteConnections: $($most_host_connections.Value)"

                                                                 write-host "Statistic.TotalRemoteConnections: $($total_remote_count)"

                                                             

                                                                 #write-host "TOTAL: $totalCount"

                                                            }

                                                            Catch {

                                                                 #write-host ("Message.Errors: Error: {0}" -f $_.Exception.Message);

                                                                 #write-host ("Statistic.Errors: -1") ;

                                                                 #This will set the component to DOWN. The two commented out lines above are for manual troubleshooting.

                                                                 Exit (1)

                                                            }

                                                             

                                                            # This will set the component UP

                                                            Exit (0)

                                                              • Re: APE Servers going in Hung State - Issue not solved
                                                                nks7892

                                                                Finally, issue has been resolved. Root cause was definitely High TCP connections that was consuming all the system resources over a period of time. Another buddy that was doing this is the Memory leak issue due to some application (That has be to be checked by Windows Expert that what app is eating memory of the system).

                                                                So created multiple app monitors to find out both the issues well before the issue breaks our environment and we are living happy now.