This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

APE Servers going in Hung State - Issue not solved

Hi ,

I am posting this issue when nothing worked out, means everything - Support case, Windows troubleshoot, Logs capturing etc. etc.

Environment -

1. NPM, SAM, SRM on Primary Poller. Configuration  - 160 GB RAM and 16 Core CPU. OS - 2008 R2 SP1.

2. SQL DB - On Separate Server - 256 GB RAM and 16 COre CPU. OS - 2008 R2 SP1.

3. Three more Additional Polling Engines for NPM and SAM. SRM is also there but we are not using SRM so far. Configuration  for each APE : 40 GB RAM, 16 Core CPU. OS - 2012 Standard.

Issue -

APE server is going in Hung state every 3-4 days that results in Solarwinds to stop monitor, Bulk of false email alert of past events and DCOM fail events. That happens in Evening time only of EST time i.e. arround 4-6 PM EST.

Initially we had only one APE and when we distributed the load of Primary poller it worked fine for one week but after that started going Hung state, nothing could be done except reboot and monitoring started worked fine.

After having a case with Solarwinds Support Team, at last they suggested to rebuild another APE machine. We did builted another APE machine with same configuration as emntoned above for APE's and moved all the nodes form that APE to 2nd APE. Now issue started with new APE.

We had multiple cases where Solarwinds had said that this is related to System issue not with Solarwinds Product.

Lot of Subscription errors - but they have stopped after upgrading Solarwinds to 2017 SP2. Polling rate is also normal.

I had another APE (Same configuration , and it is on same host and same LAN) assign to another Solarwinds instance and that has NPM, NTA and IPAM only and APE is running fine. No hanging issue so far after bulting.

If that issue is with system host or configuration then it should happen with other APE on another Solarwinds because all are builted at same time with same configurations.

Steps that we have done so far.

1. Registry modification for TCP/Port Excaution is done.

2. Exclude Solarwinds Folders from Antivirus Scanning.

3. Re-building APE's.

4. Increased Resources.

5. Disabled all Down or Unknown AppInsight for SQL and IIS.

6. Un-Managed all down nodes and those nodes which are not responding to WMI or SNMP.

7. Upgraded Solarwinds platform to latest version i.e. Orion 2017 SP2  with NPM 12.1, SRM 6.4 and SAM 6.4

Solarwinds Support Case Numbers .

118543

1189038

Windows team has verified everything and nothing could be find, we tried building new server but same issue over there. So finally Solarwinds has said no issue at application level and Widnows team has said no issue at Server level. We are now stuck in between, We don't know what is causing Solarwinds APE servers to go in hung state in a same pattern i.e. after every 3-4 days in evening time i.e. 4 -6 PM EST.

Please help us to find out the root cause.

  • Based on our continuous observation and experiments , we have found the cause of this issue but issue is not resolved yet.

    There are few nodes around 10-15 nodes out of 1000s which are either down or not responding to SNMP and WMI, few of them also have SAM templates assigned.

    For some good reason - TCP connections (Established) for these nodes are keep on increasing like - 10k for one node, can you beleive it, but yes it is happening. For few nodes it is 7k, 5k and so on.

    This is building up over a time of period and then probably causing system to go in hung state and rebooting the server.

    I do not have the answer why this is happening, I believe Solarwinds should have some mechanism to auto close the ports which are in "established" state and for more than some specific time period. But unfortunately they are not going in "Wait" state and keeo on increasing till system crashes. Can some body help me to figure it out why this is happening, is this a system level issue or some thing is wrong with the Solarwinds Application behavior.

  • I've been experiencing similar issues and support basically found a few things and said "upgrade to NPM 12.1 and then if the problem persists, reopen the case".

    For what it's worth:

    I'm on NPM 12.0.0, with SAM, UDT, NCM, and IPAM all on 1 Windows 2008 R2 SP1 server (not latest versions, but not too far behind).  DB on separate server.

    Support DID find a few oddities from the log dumps I sent them:

         1. There was "lots of DB fragmentation" and "we had DB auto fragmentation turned off".  For this, our DBA's confirmed they didn't have the built in MS SQL DB auto defragmentation turned on (as their best practice) and they had custom  jobs running daily to defragment.  THey checked and there was more fragmentation that shodul be there, and found their scripts were not running properly on the Orion DB. I poked around and it appears that SolarWinds turns off ( or does not turn on?) DB auto-defragmentation as a best practice (I found it someone on Thwack or on the Support site), so it's not supposed to be on anyway.  In addition, it appears taht the NPM nightly maintenance job was running because we don't get "maintenance not running" alerts that appear if the maintenance doesn't run.  However, maybe the maintenance doesn't run fully?  I can manually run the maintenance and it runs without visible issues.

         2. I looked into the release notes of all the modules I have up to 12.1 NPM and there appears to be fixes for several things I've see with my server, so it seemed reasonable support may be right about just needing to upgrade to 12.1.

         3.  SW support did (for one crash dump) point out the large number of connections/ports I had open at one time.  I have been noticing it since then, as previously my system froze so hard I couldn't log in to see it

    However...since you're at 12.1 emoticons_wink.png.... I'll send you all the speculation I have about this issue...

    I have had "Orion issues" that involved the system just grinding to a halt (fast and slow) since v10. I've been on the same old server since I inherited it so I surmised it was just symptoms of an old system, with many uninstalls and re-installs of software over the years, coupled with bugs in X version of software.  So basically I've mostly just put up with it.

    Without going into a ton of detail, here's what I've got:

    It appears there is a "slow grind to a halt" (e.g. takes a week) and a "fast grid to a halt" condition (maybe 2 days can pass then system is suddenly not responsive).  I think the symptoms may cross over depending on what is happening

    - Possible speculative cause #1: System/OS is re booted and Orion processes do not shut down properly, corrupting things like the poller and job engine databases.  Maybe other things.  I have noticed that when using teh "Orion services App" to shut down services, all services will stop (well, ok, maybe MSMQ or a few non-Orion core services are up).  I can go to service manager and they are all stopped.  I open process manager and there see to be a few processes still running (I believe JobEngine or InformationService processes).   These may take several minutes to shut down on their own.    I believe that during an automated reboot ( like for patching) OR if you just believe the services are down and reboot, the boot prematurely kills whatever processes are still running and causes some corruption.  After Orion comes back up, it seems to hurt the system where symptoms slowly start manifesting over about a week ( "jobs lost" and other Orion self monitoring metrics hop up and down, system starts slowing down, etc.).  The fix for this, for me, was to do cleanup of the job engine and other local databases on the Orion server.  There are several guides on how to do so on the support site, most of which seem to be:  Find the installer for the service, uninstall it ( clears the DB files) and then re-install.   The problem is, you really don't know "why" or "where" the issue lies, so maybe these fixes don't work.  It seems the "best" option is the more involved one where you run the config wizard and have it reinstall/repair the DB/services/and website.  It takes more downtime, but most of the time the system stays up and doesn't exhibit symptoms...until it starts again, usually a week after automated patching emoticons_happy.png  If I manually stop the services, and wait until any background processes stop on their own, it seems the system does not experience this issue.

    - Possible speculative cause #2: I have noticed ( and there appears to be fixes for this in NPM 12.1) that the SolarWinds Job Engine v2 process crashes quite frequently (like a few times a day, sometimes it can go for a day without crashes, others it's crashing 4 times in an hour...and then it's OK for awhile).  I also have the SNMPTrap service crashing rather frequently.  Calls to support have not been too helpful as the fix above ( uninstall/reinstall the service) DOES seem to clear it up for awhile until it comes back.  I believe (myself) this constant crashing intermittently can cause/contribute to the corruption issues in #1.  I also believe that it is a symptom of another issue, where you have hit on in your posts about "weird connection issues".

    - Possible speculative cause #3: Weird connection issues.  This seems to be what I have experienced for the "fast grind to a halt" issue, and I'm still speculating on it.  I've had the fast grind to a halt" issue happen maybe 10 times, usually it just keeps happening until I find some temporary cure.  Because of this, and calls to support identifying it, I've built a SAM template collecting data on # of connections from Orion, and # connections for the most common system (which is usually the Orion DB). Here's what I have seen and my guesses

    a. Many connections to a **single** server.  Like thousands.  This has happened only 1 time, and I have only 1 guess.  In Orion we have set up some "speculative Windows OS soft hang alert", which basically looks at the CPU data collected on the windows servers.  Since we use WMI for all our widows servers, it seems that the Widows server will sometimes go "unresponsive" for any number of reasons (MS support blames this being some driver or similar issue, so maybe soft memory fault is a better term).  When this happens, we see some metrics in Orion collect data, and CPU data stop. The server is still pingable.  You can map a drive walk the filesystem and remotely connect to it via service manger...but generally you cannot log in ( or login hangs).  The "fix" is to hard boot the server/VM, and maybe hunt down the app/driver issue, if you can.  This issue IMHO has been in Windows since as far as I can remember and I've been monitoring for it for over a decade via various means.  Anyway, when the server gets to this point we normally get an alert.  One time Orion had the "fast crash" and I was able to log in and get a diagnostics dump. I then booted Orion, ran fixes above, etc. Support looked at the dump and pointed to all the connections open to a single server.  I checked the server and it was in that "soft hang" state.  Normally this doesn't appear to hurt Orion, but my guess is the remote server was stuck in some special state where connections to it (WMI) would not get the hint to drop their connection.   I put this node into maintenance in Orion, hard booted the server, and put it back.  The issue has not come back.  So the take away is it looks like some issue exists where Orion will not meter its connections to a remote server and the remote server will just keep sucking them up if it's having serious issues.

    b. Many connections to a **multiple** servers.  Like thousands.  I've seen this where we have been doing patching of groups of servers and afterwards Orion has the fast crash.  On the Orion server there's several connections ( 5-10) to many servers.  I suspect that when those servers are being patched/applying patches, they do the same thing as above, where they start accepting connections for things like WMI, but they really won't return any info and the connection won't drop.  The "fix" for Orion has been the same as above ( shutdown Orion services, run maintenance/reinstalls, reboot Orion server).

    c. Many connections to a **multiple** servers.  Like thousands, but patching is not occurring/occurred recently.  I suspect that the crashing process issue I have is causing this.  I'm betting Orion is running ( JobEngine or Polling Engine, whatever is crashing today) and then it crashes, but some processes remain "up".  This is similar to above, except I believe the process starts back up and suddenly there are 2 "threads" polling/collecting/etc. info at the same time..and this causes the sudden jump in connections.

    For the cases above, I found I cannot stop remote systems just causing issues, but I have been much more careful around when we patch servers Orion is monitoring (being prepared to nicely shut down Orion after it's over and reboot the Orion app server), as well as manually patching the Orion server itself (so I can shut it down nicely). Identifying problematic external systems and taking care of them ( removal/maintenance mode) has been done for a few systems that are "mostly not up all the time".  When I have to repair Orion I do the full "config Wizard complete reinstall/repair and not the piecemeal single service fixes.

    Also: I mentioned I have some monitoring (and alerting) for the "many connections open" issue.  it hasn't done much good because it happens so fast the system cannot alert me (and the alerting system is failing at this time too, as is everything).  I know I'm experiencing the port exhaustion issue because after it's all over and I look at this monitoring, I see maybe 1 or 2 data points (where there should have been many) where it actually saved data and it jumps from a "normal" number of say 1.5k connections to 7+k.   During these times the CPUs in my Orion serer get pegged ( as evidenced when I can log in, and what few data points can be collected). All errors int eh Widows event logs make it loos like a database issue ( cannot open connection to database) or network issue with the DB ( cannot resolve name of DB server, etc.).

    I've checked if backups are contributing and they are not.  Antivirus exclusions on my system is not up to SoalrWinds specs (it's close, probably put in long ago and not updated) and I'm trying to get their specs approved, but I suspect it's not the case.

    Also: I've noticed that sometimes my Orion system will "recover" from this issue (usually when I can't get to fix it fast enough).  Suddenly I get a lot of alerts (after it's recovered), I check the data for the "connection monitor" and I see it had issues.  I look in the Orion and Windows logs and he Orion system/services restarted itself.  Normally I don't want around for it to self recover...but it appears to be able to do it sometimes.

    Anyway, hope this helps emoticons_happy.png

  • Also: had the issue again this Sunday afternoon.  It fits into the "Possible speculative cause #3:", part "c."

    New info: This time I was barely able to log into the system and do a "netstat -a" to confirm the connections issue. I was able to eventually shut down the SolarWinds services (via the Orion Service Manager) and the system became responsive again (Orion processes were eating up the CPU).  After everything was shut down, I noticed that there were several "conhost" processes and "cscript ..." processes that looked like they were related to monitors we have in SAM (custom built scripts) that were still running (like 20 of them). I was able to kill the conhost processes, but the cscript processes could not be killed ( I tried multiple built in Windows tools, including sysinternals tools).  It seems, from some internet reading, that these processes were owned by SYSTEM and could not be killed for that reason.  Each of the cscript processes had a timeout setting in their command line string that looks like it was being ignored, since they were still hanging raound.

    After clean up, I rebooted the server, and ( it seems to be only when this happens ) it did not go down all the way and had to be manually powered off.   I then performed the full "run NPM config wizard and have it reinstall/repair the DB/services/and website." and it fixed the issue.

    Anyway: I'm not sure if the cscript processes are a cause or a result of the symptoms. Probably a result.  We have had these monitors in our environment for a long time and they have not given us issues like this, and they all open remote connections to servers ( to read logs, etc) so they probably just hung trying to connect.

    In addition: I have gone through the timelines of some of the recent issues I've been having and it seems that, even though I've seen this occur for years (intermittently), it's only cropped up again recently with such vengeance.  It looks like I started having these recurring issues around the time the Microsoft patches occurred/were released in June as I've not had this issue in the months prior to that.  Maybe there was something released in these patches that has changes the OS slightly and is not playing well with the Orion app?  I've had the issue re-occurring about every 4-6 days in the last 30 days, since Jul 17th, and another cluster of them in June.  I've been shutting down all Orion services and then restarting every few days to see if that helps, and I got lazy and didn't do it this week hoping my issues were resolved (nope) emoticons_happy.png

  • After reviewing the entirety of the symptoms you described above, I would concur with supports assessment that upgrading to NPM 12.1 would be the most prudent action. I say this not because it's seemingly what every software vendor says whenever you encounter an issue, but because many of the symptoms you describe sound identical to issues which were addressed as part of the NPM 12.1 release and subsequent hotfixes for that release.

  • We are already on NPM 12.1. Issue was there with 11.5.3 but we upgraded hoping  that 12.1 will take care of such issues.

    but no luck, even after upgrading we are facing those issues. Only option we have is to unman age those nodes.

  • I'll be finding out very soon as I'm upgrading tomorrow.    Given nks7892's issues I'm not too hopeful, but I would also assume that an issue this observable would be happening to others so maybe there's something common amongst our environments that's not common for everyone else?  nks7892 mentions it occurring with APE's, and my system is "everything on a single server" that I've been upgrading since it was an eval setup of NPM 10.x (with all the products installed) with SQLExpress all on the same server, so I can't imagine our environments have too much in common.

    Another possible piece of the puzzle: If it continues, or if anyone else is here reading this, I talked to our DBA and he mentioned that our Orion database is sometime like 10x *underutilized* due to how we build DB servers here and what our initial plans were for the database server.  We're planning on moving it to a less powerful system and using our current DB server for a more business centric app.  Anyway, this takes some of the possible focus away from it being a DB performance issue (not really mentioned above but we were looking into it) and during our talk he mentioned that we are not in a failover DB cluster (which we knew and are wanting to go towards). He indicated that there have been known issues with a MSSQL 2012 DB set up to be in a cluster, but not being clustered (a "go nowhere" cluster) and maybe that could be causing some of our symptoms. Once we upgrade to 12.1, if we have more issues we're going to move the DB to a "standard" cluster and see if that helps.  It's a long shot, but we're going to do it anyway so might as well see if we have to do it now because issue persists after upgrade or later when it's actually convenient and in our plans.

    Given this, is there a SolarWinds "recommended setup for MS SQL clustering of the DB" guide or info?  I know there's tuning of the DB guides and other things, I just don't recall ever seeing anything DB cluster specific.  If it matters, were thinking about building out our system to use the new HA features to cluster the app components.

  • Tiny update:  I just now upgraded to "the latest of things that support Win 2008 R2 SP1", which are for me:

    Orion Platform 2017.1.3 SP3,

    NCM 7.5.1,

    IPAM 4.3.2,

    VIM 7.0.0,

    NetPath 1.1.0,

    UDT 3.2.4,

    DPA 10.0.1,

    NPM 12.1,

    QoE 2.3,

    SAM 6.2.4

    In addition, our DBAs have found what they believe a possible issue with a "go nowhere" database cluster like I have where they believe that (possibly) some recent Windows updates have made a change so that the standalone DB does a "reset" of it's "lease" (like maybe a checksum or something) and drops the connection for a short period.  It's like having a heartbeat in between MS SQL nodes, but there's never another server to respond to the heartbeat, so the single server decides to renew it's "lease"/connection to reconnect to the cluster, and once it tries one time  it gets a new lease to use for a while.  At my site, there have been 2 apps (Orion and another) that are set up like this and have recently been having intermittent issues.  Our DBA's set up a fake replication DB (or something, I don't really understand it) so the heartbeat/lease check will succeed, even though there's not another DB on the other side to talk to/replicate to.

    So: I upgraded to NPM 12.1, and we made some changes to our DB.  Waiting and seeing how it goes...

  • Thanks for the updates. Let us know if that works.

    The only gap I can see in my environment is the OS upgrade, rest everything is Updated.

  • Update: The system had the same (apparently the same. the symptoms were the same)  issue last night at 6 PM.  The upgrades I did to go from NPM 12.0.0 to 12.1 took all day and were finished around 7 PM on Wed, so it went down on it's own about 24 hours after the upgrade emoticons_sad.png.

    I've re-opened my support case, and I took a diagnostics dump last night.  From my past history of support cases, I've been having "system crashes, uses up CPU resources" type issues that I've opened tickets on ( because I do know how to temporarily remedy it, and sometimes it stays fixed for awhile), since Dec 2014 (NPM 10.6).  The tickets are closed because, the system gets fixed (support may find some anomalies, or recommend some re-install of a component) and seems stable.  Then it crops up again, but it's never been this bad in the past.

    For what it's worth, last night, I was barely able to log into the app server and take a 'netstat -a' and shutdown the Orion services.  From the netstat, I didn't see anything that looked too weird, but the port monitoring I set up, for the few data points it could collect, showed 5.6k total connections. Normally (after the upgrade) it seems to be about 1.2k connections total, with 250 to the Orion DB (Note: before the upgrade, it was about 1.5k total, 56 connections to the Orion DB.  So something about the upgrade made the connections to the DB go up about 5x).  Anyway, when I shut down the Orion services, the InformationServiceV3 and the Collector service took a long time to shut down (like 3+ minutes or more) after everything else shut down rather quickly.  It seemed the system was not very responsive until they shut down, but I was not doing a lot of testing to try to prove anything.  From looking at the running processes I did not see any that were "hung" after all services stopped, like I've seen in the past, so I didn't reboot the server.

    Another note: I ran the diagnostics tool to get diag info to upload...and it took about an hour to run.  I think it's taken this long in the past, but I also think it didn't used to take this long.   I poked around and found some directories with lots (100k) of very old log files the diag tool should not be picking up, so I'm wondering if maybe there's some other places in the filesystem with many logs/files that the diag. tool slowly parses through OR that the information service/Collector service has to deal with which makes them shut down so slowly?  It seems like a long shot, but that's where I am at this point and I'm going to see if I can identify and do some clean up of junk on the system.

    Also: To get the system going last night, I ran the NPM config tool and just had it reinstall all the services.

  • More info, probably useless:

    I did some checking and it appears that our "persistent issues" started on 6/18.  OS patching occurred on 6/15, but *also* it appears we added IIS Appsight monitoring to 3 servers on 6/12.  Since this appears to use powershell remoting, I was hesitant because we've had issues with monitoring using powershell remoting in the past due to (apparently) known issues with WinRM in some situations not releasing connections and causing memory issues on the remotely monitored servers (we had this issue with the Appinsight for Exchange awhile back).   Anyway, I warned the  app support person about this, and we added teh montioring. I'm not saying this is the cause, just something I notice in the timeline that seems to correlate with changes, and connections causing issues.  I'm seeing if we can put this into maintenance is the system crashes again just to see if things clear up, or stay up longer.  I'm not too hopeful it's this though.