cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Solarwinds is still not stable

Jump to solution

The other thread is closed so I figured I would start a new one I usually get more help here than actually contacting support.

So same issues as before but instead of the server not responding in 36 hours or so it took maybe a week but it is the SAME issues. 

1. Server stopped sending alerts out sometime around 11AM on the 4th.

2. Logged onto server and opened Orion service manager and both the module engine and the administration service were going back and forth between running and stopping. 

3. Orion could not connect to SQL

4.  I have some alerts that at are going out but not sure if they are legit or not. 

5. After the reboot I notice that a good chunk of my nodes interfaces are 'unknown' this looks like it fixes itself but again something else going on. 

I have applied the 'hotfix' that you all pushed out to try to fix this.

I have done the change from streaming to buffered

I have done the registry change for the ports

The only thing I have not done is revert the snap shots back to June 14th prior to the update so Solarwinds is stable again. 

At this point I am going to schedule a task in VM Ware to reboot the server every night.  That is pretty much the only way I will know Solarwinds will actually work. 

Thoughts?  serenaaLTeReGo

1 Solution
Product Manager
Product Manager

Orion Platform Hotfix 3 was released yesterday to address the ephemeral port exhaustion issue which is likely the cause of the issue you are experiencing.

View solution in original post

129 Replies

So I was able to find the custom properties with no grouping and that worked for now.  Thanks dodo123​!! I am still not sure how to remove the group but at least I have another Solarwinds work around. 

0 Kudos

dodo123 is correct. The 'Most Commonly Used' group by can be a very expensive query depending upon the number of items that could be returned. If the query never returns, please open a case with support so we can determine what exactly is going on. Simply removing this group by option from your selection though should make the query return much faster.

0 Kudos

So Support just said it was because I am not running things as the local admin?   Oh I have a case open with support The current one is 00150177 - but have had issues since updating on June 18th so I think we are coming up on the two month anniversary of unstable Solarwinds suite. Well actually DPA is the only thing that has been rock solid so far but that doesn't really touch Orion... Thoughts aLTeReGo​?

0 Kudos

https://thwack.solarwinds.com/people/martian%20monster  wrote:

So Support just said it was because I am not running things as the local admin?   Oh I have a case open with support The current one is 00150177 - but have had issues since updating on June 18th so I think we are coming up on the two month anniversary of unstable Solarwinds suite. Well actually DPA is the only thing that has been rock solid so far but that doesn't really touch Orion... Thoughts aLTeReGo ?

It's possible, however unlikely that you may encounter permissions issues when installing or upgrading as a domain admin account. This is because some environments apply user level policies to domain admin accounts, which aren't applicable to local admin user accounts. If the support engineer is seeing permission failures in the install or configuration wizard logs, that's a good indication that some degree of policy/security tightening is going on. I would start with running the Permissions Analyzer and fixing any permissions issues identified there.

0 Kudos

For those playing at home to get to the permissions checker - (Drive that has Solarwinds installed) \Program Files (x86)\SolarWinds\orion\OrionPermissonChecker.exe

Ran it and the only two errors were the Network Service could not write to C:\Windows\Temp - so ran the fix and now it has access. 

Still having all the same problems.  What would you like me to try next?  aLTeReGo

0 Kudos

You'll need to recycle Orion services after the permission checker fixes permissions for the changes to take effect.

Are you using DPA to monitor your SolarWinds database? Ever since the 12.3 update we've had:

pastedImage_0.png

We failed over both our DB and MP to run out of a co-location site and still no improvement on the random blocking. Support has the latest DPA results with the blocking statements, but so far, no answer.

And now removed the Solarwinds SQL server from DPA. 

0 Kudos

hendersonwa​ - I actually had a ticket open because one of my DBAs found some indexing issues in the Solarwinds Database and we reached out to Solarwinds for support - and this got passed out to development and I have not heard a peep since.  Part of me is thinking this DB issue is related. 

Are you running NTA in your environment? NPM 12.3 and NTA 4.4 released at the same time and we updated both to the latest version after waiting a few weeks. With NTA 4.4 they switched over to SQL server for flow data. Our DBA's internally have speculated that this might be causing overall issues due to the size of our NTA environment. I am with you martian monster, in thinking that it is db related as well, but no official answer from support yet.

hendersonwa​ - Yes we are running NTA and I did the upgrade to SQL when I updated to NTA 4.4.  I am in agreement with you on the NTA 4.4 upgrade to SQL and it causing problems and I have been thinking about that as the core of the issue this whole time.  Things were rock solid prior to updating NTA to SQL now that NTA is on SQL that is running both Solarwinds and DPA DBs maybe adding this 3rd new database is choking Solarwinds somewhere.  I wonder if I could spin up a new SQL VM and move the NTA DB to a new server to see if that would fix the issue - all the errors I am seeing are all between the APP server and the SQL server.   Thoughts on the NTA SQL upgrade causing these issues aLTeReGo​ ?

0 Kudos

Double Facepalm.jpg

0 Kudos

The new mantra to fix things at Solarwinds  "reboot until it works" 

0 Kudos

Agents were amongst my major concerns due to past instability.

So far, all is OK. I'm very surprised nothing barked.

0 Kudos
Level 12

Thankyou Martin , I can feel the heat now. I am in between the upgrade process and while doing this I found so many bugs. I am with support team since system was down. Now it looks working. I will not be surprised if In find more bugs later on like you said.

Thanks.

0 Kudos

nks7892

Are you saying you had issues with 12.2 upgrade? And your system was down?

Sorry, just trying to clarify your post from a version standpoint since you say you're in between upgrades.

Thanks

0 Kudos

I had issues while upgrading from 12.2 to 12.3. But now it is completed but I had to reboot my polling engine server multiple times in order to complete the configuration wizard.

I never seen this issues in of the previous upgrades but this time it took me 4 hours extra because of rerunning the configuration wizard.

Also scalability engine download concept from primary poller is very hectic since for few servers internet connection is slow and latency is little high between Orion main servers and that particular APE so it is still downloading the package since yesterday. I wish I could have latest APE setup offline.

Getting "error while rendering the Netflow collector services", Custom properties description is not updating when you edit custom property definitions. I am expecting many other issues.

0 Kudos

Thanks. So, how many previous in place upgrades were performed on this host?

It seems that building out a new host is a good counter measure but not everyone has the time and resources to do that.

Perhaps the PowerShell script support uses to wipe Orion away may need to go public.

0 Kudos
Level 12

We are planning to upgrade our Orion Platform from 2017.3.5 to 2018.2, please tell em if this is stable or not? We can halt our upgrade process if there is any bug in the new version.

0 Kudos

nks7892

I did my upgrade back on June 18th.  Everything ran fine for a few days until I joined the group that had issues with port exhaustion and the streaming versus buffered issues that were happening for roughly 3 weeks or so until Solarwinds figured out what was going on and then they released HF3.  Which this fixed the issue pretty much and things were stable for a few weeks then I started having more odd issues.  Opened a new ticket on 7/27 for the new issue that seemed to be like the old one but Solarwinds never really crashed it would spike the CPU for 30 mins or so and do other odd things then kind of return to normal.  Last week Thursday I applied HF 4 and things seem to be stable for now.  If Solarwinds stays up for a month with no issues I will call HF 4 a success. 

So you might be good as long as you apply all the hot fixes only time will tell.

Hope this helped. - Dave

Level 8

I debated opening a separate thread, but figured that I would reply here to maintain visibility to other users. We applied HF4 yesterday to our environment and are still experiencing the same issues that we have been since we first opened our case on July 9th. At one point, support suggested simply building new VM's but we would prefer not to pursue that option. We have our SW environment monitored by DPA and we continue to see significant blocking generated at random times from the main polling engine.


aLTeReGo​, does HF4 address the need to update the TransferMode or should that still be changed?