Good morning,
I am having a significant issue with NPM and have been for quite some time. I have a ticket open, 129840. The issue that I am having is that every 3-4 days NPM stops alerting that devices are down. It also stops showing that devices have dropped in the webconsole. How I discover that the sever is no longer performing is when I go to view "Events" though the webconsole I get the error about the Business Layer and job engine. (My apologies I don't have the error verbatim) Throughout the last couple of months of trying to fix this issue we have discovered that the issue starts when the "Alert Engine" in the Solarwinds event log errors. This Alert Engine has 2 distinct errors. They are as follows:
EventID 4001
[AlertCheckingThread] ERROR All - Service was unable to open new database connection when requested.
Exception Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
Connection string - Data Source=(local);Initial Catalog=NetPerfMon;User ID=SolarWindsNPM;Password=*******;Connect Timeout=1200;Load Balance Timeout=120;Application Name="Orion Alerting Engine";Workstation ID=NPM01
and also
EventID 0
[AlertCheckingThread] ERROR Error - Error in ExecuteSQLAlert() -
System.NullReferenceException: Object reference not set to an instance of an object.
at AlertingEngine.CheckAlert.ExecuteSQLAlert()
When this is the case the only remedy is a reboot of the server. As you can imagine rebooting a production server twice a week is not a plan that is long term or appreciated by management.
I have read a couple of things about setting the connection time out to 30sec but I don't know if that is a good idea. I am quite weak with SQL and worry that I am going to make a change that will render the whole box useless. I am unable to determine why this would be the case only in the middle of the night, not correlating to Maintenance of the SQL server.
There are no Scheduled Tasks taking place at the time of failures.
With checking perfmon many times it's not showing that there are any issues with anything. There was a slight jump in i/o once in a while but not anything that is too significant.
I am currently running NPM 9.5 SP4 with the hotfix2 installed.
SQL 2005
Windows 2003 Server
Any and all assistance would be appreciated.
Please, if you require more information about anything let me know and I will facilitate to the best of my ability.
Thank you in advance,
--adam