This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

IPAM 4.6, Killing my database.

Hi,

It seems like when we do scanning it kills my database. I stop scanning and everything is good. I start scanning and I got database connection problems again. Here is an example of an error we see repeated many many times over and over again:

WARN  SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance (null) - Unhandled Exception during Realtime Maintenance

System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out

   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)

   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)

   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)

   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()

   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()

   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()

   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)

   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)

   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption)

   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)

   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)

   at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)

   at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()

   at SolarWinds.Orion.Common.SqlHelper.ExecuteNonQuery(SqlCommand command, SqlConnection connection, SqlTransaction transaction)

   at SolarWinds.IPAM.Storage.DAL.IPHistoryDAL.DeleteByRetention(Int32 retentionDays)

   at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance.PurgeHistory(SettingDAL DAL)

   at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenanceWorker.TryDoWork(SettingDAL DAL, Log log)

   at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance.<>c__DisplayClass3_0.<DoWork>b__0(RealtimeMaintenanceWorker obj)

ClientConnectionId:7d226dd7-6f07-413d-a8b2-b012aba40155

Error Number:-2,State:0,Class:11

Can anyone tell me why am I getting this? Is it more SQL related? Is it to many jobs flooding the database? I've been trying to basically trace problems from point a to point b and everything in between to try and determine where the problems are. But this one has me stomped. Any help is appreciated.

thanks!

  • I would open up a ticket with support.

  • how many records do you have in IPAM_IPHistory table? IPHistoryDAL.DeleteByRetention just tries to delete all records older N days.

    do you have othe SQL server related issues?

  • The job or query the system was trying to run is timing out. Either there is too much data, or the system is not getting the chance to run.

    So without too much more data to go off;

    What are your resources like on the SQL Server? CPU, Memory, Disk?

    When you run the Configuration Wizard (first install mainly), did you use the IP Address of the SQL Server;s IP Address, or Name?

         If you used the name - re run the CW and change to the IP Address.

    Any changes to the system recently - New Alerts, reports, views or was anything created/setup just before this started?

    -CharlesH

    Loop1 Systems: SolarWinds Training and Professional Services

  • Hi Charles,

    We have been working with Douglas K from your team. But I haven't passed this information to him yet. Haven't had time.

    But here you go:

    SQL: 24vCPU's, 100 gigs of ram, Disk's are on a SSD lun on fast policy, 10gig virtual NIC.  Everything is over provisioned according to recommendations made by solarwinds. I've even had them look it over and tell me it's as requested.  I used ip address when I ran config wizard first time on all my servers.

    No new changes.

  • So turns out the iphistory table is huge nearly 2 gigs. Simple select statement runs and runs and runs till eventually it gives me a ran out of memory exception. I can pull up till around the first million rows. Trying to pull anything more than a million rows errors out. Judging by data inside this table it looks like it's not being maintained by database maintenance.   Which would make since being that a simple select statement causes sql never to return a reply.  Wondering if I can truncate this table. My thinking is that due to it's size it hasn't been maintained in some time as it's not synced with the current retention's. Truncating the tables if there is no harm I believe should sync things up and clear this problem.

    Any thoughts or ideas on this thought process?

  • Truncate only if you are okay with losing data. What are the retention settings(under IPAM Maintenance Settings)? Compare that with how far back the data goes, if you have more data than what your retention is set for check your DB maintenance job/run db maintenance.

    Default retention is a year/365 days, and if you have a lot of sub nets or a high scan frequency you may need to reduce that retention or frequency. Though reducing the frequency won't reduce your table size until you get a little ways past that year you currently have in the DB. If you change the retention, run the DB Maintenance after.

    If you can spot a choke point when that runs, you could try to 'give it more power.'

  • The table actually had near 20 million rows and over 3 gigs in size. Due to this is the reason why this was hanging it up. Every time anything needed to be done to this table it had to go through so much table data that it would time out or run into out of memory exceptions. It was very inefficient because it was not being maintained. So without re-indexing this table was extremely slow. I did a delete from and got rid of the data we did not need. This dropped the table down to 5 million rows. This allowed queries to process without failure. This appears to allow maintenance to complete and the problem so far appears to be gone.

    Now the only other problem we had is not even related to ipam. It's related to SAM. Our sam tables where hundreds of millions of rows per table. One table nearly a billion rows. So I reduced retention across our heaviest tools and optimized the tables. Which appears to have cleaned alot of problems.

    Unfortunately we aren't even close to being done yet haha. But some progress is better than no progress.

  • I've a ticket opened with support for months. With regards to this however, I did a delete from on the table within a certain time frame and than did a database maintenance and it fixed the problem. The table has been in a manageable size sense.

    Only remaining item now is the bug this version has. I've begged and pleaded with support for a buddy drop they keep telling me they have but then they back out on it. And try to force me to upgrade to 4.7 I'm not apposed to it, I just need them to reassure and prove to me 110% that my current problems are fixed in 4.7 and they can't do that so management have said no upgrades until they give us that. If they don't give it we will not under any circumstances upgrade. It's a pain point we've been dealing with.

  • When you deleted the older data from your IPhistory table, did you also check or adjust your maintenance settings/retention?

    Also, how often are you scanning subnets?

    An errorof this type does point to the data and how you are handling that. I keep going to a couple of lines

    *at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)

    *at SolarWinds.IPAM.Storage.DAL.IPHistoryDAL.DeleteByRetention(Int32 retentionDays)

    this makes me want to double check the retention setting within IPAM Maintenance as well as scan frequency. Also in part because the table was so large to begin with.

    Has your table grown at all since the deletion of data? There is some data management within the system that has limitations due to the way some of the jobs or updates have been scripted. Making sure your settings keep you within those parameters is key in keeping you out of that scenario again, especially if what you experience is a *bug* or lack or programming consideration for all data scenarios.

    Sometimes the adjustment we need might be in the update. I can't say that for sure, but if other previous cases showing your same symptoms have led to an update to the current release then what SW says makes sense. I could see a case going to dev if there were elements that did not fall in line with the previous cases. Don't get me wrong either, I am all about Verify then Trust when it comes to tech and I feel your pain of being against that rock with pressure from the other side.

    I'll look forward to feedback regarding your db tables. Making sure those stay in line is the after effect of a proper running maintenance. Unless you are adding more sub nets to scan it should stay somewhat consistent if running for more than a year.

  • Yes retention was configured properly. And reduced. Table size hasn't really grown from my account. And maintenance isn't failing. So by all accounts all that you mentioned has been covered and double checked. I appreciate your help.