11 Replies Latest reply on Sep 4, 2018 4:18 PM by cahunt

    IPAM 4.6, Killing my database.

    lcsw2013

      Hi,

       

      It seems like when we do scanning it kills my database. I stop scanning and everything is good. I start scanning and I got database connection problems again. Here is an example of an error we see repeated many many times over and over again:

       

      WARN  SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance (null) - Unhandled Exception during Realtime Maintenance

      System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out

         at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)

         at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)

         at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)

         at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()

         at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()

         at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()

         at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)

         at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)

         at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption)

         at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)

         at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)

         at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry)

         at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()

         at SolarWinds.Orion.Common.SqlHelper.ExecuteNonQuery(SqlCommand command, SqlConnection connection, SqlTransaction transaction)

         at SolarWinds.IPAM.Storage.DAL.IPHistoryDAL.DeleteByRetention(Int32 retentionDays)

         at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance.PurgeHistory(SettingDAL DAL)

         at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenanceWorker.TryDoWork(SettingDAL DAL, Log log)

         at SolarWinds.IPAM.BusinessLayer.Helpers.RealtimeMaintenance.<>c__DisplayClass3_0.<DoWork>b__0(RealtimeMaintenanceWorker obj)

      ClientConnectionId:7d226dd7-6f07-413d-a8b2-b012aba40155

      Error Number:-2,State:0,Class:11

       

       

      Can anyone tell me why am I getting this? Is it more SQL related? Is it to many jobs flooding the database? I've been trying to basically trace problems from point a to point b and everything in between to try and determine where the problems are. But this one has me stomped. Any help is appreciated.

       

      thanks!

        • Re: IPAM 4.6, Killing my database.
          neomatrix1217

          I would open up a ticket with support.

            • Re: IPAM 4.6, Killing my database.
              lcsw2013

              I've a ticket opened with support for months. With regards to this however, I did a delete from on the table within a certain time frame and than did a database maintenance and it fixed the problem. The table has been in a manageable size sense.

               

              Only remaining item now is the bug this version has. I've begged and pleaded with support for a buddy drop they keep telling me they have but then they back out on it. And try to force me to upgrade to 4.7 I'm not apposed to it, I just need them to reassure and prove to me 110% that my current problems are fixed in 4.7 and they can't do that so management have said no upgrades until they give us that. If they don't give it we will not under any circumstances upgrade. It's a pain point we've been dealing with.

                • Re: IPAM 4.6, Killing my database.
                  cahunt

                  When you deleted the older data from your IPhistory table, did you also check or adjust your maintenance settings/retention?

                   

                  Also, how often are you scanning subnets?

                   

                  An errorof this type does point to the data and how you are handling that. I keep going to a couple of lines

                   

                  *at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)

                  *at SolarWinds.IPAM.Storage.DAL.IPHistoryDAL.DeleteByRetention(Int32 retentionDays)

                   

                  this makes me want to double check the retention setting within IPAM Maintenance as well as scan frequency. Also in part because the table was so large to begin with.

                  Has your table grown at all since the deletion of data? There is some data management within the system that has limitations due to the way some of the jobs or updates have been scripted. Making sure your settings keep you within those parameters is key in keeping you out of that scenario again, especially if what you experience is a *bug* or lack or programming consideration for all data scenarios.

                   

                  Sometimes the adjustment we need might be in the update. I can't say that for sure, but if other previous cases showing your same symptoms have led to an update to the current release then what SW says makes sense. I could see a case going to dev if there were elements that did not fall in line with the previous cases. Don't get me wrong either, I am all about Verify then Trust when it comes to tech and I feel your pain of being against that rock with pressure from the other side.

                   

                  I'll look forward to feedback regarding your db tables. Making sure those stay in line is the after effect of a proper running maintenance. Unless you are adding more sub nets to scan it should stay somewhat consistent if running for more than a year.

                   

                   

                    • Re: IPAM 4.6, Killing my database.
                      lcsw2013

                      Yes retention was configured properly. And reduced. Table size hasn't really grown from my account. And maintenance isn't failing. So by all accounts all that you mentioned has been covered and double checked. I appreciate your help.

                        • Re: IPAM 4.6, Killing my database.
                          cahunt

                          That sounds very good. In addition to that you could have your DBA verify waits or locks on those ipam tables, and if they are still happening. If you see it happening still, even at a small degree see if your dba can optimize. If that is not an option increasing logging can show you more details, searching around those table names, or ipam could help you find entries that show run times or completion of the jobs/queries that were causing issues. After review if everything looks good, turn logging back down so you don't fill up your drive space.

                  • Re: IPAM 4.6, Killing my database.
                    maryan.dmytriv

                    how many records do you have in IPAM_IPHistory table? IPHistoryDAL.DeleteByRetention just tries to delete all records older N days.

                    do you have othe SQL server related issues?

                    • Re: IPAM 4.6, Killing my database.
                      cahunt

                      The job or query the system was trying to run is timing out. Either there is too much data, or the system is not getting the chance to run.

                      So without too much more data to go off;

                      What are your resources like on the SQL Server? CPU, Memory, Disk?

                      When you run the Configuration Wizard (first install mainly), did you use the IP Address of the SQL Server;s IP Address, or Name?

                           If you used the name - re run the CW and change to the IP Address.

                       

                      Any changes to the system recently - New Alerts, reports, views or was anything created/setup just before this started?

                       

                      -CharlesH

                      Loop1 Systems: SolarWinds Training and Professional Services

                        • Re: IPAM 4.6, Killing my database.
                          lcsw2013

                          Hi Charles,

                           

                          We have been working with Douglas K from your team. But I haven't passed this information to him yet. Haven't had time.

                           

                          But here you go:

                           

                          SQL: 24vCPU's, 100 gigs of ram, Disk's are on a SSD lun on fast policy, 10gig virtual NIC.  Everything is over provisioned according to recommendations made by solarwinds. I've even had them look it over and tell me it's as requested.  I used ip address when I ran config wizard first time on all my servers.

                           

                          No new changes.

                        • Re: IPAM 4.6, Killing my database.
                          lcsw2013

                          So turns out the iphistory table is huge nearly 2 gigs. Simple select statement runs and runs and runs till eventually it gives me a ran out of memory exception. I can pull up till around the first million rows. Trying to pull anything more than a million rows errors out. Judging by data inside this table it looks like it's not being maintained by database maintenance.   Which would make since being that a simple select statement causes sql never to return a reply.  Wondering if I can truncate this table. My thinking is that due to it's size it hasn't been maintained in some time as it's not synced with the current retention's. Truncating the tables if there is no harm I believe should sync things up and clear this problem.

                           

                          Any thoughts or ideas on this thought process?

                            • Re: IPAM 4.6, Killing my database.
                              cahunt

                              Truncate only if you are okay with losing data. What are the retention settings(under IPAM Maintenance Settings)? Compare that with how far back the data goes, if you have more data than what your retention is set for check your DB maintenance job/run db maintenance.

                               

                              Default retention is a year/365 days, and if you have a lot of sub nets or a high scan frequency you may need to reduce that retention or frequency. Though reducing the frequency won't reduce your table size until you get a little ways past that year you currently have in the DB. If you change the retention, run the DB Maintenance after.

                              If you can spot a choke point when that runs, you could try to 'give it more power.'

                                • Re: IPAM 4.6, Killing my database.
                                  lcsw2013

                                  The table actually had near 20 million rows and over 3 gigs in size. Due to this is the reason why this was hanging it up. Every time anything needed to be done to this table it had to go through so much table data that it would time out or run into out of memory exceptions. It was very inefficient because it was not being maintained. So without re-indexing this table was extremely slow. I did a delete from and got rid of the data we did not need. This dropped the table down to 5 million rows. This allowed queries to process without failure. This appears to allow maintenance to complete and the problem so far appears to be gone.

                                   

                                  Now the only other problem we had is not even related to ipam. It's related to SAM. Our sam tables where hundreds of millions of rows per table. One table nearly a billion rows. So I reduced retention across our heaviest tools and optimized the tables. Which appears to have cleaned alot of problems.

                                   

                                  Unfortunately we aren't even close to being done yet haha. But some progress is better than no progress.