Transaction Logs filling up, problem solved.

Question

We encountered something recently where the TNX logs were filling up and SolarWinds was crashing. It wasn't actually the Database Maintenance procedure doing it. Within 30 minutes of turning Orion services back on, 128GB of transaction logs would be full!!! This was our path to solving the problem. I am copying and pasting this from our DBA, and if I have any comments, I'll interject. Admins, if this needs to be moved to a more appropriate location, please let me know. I chose SAM because the entries we were concerned with below were inserted into the db by SAM.

===========================

Introduction:

We had an issue that initially presented as transaction logs filling up. SolarWinds maintenance was not completing, SolarWinds in some areas was not usable, and nightly backups of the SolarWinds database were also failing.

We went through the usual checks and verified that the database was in simple recovery mode. (Chris here: We followed instructions from clearing out the MSMQ files, we recreated job engines, we ran the configuration wizard.) However, things continued to get worse - the transaction logs would fill up and SolarWinds could no longer collect data. Stopping the application resulted in needing to wait for SQL Server to run recovery on the database - upwards of an hour to recover. We also rebuilt the indexes on all the tables in the database to see if it was a performance issue. We even stopped the SolarWinds application, manually ran the Orion Database Maintenance Application, which completed without complaint, and then restarted SolarWinds but to no avail for a fix of the issue. The transaction logs would fill up in less than 30 minutes.

Troubleshooting and Solution:

We started with a query that returns log file space used by all running transactions. The base for this can be found on this web page https://stackoverflow.com/questions/5085093/finding-what-is-writing-to-the-transaction-log-in-sql-server

We observed that a SolarWinds process with the application name "SolarWinds.BusinessLayerHost@domain SolarWinds.APM.Businesslayer.dll" appeared to be attempting to delete records but was doing it in mass or in bulk and blowing out transaction log space upwards of 32G. Thinking that it might just need "a little more space" we attempted to pacify the process by granting it more transaction log space. However, once we reached 120G of transaction log space in our only 180G database, we realized it was a futile option. From the monitoring query above, the sql being run by the above SolarWinds dll process was observed to be the one below.

DELETE FROM APM_WindowsEvent_Detail WHERE ComponentID = @ComponentID        -- delete all the component statuses based on this component

A check on [SolarWindsOrion].[dbo].[APM_WindowsEvent_Detail] table showed that it had over 69 million rows in the table making it the second largest table in the database. Investigating in the SolarWinds GUI for database statistics did not show this table in its list of large tables... it appeared that it was being "ignored" by SolarWinds. Further investigation on Thwack led us to believe that this was not intended to be a large table as noted by this article: https://thwack.solarwinds.com/message/231653#231653

A query into this table (listed below) and grouping by the [ComputerName] column showed that about 69million records belonged to a particular server. This virtual Windows OS server had had hosted software with significant issues and that server had been removed from SolarWinds monitoring two months prior. (Chris here: This was a McAfee Policy server that ended up getting rebuilt. One of the issues that it was experiencing was the event log was just GROWING exponentially. This is the ultimate cause of the problem, because the IIS monitor in SAM was just importing the event log en masse, and there had been no time to age out the records. The server was removed from SolarWinds before we rebuilt the server.) However, this [APM_WindowsEvent_Detail] table still had 69million records in it related to this server - which now no longer existed in Solarwinds. Note that in order to query on this table, we had to stop SolarWinds since it was simultaneously and actively attempting to insert and delete records in this table.

-- Sorting the records in the APM Windows Detail table

SELECT [ComputerName], count(*) as rec_count, min([TimeGeneratedUtc]), MAX([TimeGeneratedUtc])
  FROM [SolarWindsOrion].[dbo].[APM_WindowsEvent_Detail]
  group by [ComputerName]
 go

After shutting down Solarwinds I did a simple count loop to delete 500,000 records at a pass from this table for the retired and no longer monitored server:

SELECT 'Starting' --sets @@ROWCOUNT
WHILE @@ROWCOUNT <> 0
    DELETE TOP (500000) FROM [SolarWindsOrion].[dbo].[APM_WindowsEvent_Detail]
     where [ComputerName]='windows-retired-server.mydomain.local'
go

While the above delete loop ran ran, it maxed out at about 3GB of use per loop pass in the transaction logs. A far different response than the 120GB that the other "SolarWinds.BusinessLayerHost" process was doing. The deletion took about three hours to complete. After that completed, just under 50,000 records were in the table so we rebuilt indexes and verified that the remaining records in the table were for known servers.

After this completed, we empted the recently added transaction logs and dropped them; we also shrank and dropped a database file. since about a fourth of our database space was also freed up by this cleanup. SolarWinds came back up with no issues and was remarkably much more responsive in all screens. Running the Maintenance App manually further cleaned up the [APM_WindowsEvent_Detail] table down to 1,700 records. We called that "fixed".

Concluding Thoughts:

Again, we are not sure why the records in this table were not originally cleaned up when the server was unmonitored from SolarWinds. (Chris here: A possibility is that SolarWinds DID try to remove the records, and was STILL trying, but could not delete 69 million rows in a mass delete without running into transaction log space issues that we continued to experience until my DBA resolved the issue for us. This highlights a possible MAJOR issue with the SQL statement that is executed when a node with LOTS of records is removed from SolarWinds. It never ran or never completed because it just ate up all space available.) Another issue for the SolarWinds architecture is that this should have been handled by the routine maintenance process in a way similar to what I manually did above to remove these records from this table since the target server was gone from SolarWinds. We are not sure why this happened and the large amount alerts from the retired Windows server event logs were due to the failing software on that long since gone server. Perhaps a fluke but hopefully this note will help and provide others a solution in the future should this happen to them.

Message was edited by: Chris Dow