As an accidental DBA, I'm having trouble determining how to handle maintenance on our Orion DB. When using the built in DB maintenance plan, the maintenance jobs inevitably end up running longer and longer due to HUGE amounts of data being dumped in a few tables daily. Eventually, the jobs run for days on end. The issue is several tables are growing to tens of millions of rows with one growing to hundreds of millions before the configured 7 day retention pruning occurs. Without shortening the retention period to clean out the daily tables quicker or turning down the volume on collected node info, I'm weighing my options for running maintenance to re-org/rebuild indexes and update statistics without having locks on these tables for days.
Here's my thoughts - open for comment:
1. Disable the built-in DB maintenance (how...rename the .exe?)
2. Create stored procedures for each of the following (based on Ola Hallengren's SQL Maintenance Solution at https://ola.hallengren.com/):
a. Database backups nightly
b. Data integrity checks nightly
c. Nightly index re-org/rebuilds and update statistics (actions determined by % fragmentation)
It's my understanding that stored procedures run more efficiently than SQL Maintenance Plans or maintenance jobs built by SQL queries (which is how I believe the .exe is doing it). So, how do those of you with large environments and tons of data being stored in your DB's handle this?
Thanks for any help!
So I am fighting a similar battle myself and I find some of your points interesting.
Here is my scenario:
When we turned on SNMP traps, we flooded the database, Traps and TrapVarbinds tables...this caused us to run out of space on the databas...this seemed to have had many significant down stream effects, other tables were not getting maintained because the DB was simply in an unhealthy state then. Once that happens, everything gets fragmented horribly, although I have since turned on the Defragmentation bit and everything seems to get fragmented daily anyway
We proceeded to Truncate a bunch of the tables (Traps, TrapsVarbinds, InterfacedAvailability_Detail, NPMMultiast, maybe a couple more, and after each truncate I would run the DB Maintenance job manually to see if it made it past the "bad" table...when it did I noticed it get stuck on a couple more we would truncate those and then move on until it made it through the entire job...I can't tell if its the database or the table spaces causing the issue, but we ran fine for 7 days after this...on the 7th day, something happened during the DB Maintenance where a Select statement was getting blocked by a Delete, and the DBAs killed it...from then on, during this last week I haven't been able to run DB maintenance and we are again going to truncate the InterfaceAvailability_Detail table...this table has the most rows in our database, and is where the Maintenance job is currently getting stuck.
My first question is, there is a second job that runs every Sunday called the Interface Baseline, I wonder if these 2 jobs conflicted at that time, one doing the Delete and the other doing a Select on the same table, or if some other outside force was hitting the DB at that time, user generated query maybe.
I have found the DB Maintenance to work very well on a healthy database, and from the sounds of yours, if the maintenance is taking longer than an hour, you may need to Truncate tables to fix things up.
You can tell which tables to truncate by watching the swdebugMaintenance.log file on the C:\ProgramData\Solarwinds\Logs\Orion folder on your main poller...the bigger tables do take a few minutes longer to get through, but we are talking minutes not hours, when its hours, there is an issue.
Support came back with kind of a canned response about re-installing the InformationServicev3 on the poller, but I don't see an issue there (anyway, but thats a different story)
We are truncating tonight to fix the DB up again and seeing where we get in 7 days, I am concerned that this will happen again, and thats where I find your solution interesting using Stored Procedures instead of the Nightly Maintenance.
Have you gotten anywhere with this?
We ended up truncating the TrapVarbinds table several more times while the app owner was tweaking what data we were collecting. Turns out the massive data collection was coming from traffic monitoring on our ASA - apparently finding info around SNMP detail from Cisco is worse than finding needles in a haystack. In the end, we were able to dial in our monitoring and data collection to keep the table sizes in check. Afterwards, the maintenance plan would "generally" finish with no issues. I was checking swdebugMaintenance.log daily and cleaning up any indexes not related to daily tables on a regular basis.
Since then, I have done the following and things are much nicer now (although we still have a few small offenders and a storage hardware challenge):
1. Shrink the database to a size based on the tweaked data collection (monitored size for two weeks and shrunk based on that baseline)
2. Moved the data and log files to 64kb block size volumes (SQL Server stores data in 64kb blocks)
3. Added three data files to the PRIMARY file group to increase read/write performance on data stored in that filegroup
4. Implemented reindexing nightly and DBCC integrity checks twice each week via SQL SPs
The key to fixing the Orion Maintenance reindexing failures was tweaking the data we were collecting - close those flood gates and you can actually swim!
Hope this helps! Sorry it took so long to reply.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.