This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SQL Trace Issues - STIG

STIGs. Everybody run now.

So we've been upgrading our SQL servers and going through STIGs. Post DB migration to the new SQL server, we started having problems with SQL crashing. Our SQL admins have correlated the issue to the SQL server/DB STIG requirement to enable tracing at all times. On top of that, if tracing cannot run, it is required to shutdown the database. They said that our trace files, sized at 512MB, get cycled every minute and stores only up to 10 files. So effectively any trace only lasts 10 minutes tops based on the transactions our Orion DB is processing. They tried a few changes, but ultimately it appears that the SQL trace just cannot keep up with the Orion transactions and consequently shuts down the DB. Right now we just have the tracing disabled for this DB since it blows up.

Has anyone run into this issue? I'm looking for some insight/experience. I'm going to open a ticket with support as well, but since there is no guaranteed failure method, it is going to be a little difficult to provide adequate scenarios for testing.

EDIT: If you read pretty much anywhere on the internet, it says that leaving tracing on forever for everything is a thing you should not ever do. We in the Federal government know that this is the type of stuff we are forced to do on the norm even if everyone says its not a good idea.

  • What I would do is take a look at the disk that the trace log is writing to. The STIG rule states that if the log cannot keep up, the SQL Server shuts down. If the log has to wait for the disk to write because it is busy, then it fails. I would first have the trace log set to a different drive than any of the SQL databases, including tempdb. See if it can be moved to a raid 10 drive set or SSD drive.

    Alerts

    One of the other things you can also modify are the alert checking. Turn off all alerts that are not critical. Most alerts are configured to check the database once a minute, change the alerts to check every 3-5 minutes, volume alerts set to 30, 45, or 60 minutes as they only rarely fill very quickly.

    Slowing down the polling cycles on devices can also help keep the trace log up. I would recommend that you setup a custom property under Settings> Manage Custom Properties and add a Node property called Device_Priority. Assign your devices a Low, medium, High, Critical, etc. to the Priority. This way you will know which devices you are going to slow the polling down without affecting critical monitoring.

    Next, go into Settings> Manage Nodes. Group By Device_Priority.

    Select Low and then select all devices on the right, and select Edit Properties.

    Select the Checkbox next to Polling and then enter the new polling of Statistics of 30 and Status of 600. (set this slower first, you can always speed up the polling once the trace)

    Select Submit

    You can do the same to Medium, set Statistics to 20 and Status to 300.

    You can also change all polling for all devices under Settings> Polling Settings.

  • Thanks. We will look into some of these things. It is probably worth noting (should've written in the first post) that we are running NPM, NCM, SAM, IPAM, UDT and NTA on our Solarwinds instance. All the latest versions. So technically NTA is a separate DB server now, but from what I can tell it still ties back into the primary SQL DB.