Lately I've been experiencing some bad performance on our Orion NPM polling and database servers and trying different things to fix/improve it. I think we are experiencing SQL timeouts between the polling engine and SQL database. I'm not a DBA, but know some minimal stuff about SQL, so here goes.
Polling server: We are running NPM, and NCM on a dedicated server, a Dell 2950 - (2) dual core 3Ghz CPUs, 8 GB ram, Win2k3 R2, 64 bit os. We also have VoIP and NTA modules. I haven't had a chance to fully dive into the VoIP monitor but we do use the NTA to troubleshoot sites with heavy WAN utilization. We currently have about 2700 nodes/4000 elements, mostly Cisco devices.
Database server: A dedicated database server for Solarwinds, a Dell 2950 - (2) dual core 3Ghz CPUs, 16 GB ram, Win2k3 R2, 64 bit os., MSSQL 2005 Enterprise, also 64bit. The drives are configured like this:
- C: RAID 1; (2) 72GB SAS drives; OS and system databases except for TEMPDB
- E: RAID 1; (2) 72GB SAS drives; MSSQL log files for TEMPDB and Solarwinds related databases
- F: RAID 1; (2) 136GB SAS drives; MSSQL database files for TEMPDB and Solarwinds related databases
We recently added the last two drives and seperated the database and log files, and moved the TEMPDB from the C: drive. That helped, but we still have high disk que length for the E: drive, and database activity is constant for the NPM database.
Looking for other ways to improve performance, I noticed under the DB properties for the two Solarwinds databases (NPM and NCM) that the "full-text indexing" option is checked. Some DBAs I've talked to said to turn that off and it should improve performance. Is there any requirement for the Solarwinds databases to have that enabled? I searched this forum, and the Admin guide for NPM but couldn't find anything on that setting.
We've also recently upgraded firmware and drivers for the servers/RAID controllers/NICs to help, along with reducing our polling frequency.
Some of our upper management is losing their faith with Solarwinds since we sometimes get alerts for multiple sites going down, then coming back up hours later, but other tools never see the outage. Another group here wants to put all of their servers in, about 800 more "nodes", along with UPSs - about 1000 of those. I don't think it can handle the additional load!
What else can I do to get this application working reliable? I don't think we are pushing the limits of the system, are we? I would rather not have to tell manangement that we have to spend another $20K for another poller!
HELP!