Sounds like it might be SQL-performance related problems?
what kind of SQL setup do you have? RAID 10? how many hdd's? how fast are the drives?
I believe if you can add some high-speed drives you should be able to increase performance.
We have a slightly larger setup - less pollers (4) however, and more elements (20,000). We run a little Netflow, plus IPSLA and APM with a separate web and database server. The web server actually runs on the database server - seems to help a little.
So within the same ballpark as you in terms of the environment size.
We do have a problem with performance of our maps (we got a little carried away with embedding and the number of elements on the maps), but in general things like node details, interface details are all quite quick (~2 seconds for a full display to be populated on most occasions).
Database server disk I/O speed seems to be a key factor - we were running a RAID10 across 6x15k drives (DL380G6 with P400i 512Meg raid card).
We then went solid state with some fusion IO drives - didn't make a noticeable difference which showed that the above RAID10 wasn't at the limit of performance (SSD drives were mainly to see if we could decrease the map load times).
Thanks for the replies! Dave, this confirms that we do have some performance issue somewhere.
I am collecting more data on our setup to isolate the various aspects to investigate. I have also been reading the Managing Orion Performance PDF.
That guide suggests that the NPM database should be around 20GB in size. I know this must be general rule. But, ours is over 106GB. I found out that we are doing a considerable amount of Netflow, and this may be why. I also learned that our database files are not on the physical SQL box, but rather on a SAN. We are going to get with our SAN group and find out if any tuning can be done. If it is already tuned as good as possible, we may consider bringing the database onto the SQL box.
I have a slightly bigger install I think. NPM (21,000+ elements), NTA (450+interfaces), APM, IPAM and IPSLA.
We have a dedicated SQL box, 2 web servers load balanced and 5 pollers.
My website runs pretty speedy.
My point to all that is that you should be able to get yours acting within reason. I think like the others that SQL is your issue I would get some metrics of your SQL box and see if it bottlenecking somewhere.
So, we are having our DBA people take a closer look at our SQL server. We definitely have a few issues to iron out with it.
I wonder if anyone can chime in on the database size issue. The Orion Performance guide gives a general rule that the DB should be about 20GB, and looking at the MS SQL Management Studio, ours is 106GB. This is the NPM DB specifically. Doesn't include NCM or other modules.
Also, another thing we have noticed is that if we take down services on our main Orion server/poller, the web console is no longer accessible....it gives an error. Our thought is that this could also indicate a problem, as it seems to us that the web server portion of Orion NPM should still be able to contact the SQL server and get any data it needs to present to the user. Does anybody know if this is normal or not?
For the Performance Guide, are you refering to this one? If not, this is a good resource for troubleshooting perfprmance issues. The 20GB size is usually for the Orion server installation, not the SQL server. The NPM database contains tables for any installed modules. The database size you have seems reasonable, but there is a chance that your SQL server is having performance issues. Check the disk queue write length. Anything over 2x the effective spindle count is a problem. Also, what are your software versions? The web service has to connect to Orion so if you take Orion down it will fail. Orion generates content, the web server formats and delivers.
Hi Andy! Thanks for the info. Yes, that is the guide I am following.
Can you verify the info about the 20GB size? This is what I found directly from the guide:
"Total size of the Orion database. Normally less than 20GB for NPM with moderate syslog data and no NetFlow data."
Maybe I am misunderstanding your explanation or the guide?
I have been looking at the Performance Monitor on our SQL server, and still trying to make sense of it all. Specifically, the disk-queue length. The Perf Mon shows this as a percentage - 0 to 100 - so I don't see the correlation to the multiples....2x, etc. Still reading up on this. May need to change a setting in Perf Mon, but haven't found that yet.
By software levels, do you mean the versions we are running? If so:
SolarWinds Orion Core 2010.2.1, APM 3.5, NCM 6.0, NPM 10.1.1, NTA 3.7, IVIM 1.0.0
I should probably edit that guide to specify that the 20GB is for a single poller deployment. With APM, NTA and multiple pollers the database size can increase easily.
Perfmon shows the y axis at 0-100 by default but it is not percent. Use the numbers under the graph to get a better idea of the queue length. 2x the spindle count isn't really a hard threshold but more of a guideline. If the avg is staying low, you are OK. If it goes up for long periods of time you have a problem. This counter must be started on the SQL box and only applies if you are using locally attached storage. For SAN/NAS you will have to ask the DBA to look for you.
Now that I read that table again, I realize it does say that Netflow can increase the size to several hundred GB.
Thanks for the tips. We will be focusing most of our efforts on the SQL server and also the fact that the DB is actually on a SAN. I wonder if anyone is getting satisfactory results with their DB hosted on a SAN?
SAN was specifically not recommend by the SolarWinds development engineer and DBA i spoke to
SAN write performance has improved significantly over the past couple of years and we now have some clients running quite well on SANS. Because of this we have softened out stance on SANS from "not recommended" to "proceed with caution". Performance of a directly attached RAID subsystem has far fewer variables than that of a SAN so it is easier to understand, troubleshoot, and measure. So you can use a SAN if you can determine that the write capabilities are sufficient. SANS are they way of the future, and I expect they all will rival the fastest subsystems soon.
With your database, what are the large tables? If you have SQL 2005 +, you can use a built in report (right click on DB, reports, standard, Disk usage by top table. That might show you whats so big in your DB. Maybe it's syslog, snmp traps, netflow... what ever the big guy is, that might be what you need to look at.
I have read people posting about too many syslog entries putting to much load on DB. Or odd syslog rules putting a high CPU load on primary poller, that could be fixed by just changing which server takes care of syslog or balance it between other pollers. Same thing with snmp traps.
I was having a problem with statistics being rolled up from detail to hourly to daily and that was creating a large DB for me.
Thanks for that tip Netlogix!! That gives some good insight into what is going on in our DB.
Surprisingly, Syslog is not one of the larger tables. Looks like the bulk of it (by a HUGE margin) is Traps.Varbinds and then Traps. Netflow looks to be in third place, even taking into account that it seems to span several tables.
you might want to look at what snmp traps are coming into the server and start writing drop rules for some of the noisy pointless traps (or find the source and fix them if it is a valid problem generating the noise).
One of the ones I started to drop was Domain Authentication Failures for the domain and some BES noise that were just filling the tables with useless information that is already available in other logs.
Filtering out unwanted traps is something I had not even considered before. I know there are some we could drop. We will be taking a look at this, too. Getting rid of background noise will have the added benefit of making it easier to sift through all the data to get to what we really want to see.
Just wanted to say thanks for all the help, and also to provide some status update, in case others are dealing with similar issues and could benefit.
So, we have decided to focus on tweaking our SAN while we also look into what we will need to do to move over to a local RAID array. Our SAN people have already have identified some things that are definitely affecting our performance:
1) Our data files and backups reside on a single logical unit. We will probably be re-arranging our files/volumes.
2) Our logical units are on tier 2 drives. Hopefully, we will be moving our non-archival/backup stuff to tier 1.
I have been working with some other engineers and one of them says that current SAN technology can actually outperform a local RAID array due to the horsepower available in a SAN hardware (i.e., huge amounts of high-performance cache) versus a typical RAID card. Of course there are several factors, but the potential is there.
Please feel free to add provide any more suggestions/ideas for improving performance. I will post again when we have made progress or find out more info.
If your not running raid 10 you need to be, that is the only way to get perfromance wi the way SW writes to DB We have 6 pollers all Orion products, run NTA on oiver 1500 roturs with aboterh 1000 to add, have seperate SQL for NPM and NCM
win 2003 enterprise
sql 2005 enterprise, be sure to configre sql to oit use all the memory, leave at leat 2 GB for OS
dell server with 32 GB mem, wish i went with 64,
6 internal drives and 15 drive array external
2 drive mirror "C" os and sql
2 drive mirror "D"DBtemp
2 drive mirror "E" DBLog
6 drive raid 10 with 1 hot spare "F" Primary file group ( this is all but NTA)
6 drive raid 10 with 1 hot spare "G" file groups 1 to 4 (this filegroups are all NTA)