This is our setup:
Software: Solarwinds Orion NPM 9.1 (with Application Performance Monitor and Netflow Traffic Analysis, both fully patched to the latest Service packs)
Hardware: Polling Engine - Dell PE1950, 3.2GHz Xeon (4 cores), 4Gb RAM, Win 2003 SP1. The database server is on identical hardware.
We have around 500 active nodes in the system with just shy of 6300 elements - well within the supposed capability of a single polling engine.
To put it bluntly, however, our performance sucks. And we are having issues generating some of the reports we need.
First on the performance side:
Basically speaking, node view pages are taking 3-5 minutes to come up. APM and Netflow pages are taking even longer. Getting a netflow report for more than 2 hours of traffic is a joke. A 1 days report is taking around 4 hours to run. Once you get a node view page up, half of the graphs are missing, and even when you can coax them into loading, data points are not there (a one day view of a switch interface for example is lucky to have more than 50% of the data).
From talking to support and our account team, they sent me here for any optimization techniques to get things usable again. There only suggestion so far has been to reduce the polling and rediscover intervals - but the polling interval is already at 5 minutes and going higher than that may mean we miss a critical server or applicaiton outage. There are apparently no whitepapers etc. on performance tuning monitoring.
So what techniques have others used to make things usable with this (small-ish) size of data? (And getting another polling engine is not really on the table. I can't really see why something running at 60% of max capacity would drive us that way).
On the reporting side:
There has to be a better way to get reports for the following:
1. Alerts that are setup and are assigned/not assigned? Again support told us to write our own SQL queries, but apparently there is no documentation on the scheme.
2. Netflow reporting. The tables look like they are all there in the database from the SQL viewer but again, without scheme documentation it looks impossible to decode (and from as near as I can tell, only the IP addresses are stored - so does Netflow really do a lookup on all IPs when it preparing its reports? Seems no-optimal to me).
For reference - the database server hardly ever seem to break a sweat on CPU load. Memory usage is around 2.65Gb on the polling engine, CPU sits around 30-50% most of the time.
Sorry for the rant but we nearly missed a critical server outage over the weekend with our CRM system because it didn't get picked up by Orion.