In my last post, "What NPM Tips and Tricks do You Have?" I asked about tips and tricks, expecting a mashup of different things from all over the NPM world and to a certain extent that's what happened. Interestingly, however, a large section of the thread turned into a discussion about two things: maps and speed.
The speed issue is particularly intriguing to me since there are a lot of times where, let's be honest here, NPM is a bit of a dog when it comes to response. The web interface is notoriously slow, and gets even worse when you have a ton of custom widgets, do-dads, and whatchamacallits loading on a screen. Several people mentioned that a lot of speed can be picked up by getting in at the database level and pre-packaging certain things.
Stored Procedures and custom Views created in the DB save us countless man hours and, in my experience, working directly in the DB can really expand your knowledge of the architecture of NPM overall. I highly recommend every SolarWinds engineer to challenge themselves to learn more SQL. I am by no means a DBA, but I can pull every bit of data you can get from the website, and I can do it faster 90% of the time.
NPM is an incredibly flexible and extensible product, especially in recent revisions, and offers a lot of opportunity for people willing to really dig in behind the scenes. As usual, I have more questions:
* What SQL version and architecture are you using (separate database, named instances, etc.)?
* What architecture have you found helps in the speed department?
As an example of what I'm interested in: we run Cisco UCS servers, with VMware as the hypervisor layer, backed by NetApp FAS3240 fully licensed arrays, with Flash Cache, etc. We tier our storage manually and have full production SQL and Oracle instances virtualized. The storage is connected to the UCS with an aggregated 80GB, and the UCS to the core at 160GB.
I've recently spent a lot of time and effort on this, even consulting with Atlantic Digital, Inc., and Solarwinds directly.
the outcome is we are focusing on the SAN and the Data Stores presented to the DB Server.
I was able to initially noticeably increase my web performance by moving to 4 SSDs (purchased at Fry's - so they are SATA SSDs) in Raid 10 format for redundancy.
Here, I had the LOG .LDF file on a SAN drive and the DB .MDF file on the local SSD array.
I had Atlantic Digital out for training and consultation and they had me move the LOG .LDF file onto the SSD drives. This alone further DOUBLED my performance.
We are about to do a migration from Big Brother to SAM, so we decided to go enterprise and get a dedicated SAN. We are getting a Dell PV with 10 SAS SSDs that will have 3 separate "drives" on the DB Server.
One for the DB, one for Log and Temp, and one for Netflow [FileGroup].
I believe Solarwinds is working on a document on this very subject.
Yeah, disk I/O on the database servers is always a big performance bottleneck/opportunity. My Oracle DBA and I work together on a lot of these same issues (moving certain mounts to certain disk arrays, etc.) to squeeze as much performance as possible out of the databases. I haven't spent as much time as I'd like on the SQL Server side, but the benefits are the same, if executed a little differently.
Good stuff, thanks!
Just some friendly advice.
You really shouldn't RAID SSDs. This will more than halve the amount of bytes that can be written to the SSD in its lifetime as you're wearing out the flash memory. You're also maxing bandwidth on internal IO instead of using them on read/write IOPS. I'll give your SSDs max 3 years until they die abruptly. 3 months until you start losing performance.
The only supported RAID format for SSDs is RAID 0. Any other RAID format will severely impair the internal maintenance protocols built into the SSD BIOS. The protocols or plans rather, include Garbage Collection and TRIM command compatibility. Garbage Collection takes care of data that can be deleted (data can not be overwritten on SSDs, unlike on spindle disks. Instead they need to be moved to an empty block on the SSD). TRIM is an OS side command that forces the data to be nulled and by doing so, allows IO operations to continue without having to wait for Garbage Collection to complete. This spares your internal SSD IO capacity. RAID controllers nowadays still do not support TRIM commands passed on by the OS (unless RAID 0 is in use) Furthermore, for successful Garbage Collection and TRIM commands to be passed on to the SSD you'll need at least 10% of unpartitioned space on the SSD. Using 100% of SSD capacity is the worst you can do, whether using it as a single drive or not.
If you're going to use SSDs on an SQL server then use it for transaction logs which writes and reads data sequentially. This is where SSDs shine, in sequential data access, NOT random data access. Random data access performance is actually the benefits of raiding multiple disks.
Only use 1 SSD for logs or 2, but only if they are in RAID 0.
Use traditional spindle disks for Data. Hybrid disks are better, though.
In my experience, NPM performance issues are due to a misconfigured SQL server. It's similar to a VDI project: things work well up to a certain point, at which you hit the IOPS limit on your LUN and all of your desktops grind to a halt. NPM is the same, where an initial deployment with little to no stored data works well, even if your SQL server isn't configured to best practices (CPU / memory allocations, storing logs and databases on separate volumes, creating a sane maintenance plan, et cetera). But once you collect a few months worth of data, and maybe turn on syslog, and take the plunge to capture flows, SQL can't keep up. Build a solid SQL server, wrap it in proper maintenance, and NPM should be happy.
I've also used another server for the web GUI to reduce load on the NPM server, but I did this in concert with migrating the NPM database to a proper MSSQL cluster, so I'm not certain how much performance benefit can be attributed to the web console being broken out. But I ended up with a traditional three-tiered web app architecture, and performance was no longer a problem.
All were vSphere VMs, EMC backend with separate LUNs for each volume on the SQL server, 2003 R2 for all nodes (I admit, this was a while ago. ). Maintenance plan is key, and tuning your NPM data retention settings is a big part of it, too. The old the data is, the more it should be summarized. YMMV.
What data retention settings are you using? We are using Detail=30, Hourly=180 and Daily=780. Our DB is 120Gb and response is NOT optimal. We built the DB server dedicated to Orion, RAID 10, 15K drives, separate partition/drives for MDF, Log file, etc.
We run NPM, SAM, NTA, UDT, IPAM, VNQM, WPM, and NCM. The web server is on the main poller with two additional pollers.
Response time has been our biggest issue with Orion. I am hoping SAM 6.0 will help us determine where our issues are on the SQL server.
I am NOT a DBA can know very little SQL.
It's great hearing what you guys are doing to get good ideas. SSD drives sound great, but is there a down side to SSD's?
SSD downside: not so much any more, besides cost.
The failure rates seem to be greatly reduced from when they first appeared.
SAS over SATA will make a difference and within SAS, you have MCL and SLC, of which the SLC is better performance.
We run all of those modules as well but are just starting full SAM deployment. retention for Netflow is minimal to reduce load. Expect I will bump up when move that filegroup to another drive though.
otherwise, we run 7,90, 365. Depending on trap volume, we are anywhere between 40 and 100G.
I also run 3 additional web servers because our current mapping has nearly EVERY network device/interface on a map at some level, which all has to be queried to paint the map.
I am only one that uses primary NPM web server and don't monitor the maps.
Like I said above though, even 4 SSD SATA drives in raid 10 (so only writing to 2 "heads") made a HUGE difference in performance for us.
My retention settings are exactly half of yours 'murder'.
Another big hitter for performance not yet mentioned is number of elements polled and polling interval.
Status polling is mostly innocuous, but statistics polling intervals are the heavy hitter.
The majority of our polled elements are interfaces and these are configured to use the default 9 minute interval.
For Backbone and Distribution Router interfaces we set this to 5 minutes, and for International WAN links on our PE's (about 2 dozen) I set it to 1 minute intervals.
This provides good performance for most pages, except for the large detailed National network map pages which take up to 20 seconds to load.
This brings up the additional question - how many elements do you put on your largest map & when do you start nesting maps for optimal performance?
Along the same lines, now that the node details page can be broken up into smaller chunks, how many tabs do you use?
These are a couple of ways of improving NPM performance, along with optimizing the DB.
It's now occurring to me that, if you really want to dig into performance problems with your SQL server, you could always point SAM at it and load up the SQL performance counters. Could be a quick way for non-SQL people to learn where their bottlenecks are.
Other thing to note: I mentioned keeping your SQL databases and logs on separate volumes. Make sure they're on separate, high-IOPS LUNs, too, so you're getting the most performance out of your storage system.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.