cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 11

Orion NPM Architecture, Speed, and SQL

In my last post, "What NPM Tips and Tricks do You Have?" I asked about tips and tricks, expecting a mashup of different things from all over the NPM world and to a certain extent that's what happened. Interestingly, however, a large section of the thread turned into a discussion about two things: maps and speed.

There were certainly a lot of good map tips, and you can find more at Solarwinds Labs.  In fact, you can even find out how to make your boss happy with a Big Green Button.

The speed issue is particularly intriguing to me since there are a lot of times where, let's be honest here, NPM is a bit of a dog when it comes to response. The web interface is notoriously slow, and gets even worse when you have a ton of custom widgets, do-dads, and whatchamacallits loading on a screen. Several people mentioned that a lot of speed can be picked up by getting in at the database level and pre-packaging certain things.

ZachM wrote:

Stored Procedures and custom Views created in the DB save us countless man hours and, in my experience, working directly in the DB can really expand your knowledge of the architecture of NPM overall. I highly recommend every SolarWinds engineer to challenge themselves to learn more SQL. I am by no means a DBA, but I can pull every bit of data you can get from the website, and I can do it faster 90% of the time.

NPM is an incredibly flexible and extensible product, especially in recent revisions, and offers a lot of opportunity for people willing to really dig in behind the scenes. As usual, I have more questions:

* What SQL version and architecture are you using (separate database, named instances, etc.)?

* What architecture have you found helps in the speed department?

As an example of what I'm interested in: we run Cisco UCS servers, with VMware as the hypervisor layer, backed by NetApp FAS3240 fully licensed arrays, with Flash Cache, etc. We tier our storage manually and have full production SQL and Oracle instances virtualized.  The storage is connected to the UCS with an aggregated 80GB, and the UCS to the core at 160GB.

35 Replies
Level 10

We use SQL 2008 Standard on a dedicated sserver. This server is running with 15K SAS 6Gbps Hard Drive, 48GB RAM and 24 cores. Only Solarwinds Named Instances will ever be on this SQL Server.

I do not think we are going to run into issues here. We may start to see issues on what is actually moniter, but hey this is why we have Solarwinds NTA. To figure out which devices need to be upgraded so our backbone and edge better handle traffic better or assign the right QoS policies throughout the network if needed.

0 Kudos
Level 15

First of all, thanks for the mention

As to your questions:

* What SQL version and architecture are you using (separate database, named instances, etc.)?

     - SQL 2008 Standard x64 using MS Failover Clustering

     - Named Instances; separate Warehouse and Production DBs

* What architecture have you found helps in the speed department?

     - We are running physical clusters that have something like 16 cores and 64 GB memory with gigabit connections to the network

     - To date, I have not personally seen any performance or speed issues that would warrant us looking at improving performance. Our average setup is 6 polling engines, all at 6,500 elements or more, and we have not had an issue yet. Heck, we haven't even seen a problem with capacity that would make us want to start planning for an issue in the future.

Maybe we're just lucky, but SQL has not been a performance bottleneck for us yet.

0 Kudos
Level 15

Another issue that can cause poor performance is index fragmentation.

Solarwinds has added a warning for this in the events view in the last few releases.
It might even be worthwhile adding an e-mail alert for this message until you get this issue resolved.

I had my dba follow the instructions here, but am still receiving this message.

I opened a case with Solarwinds support - Ticket# 511730

I have been asked to run additional SQL scripts, but this involves stopping all Orion services first.
This is not something I normally consider, just for troubleshooting, so I have not yet scheduled an outage for this.

I too would love to hear the fix for this.  I have the same issue all the time.  The auto indexing isnt working for me either.

0 Kudos

Great point.  I too have enabled the automatic indexing.  I still have many tables that are in various %s of defrag based on maintenance logs.  I've even run maintenance a few times to see results - to no real avail.

I'd love to hear what happens with your ticket NG.  Thx!

MVP
MVP

In my experience, NPM performance issues are due to a misconfigured SQL server. It's similar to a VDI project: things work well up to a certain point, at which you hit the IOPS limit on your LUN and all of your desktops grind to a halt. NPM is the same, where an initial deployment with little to no stored data works well, even if your SQL server isn't configured to best practices (CPU / memory allocations, storing logs and databases on separate volumes, creating a sane maintenance plan, et cetera). But once you collect a few months worth of data, and maybe turn on syslog, and take the plunge to capture flows, SQL can't keep up. Build a solid SQL server, wrap it in proper maintenance, and NPM should be happy.

I've also used another server for the web GUI to reduce load on the NPM server, but I did this in concert with migrating the NPM database to a proper MSSQL cluster, so I'm not certain how much performance benefit can be attributed to the web console being broken out. But I ended up with a traditional three-tiered web app architecture, and performance was no longer a problem.

All were vSphere VMs, EMC backend with separate LUNs for each volume on the SQL server, 2003 R2 for all nodes (I admit, this was a while ago. ). Maintenance plan is key, and tuning your NPM data retention settings is a big part of it, too. The old the data is, the more it should be summarized. YMMV.

Good tips for sure.  We've thought about moving the web front-end for a while now... just haven't gotten to it yet. 

0 Kudos

One other thing that I found was to make sure your roll up is happening.  I had something break once in the nightly roll-up, and it was a month or two till I noticed it.  So it was an extra month or two of full detail which really slowed things down for me.  So watch c:\ProgramData\Solarwinds\Logs\Orion\swdebugMaintenance.log for "] Error "

- The biggest downside of SSD is GB/$. SSD seams *really* expensive when compared to 15k disks, but when you look at it as IOPS/$, SSD usually beats 15k, especial if you add IOPS/kwH and/or IOPS/BTU.

0 Kudos

What data retention settings are you using?  We are using Detail=30, Hourly=180 and Daily=780.  Our DB is 120Gb and response is NOT optimal.  We built the DB server dedicated to Orion, RAID 10, 15K drives, separate partition/drives for MDF, Log  file, etc. 

We run NPM, SAM, NTA, UDT, IPAM, VNQM, WPM, and NCM.  The web server is on the main poller with two additional pollers.

Response time has been our biggest issue with Orion.  I am hoping SAM 6.0 will help us determine where our issues are on the SQL server.

I am NOT a DBA can know very little SQL.

It's great hearing what you guys are doing to get good ideas.  SSD drives sound great, but is there a down side to SSD's?

0 Kudos

My retention settings are exactly half of yours 'murder'.

Another big hitter for performance not yet mentioned is number of elements polled and polling interval.

Status polling is mostly innocuous, but statistics polling intervals are the heavy hitter.

The majority of our polled elements are interfaces and these are configured to use the default 9 minute interval.

For Backbone and Distribution Router interfaces we set this to 5 minutes, and for International WAN links on our PE's (about 2 dozen) I set it to 1 minute intervals.
This provides good performance for most pages, except for the large detailed National network map pages which take up to 20 seconds to load.

This brings up the additional question - how many elements do you put on your largest map & when do you start nesting maps for optimal performance?

Along the same lines, now that the node details page can be broken up into smaller chunks, how many tabs do you use?

These are a couple of ways of improving NPM performance, along with optimizing the DB.

It's now occurring to me that, if you really want to dig into performance problems with your SQL server, you could always point SAM at it and load up the SQL performance counters. Could be a quick way for non-SQL people to learn where their bottlenecks are.

Other thing to note: I mentioned keeping your SQL databases and logs on separate volumes. Make sure they're on separate, high-IOPS LUNs, too, so you're getting the most performance out of your storage system.

SSD downside: not so much any more, besides cost. 

The failure rates seem to be greatly reduced from when they first appeared. 

SAS over SATA will make a difference and within SAS, you have MCL and SLC, of which the SLC is better performance.

We run all of those modules as well but are just starting full SAM deployment.  retention for Netflow is minimal to reduce load.  Expect I will bump up when move that filegroup to another drive though.

otherwise, we run 7,90, 365.  Depending on trap volume, we are anywhere between 40 and 100G.

I also run 3 additional web servers because our current mapping has nearly EVERY network device/interface on a map at some level, which all has to be queried to paint the map.

I am only one that uses primary NPM web server and don't monitor the maps.

Like I said above though, even 4 SSD SATA drives in raid 10 (so only writing to 2 "heads") made a HUGE difference in performance for us.

Good Luck!

0 Kudos
Level 17

I've recently spent a lot of time and effort on this, even consulting with Atlantic Digital, Inc., and Solarwinds directly.

the outcome is we are focusing on the SAN and the Data Stores presented to the DB Server.

I was able to initially noticeably increase my web performance by moving to 4 SSDs (purchased at Fry's - so they are SATA SSDs) in Raid 10 format for redundancy.

Here, I had the LOG .LDF file on a SAN drive and the DB .MDF file on the local SSD array.

I had Atlantic Digital out for training and consultation and they had me move the LOG .LDF file onto the SSD drives.  This alone further DOUBLED my performance.

We are about to do a migration from Big Brother to SAM, so we decided to go enterprise and get a dedicated SAN.  We are getting a Dell PV with 10 SAS SSDs that will have 3 separate "drives" on the DB Server.

One for the DB, one for Log and Temp, and one for Netflow [FileGroup].

I believe Solarwinds is working on a document on this very subject.

Hi,

Just some friendly advice.

You really shouldn't RAID SSDs. This will more than halve the amount of bytes that can be written to the SSD in its lifetime as you're wearing out the flash memory. You're also maxing bandwidth on internal IO instead of using them on read/write IOPS. I'll give your SSDs max 3 years until they die abruptly. 3 months until you start losing performance.

The only supported RAID format for SSDs is RAID 0. Any other RAID format will severely impair the internal maintenance protocols built into the SSD BIOS. The protocols or plans rather, include Garbage Collection and TRIM command compatibility. Garbage Collection takes care of data that can be deleted (data can not be overwritten on SSDs, unlike on spindle disks. Instead they need to be moved to an empty block on the SSD). TRIM is an OS side command that forces the data to be nulled and by doing so, allows IO operations to continue without having to wait for Garbage Collection to complete. This spares your internal SSD IO capacity. RAID controllers nowadays still do not support TRIM commands passed on by the OS (unless RAID 0 is in use) Furthermore, for successful Garbage Collection and TRIM commands to be passed on to the SSD you'll need    at    least   10% of unpartitioned space on the SSD. Using 100% of SSD capacity is the worst you can do, whether using it as a single drive or not.

If you're going to use SSDs on an SQL server then use it for transaction logs which writes and reads data sequentially. This is where SSDs shine, in sequential data access, NOT random data access. Random data access performance is actually the benefits of raiding multiple disks.

Only use 1 SSD for logs or 2, but only if they are in RAID 0.

Use traditional spindle disks for Data. Hybrid disks are better, though.

0 Kudos

Yeah, disk I/O on the database servers is always a big performance bottleneck/opportunity.  My Oracle DBA and I work together on a lot of these same issues (moving certain mounts to certain disk arrays, etc.) to squeeze as much performance as possible out of the databases.  I haven't spent as much time as I'd like on the SQL Server side, but the benefits are the same, if executed a little differently.

Good stuff, thanks!

0 Kudos