Orion NPM Architecture, Speed, and SQL

In my last post, "What NPM Tips and Tricks do You Have?" I asked about tips and tricks, expecting a mashup of different things from all over the NPM world and to a certain extent that's what happened. Interestingly, however, a large section of the thread turned into a discussion about two things: maps and speed.

There were certainly a lot of good map tips, and you can find more at Solarwinds Labs.  In fact, you can even find out how to make your boss happy with a Big Green Button.

The speed issue is particularly intriguing to me since there are a lot of times where, let's be honest here, NPM is a bit of a dog when it comes to response. The web interface is notoriously slow, and gets even worse when you have a ton of custom widgets, do-dads, and whatchamacallits loading on a screen. Several people mentioned that a lot of speed can be picked up by getting in at the database level and pre-packaging certain things.

ZachM wrote:

Stored Procedures and custom Views created in the DB save us countless man hours and, in my experience, working directly in the DB can really expand your knowledge of the architecture of NPM overall. I highly recommend every SolarWinds engineer to challenge themselves to learn more SQL. I am by no means a DBA, but I can pull every bit of data you can get from the website, and I can do it faster 90% of the time.

NPM is an incredibly flexible and extensible product, especially in recent revisions, and offers a lot of opportunity for people willing to really dig in behind the scenes. As usual, I have more questions:

* What SQL version and architecture are you using (separate database, named instances, etc.)?

* What architecture have you found helps in the speed department?

As an example of what I'm interested in: we run Cisco UCS servers, with VMware as the hypervisor layer, backed by NetApp FAS3240 fully licensed arrays, with Flash Cache, etc. We tier our storage manually and have full production SQL and Oracle instances virtualized.  The storage is connected to the UCS with an aggregated 80GB, and the UCS to the core at 160GB.

  • I've recently spent a lot of time and effort on this, even consulting with Atlantic Digital, Inc., and Solarwinds directly.

    the outcome is we are focusing on the SAN and the Data Stores presented to the DB Server.

    I was able to initially noticeably increase my web performance by moving to 4 SSDs (purchased at Fry's - so they are SATA SSDs) in Raid 10 format for redundancy.

    Here, I had the LOG .LDF file on a SAN drive and the DB .MDF file on the local SSD array.

    I had Atlantic Digital out for training and consultation and they had me move the LOG .LDF file onto the SSD drives.  This alone further DOUBLED my performance.

    We are about to do a migration from Big Brother to SAM, so we decided to go enterprise and get a dedicated SAN.  We are getting a Dell PV with 10 SAS SSDs that will have 3 separate "drives" on the DB Server.

    One for the DB, one for Log and Temp, and one for Netflow [FileGroup].

    I believe Solarwinds is working on a document on this very subject.

  • In my experience, NPM performance issues are due to a misconfigured SQL server. It's similar to a VDI project: things work well up to a certain point, at which you hit the IOPS limit on your LUN and all of your desktops grind to a halt. NPM is the same, where an initial deployment with little to no stored data works well, even if your SQL server isn't configured to best practices (CPU / memory allocations, storing logs and databases on separate volumes, creating a sane maintenance plan, et cetera). But once you collect a few months worth of data, and maybe turn on syslog, and take the plunge to capture flows, SQL can't keep up. Build a solid SQL server, wrap it in proper maintenance, and NPM should be happy.

    I've also used another server for the web GUI to reduce load on the NPM server, but I did this in concert with migrating the NPM database to a proper MSSQL cluster, so I'm not certain how much performance benefit can be attributed to the web console being broken out. But I ended up with a traditional three-tiered web app architecture, and performance was no longer a problem.

    All were vSphere VMs, EMC backend with separate LUNs for each volume on the SQL server, 2003 R2 for all nodes (I admit, this was a while ago. emoticons_happy.png). Maintenance plan is key, and tuning your NPM data retention settings is a big part of it, too. The old the data is, the more it should be summarized. YMMV.

  • What data retention settings are you using?  We are using Detail=30, Hourly=180 and Daily=780.  Our DB is 120Gb and response is NOT optimal.  We built the DB server dedicated to Orion, RAID 10, 15K drives, separate partition/drives for MDF, Log  file, etc. 

    We run NPM, SAM, NTA, UDT, IPAM, VNQM, WPM, and NCM.  The web server is on the main poller with two additional pollers.

    Response time has been our biggest issue with Orion.  I am hoping SAM 6.0 will help us determine where our issues are on the SQL server.

    I am NOT a DBA can know very little SQL.

    It's great hearing what you guys are doing to get good ideas.  SSD drives sound great, but is there a down side to SSD's?

  • SSD downside: not so much any more, besides cost. 

    The failure rates seem to be greatly reduced from when they first appeared. 

    SAS over SATA will make a difference and within SAS, you have MCL and SLC, of which the SLC is better performance.

    We run all of those modules as well but are just starting full SAM deployment.  retention for Netflow is minimal to reduce load.  Expect I will bump up when move that filegroup to another drive though.

    otherwise, we run 7,90, 365.  Depending on trap volume, we are anywhere between 40 and 100G.

    I also run 3 additional web servers because our current mapping has nearly EVERY network device/interface on a map at some level, which all has to be queried to paint the map.

    I am only one that uses primary NPM web server and don't monitor the maps.

    Like I said above though, even 4 SSD SATA drives in raid 10 (so only writing to 2 "heads") made a HUGE difference in performance for us.

    Good Luck!

  • My retention settings are exactly half of yours 'murder'.

    Another big hitter for performance not yet mentioned is number of elements polled and polling interval.

    Status polling is mostly innocuous, but statistics polling intervals are the heavy hitter.

    The majority of our polled elements are interfaces and these are configured to use the default 9 minute interval.

    For Backbone and Distribution Router interfaces we set this to 5 minutes, and for International WAN links on our PE's (about 2 dozen) I set it to 1 minute intervals.
    This provides good performance for most pages, except for the large detailed National network map pages which take up to 20 seconds to load.

    This brings up the additional question - how many elements do you put on your largest map & when do you start nesting maps for optimal performance?

    Along the same lines, now that the node details page can be broken up into smaller chunks, how many tabs do you use?

    These are a couple of ways of improving NPM performance, along with optimizing the DB.

  • Another issue that can cause poor performance is index fragmentation.

    Solarwinds has added a warning for this in the events view in the last few releases.
    It might even be worthwhile adding an e-mail alert for this message until you get this issue resolved.

    I had my dba follow the instructions here, but am still receiving this message.

    I opened a case with Solarwinds support - Ticket# 511730

    I have been asked to run additional SQL scripts, but this involves stopping all Orion services first.
    This is not something I normally consider, just for troubleshooting, so I have not yet scheduled an outage for this.

  • Great point.  I too have enabled the automatic indexing.  I still have many tables that are in various %s of defrag based on maintenance logs.  I've even run maintenance a few times to see results - to no real avail.

    I'd love to hear what happens with your ticket NG.  Thx!

  • Yeah, disk I/O on the database servers is always a big performance bottleneck/opportunity.  My Oracle DBA and I work together on a lot of these same issues (moving certain mounts to certain disk arrays, etc.) to squeeze as much performance as possible out of the databases.  I haven't spent as much time as I'd like on the SQL Server side, but the benefits are the same, if executed a little differently.

    Good stuff, thanks!

  • Good tips for sure.  We've thought about moving the web front-end for a while now... just haven't gotten to it yet.  emoticons_happy.png

  • One other thing that I found was to make sure your roll up is happening.  I had something break once in the nightly roll-up, and it was a month or two till I noticed it.  So it was an extra month or two of full detail which really slowed things down for me.  So watch c:\ProgramData\Solarwinds\Logs\Orion\swdebugMaintenance.log for "] Error "

    - The biggest downside of SSD is GB/$. SSD seams *really* expensive when compared to 15k disks, but when you look at it as IOPS/$, SSD usually beats 15k, especial if you add IOPS/kwH and/or IOPS/BTU.