Thanks in advance for any tips or other thoughts on this effort. This is admittedly a lengthy post..
We are a national healthcare with footprint coast to coast. We've been struggling with our 15 or so Orion deployments (NPM, NTA, NCM, SAM, QOE) for the last 3 years since their original deployment. Some of the instances are a few thousand elements while others are 13k+. Drivers for my consolidation project include:
- Moving targets on hardware requirements when upgrading versions causing issues getting all instances upgraded
- Varying levels of toolset administrative expertise in the network group and varying VM capacities in the farms within the hospital computer rooms
- General lack of priority put on keeping the toolset up & running
Those 15 instances are spread throughout our Midwest & West regions. Due to M&A activity we'll be adding an East region to the solution which under our distributed model would require another 10 or so instances. Given the issues we've had and then looking at the TCO to add the East region sites, we have decided that we need to consolidate the instances into 'mega instances' if you will..
We've had extensive conversations around this with our account team & their engineer. Here are some of the key takeaways that we've come up with together that I'd love to hear some comment on as I work to size the hardware:
A little background:
- Today we are mostly virtualized except for 2 of our corporate data center instances, including the SQL server.
- A comprehensive inventory reveals that under our current configuration / policy we have around 180K elements. We feel that this can be reduced literally by orders of magnitude as most of this is user-facing switchports. Our tools group has discussed this and we are hard pressed to show that instrumenting these provides any significant operational value. The intent is to change our default policy to NOT monitor access-layer switch ports. The exception will be on server aggregation switches in the hospital computer rooms.
- We have three corporate data centers that are interconnected with 10g fiber. They all have redundant MPLS head ends as well & all hospitals have redundant 100mbps MPLS connections.
- We do a lot of NPM, a good bit of NTA (mainly core & branch WAN), a little bit of SAM (WAN-only related server/apps), and a bit of NCM (limited as we have Ciscoworks prime for most need).
- General need to accommodate separate administrative domains while providing for the concept of cross-hospital support WITHIN region.
New design info:
- Interface polling interval will be somewhere between 5-10 minutes, tbd, maybe less for critical backbone WAN interfaces..
- By removing the access layer ports & doing some spreadsheet work, I project that the existing regions will be reduced to around 15K elements, at least to begin.
- There are slightly fewer overall nodes in our new East region but I'm still setting aside 15K elements for them for a total of 30K to begin.
- We have been told by leadership that through additional M&A we should expect to double in size over the next few years.
- As we grow towards 100K elements + I expect (based on other posters) to need 100-100K IOPS on the SQL backend. The SQL server(s) will be physical.
- We worked with our IBM SAN team & though they can provide the IOPS and flash, etc, they feel that the strain on the SAN would be too much for a non-core application.
- The intent then for the SQL needs will be HP Proliant hardware with 128g ram. The disk will be local flash (HP, not off the shelf).
- We have all of the licensing that we can possibly need, except for maybe some add'l web licenses.
- My thought is that we will first try a single instance (while leaving all the legacy instances in place just in case).
- If it does not perform well or if we grow to a size that causes it not to perform well, THEN we will replicate the new high performance design to a 3 instance design that mirrors our 3 regions and then implement EOC (licensed already).
Anticipated Compute (all HP Proliant or VMWare:
- SQL - 2014, HP Proliant physical, 128g, Mirrored OS disk, 4x HP ssd
- NTA - Proliant physical, 64g, Mirrored OS disk, DB disk TBD
- Core - Virtual, 48g, SAN-based raid 5 (No polling on core)
- APE x 'n' - Virtual, 16g, SAN-based raid 5. Intent is to start with a polling engine for each of the 3 regions and add APEs (up to 3 per region) as required by monitoring the completion rates.
- WEB x 3 - Virtual, 16g, SAN-based raid 5. Web for each individual region (3)
So please comment on any holes you see. My main outstanding questions that I need some input on are:
- Flash layout for SQL - How many SSDs & how to lay out. I'd like to avoid splitting log / data temp if at all possible but how to best layout either 2 or 4 SSD (planning on the 800GB write-intensive variant)
- Disk requirements for NTA - Should I spec flash? They are VERY expensive. Would SAN-based raid 5 (very high end, IBM stuff) be ok for this or does it have the exact same caveats as the SQL server? Also to anyone with experience .. how does NTA scale in very large environments?
- Do the APE specs look ok? I'm intending to load the APEs up & maintain 5-10 min polling intervals.
- Thoughts on CPU count, qty for SQL. Most of the proliant stuff I'm looking at maxes out at 2.6ghz. The guides all recommend (for years now) 3ghs 'or better' but honestly, even a 2.4 of today is so much better than a 3ghz from a couple gen's back.. My thought is dual proc 2.6ghz 12 core each..
- Any thought that I should just START with 3 instances + EOC. Don't know if I can find that money but would certainly entertain any opinion on this.. If we did go this route would you place each instance within region or all centrally next to the EOC?
- Any other thoughts?
Appreciate any input. Also greatly appreciate all of the previous posts over the last couple of years. Some good nuggets in there to be sure.. I'm just trying to provide the best possible user experience possible while consolidating the platform sprawl down into something supportable. My users are all WAN guys so as you can imagine a very tough & unforgiving crowd so trying to avoid any misstep..