14 Replies Latest reply on Oct 18, 2018 12:44 PM by josh.haberman

    Cloud Fever, throwing all the SolarWinds into AWS

    josh.haberman

      So after picking the amazing adatole's brain after his 3/13 LIVE WEBCAST: IF AN APPLICATION FAILS IN THE DATACENTER AND NO USERS ARE ON IT, WILL IT CUT A TICKET?  presentation about migrating our SolarWinds environment to AWS, he recommended I let the THWACK community weigh in.

       

      Here's the my situation in a nutshell:

       

      Currently, our Orion SQL DB is in a SQL cluster shared by other applications.  The Orion DB is killing performance for the other DBs in that cluster. Based on current resource utilization and guidelines for future growth I'm looking at getting the DB it's own "server" whether virtual or physical with at least 128GB of RAM.  The SQL cluster has a shared RAM of 64GB of RAM.  We're not able to deploy a virtual server with that much ram in our virtual environment.

       

      So what about a physical server?

       

      Well...

       

      My company has caught the "Cloud Fever" and the only cure is more Cloud!

       

      Our parent company based in France has pushed out an IT edict that all 26 of it's international entities (North America, UK, China, etc...) must convert 50% of it's data center based hardware and virtual servers into "the cloud" by 2020. Unfortunately, cost concerns and performance be ****** nobody is asking the important questions like "Why?"

       

      So I'm being told that with this edict in play, any requests for a new hardware server would be instantly denied.

       

      With this I'm trying to make this work as best as possible given the situation.  The slight advantage I have is that there is massive amounts of money being thrown at this cloud effort so, I can leverage that to make this as smooth as possible.

       

      Another side issue is that our IT department is extremely siloed. My title is "Network Engineer" which means I'm a member of the "Network Team". However, it's 2018 and IT should all be on the same team.  The server team is full of old school Microsoft fanboys and girls that have fought AWS tooth and nail (and not for logical reasons). We have a very developed and robust Orion environment with 3 very dedicated individuals maintaining it and end users actively using it across many teams including some outside of IT.

       

      The server team uses a neglected instance of SCOM 2012 to monitor servers, AD, and databases using mostly out-of-the-box alerts that only is sent to members of their team and whose web portal is only accessible by them.

       

      I have graciously offered to take on the task of assisting them with integrating our servers and AD environment into SAM which would incur no cost as we are already licensed. I get immediate kickback with no logical reasons, almost like it's some sort of childish turf war for them.  So, asking for any assistance with the Orion servers from them is a pain because I offered to help them.

       

      Here's what my Orion environment looks like now:

      NAM 3000 with ACM 250 We're using NPM, SAM, NTA, NCM, IPAM, VNQM, UDT, and WPM.

       

      Main PE: polling 12893 Elements with a job weight of 5650.

      APE in Canada whose local subnets aren't routed to the "Main" network hence the need for an APE: polling 5008 elements with a job weight of 1929.

      APE in our SCADA industrial controls system DMZ (we're planning on rolling this into the MPE since we should be able to poll these nodes with "routing and firewall magic"

       

      Our AWS environment is in the very earliest stages, only 1 test application has been migrated so far, so I have a lot of freedom to plan out how to monitor that.

       

      Our AT&T managed MPLS cloud has a direct connect into our AWS instance so, that should help alleviate some latency issues with our remote location polling.

       

      Some of the advice Leon offered includes the following:

       

      1. Installing SW into AWS
      1. The first and most important thing you need to ensure is that the timing between the primary poller and the database remains low – under 1500 miliseconds. If you have latency that is longer than that, you  are going to experience errors and data corruption
      2. The second (and only slightly less important) thing is to ensure that your database is set up for the transaction volume – in on-prem terms, it needs to be RAID 10 or flash. Not RAID 5.
      3. The third thing is that you will likely be monitoring your on-prem environment using an additional polling engine, unless you have less than 100 devices on-prem that you wish to monitor
      With all of that said, there is a guide to help you:

      1)      Put the primary poller and the db in the cloud so that your timing between them is as short as possible. The primary poller will have very little to monitor (at least right now) and That’s OK ™

      2)      Put an additional poller in the main site, and another APE in your secondary site. They cost nothing, so why not. They can be virtual. You can play with the hardware they’re assigned until you’ve salted to taste.

      3)      If you can, install the AWS-based instance of DPA (it’s in the Amazon store) and watch your SW database with it. You will have the ability to see how it’s truly performing and where any bottlenecks might crop up.

      a.       It’s also a great “advertisement” to your DBA team to show the capabilities of the tool. No I’m not trying to upsell you. It’s just a nice tool in your toolbox if you don’t have something else. And it’s natively cloud-based, so you can score some points from corporate.

      4)      Make sure you add your cloud credentials to SAM. Again, score some corp brownie points.

      Paging jbiggley at adatole suggestion to weigh in.
      TL;DR I have to move my Orion environment to AWS because of corporate politics. Any advice is very appreciated.
      So, what THWACKsters out there have installed/migrated Orion in the Cloud either by choice or by corporate politics gunpoint?
      1. The first and most important thing you need to ensure is that the timing between the primary poller and the database remains low – under 1500 miliseconds. If you have latency that is longer than that, you  are going to experience errors and data corruption
      2. The second (and only slightly less important) thing is to ensure that your database is set up for the transaction volume – in on-prem terms, it needs to be RAID 10 or flash. Not RAID 5.
      3. The third thing is that you will likely be monitoring your on-prem environment using an additional polling engine, unless you have less than 100 devices on-prem that you wish to monitor

       

      Edit for grammar DERP.