Orion Platform 2020.2: The Easiest Upgrade in Years

Hi, I am @jm_sysadmin, but you can call me Jake. I work at Nationwide Children’s Hospital as a Senior Systems Engineer, have been with NCH for 16 years in various engineering roles, and have been a TWHACK MVP for about three years. My company’s first product on the Orion Platform was IP Address Manager about 10 years ago. Since then, the number of products and servers have grown. Before I specifically talk about my upgrade experience, let me tell you a little about my deployment.

My Monitoring Infrastructure

jm_sysadmin_0-1598642593876.png

All of our Orion servers (and the SQL Server) are virtualized and running Windows 2016. We’re using VMware as our hypervisor of choice and the monitoring environment in our main data center consists of a SQL Server, a Main Polling Engine, two Additional Polling Engines (APEs), and an Additional Web Server (AWS) for normal production workloads.

Upgrade Summary

Monitoring Infrastructure

  • Datacenter
    • Main Polling Engine (MPE)
    • 2 × Additional Polling Engines (APE's)
    • 1 × Additional Web Server (AWS)
  • Disaster Recovery Site
    • 1 × Main Polling Engine HA (MPE+HA)
    • 2 × Additional Polling Engines HA (APE+HA)
  • O/S: Windows Server 2016
  • SQL O/S: SQL Server 2016 SP2
  • Primary Orion Database: ~100 GB
  • 9 products on 7 Orion servers

Monitored Elements

  • 5,000 Nodes (2,800 Network / 2,200 Servers)
  • 16,000 Interfaces
  • 8,500 Volumes
  • 2,600 Applications
  • 6 Virtual Centers
  • 24 Hypervisor Clusters
  • 250+ Hypervisors
  • 8,000 Virtual Machines
  • 13 Storage Arrays
    • 9 × Pure Storage
    • 2 × Nimble
    • 2 × NetApp
  • 41 Storage Pools
  • 1,300 Storage LUNs

Upgrade Plan

  1. Disable alert actions
  2. Backups and snapshots
  3. Install any prerequisites
  4. Disable High Availability (HA)
  5. Upgrade MPE via web
  6. Upgrade MPE+HA via web
  7. Upgrade APE, APE+HA and AWS's via web
  8. Planned Change Window: 6 hours

Upgrade From/To

  • NPM 2019.4.1 --> 2020.2
  • NCM 2019.4.1 --> 2020.2
  • NTA 2019.4.1 --> 2020.2
  • SAM 2019.4.1 --> 2020.2
  • SCM 2019.4.1 --> 2020.2
  • SRM 2019.4.1 --> 2020.2
  • VMAN 2019.4.1 --> 2020.2
  • VNQM 2019.4.1 --> 2020.2

Total Upgrade Time: 75 minutes

The SQL Server is dedicated for use by the Orion Platform. It’s running a single instance and no other applications run in that instance. The SQL Server is backed up locally and backups are shipped to the Disaster Recovery (DR) site.

In the DR data center, we have HA configured for the Main Polling Engine and the two APEs. The DR site isn’t located on campus but close enough for us to almost pretend it’s local using dark fiber. In the event of a failure, we have a SQL Server on standby in the DR site where we can perform a restore. Then we point web traffic to the HA Main Polling Engine using a DNS CNAME.

Most of this is pseudo-automated and can happen quickly if needed, but I have a personal goal to improve it even more. When things go wrong, that’s when you need the Orion Platform the most.

On our Orion Platform, we have several products, most with very large licenses due to our organizational requirements. All of them were on 2019.4.1 prior to looking to upgrade.

This is our product list:

Our database has about 100 GB of data in it right now, and our retention levels are pretty standard. Data retention attributes directly to database size, which is related to upgrade time, which informs overall performance, so I figured it would be best to share.

  • Retain Detail Stats: 7 Days
  • Retain Hourly Stats: 14 Days
  • Retain Daily Stats: 365 Days
  • Retain Container Detail Stats: 7 Days
  • Retain Container Hourly Stats: 14 Days
  • Retain Container Daily Stats: 365 Days
  • Retain Events: 45 Days

We have 122 alerts enabled. Leon would be tempted to publicly shame me, but most of our alerts are for recordkeeping, kicking off scripts to automate things, or other internal tracking. Only a handful notify humans, so hopefully I’ll be spared his wrath.

These are some rough numbers about what we’re monitoring:

  • Nodes: 5,000 (2,800 Network Gear / 2,200 Servers)
  • Interfaces: 16,000
  • Volumes: 8,500
  • Virtual Centers: 6
  • Applications: 2,600
  • Hypervisor Clusters: 24
  • Hypervisors: 250+
  • Virtual Machines: 8,000
  • Storage Arrays: 13 (9 × Pure Storage, 2 × Nimble, and 2 × NetApp)
  • Storage Pools: 41
  • Storage LUNs: 1,300

Monitoring the complete application stack (from web/application interface down to the storage) is critical for us in troubleshooting problems quickly. We’re also running Virtual Desktop Infrastructure (VDI) within some of our hypervisors. None of the VDI machines are monitored in the Orion Platform as nodes (because of the ephemeral nature of VDI machines), but they’re monitored as part of the virtual infrastructure monitoring.

If you’ve read this far, I’ll guess you’re interested in my upgrade story. You have an understanding of my infrastructure, and now I can share the details about crafting the plan and executing the upgrade.

Preparation, Planning, and Upgrade

July 28

Day and time for upgrade chosen (August 12 at 9 a.m.), approved, and “power users” informed. This is contingent on the Change Approval Board (CAB) giving me the thumbs up.

Change control plan made with the help of the “My Orion Deployment—Orion Update Tool” to identify the current and planned versions we would upgrade to.

July 30

CAB reviews and approves final plan.

I went to “Make a Plan” under the “Updates & Evaluations” tab on “My Orion Deployment.” Unfortunately, it came back not so good.

jm_sysadmin_1-1598642593898.png

Oh no! My main engine couldn’t talk to any of the other servers for the upgrade check. I knew they were polling and reporting data, so this looked to be only a communication issue regarding the upgrade process.

Time to check on the normal things:

·         Are all the SolarWinds services running?

·         Restart the services on all the servers. 

·         Reboot the servers. 

Well, that normally does it. I admitted to myself I might not know everything and checked the Success Center, searched the knowledge base, and I immediately found an easy fix! Just kidding, nothing really applied. Normally I can find my answer in a Success Center knowledge base (KB) article, but this time none of them looked 100% to match my circumstances.

August 4

After reflecting a little bit, I decided I went as far as I could on my own. Worst case, I knew I could get away with manual installs like I’ve done in the past, but I really wanted to give the full web install a try. I opened a ticket with SolarWinds support, documenting the issue as best I could as a medium severity ticket.

Why a medium severity and not higher? The systems were all still polling and otherwise working 100%, so it wasn’t a critical issue, I had some time before my planned upgrade, and I had a fallback to install manually if needed.

August 7 – 10

On a Friday, I started emailing with the SolarWinds Technical Support engineer who took the case. I was provided several other KB articles I didn’t see before to review and hopefully we could get the upgrade engines to connect. We went back and forth trying different things including disabling the High Availability (HA) on my polling engines (I need to do this before an upgrade anyway, so doing it now doesn’t hurt).

Note to self: always check your case status at 6 p.m. The case was still open over the weekend because I didn’t close the loop.

August 10

Today was a good day: I added more memory to my SQL Server. During troubleshooting, we identified we might need more resources. We tend to run significantly under the admin guide recommendations. I say we’re frugal (or I use the word “cheap” when I’m in a sour mood); my VMware admins say we’re “rightsized.” More specifically, I said we can’t live without it. We got the SQL Server bumped from 52 GB to 96 GB of RAM. The monitoring databases behave much like OLTP databases, so having enough RAM and high-speed disks in your SQL Server makes the Orion Platform happy.

August 11

After a Webex with SolarWinds Technical Support, the root issue was still not found. We agreed to run the upgrade via the executables and manually upgrade servers as needed. I was hoping after the upgrade of the Main Polling Engine, the other servers would start communicating. In a quick effort to validate my thinking was correct, I struck up a conversation with Kevin and gave him my details. He’s not in support, but he’s done enough of these upgrades that a second set of eyes can’t hurt. He agreed with the plan and after discussing it with a few others on THWACK, I also agreed it was our best plan of action. Again: worst case, I can upgrade by hand like I’ve done in the past.

Thank you @alain.sotto for working this case with me. We’re ready for upgrade day.

August 12, 2020 - Upgrade Day

8:00 a.m.

Notification emails explaining upcoming downtime drafted and sent to the Server and Network Device Engineering and Operations teams.

I download the installer files using the Orion Upgrade tool to all the servers I can, and by some dark magic, all but my Additional Web Server (AWS) was green.

jm_sysadmin_2-1598642593904.png

Part of me wanted to run it right there, but I had to wait until my change window started. (Still have an hour before I can take the systems offline)

I dropped the DNS CNAME TTL we use for the Orion Platform to one minute. I chose to do this in case I wanted to direct to a specific server. Hopefully, I won’t need it, but fortune favors the bold prepared.

Next, I unmanage the Orion servers from within the Orion Platform. I know new SAM templates should be used after the upgrade, so disabling them now in prep of using the new templates is just good form.

Disabling the High Availability pairing is normally part of preparing for an upgrade, but since we already did it with support during our troubleshooting, I didn’t need to do it now. I’ll just add it to my list of things to re-implement after the upgrade.

9:00 a.m.

Time for one last chat post to tell all the various Information Services teams the upgrade is starting to a channel specifically for Change Control notices.

I validate nightly SQL backups one last time, shut down, snapshot, and power the servers back up (safety first).

Go to the “All active alerts” page and pause actions of all alerts. This way, after the upgrade, you can let the missed polling cycles settle out, verify none of your alerts work differently than prior to the updates, and no mass emails to all your coworkers to distract you if there’s something you need to address.

jm_sysadmin_3-1598642593905.png

9:15 a.m.

I get delayed by my manager with questions on completely unrelated topics (This even happens to best of us, right?).

9:26 a.m.

OK, now I’m ready to kick off the upgrade. Installation on Main Polling Engine started via the web. As I feared, the communications issues returned, mostly. One Additional Polling Engine was green, so at least two servers (the Main Polling Engine and this APE) will be upgraded using the fancy web install pages.

9:37 a.m.

Install completed and the Configuration Wizard on Main Polling Engine started. The config wizard does three things: configures services, updates the database, and builds the website. In previous upgrades, the configuration wizard is where most of my time went and it was not exciting. It was just staring at a progress bar.

9:55 a.m.

Configuration wizard completes on my Main Polling Engine. I’m sorry, what? In the past, it was not unheard of for config wizard to take up to three hours. It just completed in 18 minutes against a 100GB database! I’m sure my previously starving for memory SQL Server could have something to do with the three-hour timespan, but 18 minutes? I’m just dumbstruck.

As soon as the configuration wizard on the Main Polling Engine finished, the online install for the one green Additional Polling Engine started without waiting for my input. I literally could have walked away.

Clearly this is much faster than before, but I still have five servers to think about after these two complete.

10:05 a.m.

First Additional Polling Engine install completes and starts its own configuration wizard.

10:10 a.m.

First Additional Polling Engine configuration wizard completed! I’m sorry, five minutes? Color me impressed. I clicked “finish” on the Update Tool, then restarted the servers that were upgraded. While the servers were rebooting, I had my fingers crossed hoping to see green when I logged back in.

10:13 a.m.

Orion Deployment—Upgrade Tool connects to all the servers this time. The first two (the original green ones) look up to date. Using the web-based tool, I initiated downloads and installs on all other servers. I note @KMSigma still has some admin chops and isn’t just a pretty face.

I kick off the upgrade for the other servers and we’re off to the races.

jm_sysadmin_4-1598642593914.png

10:25 a.m.

First of the five servers to be upgraded completes—an Additional Polling Engine at my primary data center.

10:26 a.m.

The Additional Polling Engine at my primary data center completes and the configuration wizard begins. It completes sometime in the next few minutes, but I missed seeing it as the other servers were doing their thing.

10:31 a.m.

Additional Web Server upgrade and config wizard completes.

10:34 a.m.

HA for Additional Polling Engine upgrade and config wizard completes.

10:41 a.m.

HA failover for my Main Polling Engine completes.

That’s it. That was the last server to complete its install and configuration wizard. My platform upgrade is complete. It started at 9:26 a.m. and finished at 10:41 a.m.—that’s 75 minutes for seven servers! I’ve got hours left in my change window. Talk about under promising and over producing, right?

To finish up, I go about getting everything back to normal.

Post-Upgrade Steps

After the upgrade, there are a few things I need to do either to revert to the state where things were before or to prepare for the future.

  • Reboot everything and validate services automatically startup and run correctly (always a good step).
  • Set up HA again (mentioned earlier).
  • Commit virtual machine snapshots—don’t leave these floating out there. Delta disks are an I/O sinkhole. Clean them up when you are done with them.
  • Revert the DNS CNAME TTL change I put in place before the upgrade.
  • Resume Alert Actions since things look good (except one of the mailbox servers in our Exchange infrastructure seems to have stopped playing nice, but this is a legit issue, so let those alerts fly).
  • Remove the “What’s New” widgets from summary homepages. In the past, I’ve found these can be distracting for the other staff members, particularly if it’s a feature where I need to do the work and I haven’t had the time to learn the feature. I’ll take care of notifying the various teams of new features as I get up to speed on them.
  • Validate any customizations still work. Most work right out of the box, but I have one application status summary for management needing a little bit of love to get back to 100%.
  • Update SAM templates to monitor the Orion Platform 2020.2, assign them, and remove the old ones.
  • Notify staff of upgrade completion via email and chat.
  • Carve out a little on my calendar time to get familiar with new features, including re-reading the 2020.2 GA announcements (Scalability and Performance Improvements, Enhanced Volume Status, Orion Map improvements, Modern Dashboards).
  • Make some documentation about what the new features are and send some links to power users, so they see the new stuff right away. It’s good to associate downtime with new stuff and not just potential missed alerts.

Summation

My total upgrade for seven servers took 1 hour, 15 minutes (9:26 a.m.—10:41 a.m.). Considering a few versions ago, just the configuration wizard on one server (my Main Polling Engine) took over twice that, I was stoked. This is an amazing improvement made by the Orion Platform team. They should be very proud of the work they’ve done.

Although I was biting my nails at the beginning of this upgrade, it went smoother than I could have imagined. Upgrading via the web console is a blessing: no more RDP sessions, no more questioning which server is on what step, no more waiting for one server to finish before starting on the next. It’s not an understatement that doing multi-server upgrades in parallel saves a boatload of time (28 minutes for five servers). Even the first machine, the Main Polling Engine, installed and upgraded in what I felt was record time.

For me, I’m most looking forward to playing around with the new Modern Dashboards. I like how I can build what I need for me while other teams can build what they need for themselves. Then if either of us like a particular widget the other built, we can just “steal” it from the other dashboard to add to our own.

  • orion has a seperate download that can check the patch level use service management tool to disable solarwinds itself. This way you can have the server on but not doing anything. Have your security team keep checking for outbound communication and and validating nothing is going out.

  • Thanks for sharing Jake and nicely done , I am really Happy everything run better than you expected 

  • Great write-up. One thing just to chime in, I got similar connectivity "issues" or "warnings" both during our upgrade and our latest hotfix application (which I got done yesterday).

    Regarding the hotfixes, the pre-flight check said my APE couldn't be centrally upgraded due to "product inconsistencies". I talked to support and we agreed to do the main poller using the web and do the APE manually if necessary because trying to find/fix the alleged inconsistencies was a bust.

    We went to stage the files and the APE was happy, said it could be centrally upgraded, we went forward, and in roughly 30 mins we are running at the latest and greatest on our main poller and APE.

  •  - That's great!  After a little sleep (change windows can be brutal), you'll have to let us know how much you like the new features!

  • We actually completed our upgrade last night from 2019.4, and I'm still a little sleepy from the late start but not so late finish!

    We were unsure if the new, 'quicker' approach would work, so we had already downloaded the offline install update and manually copied to our 14 servers. Anyway, colour us very happy after starting the upgrade on our primary NPM box, and we thought we'd gamble and dive into the assisted upgrade.

    Just over 2hrs later we returned to normal service and our change window still had 3hrs to run.

Thwack - Symbolize TM, R, and C