Disclaimer: You can do some of this already, by having two alerts for each group of cloudy elements, but let's make Orion think smart about the cloud!
Picture this:
You have a group, and this group contains Azure or EC2 resource group nodes, and associated elements. Say they have applications assigned to them as well, for good measure.
Got it? Great. So, let's take a look at the alerts which encompass these devices:
You have alerts which include:
- Node down
- High response time
- High memory use
- High CPU usage
- An application template which looks at "Critical App#1'
You get the idea! These all work to notify you when any of these go down, or breach their thresholds. Sounds like we have everything covered, right? Wrong! The bean counters have decreed that all cloud-based servers are turned OFF out of hours!
Our operations team's have been inundated by alerts and and our ITSM is chock-full of tickets. In short, Ops are sad-pandas. So, how can we fix this WITHOUT setting active hours on our alerts, and then creating yet another alert to tell us if somebody forgot to turn Skynet off? Simple! We use Cloud Maintenance!
This is an expansion of the cloud management, empowering Orion with the knowledge of what times are production times for a given resource group, and which times all elements should be dormant, by setting rules for cloud-based elements. The settings of Cloud Maintenance tell Orion how to handle alerting, and how to display suspended instances when they are in dormancy, and how to deal with devices which are breaching dormancy periods.
Here's how I would see this working:
- When you setup your new cloud integration, an additional page of options is displayed, which allows you to configure Cloud Maintenance.
- This allows you to set dormancy intervals, per resource group. When a resource group is dormant, all alerts are muted OTHER THAN the Cloud Maintenance alert!
- The Cloud Maintenance alert, which you can create more than one for and assign per Resource Group if required, will alert your chosen recipients when any instance on a resource group which should be powered off is still up after the dormancy period starts.
- Rules within Cloud Maintenance alerts allow you to AUTOMATICALLY power down said resource (if enabled), in a similar fashion to the way VMAN allows users to automatically manage resources within vCenters / Hyper-V. Peace of mind for all involved.
- Overrides can be set, per node (if managed as an Orion node), for for patching etc, in a similar fashion to how node maintenance works now.
Not only would this streamline the management of cloud resources, but it would also allow organisations to ensure the mandatory power-down rules are respected.
Dormant nodes which are powered off will have a new status and colour within Orion, perhaps a faded cloud? Whatever it is, it'll be obvious when you see it
Similarly, cloud instances which are breaching the dormancy rule will have another clear status, and there should be a dashboard widget which covers this.
This is very much version 1 of this idea, but for me this would be natural evolution of cloud support. Keeping down the cost of the cloud should be something we all want to optimise, and having NPM help with that will go a long way to driving home the relevance (and expertise) of Orion in the hybrid-IT era.
Comments welcome, Thwacksters!
Edit: Renamed to "Cloud Maintenance" due sensible feedback! Not everything has to have a buzz word, after all 