Identifying Cost-Saving Opportunities in Azure DevOps
You and your company are likely dealing with the new reality of a contracting and uncertain economy. Now is a good time to deliver value by turning your attention toward identifying some cost-saving measures for the various systems you are using.
The process in which you examine, observe, and report on spend is part of a healthy DevOps lifecycle, no matter the state of the economy. Therefore, if you are reading this post-COVID-19, it should still help you properly evaluate your spending within your Azure DevOps estate. With that in mind, I want to share some strategies with you for saving money in Azure DevOps that I think you’ll find very useful under the current circumstances.
The cost of any software delivery process is dependent on the efficiency of your value stream. There are easy to quantify material costs, such as the number of licenses you are using. There are also plenty of costs that exist a few degrees of separation from any simple number on a spreadsheet. How you weigh more complex costs is largely dependent on your specific business. Only business insiders (or a highly paid consultant) are going to be able to give you the correct modifier for the value a given phase of your process is providing your team. (This is my way of saying not everything in this post is universally applicable!)
I am going to focus on two areas for potential savings:
- The first is license usage, which is easier to measure.
- The second, pipeline usage, is more difficult to measure.
As I discuss each area, I will cover how to manually audit costs and share scripts to help automate the process. At the end of this post, I share an Azure Pipeline that can tie all the automation together and start tracking spend continuously. (If you are a code-first type of person, you can access the Azure Pipeline here.)
The first step to evaluating your cloud spend is to audit your license usage within Azure DevOps. Azure DevOps has three or four different tiers of license usage, depending on who you ask.
- Basic—First five users free, then $6 per user
- Basic + Test Plans—$52 per user
- Visual Studio Subscriber—$45 - $250 a month
I won’t dive too deep into who should have what type of license in this post because that is dependent on how your team operates. As a general guideline I would expect that:
- Your engineers have Visual Studio subscriptions.
- Teams adjacent to your engineers have basic licenses.
- Anyone else in your org that needs visibility have a stakeholder license (these licenses are free).
- Any team members outside of Visual Studio subscribers who need to review or access test plans on a regular basis might have the Test Plan add-on license. This would typically include product owners and possibly support engineers who are subject matter experts.
I would recommend taking a close look at the number of Visual Studio subscribers you have and making sure that everyone is actively using their license.
Alright, let's get a handle on who is assigned what license type. To pull this information manually, navigate to the Users page for your organization using the following URL:
In the top right corner, you should see an Export Users button. Clicking this button will export the user information in a .csv file.
I have also written a PowerShell script that pulls the user information for you using the Azure DevOps API and the user entitlements endpoint.
This script creates a markdown file that is similar to the .csv file produced via the Export Users button and includes an additional column for any users that are using paid license types who have not accessed Azure DevOps in the past thirty days.
Below is a preview of what the script will output. I have drawn an arrow to highlight that the license for this user is Out of Compliance, as it's a paid license type and it has been 30 days since this user accessed Azure DevOps.
We will work this script into our cost audit pipeline, but if you want to see it now, you can access it here.
When thinking about spend as it relates to pipeline efficiency, it's natural to focus on hosted pipelines. After all, there is a measurable direct cost ($40) for each concurrent pipeline. These costs can add up quickly, especially if you are using thirty or forty concurrent hosted agents. (For example, the cost for forty agents is $1,600 a month.)
When considering these costs, you don't want to underestimate the potential hidden costs of the alternative to hosted pipelines, which is maintaining your own build agents. They are certainly not free. That cost is largely dependent on how much manual lifting you are doing and the scale at which you are operating.
There are plenty of costs associated with maintaining your own agents. That is true even when your process is relatively clean and you are using good Infrastructure as Code (IaC) practices. The cost is even greater if those pipelines are using hand-crafted machines, are people powered, and resource intensive to maintain. If that’s the case, the best time to invest in a better process is now.
Given the real cost of ownership for both Microsoft-hosted and self-hosted agents, the optimization ideas that follow should be applicable no matter which type of agent you are using. So, let's dive into some strategies to examine your pipeline usage and gain some efficiency.
- Use hosted agents. Okay, so I just said that both Microsoft-hosted agents and self-hosted agents have a real cost; however, in my experience, the $40 per month for Microsoft-hosted agents is absolutely worth it. (That is why this is the top suggestion.) The inevitable complexity of maintaining your own agents is rarely worth it. Hosted agents do not support all scenarios but, in general, will save you money, time, and sanity.
There are some exceptions. If your tooling needs can't be met by the Microsoft-hosted agents, or if self-hosted agents are cost effective in your case, you might consider basing your agents on the GitHub Actions Virtual Environments and extending them to meet your needs. You don't have to reinvent the wheel.
- Focus on optimizing the pipelines that trigger critical decision points for your team in delivering outcomes to your customers. If you have 25 product-focused pipelines, and only two of those products represent 75 percent of your customer base, focus on what is making you money. Invest your time and resources into those critical business inflection points. Consider this to be an opportunity to banish noise. Noise is cognitive load, and cognitive load is the enemy of efficiency.
- Next, examine your pipelines that have the longest run times and the highest run counts. Remember, a critical tenant of DevOps is a fast feedback cycle. Saving three minutes once is trivial, but saving three minutes for a pipeline that runs 20 times a day will free up potential compute time to the tune of an hour a day, more than a day per month, and more than 15 days per year.
- For key pipelines, consider devoting additional agents and parallelizing the workload into multiple concurrent stages. You will be occupying an additional agent with concurrent stages, but if the pipeline is business critical, it is worth it.
- Move your legacy GUI-based pipelines to YAML-based pipelines. This is especially true for your build definitions, as there is 100 percent feature parity and then some. Multi-stage pipelines, which encapsulate the ideas of build and release definitions, are not far behind. Once your pipelines are in code, it is much simpler to treat them like any other piece of code (code that needs to be fed, optimized, and owned).
- Ask yourself if each step, job, and stage are always needed for every pipeline run. Could you wrap any blocks in conditional logic and save time?
- When using hosted agents, take a look at the available caching mechanisms to cut down on repetitive restore operations.
- Closely examine any pipelines that have * based triggers and ask yourself if your team is getting value from every run. Is there value for building each feature branch for every project? Can you cut down on the workload by instead building pull requests and mainline builds for certain projects? Consider configuring some path filters to build feature branches only when critical parts of the code base are touched. You likely already know what those critical parts are.
- Consider modifying your pipeline to include an initial job or stage that can check pool queue length. For non-critical branches, you could choose to move to the next stage only if the current pool workload is light.
- Educate your teammates about the commit options that can suppress pipeline runs. Pushing a commit with any of the following will suppress triggers:
[skip ci] or [ci skip]
skip-checks: true or skip-checks:true
[skip azurepipelines] or [azurepipelines skip]
[skip azpipelines] or [azpipelines skip]
[skip azp] or [azp skip]
The point of this exercise should be to tune your pipeline workload, just as you would optimize a SQL query or .NET class. When your pipelines are running more efficiently, you are using less resources, engineers are context switching less, and your company is making more money.
There is a good bit of nuanced advice in the section above. Pipeline efficiency clearly falls into the bucket of more complex things to measure. To begin to tackle this area, there are some things you can measure pretty easily, such as queue time, run time, frequency of runs, and pipeline type.
Azure DevOps has provided some great views of the analytics of pipelines over the past year. If you go to any pipeline, you can now see an Analytics tab that provides an awesome breakdown of the stats for the past 14, 30, or 180 days. The issue is really getting a 30,000-foot view across multiple pipelines. For that level of information, you can turn to Azure DevOps Analytics OData.
There are some cool solutions out there that the community has provided leveraging the OData service. Colin Dembovsky and Wouter de Kort first exposed me to some of these solutions last year. FlowViz by Nicolas Brown is also a great example of leveraging additional analytics data. Here at SQL Sentry, we have rolled our own solution, which pushes pipeline stats from the OData service into our Grafana estate using an Azure Function. (Let me know if you would like to see a post on that solution!)
For the purposes of this blog post, we will be using PowerShell to build a report that pulls pipeline stats across an entire project directly from the OData endpoints. This is a great way to get started and can complement a larger observability dashboarding project.
The script is designed to work out of the box in any Azure DevOps collection. It makes use of the available pipeline variables to build the contextually correct URLs for querying the OData feed and other REST API endpoints. Here’s a quick summary of the script:
- Finds all the pipeline definitions that exist in the project using
- Loops through the pipelines, checking to see if there has been a run of that pipeline over the past 14 days using
- For each Pipeline run, it gathers the stats of the past 14 days and outputs those stats to a markdown file
Here is a preview of what the script outputs so that you can get an idea of the stats produced.
I ran this script against one of our smaller projects and manipulated the data a bit to help illustrate the output (and to protect the innocent and guilty). We do not use this naming convention for pipelines, although I might pitch the idea of naming all the pipelines like this in the future!
The output is designed to give you a good starting point to jump in and decide which pipeline might be worth your time to tune. The first pipeline listed, SQL SentryPipeline-1, is a bit of a pig, with an average duration of an hour and decently sized six-minute average queue. It's also still using the designer (i.e., it is a GUI-based build). My first instinct might be to work on converting this definition to YAML and start tuning it, but it only ran five times in the past 14 days. So, I think you could make an argument your time might be better spent elsewhere.
Instead, it is probably a good idea to work on tuning SentryOnePipeline-2, which has run 66 times in the specified time period. It is getting canceled fairly often, and it also fails at 21 percent clip. Seems like this pipeline is getting exercised a good bit and needs some love.
If you click the Analytics link, it will jump you over to the Azure DevOps analytics duration pipeline page. You can use this page to do a deep dive into what is contributing to your pipeline's duration. It has a breakdown of the top 10 steps by duration across the entire pipeline. You can also use the Stage drop-down menu to see a further breakdown of step duration per stage.
Here is a screenshot of one of the pipelines that we tuned recently for SQL Sentry Portal.
This pipeline suffered from some natural bloat as the project picked up steam. We identified it needed attention and set about optimizing it. Following the steps I detailed above, including breaking the workload out into parallel stages, we were able to get about a 40 percent improvement in build time—reducing it from 26 minutes to 14 minutes.
Since you can't actually have a proper Azure DevOps blog post without including an Azure Pipeline, I have wrapped both the cost savings license and the cost savings pipeline scripts in an azure-pipelines.yml. The pipeline does the following:
- Runs the cost-savings-pipelines.ps1 script
- Runs the cost-savings-license.ps1 script
- Publishes an artifact consisting of the generated markdown files for the audited users and the pipeline stats
- Pushes a Slack notification to a user or channel with those same markdown files
You should be able to take this pipeline, drop it in, and run it in your collection, with some limited configuration as follows:
- The Project Collection Build Service will need View Analytics rights for the project the pipeline is running in.
- If you would like to push a notification to a Slack channel, you will need a token to do so. That token needs to be set as a secret variable, SlackToken, in your pipeline. If you want to skip the Slack message, that's not a problem. Simply comment out or remove those tasks from the pipeline file; the markdown files will still be available to download as artifacts.
You could set a schedule and run this script weekly, daily, or on-demand. In our case, we have a good number of meta tasks that run on a weekly schedule. Go ahead and clone the repository to get started!
In this post, I shared some strategies to evaluate your Azure DevOps costs. We looked at a way to audit your Azure DevOps license usage and dove into pipeline efficiency (because time is money!). We leveraged PowerShell, the Azure DevOps Rest API, and the OData analytics endpoints to pull additional data, and then wrapped it all in a reusable Azure Pipeline that can be run in any Azure DevOps project.
I have plans for a part 2 of this post, where I will walk you through auditing your artifact storage usage, auditing some general Azure spend as it relates to dev\test, and reporting all this information back to stakeholders.
Thanks for reading! I hope you found this blog post useful. If you want to connect and talk DevOps, pipelines, process, automation, or people, I am always interested. Reach out and connect with me on Twitter, LinkedIn, or StackOverflow.