What is CI/CD and How to Build ETL Processes
In today’s world of heterogeneous data ecosystems, managing and consuming data can be cumbersome. Organizations often have multiple systems of truth in corresponding to the applications managing the data.
While data engineers dream of software that would make it easy to consume and digest different data streams from disparate systems, that scenario rarely comes to fruition. Often, data professionals must develop a way to extract data from one source, potentially modify or massage the data to fit a particular data structure, and then subsequently insert into another data store. People in the industry often call this an ETL process.
Until recently, continuous integration (CI) and continuous deployment (CD), sometimes also called continuous delivery or continuous development, were found mostly in application development cycles. Combined, these two acronyms defined an overarching process in which code, regardless of whether it’s an application, ETL code, or even database code, is released seamlessly into environments. This process is manifested using tools specifically designed to conduct the code release with little to no human intervention. Due to this limited human interaction with the actual release process, errors are mitigated and reduced, thus allowing a team to spend more time on development work.
DevOps takes advantage of CI/CD by helping to bridge the divide between “development” and “operations,” thus DevOps is born. The idea of DevOps is it helps organizations at a holistic level to encompass multiple teams and deliver application changes at a faster rate of change. This increased delivery velocity further helps to accelerate ETL pipelines and improve product quality over the use of traditional development lifecycle management tools and processes. The latter typically delivers large and monolithic releases whereas the former delivers small, more easily controlled code changes which allows for better quality work.
Let’s examine the components of CI/CD in a DevOps world.
CI depicts how developers interact with source control, also called version control. Keep in mind the term “developers” doesn’t necessary imply application developers; it can be anybody who writes a line of code in any language and then checks it into source control. Even ETL developers can incorporate continuous integration into their workflows as they design data flows.
As developers work with subsets of code, often in independent branches, each developer checks each subset routinely into source control. Each check-in automatically triggers a build process that validates the checked-in code and subsequently builds a package that can then be released (if configured to do so).
One of the benefits of deploying continuous integration is rigorous testing can be incorporated into the build process to ensure the quality of every build. Another benefit of continuous integration is automation, which removes the human element from the deployments.
CD can stand for two different things reflecting different approaches to the deployment lifecycle. In Continuous Delivery, a code deployment reaches the environment just before production. This allows for some level of human intervention to verify and validate the deployment is suitable for the production environment. With Continuous Deployment, a code deployment goes straight to production without delay. This further means there won’t be human involvement to double check the release. When the build process is complete and passes all tests, the release is subsequently deployed to appropriate environments with no further intervention (Figure 1).
Figure 1: Continuous Delivery vs. Continuous Deployment
Keep in mind automated testing can fall into two basic categories at a high level. Test Driven Development (TDD) dictates that before a single line of code is written, it must be testable. This means a valid test must be written to ensure the resulting code can be tested. The other aspect of this is Code First, which basically states the code is written first and then tests are developed after the fact. Both are valid ways of development and either can be appropriate for your organization. In either case, automated testing is critical as it helps to provide better and faster feedback to not only developers but business users as well. It also helps facilitate a consistent and manageable way to ensure that as the lines of code increase, the testing is consistent and grows right along with it. Automated testing can be integrated into the CI/CD world via pipelines.
One key aspect of any CI/CD methodologies are pipelines. Pipelines are typically considered to be a unique set of grouped together tasks.
Some example pipelines might be:
- Build – this is where the application changes are developed and compiled
- Test – Upon a successful check-in, everything is tested thoroughly as possible
- Deploy – Given all tests are passed, release the deployment to any or all environments
- Verification – Lightly (or smoke) test things to ensure the deployment was accurate
Keep in mind the above pipelines may or may not be suitable for your unique organization. Many businesses must flex the pipelines, so they better suit their own individual requirements. That’s the nature of CI/CD and DevOps and that’s the way it’s supposed to be done. It’s flexible.
ETL stands for Extract, Transform, and Load. Any process that consumes data from a source, modifies it, and then pushes it to another data repository is doing an ETL process. It doesn’t matter what development platform the process is constructed in: if it’s moving and massaging data prior to loading it somewhere, it’s doing ETL.
The actual use of data usually involves an online transactional processing (OLTP) system, which runs something such as a point-of-sale system that processes business transactions. In most cases, an ETL process runs independently of the OLTP system, so while you develop and release the ETL process, you lower the risk to systems making you money. However, this doesn’t mean a release process for an ETL process shouldn’t be carefully orchestrated. This is where CI/CD comes into focus.
While continuous integration is great, source control is more central to modern development. Coding efforts of any nature should be kept in source control. This means application code, database code (database queries and updates are just code, after all), and even ETL code should be kept in a source control repository. There are multiple source control vendors today to choose from, offering different attributes for a wide range of requirements.
Source control manages and tracks any changes to code. Remember source control is an integral part of the continuous integration lifecycle, which starts with checking a code change into a repository. A commit to the main branch of the source control is commonly the trigger that kicks off the integration and build process, which subsequently leads to continuous delivery or deployment.
If you want to start down the path of continuous integration and continuous deployment, source control is where to start. This is the foundation on which you’ll build automated deployment to your respective environments.
Now that we’ve talked about the components of CI/CD, how do we get an ETL process into it? The key is to remember ETL is just another coding language. If you develop a SQL Server Integration (SSIS) package (which does ETL work), it’s composed of a file containing XML, which is code. In general, SSIS packages are constructed with Visual Studio, an integrated development environment (IDE) that can be fully integrated into a source control platform. This means any changes to the package are tracked, managed, and then subsequently built and deployed when changes are checked in and continuous integration takes over.
If you’re just starting to build an ETL process, check to make sure whatever platform you’re using can integrate well with a source control system. Most ETL platform vendors today will work with one version of source control or another.
Continuous Integration and Continuous Delivery/Deployment are great things to have in your environment. When deployed correctly, these tools offer up a better work-life balance for everybody involved because when implemented right, the less humans must be involved and provide a clear and concise path to deliver code changes to respective environment. Development teams embracing agile methodologies (like Scrum or eXtreme Programming) will find CI/CD tools enhance the ability to deliver code changes faster and better.
As you embark on your path to enlightenment, make sure to try SolarWinds Task Factory. Its plethora of features allows for numerous enhancements to add to your ETL process, including helping you make SSIS packages run faster without heavily increasing your development costs. You can download a free 14-day trial of Task Factory here.