Finding the Gaps in Your Data Causing Data Drift
When drift happens within a database, it can occur at a couple of different levels. Drift refers to entities—tables, views, or even data—out of synchronization with each other. This could be a difference in schema structure, data, or even operational metadata like permissions. Often, drifts happen between two different environments like development and staging databases. If your data is static and you utilize a continuous delivery model for deployments, you could even have drift between what’s contained in your source control (you’re using source control, right?) and upper environments.
Keep in mind your production database should be the source of truth for transactional data. Transactional data is the data driving your organization’s business. Static data, which is reference data driving applications (for example, populating a drop-down menu), should be kept safe in a source-controlled repository along with the database schema.
Of course, through the course of a development life cycle, some data drift is expected and normal. As application requirements are defined, data corresponding to those requirements might need to be altered. For example, “test” data values may get propagated into a database. This will cause a lower environment, like development, to be out of sync with a higher environment until the change is deployed all the way to production.
The notion of data drift exists in the world of machine learning (ML), too. Data drift in the context of ML refers to the change in model input data leading to model performance degradation. This is a different type of data drift outside the scope of this article. This article addresses data drift occurring within online transactional processing systems.
So how does data drift happen, and how do we fix it?
Drifts can happen when data is missing. Rows have somehow gone missing in the destination table, and without the data, the application may not function correctly—if at all.
If the data is new, meaning you’re adding to the data set as part of a development cycle, then this is to be expected. This data hasn’t been deployed yet. However, if the missing data is unexpected, this should be investigated immediately to discover who or what caused the data to be removed.
Sometimes there are instances where the data in a production system is different from the data in a test environment. If you don’t have a well-defined deployment process and adhere to the process, data may change in the wrong location. This generally occurs with manual deployment processes. A developer generates a script and then passes the script off to another to be executed within the target environment. This method is prone to mistakes simply because we’re all human. Getting this type of process automated with validation unit tests can help mitigate data issues moving forward.
If you have true database continuous integration/continuous delivery (CI/CD) in your environment, it’s critical to look at the permissions in your environments. The only entities needing elevated permissions are the tool sets doing the actual work. Developers and other end users only need read access to the data for verification purposes. Locking down your database environments is one of the benefits of putting continuous integration and delivery into place, and limiting access in this way prevents accidental changes, whether they be data deletion or just accidental modifications.
If you don’t have any testing at all within your environment, then you’re just doing things wrong, and we should talk about it!
But if any level of testing is present, whether it’s manual or automated, having a drift or a gap in the data can drastically skew testing results. When drift is subtle, it can sometimes be difficult to pinpoint exactly where the issue might be. Large data sets continue to compound the issue because reconciling the drifts or the gaps can be cumbersome. Large data modification scripts must be constructed and subsequently tested to ensure the issue is resolved.
There are a couple of ways you can fill in data gaps. Some are harder than others, and each comes with varying degrees of risk when making the change.
Repaving the existing tables with a known good data set is a clean and effective way to get data synchronized in most cases. You’ll have to truncate the target table and then insert into the target table by reading from the source.
INSERT INTO dbo.Sales (id, orderId, OrderName) SELECT id, ordered, OrderName FROM SourceDatabase.Dbo.Sales;
This example uses three-part naming, which requires the two databases to reside on the same SQL Server instance. If the databases aren’t adjacent to each other, which is normally the case, you’ll have to employ other means—like an SSIS package or linked server—to get the data inserted. You may need to adjust the identity_insert parameter for your table if you’re getting identity values from another table.
The more painful approach to solving this problem, which is often referred to as looking for a needle in a haystack, is to write a custom process to compare the data between two sources, identify the differences, and then correct them. This can be done via means such as Biml, SSIS packages, or even aggressive T-SQL scripts designed to perform the comparison and then apply the differences. This method vastly increases complexity and is harder to maintain.
Microsoft has their own tool, tablediff, designed to identify differences between data sets. It’s a free tool, but it’s based on command line, which means it’s a bit less intuitive. tablediff has some nice features, such as the following:
- The ability to compare across servers
- The ability to perform a fast comparison by only comparing row counts or schema
- The ability to perform a comparison at the column level
- The ability to generate T-SQL to fix discrepancies
If you like command line tools, this is one you should examine and have in your tool belt.
With the power of Visual Studio and SQL Server Data Tools, you can perform a data comparison directly from within the application. If you’re familiar with the Visual Studio interface, this might be an easy solution to discover where data gaps exist within your databases.
Figure 1—Example of starting a Visual Studio data comparison
The image in Figure 1 is from Visual Studio 2017, but any supported version of Visual Studio has this functionality.
There are many open-source and commercial tools capable of delivering similar functionality. The nice thing about going the tool route is someone else has likely already addressed issues such as dropping constraints potentially blocking data modifications from occurring. Furthermore, most of the tools out there can either perform the synchronization for you with a click of a button or generate a data modification script for you, which allows you to deploy at the appropriate time.
Make sure you’re doing your due diligence when it comes to identifying data gaps. If data gets skewed between environments, the downstream effects could be drastic.
When you start researching tools for monitoring for data drift, look at SolarWinds® DBA xPress. DBA xPress is a free tool designed to help you identify where data gaps exist and provide the means to correct them. Download the free tool here.