cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Culture of Data Protection: Data Quality, Privacy, and Security

Level 12

Culture of data protection tiles.jpg

As I explained in previous posts on building a culture of data protection, we in the technology world must embrace data protection by design:

To Reward, We Must Measure. How do we fix this?  We start rewarding people for data protection activities. To reward people, we need to measure their deliverables.

An enterprise-wide security policy and framework that includes specific measures at the data category level:

    • Encryption design, starting with the data models
    • Data categorization and modeling
    • Test design that includes security and privacy testing
    • Proactive recognition of security requirements and techniques
    • Data profiling testing that discovers unprotected or under-protected data
    • Data security monitoring and alerting
    • Issue management and reporting

Traditionally, we relied on security features embedded in applications to protect our data. But in modern data stories, data is used across many applications and end-user tools. This means we must help ensure our data is protected as close as possible to where it persists. That means in the database.

Data Categorization

Before we can properly protect data, we have to know what data we steward and what protections we need to give it. That means we need a data inventory and a data categorization/cataloging scheme. There are two ways that we can categorize data: syntactically and semantically.

When we evaluate data items syntactically, we look at the names of tables and columns to understand the nature of data. For this to be even moderately successful, we must have reliable and meaningful naming standards. I can tell you from my 30+ years of looking at data architectures that we aren't good at that. Tools that start here do 80% of the work, but it's that last 20% that takes much more time to complete. Add to this the fact that we also do a shameful job of changing the meaning of a column/data item without updating the name, and we have a lot of manual work to do to properly categorize data.

Semantic data categorization involves looking at both item names and actual data via data profiling. Profiling data allows us to examine the nature of data against known patterns and values. If I showed you a column of fifteen to sixteen digit numbers that all had a first character of three, four, five, or six, you'd likely be looking at credit card data. How do I know this? Because these numbers have an established standard that follow those rules. Sure, it might not be credit card numbers. But knowing this pattern means you know you need to focus on this column.

Ideally we'd use special tools to help us catalog our data items, plus we'd throw in various types of machine learning and pattern recognition to find sensitive data, record what we found, and use that metadata to implement data protection features.

Data Modeling

The metadata we collected and design during data categorization should be managed in both logical and physical data models.  Most development projects capture these requirements in user stories or spreadsheets. These formats make these important characteristics hard to find, hard to manage, and almost impossible to share across projects.

Data models are designed to capture and manage this type of metadata from the beginning. They form the data governance deliverables around data characteristics and design. They also allow for business review, commenting, iteration, and versioning of important security and privacy decisions.

In a model-driven development project, they allow a team to automatically generate database and code features required to protect data. It's like magic.

Encryption

As I mentioned in my first post in this series, for years, designers were afraid to use encryption due to performance trade-offs. However, in most current privacy and data breach legislation, the use of encryption is a requirement. At the very least, it significantly lowers the risk that data is actually disclosed to others.

Traditionally, we used server-level encryption to protect data. But this type of encryption only protects data at rest. It does not protect data in motion or in use. Many vendors have introduced end-to-end encryption to offer data security between storage and use. In SQL Server, this feature is called Always Encrypted.  It works with the .Net Framework to encrypt data at the column level and it provides the protection from disk to end use. Because it's managed as a framework, applications do not have to implement any additional features for this to work. I'm a huge fan of this holistic approach to encryption because we don't have a series of encryption/decryption processes that leave data unencrypted between steps.

There are other encryption methods to choose from, but modern solutions should focus on these integrated approaches.

Data Masking

Data masking obscures data at presentation time to help protect the privacy of sensitive data. It's typically not a true security feature because the data isn't stored as masked values, although they can be. In SQL Server, Dynamic Data Masking allows a designer to specify a standard, reusable mask pattern for each type of data. Remember that credit card column above? There's an industry standard for masking that data: all but the last four characters are masked with stars or Xs. This standard exists because the other digits in a credit card number have meanings that could be used to guess or social engineer information about the card and card holder.

Traditionally, we have used application or GUI logic to implement masks. That means that we have to manage all the applications and client tools that access that data. It's better to set a mask at the database level, giving us a mask that is applied everywhere, the same way.

There are many other methods for data protection (row level security, column level security, access permissions, etc.) but I wanted to cover the types of design changes that have changed recently to better protect our data. In my future posts, I'll talk about why these are better than the traditional methods.

16 Comments

I think your approach of data categorization (I would call it classification) is the key point in modern Data protection strategies. Knowing your „Crown Jewels“ makes it easier to focus on what needs to be secured. Also with new privacy regulations you need to make sure that you also delete backup data when a Deletion request comes in (or make the data unusable if you can’t delete it).

I see encryption as an important part as well. I like your „holistic“ approach, makes it easier not to forget some path that might be unencrypted.

thanks for your posts

Level 13

Good Article. I'm enjoying the series.

Level 15

I found in my IT travels that getting people to about their data in terms on critical and non-critical is the biggest hurdle.  Have a public folder that people drop data as they see fit without an underlying understanding of the effects of it being visible.  Once we get past this hurdle, the concept that I need to archive data forever in uncategorized storage.  Just because you put something out there today does not mean in 4 years it is still needed.  People need to be held accountable to police their storage and purge what needs to be purged.

Thanks for the thought-provoking article.

MVP
MVP

Good article

Level 12

Thanks.  There are a few names out there for this: Categorize, Classify, Catalog. I’m waiting to see what the data community ends up settling on.

Level 20

Data at rest crypto is becoming a requirement now for us.

I'm feeling the entire end-user issue of training (or the lack thereof) is missed.  End users are a huge source of security implementations or security risks.

Keeping training in the mix of solutions for prevention of insecure practices is important!

Level 9

Right on!

Level 12

Data retention is indeed another critical aspect of data protection. It's impacted by GDPR and other privacy issues, as well as being important for many compliance challenges.  Plus, there's just the whole cost of retaining, backing up and managing the security of older data.  Thanks for mentioning this.

Level 12

And to think that data a rest is only one part of the entire data protection requirement scheme.

Level 12

Yes, always important.  But I've seen the biggest risks taken by IT users.  Likely because they have greater access to technical resources.

Level 12

YEAH!

MVP
MVP

Good information and I think the categorization is one of the most important elements. How many times, in IT, have you heard the statements; "Just back up everything." or "just encrypt everything"

Proper classifications will lead to better protection, greater efficiency and cost savings. Not to mention you will know what you have.

Level 12

Yes, important.  We now have data features to target data protection.  Design is about cost, benefits, and risks.  By applying a huge blanket tact we often incur costs for little gain, while introducing risks for data that is underprotected.

Level 13

One of the problems I keep running into involves what is commonly called data governance or a data glossary.  Just exactly what does the data mean (or *not* mean, which can be more important).  There is also a lot of what is called data is actually pretty much garbage because it wasn't properly vetted or can't be verified in such a way that it fits within the broader context of the enterprise and so is rendered pretty much useless.  Similarly is where data is aggregated without keeping the detail.  That can be very useful if all you want is trends, but you can't extract detail from summarized data after the fact, no matter how hard you try.

Level 12

thanks for the article

About the Author
Data Evangelist Sr. Project Manager and Architect at InfoAdvisors. I'm a consultant, frequent speaker, trainer, blogger. I love all things data. I'm an Microsoft MVP. I work with all kinds of databases in the relational and post-relational world. I'm a NASA 2016 Datanaut! I want you to love your data, too.