Entry Level: Compression and Data Types

SQL Server Data Compression

Earlier this year, I co-presented a webinar with Kevin Kline (b|t) on SQL Server Data Compression. In this post, I'm going to focus on the segment of that webinar which (1) covered the two types of SQL Server Data Compression and (2) illustrated which types of data will compress well and which will not.

Data Compression Methods

Row Compression

Row compression is essentially smarter, more efficient storage.

  • An int uses 4 bytes of storage, whether it is 2 or 2,147,483,647. With row compression, 2 should use 1 Byte of storage space.
  • Trailing zeros, padded, and blank characters are not stored (i.e. take up 0 bytes), where applicable.

For a detailed list of how Row Compression Implementation works for each data type, refer to this article on MSDN.

Page Compression

Page Compression is what I like to refer to as "compression for real this time" as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.

First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You'll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).

No Data Compression
Figure 1: No Data Compression

Second, take a look at a page of data that has Prefix Compression applied (Figure 2). Some of the values have been removed from the data part of the page and stored once in the compression information metadata section below the page header. These are the prefix values.

Next, spot what happens to some entries that share part of those values. "SSSQLL" is at the top of one section, and below that, "SSSQQ" has been replaced by "4Q", as in the first 4 characters of "SSSQLL" followed by a "Q" equals the original "SSSQQ" in fewer characters. Rather than storing some of these repeating prefixes it can store them once and uses references to complete the values. At this stage, the prefixes are limited to their respective columns. As a result, you'll see that the "4Q" refers to the "SSSSLL" prefix above it in the middle column.

Prefix Compression
Figure 2: Prefix Compression

Finally, observe what happens when Dictionary Compression is utilized (Figure 3). Unlike prefix, this level allows for compression of duplicated values across the entire page. The "1" across the two right columns corresponds to the "0QQQ".

Dictionary Compression
Figure 3: Dictionary Compression

For more information on Page Compression Implementation, see this article on MSDN.

Will It Compress Well?

Yes Example

Once you have an understanding of how compression works with repeated values and prefixes, you can have a better idea of whether your data will compress well.

In the "Yes" example below, I estimated savings on a table and discovered that one index would compress by about 80%:

EXEC sys.sp_estimate_data_compression_savings @schema_name = N'dbo'
     , @object_name = N'EventSourceHistoryDetail'
     , @index_id = NULL
     , @partition_number = NULL
     , @data_compression = N'PAGE';

Estimating Savings

Savings by Index
Savings by Index

Upon seeing that % savings, I took a closer look at that index. Shown below, it has three columns:

Index Info
Index Info

I then ran a query to see how many of those values were unique:

SELECT COUNT(DISTINCT EventSourceID) AS UniqueEventsSourceIDs
     , COUNT(DISTINCT RemoteSequenceID) AS UniqueRemoteSequenceIDs
     , COUNT(DISTINCT RemoteObjectID) AS UniqueRemoteObjectIDs
     , COUNT(*) AS TotalEventSources
FROM dbo.EventSourceHistoryDetail;

Querying for repeating values

Below, you can see in the results that across 253,127 rows, there are only 16 different values for EventSourceID and 241 values for RemoteObjectID. The RemoteObjectID has the potential to have the largest size (256 Bytes). Consequently, being able to refer to those values instead of duplicating them across the quarter million rows will certainly add up to a high % savings in storage space.

Unique Table Data
Unique Table Data

Furthermore, the RemoteSequenceID is a bigint, which uses twice as much space as an int. Row compression is able to store those values more efficiently since they do not require the maximum space required by a bigint. A value of '2' can use a single byte instead of eight.

No Example

Perhaps most noteworthy about this example is the that the data not only fails to compress well, but that the compressed size is greater than the original.

EXEC sys.sp_estimate_data_compression_savings @schema_name = N'dbo'
     , @object_name = N'PerformanceAnalysisPlan'
     , @index_id = NULL
     , @partition_number = NULL
     , @data_compression = N'PAGE';

Estimating Savings

Recall that compression creates an area of metadata for it to work. As a result, that metadata overhead could increase the storage size in some cases.

Savings by Index
Savings by Index

Taking a closer look at the index in question reveals that it consists of a plan hash, a binary data type. According to Microsoft documentation, binary data types row compress by removing trailing zeros.

Index Info
Index Info

Querying the data shows that each plan hash is a unique value:

SELECT COUNT(DISTINCT PlanHash) AS UniquePlans
     , COUNT(*) AS TotalPlans
FROM dbo.PerformanceAnalysisPlan;

Querying for repeating values

Unique Table Data
Unique Table Data

If you're not familiar with what a plan hash is, you might be wondering why it doesn't compress the prefix values. Just because the values are unique doesn't mean they don't share prefix values, right? Due to the nature of a plan hash, this isn't the case.

Plan Hash Data
Plan Hash Data

As shown above, the prefix values for the plan hash are rather limited. There's the 0x, but the remainder of the hash has less to offer for prefixes and dictionary compression methods. Notice that the removal of trailing zeros is not helpful here, either. Row compression is unable to store this more efficiently.

Summary


While there are a multitude of factors to consider before applying compression to your data, one of the first things you might want to determine is whether or not your data actually compresses well enough to use it. Just because you have a large table (or index), that doesn't mean it will benefit from either data compression method. Having an understanding of what is in your data and how those methods work is an excellent start.

More Information

Finally, if you're interested in learning more about how to decide when it's a good idea to use data compression, check out the aforementioned webinar or read my other posts detailing how I decided where to use it in the SQL Sentry database.

The Webinar

Have you heard of the data compression feature in Enterprise Editions of SQL Server 2008 and later? It promises great savings on storage space. But did you know it can also improve I/O performance and enhance SQL Server memory? It also has costs too, in particular, it costs additional CPU to compress data on the fly. But how much CPU is spent on data compression overhead? Is it too much to be worth your time? Or is it an acceptable cost for the big gains in storage space, I/O performance, and memory improvements? To find out all about this cool technology, join Kevin Kline (@kekline) and Melissa Connors for this case study webinar. They’ll show how data compression works, how to choose the tables and indexes to compress, and the positive and negative performance impact of implementing data compression. Attend this webinar to learn the important ways that data compression affects workloads.