Posted by: Denny Cherry
Backup & recovery, compression, SQL Server, SQL University, Storage
If you have a modern tape backup solution you probably have some sort of dedup process in your backup environment. What these deduplication processes do normally is look at the data which is being written to the backup platform and remove duplicate values in order to save space on the tape backup system.
However if you are using backup compression, either the native backup compression or a 3rd party backup compression tool, this will pretty much completely kill the effectiveness of the deduplication application.
How the deduplication systems work
There are two different kinds of deduplication systems out there (which come as an appliance or as a software package), which do fixed length deduplication or variable length deduplication. The fixed width deduplication systems use a fixed width process to look for duplicate data. They take the file and break it down into fixed width strings (different vendors use different length string, but for the sake of argument we’ll assume a 1k fixed width string) of 1k in size. Then everywhere the string is duplicated a pointer is put in which points to the original 1k length string.
The variable length systems do basically the same thing, but they don’t use the fixed length string. Because they use variable length strings they have a better change of removing smaller duplicate values, possibly as small as every word if needed.
For example, if you have a table with employee names in it and you are using a deduplication system which has a fixed width string size of 1k, you probably won’t get any deduplication. However if you have a system which has a variable length string size, if there are duplicate names (such as within the last name column) then you will get some deduplication with the amount of deduplication simply depending on the data within the database file.
Why doesn’t this work with compression?
The reason this doesn’t work with a compressed database backup (either the fixed or variable length deduplication) is because every value will be unique. When the data is compressed every row which has the last name “Smith” in it will already have the duplicates removed. Besides that databases (at least OLTP databases) already have the bulk of the duplicate data removed via the process of normalization.
A couple of weeks ago, I took a compressed database backup (some of the data within the backup was encrypted) to Quantum’s lab here in Southern California so we could see just how well it would dedupe a compressed backup. We achieved what the guys at Quantum considered to be the impossible, we got 0% dedupe of our compressed backup file.
As the DBA, why do I care?
As the DBA we care about this, because we want to play nice with the rest of the Enterprise environment and not waste resources if we don’t need to. Because of this you will want to work with your storage and backup teams to come up with the best plan for your systems. That may mean that you are taking backups which are not compressed so that the deduplication process can deduplicate the file as depending on your data set that may be more efficient. On the SQL Server side this will make your backups take a little longer, but as long as the backups still fit within your backup window that may be ok. You’ll only be able to find this out after working with your backup and/or storage admins to find out what will work best in your specific environment.
Now I’ll be going back to Quantum in the next couple of days to try and dedup an uncompressed backup file to see how well that works compared to the traditional SQL Server compression. I’ll be sure to post an update after we run this second set of tests to get some example numbers comparing the two.