Data de-duplication/deduplication is a specialized compression technique for data which operates by removing duplicate copies of data, thus making it a great way to make data recovery and backup simpler. Knowledge is half the battle when it comes to minimizing the negative aspects of de-duplication, so I will provide an overview of it. There are several forms of de-duplication such as file-level, block-level, inline, and post-process. File-level de-duplication searches for identical files, while block-level looks for identical blocks. Inline de-duplication starts when it is received from the storage system, but before the data is written. Finally, post-process de-duplication writes data and then de-duplicates it only when resources are available for use.
Data can accumulate quickly, and multiple copies of the same file can be stored in a manner of ways and various places. While data de-duplication can help reduce this, it can come with many challenges. In this blog, I will discuss the de-dupe tax; what it is, how it is created, and most importantly, how to limit its negative impact. When performing data de-duplication, backups can slow down significantly, especially when one is using disk staging. This results in lengthier backup times and lower performance. When restoring de-duplicated backups, they can take much longer than when the data is not de-duplicated. This is what the de-dupe tax is all about.
To help protect against this, one must decide whether they are planning to utilize the de-dupe systems as the source for copying tape, as the de-dupe tax is much higher when performing large restores and on tapes. Furthermore, knowing everything one can about the restoration in question, what vendor’s products are being used, and which method is being implemented to de-duplicate the data can help plan and perform data de-duplication better and reduce the impact of the de-dupe tax. Although with de-duplication, backups can become more efficient, it can slow down the process of restoration due to having a reduced amount of data being contiguous stored on a disk. This factor causes a high amount of disk fragmentation, thus causing backup software to find these scattered blocks, taking time and sacrificing patience.
While there are many forms of data de-duplication, every form suffers from the de-dupe tax in some way. With the aid of vendors, there are other methods to reduce this tax, such as increasing the number of available disks or implementing solid-state disks. Additionally, caching is another method that can be used to lessen the impacts of the de-duplication tax. Since early data de-duplication products had the problem with a single block of corrupted storage affecting many others, products now keep a watchful eye of how often blocks are being used and then storing a second copy of the block on the same disk. This helps prevent corruption from spreading and providing a copy of the block as a safety net. Also, caching the most frequently used blocks is an excellent way of reducing read time during restoration, further decreasing the de-dupe tax.
Data de-duplication can be of great help to any user. By reducing duplicated files, the size of our data is lessened, providing higher performance and faster operations. However, data de-duplication can create fragmented drives as well as slow down backup operations.
Preston, W. Curtis. (2010, November). Solving common data deduplication system problems. Retrieved from .
Posey, Brien. (2013, April). The “deduping tax” and other issues to consider with deduplication. Retrieved from .