At the end of February 2009 the Celerra team introduced Celerra Data Deduplication, the design precept of EMC's data reduction strategy being how you can decrease storage requirements without negatively impacting operational requirements.
As stated in a previous blog post, compression was found to be an effective general purpose data reduction technique for unstructured data but ultimately it isn't a case of favoring one reduction technology over another, it's about using the correct data reduction technology (or technologies) on the right data.
The idea being to make data reduction intelligent and not brute force. Take the savings where it makes sense and not burn resources trying to reduce something which won't reduce well if at all.
Building on the work already in place this latest release removes the 200MB max file size ceiling allowing active files and files sized in Terabytes to undergo space reduction.
NDMP tar & dump backups will now output space reduced data during backups thereby cutting down on the the amount of data transferred during a backup and reducing your backup's time to completion. In the previous release only Volume Based Backup (VBB) preserved the space savings and only allowed for volume level restores. NDMP tar/dump now will not only maintain the space savings but will allow for individual file restores back to their originating production file systems while maintaining the initial space savings.
Along with increasing the range of files in a file system which become candidates for Celerra Data Deduplication, how files are identified as candidates has also been extended. Previously the options were to use SHA1 hash values or disable deduplication and just apply compression, as of this release a byte by byte comparison may also be selected.
Since Celerra Data Deduplication works on a per file system basis not only can you exclude specific file types but you can also exclude specific pathnames on individual file systems. So if your file system structures have directories containing data you don't want to deduplicate you can specify them for exclusion and they will not be processed for deduplication.
On the topic of intelligently specifying which files to deduplicate and which files not to, using the CIFS compressed file attribute Windows clients can enable and disable deduplication on a per file or per directory basis directly from Windows Explorer. When set the Celerra will attempt to always ensure the file or directory contents are deduplicated as new or modified data is written to the files or directories specified.
So if you don't like how the policy engine is deduping files automatically you can simply select which files and directories you want deduplicated and choose in what fashion they are deduplicated.
All in all advancements which lay the groundwork for the next set of improvements which will follow soon.
My thanks to the Celerra team for answering my questions and supplying technical input on this post.