--Updated 7/7/09--
Having looked at this stuff on and off for a while now I find it's best not to be wedded to any particular data reduction technology.
To give an example of the pragmatism involved anytime I log on to the EMC corporate network various NAS (Network Attached Storage) file systems are mapped to my laptop. Ten or so of them in total and they're all vast oceans of unstructured data.
So when we looked at the biggest win for de-duplication in our own environment going after all this unstructured data in file systems was on the top of the list and it's sure to be up there for customers as well.
So to begin with, the question could be asked "Why compression?".
The simple answer is "It's the best technology approach for this kind of data".
Notice I said for this kind of data. I did not say it's the best technology full stop. This wasn't an exercise in favouring one approach over the other it was an examination of what could be done to decrease storage needs without hitting system performance.
The more elaborate answer is that not only is compression not memory intensive but it's an infinitely variable bit level intra-object deduplication process which offers the greatest savings for the oceans of unstructured data typically found in NAS devices.
That isn't an opinion it's a scientific observation. We used our own NAS file system data in our tests and I've inlined some of our results.
| Technology | "Typical" Space Savings
| Resource Footprint
|
| File-level deduplication | 10% | Low |
| Fixed block deduplication | 20% | High |
| Variable block deduplication | 28% | High |
| Compression | 40% - 50% | Medium |
Looking at our own data and taking them from the top we saw File Level Deduplication (AKA Single Instance Storage) provide some modest space savings but it was pretty light on system resources.
Fixed and Variable Block De-Dup come next both are resource intensive but variable block de-dup had the edge in space reduction. This is also shown by it's use in the de-duplication of backup data where the most benefit is seen in savesets containing repeated but block misaligned data. It's the same with Virtual Machines, when you think of it they're just like backups as if you put a bunch of them together containing the same OS image a variable level de-dup system will see a lot of repeated but block misaligned data.
Finally we come to Compression which showed the highest percentage of space savings for file system data, anywhere up to (and sometimes above 50%) and while the resources used are CPUcentric.
So if you look at those numbers scientific measurement showed us that for file system data compression provided 75% the cumulative space reduction of the four other approaches combined for file system data.
By combining intra-object de-duplication with file level de-duplication we found you can realise great space savings at a reasonable resource cost.
What happens when you need to read that data? Yes you need to decompress it but it isn't that cut and dried. De-compression happens in memory not on disk, so you read a smaller amount of data off of disk than you normally would which in turn offsets and sometimes eliminates the amount of time spent on de-compression. But it goes even deeper than that. It's an open secret at this stage that we're using the RecoverPoint compression algorithms. I mentioned in a previous post that RecoverPoint was a treasure trove of technology, it's compression algorithms are just one of the clever technologies in that product.
Celerra deduplication can decompress just the required data to service an IO so you don't even have to spend cycles decompressing the entire file if the entire file isn't what's needed.
So when I go to Celerra Manager and check the Data De-Duplication box under the NAS File System is that an all or nothing operation?
No it's not. The way the system is designed we have a policy engine which is always running in the background using tunable parameters to decide on what should be de-duplicated and what should be left alone. If it considers the file to be active or the file type has been singled out to not be de-duplicated it'll ignore the file. If the file meets the tunable criteria for being inactive it'll be de-duplicated. First the compression engine will check to see if there's any benefit to be gained from compressing the file, if there is it'll be compressed and single instanced, if there isn't it won't spend CPU cycles attempting to compress the file and it'll just be single instanced.
Drinking our own champagne on average across our systems we're seeing between 30% to 40% space savings with file system de-duplication turned on.
But the biggest win for me is that Celerra de-duplication is integrated with VBB for even higher speed backup and recovery. VBB is Celerra's block based high speed backup and recovery technology, now it'll backup and restore de-duplicated data which means you've less to write to and less to read from backup media when you take backups.
There's also full integration with Celerra Manager/Replicator/SnapSure/FileMover.
To conclude, as we could go into the guts of this forever, the first data reduction use case we went after was file systems and we found that compression with file level de-dup does a great job, but what about Databases or VMware or anything else?
Plugin architecture.
Use of one or more technologies does not preclude the use of any other technology or any combinations of technologies. Since we have a policy engine for orchestration we don't have to standardise on a single all or nothing approach.
I have no interest in a debate on the merits of one approach over another since at it's core it's a very simple question.
How much can you decrease your storage needs without impacting operational requirements?
And that's one of the questions driving our development of primary storage de-duplication.
My thanks to Celerra Engineering for supplying the technical details and figures contained in this post.