I’m very interested in the feedback on the last entry The DAS Future? all comments appreciated and I’m doing a lot of thinking about it myself.
That’s the not the only thing I’m doing. I caught a stupidly bad head cold about five weeks ago, it came and went having run riot but after the air travel I had to do Bronchitis moved in afterwards and refused to be turfed out no matter how many cough bottles I took or boiling hot whiskeys I necked.
And I necked a lot of hot whiskey to try and dislodge the 40 shades of green I was attempting to cough up out of my lungs but to no avail.
Like any other man I just assumed I’m Superman and waited “one more week” for three weeks until I was pushed into seeing a Doctor. After listening to weird sounds emanating from my lungs he looked over at me and asked “Didn’t you think the sensation of drowning in your own snot was something you should have acted on sooner?”
I had planned on it Doc…maybe next week? Or when it had turned to pneumonia.
To keep me occupied while I waited for my respiratory system to stop trying to escape it’s meat puppet prison I needed a few small projects to keep me occupied, one of those projects was to investigate LessFS.
LessFS is a fixed block inline deduplicating virtual file system delivered as a FUSE component.
FUSE (Filesystem in Userspace) is a loadable kernel module for Linux/UNIX/BSD operating systems that allows developers to write file systems without mucking around in the kernel. Being virtual, storing data (The easy part) and reading it back correctly (the hard part) is left to something that’s known to work. In my case that was Ext4. The value in FUSE virtual file systems comes in the levels above that of worrying if that data you’re reading really is the data you’ve written.
In a LessFS virtual file system data is taken, chunked using a hashing algorithm, in this case one provided by the mhash library and stored compressed in a key value store like Redis.
But in my case it wasn’t Redis as Redis doesn’t appear to be supported yet. I chose the exotic sounding Tokyo Cabinet as my key value store for the technical reason that I liked the name Tokyo Cabinet.
Tokyo Cabinet appears to store records in a file, the records created by LessFS being compressed fixed block chunks. Of course if I lose that database I lose everything I’ve written into the directory and it appears Tokyo Cabinet can be fatally damaged in particular circumstances. If the data in there is your primary data backups can really matter.
Having whacked a minimal Linux installation onto a laptop I had laying around I set about building my virtual file system, a failing ./configure command leading me by the nose as to what I needed to install until I managed to get right way through to run make install.
I mentioned mhash earlier, mhash looks pretty interesting as not only does it offer the old reliables like MD and SHA at various lengths but also things like the Whirlpool crypto hash.
Having built and installed mhash, Tokyo Cabinet and LessFS from source I started up mklessfs and created a Tokyo Cabinet Database and then associated that with a mount point. LessFS mounted on the mount point and after a few permissions changes and some poking around in the LessFS config file to see what options I had (Changing the hashing algorithm, compression algorithm and configuring replication) I was good to go.
My in depth testing, which involved nothing more than throwing in some office files, media files and VMDK files while checking the output of a hidden stats file in the Directory I was deduping, proved that it does appear to work. It does space reduce files you’d expect it to space reduce and doesn’t for the ones you’d expect it not to. I’ll admit to not tweeking anything for speed in the config file so this is completely unoptimized but overall I found it slow as hell while running.
Slow on the way in and slow on the way out.
I’m sure I could be doing more to optimise it, I’m sure I could be adding more spindles, making config file changes, getting Tokyo Cabinet to scream but I’m not and I didn’t so it is what it is. There’s probably also the edging factors covered by Steve in his entry about component performance in userland but not being a programmer I wouldn’t guess at where such bottlenecks might crop up.
You might use it for home disk to disk backups which you have a lot of time to run or maybe to archive inactive files to squeeze back some precious hard drive space but you’re not going to be running games or other applications out of it but then it is really as fresh as the dew on a daisy and I’m certain it’ll develop over time.
That this belongs as it’s own project or as a feature in a mainstream file system is a question that comes to mind and I’m sure it’ll work itself out one way or another.
All in all a quick little assembly project.
Though at least with models or ships in bottles you have something to put on a shelf.