Mozy made some pieces of it's infrastructure code available in two open source projects last week. Now, Mozy operates at a different scale than many of us. They have more than 50PBs of data from more than 1 million customers stored in multiple data centers around the globe, as such I thought anything they could release would be niche but interesting to anyone considering operating at hyper scale.
It turns out after speaking with Reverend Ted Haeger over at Mozy I was only half right.
As per usual I go for the jugular with the first question.
99.999999% of the developers out there aren't operating at the scale you are and may not see a need for any this. Why release it?
We're big believers in Open Source Software and what we have over on Mozy Code are two projects that were started to solve problems specific to us but we think may now be useful in someone else projects. Even if it's only applicable to that 0.000001% of developers out there. They most certainly would have be useful for us to have had when we started out.
Lets go deeper into these projects starting with Ruby Protocol Buffers. I know that Ruby is an object oriented programming language but what's a Protocol Buffer?
A Protocol Buffer is a low fat data exchange mechanism over HTTP, invented by Google.
Lets go a bit deeper and consider the advantage of JavaScript Object Notation (JSON) over Extensible Markup Language (XML) object serialization: for each data field in an object, XML uses both an opening and closing enclosure (e.g. <fullname>Joe Tucci</fullname>). JSON is a bit more efficient, because it uses a comma to indicate the end of an object value (e.g. fullname:Joe Tucci,).
For a low use API, the smaller number of characters gives only a trivial advantage to JSON over XML. But as you scale up, XML begins to show its bulk relative to JSON, but JSON still carries redundant information: the "fullname:" label for the actual data could be eliminated. As long as the data gets transferred in a predictable position in a data stream, it doesn't need the label.
That's what Google implemented with Protocol Buffers. They serialized the data in a more rigid, predictable format for faster communication between systems. As long as both systems understand the format, you can eliminate a bunch of overhead.
Google created Protocol Buffers to optimize data transmission in their data centers, where high speed data transfer between systems is all managed through lower-level languages. So the first implementations were in languages like C and C++. Naturally, Google also made a Python implementation, which was the first higher-level language to use Protocol Buffers.
Mozy is largely a Ruby shop for our web logic, but on the back-end, we're heavily C++ so at the scale we require Protocol Buffers made a lot of sense. However, we needed a working Ruby implementation to get that advantage. When our developers' tried a Ruby implementation they found that not only did it not work well for what we needed, but also that it would be easier for us to build it from the ground up.
And that's what we did.
So, the developer who would most likely be interested in our ruby-protobufs project would be a Ruby developer who needs rapid data exchange with back-end applications (likely written in a lower-level language) to reach a level of scale that cannot be so easily achieved using formats like XML and JSON.
XML and JSON will probably support the majority of developers' needs but ruby-protobufs is a straight forward implementation for anyone who chooses to use it.
Right, I can see where that would be useful to Ruby developers. Mordor is an I/O Library, how is it different from what's already out there? I've heard of threads, what's a Fiber?
Asynchronous I/O libraries have become well known tools of the trade, optimizing major limitations of synchronous I/O.
Synchronous I/O requires programs to initiate a task (such as an HTTP GET, or file write operation) and then wait until the task completes. The application is blocked until the task completes.
Asynchronous I/O libraries allow a program to initiate a task, and then continue doing other things. When the task completes, a "callback" function typically kicks off a short routine for handling the task completion (such as doing something with an HTTP response, or recording success of the file write event).
Most current asynchronous I/O libraries use Threads as their base unit of execution. Threads allow simultaneous execution of tasks in a program, with each task using its own thread. This greatly increases program performance and resilience. For example, the Chrome browser uses a separate thread for each tab it runs, so a bad web app only hangs a single tab rather than the whole browser. And when you close a tab the memory it consumes is returned to the system.
But threads have downsides.
Because threads are managed by the OS kernel, the kernel allocates their memory and manages their scheduling. Mordor uses finer-grained execution units called Fibers. Conceptually, fibers are similar to threads, but instead of the kernel, the application schedules multiple fibers in the context of a single thread. This finer-grained approach to cooperative tasks reduces demand on the kernel for allocating, managing, and de-allocating. The result is much higher performance and lower demand on system resources.
The developers who would be interested in Mordor are those concerned with I/O task performance. Mordor provides developers with an Async I/O library through a simple API, with the performance benefits of a complex fiber-based system.
Great. All I need now is a compiler and I'm writing high performance code.
Well be sure to check out http://code.mozy.com before you do, you might find something useful!
Any more stuff coming?
These two are just the beginning. Going forward we have to consider that getting the right person to lead a project is as important as the technology itself. That's what will dictate what we release and when we release it. Right now there's great potential for future projects.
