One of the interesting technologies EMC is working on with the scientific community is PLFS, a checkpoint file system for parallel applications.
When running across thousands of CPUs on hundreds of nodes on a parallel file system, such as Lustre, you can assume something at sometime is going to fail. What HPC application authors tend to do is design their applications to take regular checkpoints to allow for application restart when such a failure occurs.
The two common ways of doing this are either to create a file per process and save your state across those or share a file between processes and save all the state to that.
However, the workload profiles of these checkpoint processes tends to choke the bandwidth of parallel file systems especially for multiple writes to a single file and you end up trading an aspect of checkpoint performance for protection.
What PLFS does is sit between the application and parallel file system and on the fly rearrange the writes in a sequence more conducive to optimising the bandwidth of the underlying parallel file system so you don’t suffer an order of magnitude loss of bandwidth.
It’s an interesting technology for the HPC market and something I’ll be watching develop.