Luke Lonergan is a co-founder of Greenplum and the CTO of the Data Computing Products Division at EMC.
The Data Computing Products division provides data computing solutions focused on "Big Data" processing. These include the Greenplum Database, Chorus, Embedded Analytics, Hadoop and other kinds of processing infrastructure to be discussed later. Their latest solution is the recently announced Greenplum Data Centre Appliance. I quickly snapped a picture of two GP1000's on the manufacturing floor at the Cork Center Of Excellence as they were heading towards shipping.
It was interesting to see finished product on it's way out the door just a few days after I had spoken with Luke. I had some questions, he had some answers and as per usual I get to the point right up front
Is the Database market done? The Online Transaction Processing (OLTP) business has been long dominated by three commercial entities, isn’t that game now over?
Over? No. Technologically the “Database Wars” as we call them were fought by knights jousting in a field, by comparison the coming battle is going to be fought in Space. We're in the era of Big Data, hence the focus on really niche hardware designs from some as they struggle to square their existing technology with the needs of Online Analytics Processing. (OLAP) There will be different players and new winners in Big Data.
Define Big Data for me? It conjures images of an ocean of storage.
Well it can be an ocean of storage and some customers have an ocean of storage but Big Data isn't about quantity it's about gaining quality insight from multiple data sources. It’s about being presented with what you need to make a decision.
OLTP. OLAP. What's the difference?
Let’s take online shopping for example, OLTP is where you buy a product and it's recorded that something just happened. We sold something, decrement the inventory number for that item and get the order packaged and shipped. OLTP is the now. OLAP is the process of examining everything that's happened and gleaning insight from what's relevant. When shopping online you've probably seen suggestions for products they think you might be interested in as a result of your past buying behaviour…
I've seen varying degrees of accuracy and some very unusual suggestions with those things.
Right, some of those were organically developed from scratch over the course of a decade, we have people working for us who worked on some of the most well known implementations at the very beginning, but as you look at what followed you can see the accuracy getting better and better as the ideas were refined. These systems now suggest things I'm interested in looking at even if I choose not to purchase.
That's a consumer facing example, if we look at it from the business side let’s take the mobile phone provider. Customer churn is a massive problem as it costs a lot more to win a customer back after you've lost them than it does to keep them, but by examining a customer's usage behaviour you could offer them a more favourable deal before they dump you and sign on with a competitor.
In both cases quality information leads to a beneficial outcome.
OLTP & OLAP are two distinct categories right now do you see those merging?
Yes.
Is the Greenplum Data Centre Appliance (DCA) a step towards that merge?
Wow, is that the time?!?
Ha! It strikes me that the appliances in this space can be oddities in the IT infrastructure. They don’t integrate well with what people have already bought.
That’s because of how they’re sold and I know this as I’ve been there. A Business Unit goes to the Operational IT team and says it has these particular requirements. Operational IT thinks about it, comes back with how long it’ll take to get approval, how long it’ll take to deploy, configure, test and put into production as well as what SLAs they can provide when it’s running. This process usually takes so long that the Business Unit is out the door and looking at other options. They’re not in the business of provisioning and managing infrastructure so they want this up and running in the next hour, not 6 to 12 months from now. They want the box with a power button on the front and a network cable coming out the back.
The strength of the appliance model is that it lands on the floor tested and configured at the point of manufacture, the weakness has been that many of these products are infrastructure islands. What’s good for the Business Unit can be bad for the Operational IT team as the appliance stands apart from all the existing production infrastructure.
What we’ve done with the Greenplum Data Centre Appliance is bridge that gap between the requirements of the Business Unit and the requirements of Operational IT. The DCA can be deployed and operated as a stand alone Appliance, turn it on and data goes in while decisions come out, but you can connect it to an EMC array if you choose, replicate it with RecoverPoint and back it up to Data Domain. You’re now storing the data on your production arrays, getting long distance continuous remote replication with bookmarking and backing it up to deduplication storage with built in integrity checking and bandwidth optimized replication.
All of that means it’s no longer an island in your data centre. It’s part of the infrastructure.
It’s all industry standard components here. There’s Intel processors, disk drives and Ethernet. Nothing esoteric.
We don’t need anything esoteric. We put 16 segment servers in a rack and have qualified it to scale out to 24 racks. There’s 4608 Intel processing cores in that 24 rack configuration and we can use all of them. I'm not ruling out adopting anything beyond Processors, Ethernet and Disk Drives if it makes sense for us to do so but we get our performance from our Massively Parallel Processing architecture. What the appliance model does for us is increase our delivery options to customers. Stand alone software, appliances and virtual appliances for virtual infrastructure.
Are you working with VMware?
We have a number of active projects with VMware. In the near term we’re focusing with them on operation in purely virtualized infrastructure. That’s the starting point.
Tell me about Greenplum Chorus?
Chorus provides a collaboration framework for data analysis and sharing. We’ve wrapped it in an interface kind of like Facebook as we want to make it easy to create sandboxes and share documents and data between people of different skill sets who have different data requirements. We’re working to get the Chorus Beta ready for early 2011 so we’ll be talking more about that the closer we get to the Beta.
Greenplum and Cloudera made an announcement recently around MapReduce where’s the win there? I thought Greenplum already supported MapReduce?
MapReduce (MR) is like Assembly language for data programming. Greenplum has supported a MapReduce interface directly into our parallel execution engine for several years. Our relationship with Cloudera allows us to support even more configurations of Greenplum and Hadoop, Hadoop being one of the popular flavours of MR, but programmers can need a higher level of abstraction so our objective is to facilitate alternative bindings for Perl, Python, C, Java and other languages in a naturally parallel manner. Our high level SQL interface is more powerful for the experienced data professional, and in addition to SQL and MR we are providing other powerful programming abstractions.
Got any Scott McNealy stories you want to share?
Well that depends on how much Beer you have...?
