Cloudera is, by most estimates, the market share leader for commercial distributions of the big data framework Hadoop. But the company evidently has ambitions beyond Hadoop itself, which is primarily a system for distributing data between clusters of hardware.
The real value of Hadoop for businesses is as a platform for analytics, and Cloudera has been building something of data science brain trust to lead its push up the analytics stack.
The team is led by chief scientist Jeff Hammerbacher who, before co-founding Cloudera, led the data analytics unit at Facebook.
And the latest addition, Cloudera revealed this week, is an alum of London's very own Tech City.
- See also: Cloudera leads European Hadoop market
Sean Owen is a US-born software engineer by training who, having once worked for Google, came to London to work for a venture capital firm.
As a hobby, Owen would work on a number of open source software development projects. One of these was Apache Mahout, whose aim was to use Hadoop as a platform for machine learning – the use of algorithms to automatically infer patterns from data.
"Before Hadoop, most people thought about computations as something you did on a single computer," Owen explains. That means most of the established machine learning algorithms do not work with Hadoop's highly parallelised approach to computation.
"Machine learning works better with more data. The more data you have, the better your predictions will become."
Porting machine learning to Hadoop offered rich improvements. "In general, machine learning works better with more data," Owen explains. "The more data you have, the better your predictions will become."
"The limiting factor for a lot of applications has been their ability to use more and more data types, data sources, volume of data," he adds.
Using Hadoop as a platform for machine learning, he claims, "means you can go through 100 times more data."
So engrossed in the project did Owen become that last year, he decided to found his own start-up, Myrrix. The company offers a recommendation engine, one of the best known business applications of machine learning, based on Apache Mahout.
The most familiar recommendation engine is Amazon.com system for suggest books and DVDs its customer might like. Owen says that around two thirds of Myrrix's customers are either in e-commerce, or operate large content sites that recommend videos or songs it's visitor might like.
"One of our biggest customers is a fashion flash sale website in the Middle East," he explains. "Their products only last for 24 hours, so they need to learn what products to target in a very short amount of time."
There are non-web applications of the technology too, Owen says. "If I'm a bank, I could analyse my customers' transactions and see which kinds of customer shops where," he explains. "Then, if there's a customer who I would expect to be shopping at a particular store but isn't, I could help that store target that customer."
Tech City roots
Owen worked from home and at Google's Campus workspace near Old Street roundabout. "It's important to be around other start-ups, to see what they need, what they're interested in."
Indeed, had it not been for the Tech City start-up scene, which he got to know while working in venture capital, Owen may never have started Myrrix, he says.
"I wouldn't have set up shop in London if there wasn't a tech scene there," he says. "I needed to meet engineers and potential co-founders, and I wanted to talk to 100 different start-ups about what they needed."
"As a network, Tech City is strong and it's useful."
The next stage in Owen's plan had been to raise investment for Myrrix. As it happens, though, he got talking to Hammerbach and another Cloudera data scientist, Josh Wills.
"They had very similar impression of what needed to be done as I do, and they wanted to do the same things," he said. "That was a big piece of the appeal of Cloudera for me."
Cloudera has now acquired Myrrix and appointed Owens as its London director of data science. He will continue his work on machine learning, he says, with the aim of delivering what he calls "big learning" – the machine learning equivalent of big data.
Part of the plan is to do for machine learning what Cloudera (and others) have done for Hadoop: turning a complex code base into a business-ready product.
"I don't know exactly what form it's going to take yet, and we're going to doing a lot of work with customers to find out what they need."
There's certainly plenty of work to be done, Owen says. "We could spend years on doing this. There's so much ground to cover."