The Hadoop ecosystem and big data have transitioned from being buzzwords and hype to a reality for businesses, as they realise the value and begin to understand the use cases. But for many, the pressure to use big data could be outweighing the ability to effectively and appropriately store all that data to be processed.
A data lake is one of the most common and cost-effective approaches to storing big data in one place while removing the silos that separate data sets in relational database environments.
It works on the idea that creating a vast reservoir for all data will allow all of your enterprise data to be stored in one place, and accessed and used equally by all business applications, without any need to specially prepare it.
It can save businesses money by enabling them to reduce costs by building their data repository on commodity servers, which are significantly cheaper than those needed for more traditional data warehousing platforms.
But building a data lake alone won’t solve all of your big data analytics worries. ‘Solely building a data lake starts this from the wrong place,’ says Patrick McFadin, chief evangelist for Apache Cassandra at DataStax.
‘Instead, you should ask: why are we storing more data in the first place? Can it be used in specific ways to improve service or reduce customer churn?’
It’s possible to apply analytics to data lakes, but getting value out of information after it has been stored is often more difficult than developing the data storage and analytics strategies together. McFadin warns: ‘It’s like moving a boulder – it’s very difficult to get a static rock moving, but much easier to keep a big project going once it’s moving.
‘I suggest thinking about big data in three ways. First is new transaction data – you have to be able to keep up with all this new data being created by new devices or new actions. Second is near real-time analytics – can you capture that stream of transaction data and use it as part of delivering a service back to the customer? Third is graph, which models the relationships between objects and uses this to find patterns.’
> See also: The 5 phases of overcoming hybrid cloud data integration
Alongside Hadoop for storing data, Cassandra for transaction data, Spark for analytics and distributed graph technologies are developing to fit these requirements.
‘This combination of transaction data, analytic data and graph data can be really powerful,’ says McFadin, ‘particularly when you are able to run multiple data models alongside each other. This big data approach can create more value in the moment, rather than keeping huge amounts of data 'just in case'.’
Blowing hot and cold
As we’ve seen, by developing analytics and storage strategies in tandem, organisations have a place to start. But one size does not fit all when it comes to big data storage requirements.
When assessing where to store their data, organisations need to leverage the various tiers of their platform, considering the volume, velocity and variety of their data, to focus on different areas.
As Richard Simmons, chief technologist for cloud solutions company Logicalis UK, explains, a general guide to deciding this is identifying your data storage needs as cold, warm or hot.
‘Ideally, organisations want to store data as cost effectively as possible in a way that still allows it to be accessible to employees,’ says Simmons. ‘What they don’t want to do is put all their data into an expensive platform from the get go, or equally place it in a cold layer, as employees will find it difficult to access. It’s about finding the right balance.’
The process needs to be seamless as their data storage needs move to warm, and then hot, as value is being established and the need for performance increases.
Ultimately, the key requirements of big data storage are that it can handle large amounts of data, it’s flexible enough for data to be moved, it’s capable of scaling to keep up with data growth, and it can deliver appropriate, fit-for-purpose performance to the data analytics engines. Making all of this work in a cost-effective manner can be a challenge.
Storage options and costs vary significantly based on performance expectations and availability specifications. Power and cooling costs can dominate acquisition costs, so total cost of ownership versus just total cost of acquisition should be considered. And as Joe Fagan, senior director of cloud initiatives EMEA at data storage company Seagate, advises, in a very large pool of storage, devices and interconnects will fail.
‘The uptime specification, together with recovery time objectives (RTOs) will dictate the cost of adding redundancy to the storage,’ he says. ‘Big data is often protected today by triplicating the data to deliver the reliability and performance but taking a hit on capacity costs. Erasure coding techniques are limiting the 3x capacity requirements but are not yet mature, except in a few environments or implementations.’
People power
The problem with data lake environments is that they can be extremely complex to manage and build, driving up costs.
The difference between a data lake and a big data set is that the latter is expected to store and accommodate analytics workloads only, while a data lake is expected in addition to support normal transactional workload, so the lake contains the working data set too. The advantage is that you’re analysing live data,’ says Fagan. ‘Disadvantages are complexity, cost and reliability.’
Because of the inherent complexity, companies might not necessarily have the skills needed in-house, so there’s a need to educate their teams for businesses to be able to manage the technology appropriately.
‘It’s very easy for organisations to build aplatform they canjust throw data into,’ says Simmons. ‘However, ifthey don’t have a good governance strategy around what data they are putting in, it can very quickly become an area that is a swamp rather than a lake – ill managed, poor quality and impossible to draw value from.’
> See also: The road to high-quality analytics starts with high-quality data
Another often-overlooked ‘people and process’ hurdle is adoption.
There’s a danger that an organisation will build a data lake thinking that employees will naturally come and use it, but if they have neglected to focus on how their workforce is going to use it and how it’s going to benefit the business, they can find themselves with a very expensive data lake that’s actually providing little value.
First and foremost in any big data storage strategy is aligning people with the data strategy by identifying why you are looking at your data. If you are unable to identify value from your data and are putting the technology cart before the business value horse, then the theyareonlyusing whole exercise is pointless, argues Simmons.
‘There are a lot of businesses that feel pressured to do something with big data, simply because of its current hype, but haven’t yet identified the business challenge they are trying to overcome,’ he says. ‘In simple terms, the analysis of big data should either save a company money or identify avenues to make additional profit. If it doesn’t do either of these things then there is no point to it.’
> See also: Why open source can save companies drowning in the data lake
Businesses today have the ability to be more creative with their data than ever before thanks to the technological opportunities that weren’t available to them five or ten years ago.
The savvy CIO will ask what it is they are trying to achieve and how they are going to measure that success, and develop their big data storage strategy from there.
Another key point to any strategy will be looking at existing data in new and interesting ways, and being more creative in overcoming business challenges and how data can be used to achieve that.
‘Ideally, you would have a smaller, more focused, team of creative people, such as data analysts and data scientists, who spend their time solely looking at your data and finding new ways to apply it,’ says Simmons.
That kind of engine, he argues, is building the business cases that are justifying the investment in storing all that data.