Data lakes: Does size really matter?

By Lum Ka Kay May 17, 2016

Bad governance will see data lakes become data swamps
Look at the value of the data, not the size of the lake

THE ‘data lake’ – where disparate sources of data are stored in a single repository, with a processing overlay – is shaping out to be the next big thing, but the concept is fraught with challenges, according to Teradata Corp chief technology officer Stephen Brobst.

Citing findings from analyst firm Gartner Inc, he said through 2018, 90% of deployed data lakes will be useless as they would be overwhelmed with information assets captured for uncertain use cases.

And the size of the data lake does not really matter, he added.

“Most companies measure the success of their data lake by looking at how big is it. But the size of the data lake has no value unless you’re curating the data.

“Another thing is that we don’t know what we’re bringing into the data lake. There may be copies of the same data, yet we’re still dumping it in because the bigger your data lake, the more ‘successful’ you are,” he told the Teradata Innovation Forum 2016 in Kuala Lumpur recently.

READ ALSO: Acronis boosts data protection with blockchain technology

This is a reflection of bad governance – the data lake is becoming a data dumping ground, Brobst argued.

“If we’re not doing a good job of governance with our data lakes, we’re going to get a data swamp – a data-dumping ground that has no value – and this is not good.

“We have to curate the data that we put in, meaning we’re doing the care and feeding of the data asset,” he added.

Brobst likened the data curation process to house-cleaning. “No-one likes to do it,” but it must be done.

Data lakes: Does size really matter? “Data curation includes finding the right data structures to map into data stores; creating metadata to describe the schema for storing the data; integration across multiple data stores; and lifecycle management.

“And I argue that … before we put data into the data lake, we should automate the process of capturing what data we put in there, who put it there, and when we put it there,” he said.

According to Brobst (pic), companies can also use crowdsourcing to gather metadata, citing Wikipedia as an example, where readers contribute to managing the content on the online encyclopaedia.

“Just like Wikipedia, where people will constantly argue or edit topics that are most read, this makes data that are constantly being used more accurate.

“Topics that no-one reads about on Wikipedia will be left unedited and unchallenged – and the same goes for data.

“Data that are not useful will be left behind as companies know there is no need for them to invest time to analyse those data,” he added.

Related Stories:

Malaysia needs to accept failure: Teradata CTO

Forget data warehousing, it’s ‘data lakes’ now