Data lake: “A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing big data.” (Source: Wiktionary.) “The opposite of a data warehouse, meaning they're huge pools of data stored in its original format instead of being collated, sorted and filed.” (Source: ReadWrite.)
The last several years have seen a growing focus on the uses of “big data” (the term was a strong contender for word of the year in 2012). Perhaps inevitably, there’s been a reaction against what’s perceived as big-data hype. Now, reports Matt Asay in the online technology publication ReadWrite, a prominent industry analyst, Gartner’s Nick Heudecker, “has come out swinging” against “the data lake fallacy”:
It was … surprising to see Heudecker go after one of the latest buzzwords making its way around Big Data circles: the data lake. Espoused by a variety of vendors (usually Hadoop vendors, but not exclusively so), the data lake is a mythical happy place for data to reside in its native format until someone within the enterprise needs to analyze it.
(Hadoop is “an open-source framework for processing and storing vast amounts of business data.” The company was founded in 2005 and named after a toy elephant owned by the son of one of the founders.)
From a Gartner press release dated July 28, 2014:
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information,” said Mr. Heudecker. “It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without ‘a priori knowledge’ and that they understand the incomplete nature of datasets, regardless of structure.”
While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines. Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible.
Coinage of “data lake” is credited to James Dixon, James Dixon, co-founder and chief technology officer of the open-source business-analytics company Pentaho*. In an October 14, 2010, blog post Dixon wrote:
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
It’s possible that the lake metaphor was influenced just a bit by the data cloud metaphor, which has been in circulation for about 20 years. When referring to new, abstract, and abstruse phenomena, it may help to create parallels to old, concrete, and familiar concepts.
* The “penta” in Pentaho is a reference to the company’s five founders. I haven’t found an explanation for the “ho”; the company is based in Orlando, Florida, not Tahoe or Idaho.