Thanks For Your Feedback!

Email
Subject
Message
Five plus seven is? (answer as number)
Open/Close

Facebook Comments

Click to open/close fB Comments
RSS

Intelligencia.co

Data Lakes

A data lake is a hub or repository of all data that business users have access to, where the data is ingested and stored in as close to the raw form as possible without enforcing any restrictive schema. This provides an unlimited window of view of data for anyone to run ad-hoc queries and perform cross-source navigation and analysis on the fly. Successful data lake implementations respond to queries in real-time and provide users an easy and uniform access interface to the disparate sources of data. Data Lakes retain all data, support all data types and all users, as well as adapt easily to changes and provide faster insights.

With a normal data warehouse, businesses would need to decide on the structure (schema) of the data when creating the warehouse—before anything is even populated with data (schema on write). With a Hadoop-based data lake like Datameer, however, the business just has to store the data and structure it later, at a time when it is needed for each query or use case (a schema on read framework).

In recent years, businesses have come to the realization that data warehouses, while perfectly able to handle the BI and analytics needs of yesterday, don’t always work in today’s complex IT environments, which contains structured, unstructured and semi-structured data. Normal relational databases worked just fine when business users were restricted to proprietary databases and the scope of work was restricted to canned reports and modest dashboards that included limited drill down functionality. Today, however, with the inclusion of so much unstructured social media and IoT data, limitations abound. Data warehouses need built-in, understandable schemas, but unstructured data, by definition, doesn’t have definable schemas that are accessible and understandable in every case. Data lakes were a response to these sizeable limitations.   

As Sundeep Sanghav (2014) explains in his article Data Lakes vs. Data Warehouses:

A data lake is a hub or repository of all data that any organization has access to, where the data is ingested and stored in as close to the raw form as possible without enforcing any restrictive schema. This provides an unlimited window of view of data for anyone to run ad-hoc queries and perform cross-source navigation and analysis on the fly. Successful data lake implementations respond to queries in real-time and provide users an easy and uniform access interface to the disparate sources of data.

In particular, data warehouses can’t handle many of the aspects of Big Data as “it is very difficult to ascertain upfront all the intelligence and insights one would be able to derive from the variety of different sources, including proprietary databases, files, 3rd party tools to social media and web, that keep cropping up on a regular basis” Sanghav notes (2014). While a business user or company executive might have a long list of questions that they want answered during the setup phase, they have no way of knowing what they might want to ask once the answers to the first batch of questions come in (Sanghav 2014). For sports book, discovering the ability “to navigate from a starting question or data point to different directions, slicing and dicing the data in any ad-hoc way that the train-of-thought of analysis demands is essential for real data discovery” Sanghav continues (2014); those Rumsfeldian “unknown unknowns” can actually be the most revelatory of all discoveries and, by definition, they are not on the table at the start of these questioning exercises.

Data lakes do carry substantial risks, however. As Gartner (2014) warns: “The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake.” “By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch” (Gartner 2014).

&