Data Lakes

Perhaps one of the most interesting data warehouse developments in the last decade has been the introduction of Hadoop and its Hadoop Distributed File System (HDFS).  At its core, Hadoop is a distributed data store which provides a platform for implementing powerful parallel processing frameworks on it. The reliability of this data store when it comes to storing massive volumes of data couple with its flexibility related to running multiple processing frameworks makes it an ideal choice as the hub for all your data. This characteristic of Hadoop means that you can store any type of data as-is, and without placing any constraints on how that data is processed. Hadoop is a Schema-on-Read data warehouse, meaning raw unprocessed data can be loaded into it with the structure imposed at processing time based on the requirements of the processing application.

Although the ability to store all of a company’s raw data in a Hadoop DW is a powerful option, there are still many factors that should be considered before putting this method into practice. As Grover et al. point out in their book Hadoop Application Architectures: Designing Real-World Big Data Applications, the following factors must be considered:

  • How the data is being stored: There are a number of different file formats and compression formats supported on Hadoop. Each of these have particular strengths and weaknesses, which make them better suited for specific applications. Additionally, although Hadoop provides the HDFS for storing data, there are several other commonly used systems implemented on top of HDFS that do allow additional functionality so these systems should also be taken into consideration.
  • Multi-tenancy: It’s common for clusters to host multiple users, groups, and application types, so important considerations should be made when planning the management and storage of data.
  • Schema design: Despite Hadoop being schema-less, there are still important considerations to consider when devising the structure of data stored in Hadoop, including directory structures for data loaded into HDFS as well as the output of data processing and analysis.
  • Metadata: As with any data management system, cataloging and storing the metadata related to the stored data is as important as cataloging and storing the data itself.

As Grover et al. point out, “One of the most fundamental decisions to make when architecting a solution on Hadoop is determining how data will be stored in Hadoop. There is no such thing as a standard data storage format in Hadoop.” “Hadoop allows for storage of data in any format, whether it’s text, binary, images, etc. Hadoop also provides built-in support for a number of formats optimized for Hadoop storage and processing,” Grover et al. note. This gives users complete control over their source data and there are a number of options on how that data can be stored, not just the raw data being ingested, but also the intermediate data generated during data processing as well as the results of the data processing. The complexity of choice obviously goes exponential as one makes one’s way through the process. Major considerations for Hadoop data storage that need to be made include:

  • File format: These include plain text or Hadoop specific formats such as SequenceFile. There are also more complex, but more functionally rich options such as Avro and Parquet, each format comes with their own unique strengths and weaknesses, making it more or less suitable depending on the application and source data types ingested.236 As Hadoop is customizable, It is also possible to create one’s own unique file format.
  • Compression: Although this is more straightforward than selecting file formats, compression codecs commonly used with Hadoop have their own unique characteristics, some compress and uncompress faster, but don’t compress as aggressively, others create smaller files, but take longer to compress and uncompress, and not surprisingly require more CPU power. The ability to split compressed files is also a very important consideration when working with data stored in Hadoop.
  • Data storage: Although Hadoop data is stored in HDFS, there are decisions around what the underlying storage manager should be, i.e. whether you should use HBase or HDFS directly to store the data.

 

© 2017-2018 Intelligencia Limited. All Rights Reserved.

Contact

MACAU:

Rua da Esrela, No. 8, Macau

Macau: +853 6616 1033

 

HONG KONG:

505 Hennessy Road, #613, Causeway Bay

HK: +852 5196 1277

andrew.pearson@intelligencia.co