Hadoop

Perhaps one of the most interesting data warehouse developments in the last decade has been the introduction of Hadoop and its Hadoop Distributed File System (HDFS). At its core, Hadoop is a distributed data store which provides a platform for implementing powerful parallel processing frameworks on it. The reliability of this data store when it comes to storing massive volumes of data couple with its flexibility related to running multiple processing frameworks makes it an ideal choice as the hub for all your data. This characteristic of Hadoop means that you can store any type of data as-is, and without placing any constraints on how that data is processed.

Hadoop is a Schema-on-Read data warehouse, meaning raw unprocessed data can be loaded into it with the structure imposed at processing time based on the requirements of the processing application. Although the ability to store all of a company’s raw data in a Hadoop DW is a powerful option, there are still many factors that should be considered before putting this method into practice, including:

  • How the data is being stored: There are a number of different file formats and compression formats supported on Hadoop. Each of these have particular strengths and weaknesses, which make them better suited for specific applications. Additionally, although Hadoop provides the HDFS for storing data, there are several other commonly used systems implemented on top of HDFS that do allow additional functionality so these systems should also be taken into consideration.
  • Multi-tenancy: It’s common for clusters to host multiple users, groups, and application types, so important considerations should be made when planning the management and storage of data.
  • Schema design: Despite Hadoop being schema-less, there are still important considerations to consider when devising the structure of data stored in Hadoop, including directory structures for data loaded into HDFS as well as the output of data processing and analysis.
  • Metadata: As with any data management system, cataloging and storing the metadata related to the stored data is as important as cataloging and storing the data itself.

As Grover et al. point out, “One of the most fundamental decisions to make when architecting a solution on Hadoop is determining how data will be stored in Hadoop. There is no such thing as a standard data storage format in Hadoop.”236 “Hadoop allows for storage of data in any format, whether it’s text, binary, images, etc. Hadoop also provides built-in support for a number of formats optimized for Hadoop storage and processing,” Grover et al. note.236 This gives users complete control over their source data and there are a number of options on how that data can be stored, not just the raw data being ingested, but also the intermediate data generated during data processing as well as the results of the data processing.236 The complexity of choice obviously goes exponential as one makes one’s way through the process. Major considerations for Hadoop data storage that need to be made include236:

Hadoop’s Schema-on-Read model does not impose any requirements when loading data into Hadoop. While many people use Hadoop for storing and processing unstructured data, some order is still desirable, especially since Hadoop often serves as a data hub for an entire organization, where the stored data is intended to be shared among many departments and teams.295 It is important to create a carefully structured and organized repository of your data for the following reasons:

  • “Standard directory structure makes it easier to share data between teams working with the same data sets.
  • Often times, you’d want to ‘stage’ data in a separate location before all of it is ready to be processed. Conventions regarding staging data will help ensure that partially-loaded data will not get accidentally processed as if it was complete.
  • Standardized organization of data will allow reusing code that processes it.
  • Standardized locations also allow enforcing access and quota controls to prevent accidental deletion or corruption.
  • Some tools in the Hadoop ecosystem sometimes make assumptions regarding the placement of data. It is often simpler to match those assumptions when initially loading data into Hadoop.”295

“The details of the data model will be highly dependent on the specific use case,” explain Grover et al.295 “For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. While unstructured and semi-structured data on the other hand are likely to focus more on directory placement and metadata management,” note Grover et al.295

When mapping out the schema design, the following specifics should be kept in mind:

  • “Develop standard practices and enforce them, especially when multiple teams are sharing the data.
  • Make sure your design will work well with the tools you are planning to use. For example, the version of Hive you are planning to use may only support table partitions on directories that are named a certain way. This will impact the schema design in general and how you name your table subdirectories, in particular.
  • Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Knowing in advance the main use cases and data retrieval requirements will result in schema that will be easier to maintain and support in the long term as well as improve data processing performance.”294

The first thing to understand about HBase is that it’s not a Relational Database Management System (RDBMS), it is more like a huge hash table—i.e., a data structure which implements an associative array of abstract data type, or, put more simply, a structure that can map keys to values. “Just like a hash table, you can associate values with keys and perform fast lookups of the values based on a given key,” Grover et al explain. HBase’s value proposition lies in its scalability and flexibility and it works best for problems like fraud detection, problems that can be solved in a few get and put requests.

  • File format: These include plain text or Hadoop specific formats such as SequenceFile. There are also more complex, but more functionally rich options such as Avro and Parquet, each format comes with their own unique strengths and weaknesses, making it more or less suitable depending on the application and source data types ingested.236 As Hadoop is customizable, It is also possible to create one’s own unique file format.236
  • Compression: Although this is more straightforward than selecting file formats, compression codecs commonly used with Hadoop have their own unique characteristics, some compress and uncompress faster, but don’t compress as aggressively, others create smaller files, but take longer to compress and uncompress, and not surprisingly require more CPU power.236 The ability to split compressed files is also a very important consideration when working with data stored in Hadoop.
  • Data storage: Although Hadoop data is stored in HDFS, there are decisions around what the underlying storage manager should be, i.e. whether you should use HBase or HDFS directly to store the data.236

[i] Grover, Mark, Malaska, Ted, Seidman, Jonathan, Shapira, Gwen,. Hadoop Application Architectures: Designing Real-World Big Data Applications. O’Reilly Media. July 2015.

© 2017-2018 Intelligencia Limited. All Rights Reserved.

Contact

MACAU:

Rua da Esrela, No. 8, Macau

Macau: +853 6616 1033

 

HONG KONG:

505 Hennessy Road, #613, Causeway Bay

HK: +852 5196 1277

andrew.pearson@intelligencia.co