2013年6月28日 星期五

Hive - A Warehousing Solution Over a Map-Reduce Framework

Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.

Hive, an open-source data warehousing solution built on top of Hadoop.
Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

Data Model

Data in Hive is organized into:
  1. Tables - These are analogous to tables in relational databases. Each table has a corresponding HDFS directory.
    • The data in a table is serialized and stored in files within that directory.
  2. Partitions - Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
  3. Buckets - Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Query Language

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause.
HiveQL is also very extensible. It supports user de ned column transformation (UDF) and aggregation (UDAF) functions implemented in Java.
Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface,

HIVE ARCHITECTURE

Hive Architecture

  • External Interfaces - Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API)
  • The Metastore is the system catalog
  • The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
  • The Compiler is invoked by the driver upon receivin a HiveQL statement.

Metastore

  • Database - is a namespace for tables.
  • Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information.
  • Partition - Each partition can have its own columns and SerDe and storage information.

Compiler

  • The Parser transforms a query string to a parse tree representation.
  • The Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators.
  • The Optimizer performs multiple passes over the logical plan 

沒有留言:

張貼留言