AMMAI SUMMARY: Hive - A Warehousing Solution Over a Map-Reduce Framework

Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.

Hive, an open-source data warehousing solution built on top of Hadoop.
Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

Data Model

Data in Hive is organized into:

Tables - These are analogous to tables in relational databases. Each table has a corresponding HDFS directory.

The data in a table is serialized and stored in files within that directory.

Partitions - Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
Buckets - Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Query Language

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause.

HiveQL is also very extensible. It supports user de ned column transformation (UDF) and aggregation (UDAF) functions implemented in Java.

Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface,

HIVE ARCHITECTURE

Hive Architecture

External Interfaces - Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API)
The Metastore is the system catalog
The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
The Compiler is invoked by the driver upon receivin a HiveQL statement.

Metastore

Database - is a namespace for tables.
Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information.
Partition - Each partition can have its own columns and SerDe and storage information.

Compiler

The Parser transforms a query string to a parse tree representation.
The Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators.
The Optimizer performs multiple passes over the logical plan

AMMAI SUMMARY

2013年6月28日星期五

Hive - A Warehousing Solution Over a Map-Reduce Framework

Data Model

Query Language

HIVE ARCHITECTURE

Metastore

Compiler

沒有留言:

張貼留言

2013年6月28日 星期五

Hive - A Warehousing Solution Over a Map-Reduce Framework

Data Model

Query Language

HIVE ARCHITECTURE

Metastore

Compiler

沒有留言:

張貼留言

2013年6月28日星期五