diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-07 10:00:31 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-07 10:00:31 -0500 |
| commit | e4a447b9a7c357696d6a1bd5e71d99043b3c57ab (patch) | |
| tree | 40e1c374f25bc444ed43002cf18008a8518624a1 /chapter | |
| parent | 8e36d96069648c019d1e59fea0b0844cf43e3862 (diff) | |
Added Hive model and architecture
Diffstat (limited to 'chapter')
| -rw-r--r-- | chapter/8/big-data.md | 53 |
1 files changed, 53 insertions, 0 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 84b597c..a75f9fc 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -259,6 +259,30 @@ Other than standard data-parallel operators like filter, map, leftJoin, and redu ### 1.3 Querying +Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs. + +Data in Hive is organized into three different formats : + +`Tables`: Like RDBMS tables Hive contains rows and tables and every table can be mapped to HDFS directory. All the data in the table is serialized and stored in files under the corresponding directory. Hive is extensible to accept user defined data formats, custom serialize and de-serialize methods. It also supports external tables stored in other native file systems like HDFS, NFS or local directories. + +`Paritions`: Distribution of data in sub directories of table directory is is determined by one or more partitions. A table can be further partitioned on columns. +`Buckets`: Data in each partition can be further divided into buckets on the basis on hash of a column in a table. Each bucket is stored as a file in the partition directory. + +***HiveSQL*** : + +Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this: +FROM ( + MAP inputdata USING 'python mapper.py' AS (word, count) + FROM inputtable + CLUSTER BY word + ) + REDUCE word, count USING 'python reduce.py'; +This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py. +INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handle reader and writer concurrency. + + +***Serialization/Deserialization*** +Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat. #### SparkSQL - Where Relational meets Procedural : Relational interface to big data is good, however, it doesn’t cater to users who want to perform @@ -309,6 +333,35 @@ At high level, when the user program calls *MapReduce* function, the input files </figure> +**Hive execution model** + + +<figure class="main-container"> + <img src="./Hive-architecture.png" alt="Hive architecture" /> +</figure> + +The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop. + +***Metastore*** +It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler. + +***Query Compiler*** +Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree. +Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators. + +***Optimisations of Hive:*** + +- Column Pruning - Only the columns needed in the query processing are projected. +- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible. +- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. +- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers. +- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. These include: +- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. +- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data. + +***Execution Engine*** + +Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query). **Pregel** |
