diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-12 01:03:14 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-12 01:03:14 -0500 |
| commit | 5335e87060a44f9f6fadd3280c77a6ead384c2ad (patch) | |
| tree | c841a317292c42d40e711569c34364006db3dc40 | |
| parent | 65f5401fd11bb8d02e5e32800f8fb2e99254b123 (diff) | |
Reordering Hive execution model and adding more information
Still a bit more to come
| -rw-r--r-- | chapter/8/big-data.md | 38 |
1 files changed, 23 insertions, 15 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 2d96923..8d0407a 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -368,26 +368,34 @@ Persistent RDDs are stored in memory as java objects (for performance) or in mem The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop. -***Metastore*** -It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler. +The Hive execution model as shown above composes of the below important components : -***Query Compiler*** -Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree. -Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators. +- Driver : Similar to the Drivers of Spark/Map reduce application, the driver in Hive handles query submission & its flow across the system. It also manages the session and its statistics. +Metastore : -***Optimisations of Hive:*** +- Metastore – A Hive metastore stores all information about the tables, their partitions, schemas, columns and their types, etc. enabling transparency of data format and its storage to the users. It in turn helps in data exploration, query compilation and optimization. Criticality of the Matastore for managing the structure of hadoop files requires it to be updated on a regular basis. -- Column Pruning - Only the columns needed in the query processing are projected. -- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible. -- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. -- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers. -- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. -- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. -- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data. +- Query Compiler – The Hive Query compiler is similar to any traditional database compilers. it processes the query in three steps : + - Parse : In this phase it uses Antlr (A parser generator tool) to generate the Abstract syntax tree (AST) of the query. + - Transformation of AST to DAG (Directed acyclic graph) : In this phase it generates logical plan and does a compile type checking. Logical plan is generated using the metadata (stored in Metastore) information of the required tables. It can flag errors if any issues found during the type checking. -***Execution Engine*** + - Optimization : Optimization forms the core of any declarative interface. In case of Hive, optimization happens through chains of transformation of DAG. A transformation could include even a user defined optimization and it applies an action on the DAG only if a rule is satisfied. Every node in the DAG implements a special interface called as Node interface which makes it easy for the manipulation of the operator DAG using other interfaces like GraphWalker, Dispatcher, Rule and Processor. Hence, by transformation, we mean walking through a DAG and for every Node we encounter we perform a Rule satisfiability check. If a Rule is satisfied, a corresponding processor is invoked. A Dispatcher maintains a list of Rule to Processor mappings. -Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query). +<figure class="main-container"> + <img src="./Hive-transformation.jpeg" alt="Hive transformation" /> +</figure> + + Some of the important transformations are : + + - Column Pruning - Consider only the required columns needed in the query processing for projection. + - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates. + - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. + - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers. + - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. + - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. + - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data. + +- Execution Engine : Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query). ### 2.4 SparkSQL execution model |
