diff options
Diffstat (limited to 'chapter/8')
| -rw-r--r-- | chapter/8/big-data.md | 12 |
1 files changed, 6 insertions, 6 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 7025416..aaecd94 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -218,7 +218,6 @@ FROM ( REDUCE word, count USING 'python reduce.py'; ``` This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py. -INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handle reader and writer concurrency. ***Serialization/Deserialization*** @@ -388,13 +387,13 @@ Overall, the performance is very good for conceptually unrelated computations. The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs. The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes. -A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime. +A Spark worker executes the business logic submitted by the user by way of the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime. ***Job scheduler optimization :*** Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing. ***How are persistent RDD’s memory managed ?*** -Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory. +Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, Least Recently Used(LRU) policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory. ### 2.3 Hive execution model @@ -414,9 +413,10 @@ The Hive execution model composes of the below important components (and as show <figure class="main-container"> <img src="./Hive-transformation.png" alt="Hive transformation" /> </figure> -*Figure to depict the transformation flow during optimization, from:* %cite thusoo2010hive --file big-data %} - Some of the important transformations are : +*Figure to depict the transformation flow during optimization, from:* {%cite thusoo2010hive --file big-data %} + + Some of the important opimization techniques are : - Column Pruning - Consider only the required columns needed in the query processing for projection. - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates. Its important that unnecessary records are filtered first and transformations are applied on only the needed ones. @@ -452,7 +452,7 @@ Hence, in Spark SQL, transformation of user queries happens in four phases : <figure class="main-container"> <img src="./sparksql-data-flow.jpg" alt="SparkSQL optimization plan Overview" /> </figure> - +*Figure from : {%cite zaharia2010spark --file big-data %}* ***Analyzing a logical plan to resolve references :*** In the analysis phase a relation either from the abstract syntax tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions. ***Logical plan optimization :*** In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan. |
