diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-05 07:41:17 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-05 07:41:17 -0500 |
| commit | 5ae1c5f5a1612f6f042f75941115e4269581d371 (patch) | |
| tree | 0715be52f4d0d1cf16e55647ff29682bf63c590e | |
| parent | b3e083c9f5f3595b79e76ae5130b8c71ae022e02 (diff) | |
Update big-data.md
| -rw-r--r-- | chapter/8/big-data.md | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 1b8f925..11b047c 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -175,6 +175,7 @@ The cluster manager manages and allocates the required system resources to the S A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime. ***Job scheduler optimization :*** Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing. + ***How are persistent RDD’s memory managed ?*** Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory. |
