diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-13 23:46:17 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-13 23:46:17 -0500 |
| commit | 20ce102f8c508d9ed55b1afd66f72558055350cc (patch) | |
| tree | ded66c16373f47f0b6411bb11aa62a45005c7e9f /chapter | |
| parent | d481dd67059324d25a2af04214905d2bbac55995 (diff) | |
Fixing review comments
Diffstat (limited to 'chapter')
| -rw-r--r-- | chapter/8/big-data.md | 14 |
1 files changed, 6 insertions, 8 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index f51198f..19dc823 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -22,10 +22,7 @@ An alternative approach to data prallelism is to construct complex, multi-step d Microsfot **Dryad** {% cite isard2007dryad --file big-data %} abstracts individual computational tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports a single input and a single output for each vertex. Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO. -Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce's model and builds upon the ideas behind DAG, lazy evaluation of DryadLinq. Instead of writing data to disk for each job as MapReduce does Spark can cache the results across jobs. Spark explicitly caches computational data in memory thorugh specialized immutable datasets named Resilient Distributed Sets(RDD) and reuse the same dataset across multiple parallel operations. The Spark builds upon RDD to achieve fault tolerance by reusing the lineage information of the lost RDD. This results in lesser overhead than what is seen in fault tolerance achieved by checkpoint in Distribtued Shared Memory systems. Moreover, Spark is the underlying framework upon which many very different systems are built, e.g., Spark SQL & DataFrames, GraphX, Streaming Spark, which makes it easy to mix and match the use of these systems all in the same application. These feature makes Spark the best fit for iterative jobs and interactive analytics and also helps it in providing better performance. -{% comment %} -Above all, any system can be easily expressed by Spark enabling other models to leverage the specific advantages of Spark systems and still retain the process of computation without any changes to Spark system[ref]. -{% endcomment %} +Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce's model and builds upon the ideas behind DAG, lazy evaluation of DryadLinq. Instead of writing data to disk for each job as MapReduce does Spark can cache the results across jobs. Spark explicitly caches computational data in memory thorugh specialized immutable datasets named Resilient Distributed Sets(RDD) and reuse the same dataset across multiple parallel operations. The Spark builds upon RDD to achieve fault tolerance by reusing the lineage information of the lost RDD. This results in lesser overhead than what is seen in fault tolerance achieved by checkpoint in Distribtued Shared Memory systems. Moreover, Spark is the underlying framework upon which many very different systems are built, e.g., Spark SQL & DataFrames, GraphX, Streaming Spark, which makes it easy to mix and match the use of these systems all in the same application.These feature makes Spark the best fit for iterative jobs and interactive analytics and also helps it in providing better performance. Following four sections discuss about the programming models of MapReduce, FlumeJava, Dryad and Spark. @@ -127,7 +124,7 @@ Spark {%cite zaharia2010spark --file big-data %} is a fast, in-memory data pro *Distributed in-memory storage - Resilient Distributed Data sets :* -RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. +RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(Resilient). If a node fails, an RDD can always be recovered using its lineage; the DAG of computations performed on the source dataset. A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These transformations are deferred; that means they are built up and staged, and aren't actually applied until an action is performed on an RDD. Thus, it's important to note that while one might have applied many transformations to a given RDD, some resulting transformed RDD may not be materialized even though one may hold a reference to it. The properties that power RDD with the above mentioned features : - A list of dependencies on other RDD’s. @@ -162,9 +159,10 @@ Spark API provide two kinds of operations on a RDD: RDDs by default are discarded after use. However, Spark provides two explicit operations persist() and cache() to ensure RDDs are persisted in memory once the RDD has been computed for the first time. *Why RDD over Distributed Shared memory (DSM) ?* -RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages. -RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update. -Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system). +RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Since RDDs are immutable they don't require checkpointing at all and can be derived from their lineages. Hence RDDs do not incur the overhead of checkpointing thats present in DSM. +Also, in DSM, any failure requires the whole program to be restored. In case of RDDs, only the lost RDD partitions need to be recovered. This recovery happens parallely on the affected nodes. +RDDs are immutable and hence a straggler (slow node) can be replaced with a backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update. + ***Challenges in Spark*** |
