diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-04 16:16:06 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-04 16:16:06 -0500 |
| commit | 07de13e393d8d69e2e421df726435f0d2e465a67 (patch) | |
| tree | c4dba44bcfa14839abaa14cef6cedc8cf5a51c31 /chapter/8 | |
| parent | 538dc06632cfd59654760392be66372112c1839e (diff) | |
Update big-data.md
Diffstat (limited to 'chapter/8')
| -rw-r--r-- | chapter/8/big-data.md | 60 |
1 files changed, 32 insertions, 28 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 884dead..7727026 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -66,36 +66,13 @@ The input keys and values are drawn from a different domain than the output keys **Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine. - -### Large-scale Parallelism on Graphs -Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. - -Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model. - -This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br /> -BSP model is a message passing synchronous model where - - - - Computation consists of several steps called as supersets. - - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication. - - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. - - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1. - - Barrier synchronization synchs all the processors at the end of every superstep. - -A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms. - -### Querying - - - -## Execution Models -**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. - **Spark** -Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD), Scala’s collection API as well as functional style for high performance processing. +Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. + +***Distributed in-memory storage - Resilient Distributed Data sets :*** -Distributed in-memory storage - Resilient Distributed Data sets : -RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ. +RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. The properties that power RDD with the above mentioned features : - A list of dependencies on other RDD’s. @@ -127,13 +104,40 @@ Spark API provide two kinds of operations on a RDD: - `collect()` : Return an Array[T] containing all elements - `count()` : Return the number of elements +RDDs by default are discarded after use. However, Spark provides two explicit operations persist() and cache() to ensure RDDs are persisted in memory once the RDD has been computed for the first time. -Why RDD over Distributed Shared memory (DSM) ? +***Why RDD over Distributed Shared memory (DSM) ?*** RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages. RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update. Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system). +### Large-scale Parallelism on Graphs +Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. + +Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model. + +This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br /> +BSP model is a message passing synchronous model where - + + - Computation consists of several steps called as supersets. + - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication. + - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. + - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1. + - Barrier synchronization synchs all the processors at the end of every superstep. + +A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms. + +### Querying + + + +## Execution Models +**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. + + + + - Pig/HiveQL/SparkSQL - Limitations ? |
