diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-11-17 16:47:58 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-11-17 16:47:58 -0500 |
| commit | 59d3351d76963bf6da6489233a4f7adc098382d0 (patch) | |
| tree | c5266dcbb8d9066b4780773e746ed052511abc8d | |
| parent | 8acda3d0cfc366f2c6e52836635c347c312ba7c2 (diff) | |
Update big-data.md
| -rw-r--r-- | chapter/8/big-data.md | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index ec39127..e2ff3e3 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -15,8 +15,7 @@ In BSP model + At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. + Barrier synchronization synchs all the processors at the end of every superstep.<br /> -A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br /> -Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br /> +A notable feature of the model is the complete control on data through communication between every processor at every superstep. BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br /> `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop. Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode. Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps. Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.<br /> @@ -27,6 +26,7 @@ Pregel’s API provides <br /> + Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep. + Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint. + Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br/> + Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory. Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient. The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI. |
