diff options
Diffstat (limited to 'chapter/8')
| -rw-r--r-- | chapter/8/big-data.md | 43 |
1 files changed, 21 insertions, 22 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index b24d704..cb08b54 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -446,38 +446,37 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below : ### 1.3 Large-scale Parallelism on Graphs -Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. +Map Reduce doesn’t scale easily for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms require a programmer to explicitly handle the intermediate results (writing to disks) resulting in a lot of boilerplate code. Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. -Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model. +Also, graph algorithms require an exchange of messages between vertices. In a case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model. -In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model {% cite bulk-synchronous-model --file big-data%}. +In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. The bulk-synchronous parallel (BSP) model {% cite valiant1990bridging --file big-data%} is one of the most commonly used graph-parallel model. -This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br /> +BSP was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above-mentioned issues with map reduce BSP model is a message passing synchronous model where - - - Computation consists of several steps called as supersets. - - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication. - - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. - - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1. - - Barrier synchronization synchs all the processors at the end of every superstep. +- Computation consists of several steps called as super steps. +- The processors involved have their own local memory and every processor is connected to other via a point-to-point communication. +- At every super step, a processor receives input at the beginning, performs computation and outputs at the end. +- A processor at super step S can send a message to another processor at super step S+1 and can as well receive a message from super step S-1. +- Barrier synchronization syncs all the processors at the end of every super step. +- A notable feature of the model is the complete control of data through communication between every processor at every super step. Though similar to map reduce model, BSP preserves data in memory across super steps and helps in reasoning iterative graph algorithms. -A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms. +The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence. GraphX {%cite xin2013graphx --file big-data%} is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity. -The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction, and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence. -GraphX {%cite xin2013graphx --file big-data%} is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity. +How does GraphX improve over the existing graph-parallel and data flow models? -How does GraphX improve over the existing graph-parallel and data flow models? -Similar to the data flow model, GraphX moves away from the vertex centric view and adopts transformations on graphs yielding a new graph. The RDGs in GraphX provides a set of elegant and expressive computational primitives to support graph transformations as well as for many a graph parallel systems like Pregel, PowerGraph to be easily expressed with minimal lines of code changes to Spark. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view etc. It minimizes communication and storage overhead across the system by adopting vertex-cuts for effective partitioning. +Similar to the data flow model, GraphX moves away from the vertex-centric view and adopts transformations on graphs yielding a new graph. The RDGs in GraphX provides a set of elegant and expressive computational primitives to support graph transformations as well as enable many graph-parallel systems like Pregel, PowerGraph to be easily expressed with minimal lines of code changes to Spark. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view etc. It minimizes communication and storage overhead across the system by adopting vertex-cuts for effective partitioning. **GraphX** -GraphX models graphs as property graphs where vertices and edges can have properties. Property graphs are directed multigraph having multiple parallel edges with same source and destination to realize scenarios where multiple relationships could exists between two vertices. For example, in a social graph where every vertex represents a person, there could be a scenario where two people are both co-workers and a friend at the same time. A vertex is keyed by a unique 64 bit long ideatefier (Vertex ID) while edges contain the corresponding source and destination vertex identifiers. +GraphX models graph as property graphs where vertices and edges can have properties. Property graphs are directed multigraph having multiple parallel edges with same source and destination to realize scenarios where multiple relationships could exist between two vertices. For example, in a social graph where every vertex represents a person, there could be a scenario where two people are both co-workers and a friend at the same time. A vertex is keyed by a unique 64-bit long identifier (Vertex ID) while edges contain the corresponding source and destination vertex identifiers. -GraphX API provides the below primitives for graph transformations : +GraphX API provides the below primitives for graph transformations (From the website : https://spark.apache.org/docs/2.0.0-preview/graphx-programming-guide.html): - `graph` - constructs property graph given a collection of edges and vertices. -- `vertices : VertexRDD[VD]`, `edges : EdgeRDD[ED]`- decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs. -- `mapVertices(map: (Id,V)=>(Id,V2)) => Graph[V2, E]`- transform the vertex collection. +- `vertices: VertexRDD[VD]`, `edges: EdgeRDD[ED]`- decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs. +- `mapVertices(map: (Id,V)=>(Id,V2)) => Graph[V2, E]`- transform the vertex collection. - `mapEdges(map: (Id, Id, E)=>(Id, Id, E2))` - transform the edge collection. - `triplets RDD[EdgeTriplet[VD, ED]]` -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled. - `leftJoin` - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure. @@ -491,21 +490,21 @@ GraphX API provides the below primitives for graph transformations : Graph-parallel computation requires every vertex or edge to be processed in the context of its neighborhood. Each transformation depends on the result of distributed joins between vertices and edges. This means that graph computation systems rely on graph partitioning (edge-cuts in most of the systems) and efficient storage to minimize communication and storage overhead and ensure balanced computation. <figure class="main-container"> - <img src="./edge-cuts.png" alt="edge cuts" /> + <img src="./edge-cut.png" alt="edge cuts" /> </figure> *Figure from {%cite xin2013graphx --file big-data%}* ***Why Edge-cuts are expensive ?*** -Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. +Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. Thus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. <figure class="main-container"> - <img src="./vertex-cuts.png" alt="Vertex cuts" /> + <img src="./vertex-cut.png" alt="Vertex cuts" /> </figure> *Figure from {%cite xin2013graphx --file big-data%}* -***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex. +***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in a way that minimizes the number of machines spanned by each vertex. ***Implementation of Vertex-cut*** |
