From b82f8af807915aaa0020a39d1b1a61f5d23ca2ff Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 12:14:42 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index 9c8a8c9..d8201fa 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -1,11 +1,23 @@
 ---
 layout: page
 title:  "Distributed Programming Languages"
-by: "Joe Schmoe and Mary Jane"
+by: "A Systems Person"
 ---
 
-Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. {% cite Uniqueness --file dist-langs %}
+Distributed programming is hard because of:
+
+* Network partitions
+* Node failures
+* Efficiency / Communication
+* Data distribution / locality
+
+Approaches:
+
+* Message-passing
+* RPC
+* Actors
+* Coordination (Linda)
 
 ## References
 
-{% bibliography --file dist-langs %}
\ No newline at end of file
+{% bibliography --file dist-langs %}
-- 
cgit v1.2.3


From c59d17f62954a29c3556d4211680bdebe6842af6 Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 13:03:00 -0500
Subject: Update dist-langs.md

bullets
---
 chapter/4/dist-langs.md | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index d8201fa..fd3f053 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -11,12 +11,28 @@ Distributed programming is hard because of:
 * Efficiency / Communication
 * Data distribution / locality
 
-Approaches:
+### Two major major, orthogonal approaches to distributed languages:
 
-* Message-passing
-* RPC
-* Actors
-* Coordination (Linda)
+#### Actor / Object model
+
+* Erlang
+* Cloud Haskell
+
+#### Dataflow model
+
+The dataflow model has its roots in functional programming.
+Some languages that use this model are:
+
+* Multilisp
+* MapReduce (Spark, Hadoop, etc.)
+
+### Why GPL's not DSL's?
+
+* problem of domain-composition
+* problem of abstraction
+* problem of ecosystem
+* problem of tumultuous architecture
+* "any gpl + library can act as a dsl" - mernik"
 
 ## References
 
-- 
cgit v1.2.3


From f832573aad966fa9f600f1707eb709f7e89814c3 Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 13:03:41 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index fd3f053..d268c3a 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -4,19 +4,12 @@ title:  "Distributed Programming Languages"
 by: "A Systems Person"
 ---
 
-Distributed programming is hard because of:
-
-* Network partitions
-* Node failures
-* Efficiency / Communication
-* Data distribution / locality
-
 ### Two major major, orthogonal approaches to distributed languages:
 
 #### Actor / Object model
 
 * Erlang
-* Cloud Haskell
+* Cloud Haskell (I know, right? Why?)
 
 #### Dataflow model
 
-- 
cgit v1.2.3


From 1818c8eabf2cb1c65019bafd57198eaade8af9c0 Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 13:51:13 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index d268c3a..9f3a91a 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -8,10 +8,13 @@ by: "A Systems Person"
 
 #### Actor / Object model
 
+The actor model has its roots in procedural programming.
+This model maps in a straighforward way to a distributed environment.
+
 * Erlang
 * Cloud Haskell (I know, right? Why?)
 
-#### Dataflow model
+#### Dataflow model (static and stream)
 
 The dataflow model has its roots in functional programming.
 Some languages that use this model are:
@@ -27,6 +30,20 @@ Some languages that use this model are:
 * problem of tumultuous architecture
 * "any gpl + library can act as a dsl" - mernik"
 
+#### Erlang vs C: A Tar and Feathering
+
+[citation erlang paper]
+
+Erlang has only one clear benefit over C, which is dynamic code upgrading.
+However, there are ways of making C behave in a similar fashion with minimal downtime.
+Shuffler [citation] is a system for continuous randomization of code.
+Using techniques discussed in the paper, one could dynamically replace sections of a binary.
+Another, slightly hack-ish workaround would be to receive the upgrade, serialize the current state, and finally run the new binary based on the serialized state.
+
+Other than dynamic code swapping and poor error detection, Erlang does not offer anything that is not offered by a traditional OS.
+Isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
+Why is this language not considered redundant?
+
 ## References
 
 {% bibliography --file dist-langs %}
-- 
cgit v1.2.3


From 53e58a99885ddcf08fa5a352a917a9e6100e093a Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 13:58:53 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index 9f3a91a..8745be9 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -39,6 +39,8 @@ However, there are ways of making C behave in a similar fashion with minimal dow
 Shuffler [citation] is a system for continuous randomization of code.
 Using techniques discussed in the paper, one could dynamically replace sections of a binary.
 Another, slightly hack-ish workaround would be to receive the upgrade, serialize the current state, and finally run the new binary based on the serialized state.
+A third way of circumventing this problem would be to encapsulate any code in a shared library, and have logic in the program to unmap the old code, replace the library, and remap.
+This approach is analogous to Erlang's approach.
 
 Other than dynamic code swapping and poor error detection, Erlang does not offer anything that is not offered by a traditional OS.
 Isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
-- 
cgit v1.2.3


From 82620a042d98e2d3c1f2e87be17c87cd329ccca3 Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 14:27:18 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index 8745be9..a097f51 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -32,18 +32,15 @@ Some languages that use this model are:
 
 #### Erlang vs C: A Tar and Feathering
 
-[citation erlang paper]
-
-Erlang has only one clear benefit over C, which is dynamic code upgrading.
-However, there are ways of making C behave in a similar fashion with minimal downtime.
-Shuffler [citation] is a system for continuous randomization of code.
-Using techniques discussed in the paper, one could dynamically replace sections of a binary.
-Another, slightly hack-ish workaround would be to receive the upgrade, serialize the current state, and finally run the new binary based on the serialized state.
-A third way of circumventing this problem would be to encapsulate any code in a shared library, and have logic in the program to unmap the old code, replace the library, and remap.
-This approach is analogous to Erlang's approach.
-
-Other than dynamic code swapping and poor error detection, Erlang does not offer anything that is not offered by a traditional OS.
-Isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
+{% cite Armstrong2010 --file dist-langs %}
+
+Erlang offers nothing that is unavailable in C.
+
+For example, dynamic code swapping is one of Erlang's major selling points.
+However, code swapping can easily be achieved in C with dynamic linking.
+This approach is analogous to the example offered in the Erlang paper.
+
+Other selling points, such as isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
 Why is this language not considered redundant?
 
 ## References
-- 
cgit v1.2.3


From 21ef2e4488013769d08a27765b21017e7713a91f Mon Sep 17 00:00:00 2001
From: cnnrznn <cnnrznn@gmail.com>
Date: Wed, 16 Nov 2016 14:34:56 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 6 ++++++
 1 file changed, 6 insertions(+)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index a097f51..9f04232 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -43,6 +43,12 @@ This approach is analogous to the example offered in the Erlang paper.
 Other selling points, such as isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
 Why is this language not considered redundant?
 
+#### MapReduce: A New Hope
+
+Unlike Erlang, MapReduce and DSL's that implement the paradigm are "all the rage."
+Unlike Erlang, MapReduce has experienced adoption because it offers true abstraction of the problems of distributed computing.
+Erlang only provided a way of detecting a process failure; it did not consider machine or network failures.
+
 ## References
 
 {% bibliography --file dist-langs %}
-- 
cgit v1.2.3


From 8fdcf6b4a99834901812f5d1c41e596ecc370647 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:38:52 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index bfd3e7b..b08ae70 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -4,8 +4,31 @@ title:  "Large Scale Parallel Data Processing"
 by: "Joe Schmoe and Mary Jane"
 ---
 
-Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. {% cite Uniqueness --file big-data %}
+Though highly efficient and one of the first major programming models for distributed batch processing, it too has a few limitations.<br />
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
+`Bulk synchronous parallel` model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
+In BSP model 
++ Computation consists of several steps called as supersets. 
+2> The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
+3> At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
+4> Barrier synchronization synchs all the processors at the end of every superstep.
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. 
+Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
+Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
+Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
+Pregel’s API provides 
+1> compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
+2> User defined handler for handling issues like missing destination vertex etc.
+3> Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
+4> Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
+5> Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
+6> Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
+Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
+Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
+
 
 ## References
 
-{% bibliography --file big-data %}
\ No newline at end of file
+{% bibliography --file big-data %}
-- 
cgit v1.2.3


From 99b7aadcaf75b04c9d7a31d6d9671d13f609db36 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:40:46 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b08ae70..9dfd1d6 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -7,24 +7,25 @@ by: "Joe Schmoe and Mary Jane"
 Though highly efficient and one of the first major programming models for distributed batch processing, it too has a few limitations.<br />
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
 Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
-`Bulk synchronous parallel` model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
+### Bulk synchronous parallel model
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
 In BSP model 
 + Computation consists of several steps called as supersets. 
-2> The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
-3> At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
-4> Barrier synchronization synchs all the processors at the end of every superstep.
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. 
-Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
-Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
++ The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
++ At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
++ Barrier synchronization synchs all the processors at the end of every superstep.
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
+Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
+`Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
 Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
-Pregel’s API provides 
-1> compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
-2> User defined handler for handling issues like missing destination vertex etc.
-3> Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-4> Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
-5> Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-6> Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
+#### Pregel’s API provides 
++ compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
++ User defined handler for handling issues like missing destination vertex etc.
++ Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
++ Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
++ Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
++ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
 Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
 Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
-- 
cgit v1.2.3


From d63ceb8aa8c280d789e3163426ea86bff16997e0 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:41:41 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 9dfd1d6..6a3edd3 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,6 +14,7 @@ In BSP model
 + The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
 + At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
 + Barrier synchronization synchs all the processors at the end of every superstep.
++
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
 Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
 `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
-- 
cgit v1.2.3


From ce6727b47cbd0d4f0ac3407b498982a09c4b3e50 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:43:04 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 6a3edd3..eef8a8f 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -15,7 +15,7 @@ In BSP model
 + At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
 + Barrier synchronization synchs all the processors at the end of every superstep.
 +
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
++A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
 Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
 `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-- 
cgit v1.2.3


From 7ad2750c2af8b62717eef017cbdb4a370fbca2e5 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:44:00 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index eef8a8f..d17d2b1 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -19,8 +19,8 @@ In BSP model
 Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
 `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
-#### Pregel’s API provides 
+Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.<br />
+Pregel’s API provides <br />
 + compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
 + User defined handler for handling issues like missing destination vertex etc.
 + Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-- 
cgit v1.2.3


From c4ffeb4613d4cc96d4f520d4f57be9b22d951c1f Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:44:30 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d17d2b1..89e3d6a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -13,9 +13,9 @@ In BSP model
 + Computation consists of several steps called as supersets. 
 + The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
 + At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
-+ Barrier synchronization synchs all the processors at the end of every superstep.
-+
-+A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
++ Barrier synchronization synchs all the processors at the end of every superstep.<br />
+
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
 Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
 `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-- 
cgit v1.2.3


From 5b5b7221244988db67903836ebe6d76194001f23 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:45:45 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 89e3d6a..0bafaed 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -26,7 +26,7 @@ Pregel’s API provides <br />
 + Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
 + Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
 + Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-+ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
++ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br />
 Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
 Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
-- 
cgit v1.2.3


From 8acda3d0cfc366f2c6e52836635c347c312ba7c2 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:46:45 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 0bafaed..ec39127 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -26,7 +26,7 @@ Pregel’s API provides <br />
 + Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
 + Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
 + Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-+ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br />
++ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br/>
 Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
 Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
-- 
cgit v1.2.3


From 59d3351d76963bf6da6489233a4f7adc098382d0 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:47:58 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ec39127..e2ff3e3 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -15,8 +15,7 @@ In BSP model
 + At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
 + Barrier synchronization synchs all the processors at the end of every superstep.<br />
 
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. <br />
-Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
 `Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
 Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.<br />
@@ -27,6 +26,7 @@ Pregel’s API provides <br />
 + Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
 + Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
 + Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br/>
+
 Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
 Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
-- 
cgit v1.2.3


From 607c2f97c8c032b912bc64c553b43b694f10f693 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:59:19 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e2ff3e3..cf13efa 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -1,7 +1,7 @@
 ---
 layout: page
 title:  "Large Scale Parallel Data Processing"
-by: "Joe Schmoe and Mary Jane"
+by: "JingJing and Abhilash"
 ---
 
 Though highly efficient and one of the first major programming models for distributed batch processing, it too has a few limitations.<br />
@@ -32,5 +32,7 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 
 ## References
+"Bulk synchronous model" http://www.cse.unt.edu/~tarau/teaching/parpro/papers/Bulk%20synchronous%20parallel.pdf.
+"Pregel: A System for Large-Scale Graph Processing." 
+"One trillion edges: graph processing at Facebook-scale"
 
-{% bibliography --file big-data %}
-- 
cgit v1.2.3


From 37e2fe6098829d50679546be27d744487918d488 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 16:59:38 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index cf13efa..f1e53e0 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -33,6 +33,6 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 ## References
 "Bulk synchronous model" http://www.cse.unt.edu/~tarau/teaching/parpro/papers/Bulk%20synchronous%20parallel.pdf.
-"Pregel: A System for Large-Scale Graph Processing." 
+"Pregel: A System for Large-Scale Graph Processing." <br />
 "One trillion edges: graph processing at Facebook-scale"
 
-- 
cgit v1.2.3


From 3fc056ab35031b0c47df3a52c65a812428383250 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 17 Nov 2016 17:01:47 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index f1e53e0..4c1f060 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -34,5 +34,5 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 ## References
 "Bulk synchronous model" http://www.cse.unt.edu/~tarau/teaching/parpro/papers/Bulk%20synchronous%20parallel.pdf.
 "Pregel: A System for Large-Scale Graph Processing." <br />
-"One trillion edges: graph processing at Facebook-scale"
+"One Trillion Edges: Graph Processing at Facebook-Scale." Accessed November 17, 2016. http://www.vldb.org/pvldb/vol8/p1804-ching.pdf.
 
-- 
cgit v1.2.3


From 74473b82407edd9bc5f442103715985e1adc5859 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Thu, 24 Nov 2016 22:15:48 -0500
Subject: add mapreduce+flumejava+skeleton

---
 chapter/8/big-data.md | 113 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 106 insertions(+), 7 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 4c1f060..34a14f1 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -1,18 +1,116 @@
 ---
 layout: page
 title:  "Large Scale Parallel Data Processing"
-by: "JingJing and Abhilash"
+by: "Jingjing and Abhilash"
 ---
+## Introduction
+`JJ: Placeholder for introduction` The booming Internet has generated big data...
+
+
+This chapter is organized in <label for="note1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="note1" class="margin-toggle"/><span class="sidenote">JJ: need to fill in more stuff</span>
+
+- **Data paralleling**:
+  - MapReduce {% cite dean2008mapreduce  --file big-data %}
+  - FlumeJava {% cite chambers2010flumejava --file big-data %}
+  - ...
+- **Graph paralleling**:
+  - Pregel 
+  - ...
+
+For each programming model, we will discuss the motivation, basic model, execution model, fault-tolerance and performance.
+
+
+Ideas: get a table of what to include in the context
+Idea: instead of data/graph, maybe add one more layer (unstructured vs. structured)
+
+# Data paralleling
+
+## MapReduce (2004)
+MapReduce {% cite dean2008mapreduce  --file big-data %} is a programming model that allows programmers to express the simple computations for terabytes data on thousands of commodity machines.
+
+**Basic & Examples**  
+This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value.
+
+Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word
+
+```
+map(String key, String value):
+  // key: document name
+  // value: document contents
+  for each word w in value:
+    EmitIntermediate(w, "1");
+
+reduce(String key, Iterator values):
+  // key: a word
+  // values: a list of counts
+  int result = 0;
+  for each v in values:
+    result += ParseInt(v);
+  Emit(AsString(result));
+```
+
+Conceptually, the map and reduction functions have associated **types**:
+```
+map (k1,v1) -> → list(k2,v2)
+reduce (k2,list(v2)) -> list(v2)
+```
+The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values. The implementation given by the authors essentially pass strings and it is users' responsibility to convert between strings and appropriate types.
+
+More formalized descriptions about the `map` and `reduce` function can be found in the original paper {% cite dean2008mapreduce  --file big-data %}.
+
+**Execution**   
+At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions {% cite dean2008mapreduce  --file big-data %} are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
+
+<figure class="main-container">
+  <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
+</figure>
+
+
+**Fault Tolerance**  
+In this model, there are two parts that could fail: the master and the worker.  
+- Worker failure: The master pings every worker periodically and if no response in a certain amount of time, master marks the worker as failed and re-assign it to an idle worker.
+- Master Failure: If the master fail, MapReduce function fails. The model itself assumes that master won't fail and they have separate mechanics to backup the master, which is out of the scope of our discussion.  
+
+The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished.
+
+There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
+
+`JJ: what about other refinement: `
+
+**Performance**  
+In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
+Overall, the performance is very good for conceptually unrelated computations.
+
+
+## FlumeJava (2010)
+Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph.
+
+**Core Abstraction**  
+
+- `PCollection<T>`, a immutable bag of elements of type `T`
+- `recordOf(...)`, specifies the encoding of the instance
+- `PTable<K, V>`, a subclass of `PCollection<Pair<K,V>>`, a immutable multi-map with keys of type `K` and values of type `V`
+- `parallelDo()`, can be expressed both the map and reduce parts of MapReduce
+- `groupByKey()`, same as shuffle step of MapReduce `JJ: clear this in MapReduce`
+- `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
+
+**Deferred Evaluation**  
+`(JJ: placehoder) join, deferred/materialized; execution plan; figure 1 initial execution plan`
+
+**Optimizer**  
+`(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
+
+# Graph paralleling
 
 Though highly efficient and one of the first major programming models for distributed batch processing, it too has a few limitations.<br />
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
 Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
-### Bulk synchronous parallel model
+## Bulk synchronous parallel model
 This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-In BSP model 
-+ Computation consists of several steps called as supersets. 
+In BSP model
++ Computation consists of several steps called as supersets.
 + The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
-+ At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
++ At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
 + Barrier synchronization synchs all the processors at the end of every superstep.<br />
 
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
@@ -20,7 +118,7 @@ A notable feature of the model is the complete control on data through communica
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
 Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.<br />
 Pregel’s API provides <br />
-+ compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
++ compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
 + User defined handler for handling issues like missing destination vertex etc.
 + Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
 + Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
@@ -32,7 +130,8 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 
 ## References
+{% bibliography --file big-data %}
+
 "Bulk synchronous model" http://www.cse.unt.edu/~tarau/teaching/parpro/papers/Bulk%20synchronous%20parallel.pdf.
 "Pregel: A System for Large-Scale Graph Processing." <br />
 "One Trillion Edges: Graph Processing at Facebook-Scale." Accessed November 17, 2016. http://www.vldb.org/pvldb/vol8/p1804-ching.pdf.
-
-- 
cgit v1.2.3


From e2e0995491d8f3588d6214a2b21351063f17e9e3 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Thu, 24 Nov 2016 22:28:51 -0500
Subject: mv ref to .bib

---
 chapter/8/big-data.md | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 34a14f1..d49d5a1 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,7 +14,7 @@ This chapter is organized in <label for="note1" class="margin-toggle sidenote-nu
   - FlumeJava {% cite chambers2010flumejava --file big-data %}
   - ...
 - **Graph paralleling**:
-  - Pregel 
+  - Pregel
   - ...
 
 For each programming model, we will discuss the motivation, basic model, execution model, fault-tolerance and performance.
@@ -131,7 +131,3 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 ## References
 {% bibliography --file big-data %}
-
-"Bulk synchronous model" http://www.cse.unt.edu/~tarau/teaching/parpro/papers/Bulk%20synchronous%20parallel.pdf.
-"Pregel: A System for Large-Scale Graph Processing." <br />
-"One Trillion Edges: Graph Processing at Facebook-Scale." Accessed November 17, 2016. http://www.vldb.org/pvldb/vol8/p1804-ching.pdf.
-- 
cgit v1.2.3


From 75eafaea4784dd5a1c883720929ef35e465b5a4d Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 28 Nov 2016 17:12:45 -0500
Subject: Update dist-langs.md

---
 chapter/4/dist-langs.md | 47 ++---------------------------------------------
 1 file changed, 2 insertions(+), 45 deletions(-)

(limited to 'chapter')

diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md
index 9f04232..1736064 100644
--- a/chapter/4/dist-langs.md
+++ b/chapter/4/dist-langs.md
@@ -1,53 +1,10 @@
 ---
 layout: page
 title:  "Distributed Programming Languages"
-by: "A Systems Person"
+by: "Joe Schmoe and Mary Jane"
 ---
 
-### Two major major, orthogonal approaches to distributed languages:
-
-#### Actor / Object model
-
-The actor model has its roots in procedural programming.
-This model maps in a straighforward way to a distributed environment.
-
-* Erlang
-* Cloud Haskell (I know, right? Why?)
-
-#### Dataflow model (static and stream)
-
-The dataflow model has its roots in functional programming.
-Some languages that use this model are:
-
-* Multilisp
-* MapReduce (Spark, Hadoop, etc.)
-
-### Why GPL's not DSL's?
-
-* problem of domain-composition
-* problem of abstraction
-* problem of ecosystem
-* problem of tumultuous architecture
-* "any gpl + library can act as a dsl" - mernik"
-
-#### Erlang vs C: A Tar and Feathering
-
-{% cite Armstrong2010 --file dist-langs %}
-
-Erlang offers nothing that is unavailable in C.
-
-For example, dynamic code swapping is one of Erlang's major selling points.
-However, code swapping can easily be achieved in C with dynamic linking.
-This approach is analogous to the example offered in the Erlang paper.
-
-Other selling points, such as isolation, concurrency, and message passing can all be accomplished with unix-style system calls.
-Why is this language not considered redundant?
-
-#### MapReduce: A New Hope
-
-Unlike Erlang, MapReduce and DSL's that implement the paradigm are "all the rage."
-Unlike Erlang, MapReduce has experienced adoption because it offers true abstraction of the problems of distributed computing.
-Erlang only provided a way of detecting a process failure; it did not consider machine or network failures.
+Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. {% cite Uniqueness --file dist-langs %}
 
 ## References
 
-- 
cgit v1.2.3


From 26f84b0695d691e84ca120cce74ed96ac1886bdb Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:16:04 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 97 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 27 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d49d5a1..ec5edf6 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -82,6 +82,28 @@ In the paper, the authors measure the performance of MapReduce on two computatio
 Overall, the performance is very good for conceptually unrelated computations.
 
 
+## Iterative processing in Map Reduce:
+
+Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
+
+Haloop : HaLoop: Efficient Iterative Data Processing on Large Clusters.
+
+iMapReduce: iMapReduce: A Distributed Computing Framework for Iterative Computation
+
+Twister :  Twister: a runtime for iterative MapReduce.
+
+## Map Reduce inspired other large scale data processing systems :
+
+Dryad/DryadLinq : 
+
+Spark (big one) : content is ready, need to format a bit and paste
+
+## Declarative interfaces for the Map Reduce framework:
+Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
+Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
+
+Introduce Sazwal (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
+
 ## FlumeJava (2010)
 Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph.
 
@@ -100,33 +122,54 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 **Optimizer**  
 `(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
-# Graph paralleling
-
-Though highly efficient and one of the first major programming models for distributed batch processing, it too has a few limitations.<br />
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
-## Bulk synchronous parallel model
-This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-In BSP model
-+ Computation consists of several steps called as supersets.
-+ The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
-+ At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
-+ Barrier synchronization synchs all the processors at the end of every superstep.<br />
-
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.<br />
-`Pregel` is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
-Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.<br />
-Pregel’s API provides <br />
-+ compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
-+ User defined handler for handling issues like missing destination vertex etc.
-+ Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-+ Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
-+ Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-+ Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.<br/>
-
-Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
-Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
+
+Pig Latin : Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
+
+Hive : 
+
+Dremel :
+
+
+## Where Relational meets Procedural : 
+Relational interface to big data is good, however, it doesn’t cater to users who want to perform
+1> ETL to and from various semi or unstructured data sources.
+2> advanced analytics like machine learning or graph processing.
+These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
+Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst.
+Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+Programming API : 
+Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
+A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
+Spark SQL supports all the major SQL data types including complex data types like arrays, maps and unions.
+Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy).
+Illustrated below is an example of relational operations on employees data frame to compute the number of female employees in each department.
+employees
+.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
+Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
+A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding. 
+MORE EXPLANATION NEEDED...
+
+
+
+## Optimizers are the way to go :
+It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
+Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
+
+Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. 
+Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
+In Spark SQL, transformation happens in four phases :
+Analyzing a logical plan to resolve references  : In the analysis phase a relation either from the abstract syntax  tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions.
+
+Logical plan optimization : In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan.
+
+Physical planning : In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
+
+
+Code Generation : The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
+
+
+
+
 
 
 ## References
-- 
cgit v1.2.3


From 1068289befec66b87b7a7dc1987529c6d99d04f5 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:21:09 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ec5edf6..2b8c014 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -23,8 +23,6 @@ For each programming model, we will discuss the motivation, basic model, executi
 Ideas: get a table of what to include in the context
 Idea: instead of data/graph, maybe add one more layer (unstructured vs. structured)
 
-# Data paralleling
-
 ## MapReduce (2004)
 MapReduce {% cite dean2008mapreduce  --file big-data %} is a programming model that allows programmers to express the simple computations for terabytes data on thousands of commodity machines.
 
@@ -104,10 +102,10 @@ Non-programmers like data scientists would highly prefer SQL like interface over
 
 Introduce Sazwal (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
 
-## FlumeJava (2010)
+** FlumeJava (2010) **
 Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph.
 
-**Core Abstraction**  
+***Core Abstraction***
 
 - `PCollection<T>`, a immutable bag of elements of type `T`
 - `recordOf(...)`, specifies the encoding of the instance
@@ -116,10 +114,10 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 - `groupByKey()`, same as shuffle step of MapReduce `JJ: clear this in MapReduce`
 - `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
 
-**Deferred Evaluation**  
+***Deferred Evaluation*** 
 `(JJ: placehoder) join, deferred/materialized; execution plan; figure 1 initial execution plan`
 
-**Optimizer**  
+***Optimizer***  
 `(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
@@ -135,16 +133,17 @@ Relational interface to big data is good, however, it doesn’t cater to users w
 1> ETL to and from various semi or unstructured data sources.
 2> advanced analytics like machine learning or graph processing.
 These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
-Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst.
-Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
-Programming API : 
+Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+***Programming API***
 Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
 A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
 Spark SQL supports all the major SQL data types including complex data types like arrays, maps and unions.
 Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy).
 Illustrated below is an example of relational operations on employees data frame to compute the number of female employees in each department.
-employees
-.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
+
+```
+employees.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
+```
 Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
 A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding. 
 MORE EXPLANATION NEEDED...
-- 
cgit v1.2.3


From 776c67de53be4f502e4dbe0e40ce9f354e4f0433 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:22:55 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2b8c014..dd6bd70 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -155,16 +155,19 @@ It is tough to understand the internals of a framework like Spark for any develo
 Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
 
 Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. 
+
+
 Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
+
 In Spark SQL, transformation happens in four phases :
-Analyzing a logical plan to resolve references  : In the analysis phase a relation either from the abstract syntax  tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions.
 
-Logical plan optimization : In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan.
+- Analyzing a logical plan to resolve references  : In the analysis phase a relation either from the abstract syntax  tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions.
 
-Physical planning : In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
+- Logical plan optimization : In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan.
 
+- Physical planning : In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
 
-Code Generation : The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
+- Code Generation : The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
 
 
-- 
cgit v1.2.3


From f28d3b89ddac36320346b682115f9ced4bea9741 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:25:18 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index dd6bd70..be14b0c 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -90,7 +90,7 @@ iMapReduce: iMapReduce: A Distributed Computing Framework for Iterative Computat
 
 Twister :  Twister: a runtime for iterative MapReduce.
 
-## Map Reduce inspired other large scale data processing systems :
+## Map Reduce inspired large scale data processing systems :
 
 Dryad/DryadLinq : 
 
@@ -130,10 +130,14 @@ Dremel :
 
 ## Where Relational meets Procedural : 
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
-1> ETL to and from various semi or unstructured data sources.
-2> advanced analytics like machine learning or graph processing.
+
+- ETL to and from various semi or unstructured data sources.
+- advanced analytics like machine learning or graph processing.
+
 These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
+
 Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+
 ***Programming API***
 Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
 A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
@@ -151,6 +155,7 @@ MORE EXPLANATION NEEDED...
 
 
 ## Optimizers are the way to go :
+
 It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
 Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
 
-- 
cgit v1.2.3


From 5ffe25b7fc660c28790fb25593b59f02b1c77e03 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:27:26 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index be14b0c..2059a40 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -84,17 +84,17 @@ Overall, the performance is very good for conceptually unrelated computations.
 
 Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
-Haloop : HaLoop: Efficient Iterative Data Processing on Large Clusters.
+** Haloop ** : HaLoop: Efficient Iterative Data Processing on Large Clusters.
 
-iMapReduce: iMapReduce: A Distributed Computing Framework for Iterative Computation
+** iMapReduce **: iMapReduce: A Distributed Computing Framework for Iterative Computation
 
-Twister :  Twister: a runtime for iterative MapReduce.
+** Twister ** :  Twister: a runtime for iterative MapReduce.
 
 ## Map Reduce inspired large scale data processing systems :
 
-Dryad/DryadLinq : 
+** Dryad/DryadLinq ** : 
 
-Spark (big one) : content is ready, need to format a bit and paste
+** Spark (big one) ** : content is ready, need to format a bit and paste
 
 ## Declarative interfaces for the Map Reduce framework:
 Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
@@ -139,6 +139,7 @@ These user actions require best of both the worlds - relational queries and proc
 Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
 
 ***Programming API***
+
 Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
 A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
 Spark SQL supports all the major SQL data types including complex data types like arrays, maps and unions.
-- 
cgit v1.2.3


From 3f64c2ca7ae4b7e9eb22d959140e8f224ce9cbfb Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:28:12 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2059a40..922a517 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -84,17 +84,17 @@ Overall, the performance is very good for conceptually unrelated computations.
 
 Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
-** Haloop ** : HaLoop: Efficient Iterative Data Processing on Large Clusters.
+**Haloop** : HaLoop: Efficient Iterative Data Processing on Large Clusters.
 
-** iMapReduce **: iMapReduce: A Distributed Computing Framework for Iterative Computation
+**iMapReduce**: iMapReduce: A Distributed Computing Framework for Iterative Computation
 
-** Twister ** :  Twister: a runtime for iterative MapReduce.
+**Twister** :  Twister: a runtime for iterative MapReduce.
 
 ## Map Reduce inspired large scale data processing systems :
 
-** Dryad/DryadLinq ** : 
+**Dryad/DryadLinq** : 
 
-** Spark (big one) ** : content is ready, need to format a bit and paste
+**Spark (big one)** : content is ready, need to format a bit and paste
 
 ## Declarative interfaces for the Map Reduce framework:
 Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
-- 
cgit v1.2.3


From fbfa127da278220fc735ac5fb2f2711c18aac45f Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:37:33 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 922a517..20a485a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -80,7 +80,7 @@ In the paper, the authors measure the performance of MapReduce on two computatio
 Overall, the performance is very good for conceptually unrelated computations.
 
 
-## Iterative processing in Map Reduce:
+## Iterative processing in Map Reduce
 
 Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
@@ -90,13 +90,13 @@ Many a analytics workloads like K-means, logistic regression, graph processing a
 
 **Twister** :  Twister: a runtime for iterative MapReduce.
 
-## Map Reduce inspired large scale data processing systems :
+## Map Reduce inspired large scale data processing systems
 
 **Dryad/DryadLinq** : 
 
 **Spark (big one)** : content is ready, need to format a bit and paste
 
-## Declarative interfaces for the Map Reduce framework:
+## Declarative interfaces for the Map Reduce framework
 Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
 Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
 
@@ -121,11 +121,11 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 `(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
-Pig Latin : Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
+**Pig Latin** : Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
 
-Hive : 
+**Hive** : 
 
-Dremel :
+**Dremel** :
 
 
 ## Where Relational meets Procedural : 
@@ -155,7 +155,7 @@ MORE EXPLANATION NEEDED...
 
 
-## Optimizers are the way to go :
+## Optimizers are the way to go 
 
 It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
 Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
@@ -175,9 +175,15 @@ In Spark SQL, transformation happens in four phases :
 
 - Code Generation : The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
 
+STILL WORKING ON THIS..
 
 
+## Future and Discussion 
 
+- Current leader in distributed processing - Spark, Google's cloud dataflow
+- Current challenges and upcoming improvements ??
+
+## Conclusion
 
 
 ## References
-- 
cgit v1.2.3


From f8cf15d4ea7a9ec40bc00aa1a8f4ed0b7eb1c223 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:46:44 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 20a485a..1b0fff1 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -177,6 +177,41 @@ In Spark SQL, transformation happens in four phases :
 
 STILL WORKING ON THIS..
 
+## Large Scale Graph processing : 
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
+  Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
+ -`Bulk synchronous parallel` model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.
+ 
+ **Bulk synchronous parallel model**
+ This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
+  In BSP model 
+  
+ - Computation consists of several steps called as supersets. 
+ - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
+ - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
+ - Barrier synchronization synchs all the processors at the end of every superstep.
+ - A notable feature of the model is the complete control on data through communication between every processor at every superstep. 
+ - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+**Pregel**
+Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
+ 
+Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
+Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
+
+Pregel’s API provides 
+
+- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
+- User defined handler for handling issues like missing destination vertex etc.
+- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
+- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
+- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
+- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
+
+Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
+
+Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
+
+**Introduce GraphX and why it fares better than BSP model. Explain GraphX**
 
 ## Future and Discussion 
 
-- 
cgit v1.2.3


From 68b6294cef1fd0f5c4a245ca3206038c824130d8 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:49:56 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 1b0fff1..42e68d5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -177,14 +177,16 @@ In Spark SQL, transformation happens in four phases :
 
 STILL WORKING ON THIS..
 
-## Large Scale Graph processing : 
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. <br />
-  Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. <br />
- -`Bulk synchronous parallel` model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.
- 
- **Bulk synchronous parallel model**
- This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-  In BSP model 
+## Large Scale Graph processing 
+
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. 
+
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. 
+
+**Bulk synchronous parallel model**
+
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
+In BSP model 
   
  - Computation consists of several steps called as supersets. 
  - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
-- 
cgit v1.2.3


From 5b948ce1ef0f531279a46e4358611402fb3433ac Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:52:00 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 42e68d5..d121180 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -128,7 +128,7 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 **Dremel** :
 
 
-## Where Relational meets Procedural : 
+## SparkSQL - Where Relational meets Procedural : 
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
-- 
cgit v1.2.3


From d8e57b0148773e05d0d21716833a12db031d3de5 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:52:51 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d121180..55d6b12 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -194,6 +194,7 @@ In BSP model
  - Barrier synchronization synchs all the processors at the end of every superstep.
  - A notable feature of the model is the complete control on data through communication between every processor at every superstep. 
  - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+ 
 **Pregel**
 Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
  
-- 
cgit v1.2.3


From 84e84e420de861d89c43093947ed9be6bc51eff6 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 2 Dec 2016 05:59:18 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 55d6b12..7765cd7 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -94,15 +94,16 @@ Many a analytics workloads like K-means, logistic regression, graph processing a
 
 **Dryad/DryadLinq** : 
 
-**Spark (big one)** : content is ready, need to format a bit and paste
+**Spark (big one)** : 
 
 ## Declarative interfaces for the Map Reduce framework
 Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
 Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
 
-Introduce Sazwal (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
+**Introduce Sazwal** (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
 
 ** FlumeJava (2010) **
+
 Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph.
 
 ***Core Abstraction***
@@ -155,7 +156,7 @@ MORE EXPLANATION NEEDED...
 
 
-## Optimizers are the way to go 
+## Optimizers are the way to go (still thinking of a better heading..)
 
 It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
 Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
@@ -219,7 +220,7 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 ## Future and Discussion 
 
 - Current leader in distributed processing - Spark, Google's cloud dataflow
-- Current challenges and upcoming improvements ??
+- Current challenges and upcoming improvements ?? - Apache thunder and any others?
 
 ## Conclusion
 
-- 
cgit v1.2.3


From 175a0fae9c43c111bb02842d5b01bbb15daa8cee Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 14:12:16 -0500
Subject: add outline

---
 chapter/8/big-data.md | 71 +++++++++++++++++++++++++++------------------------
 1 file changed, 38 insertions(+), 33 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 7765cd7..23f47b5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -7,23 +7,28 @@ by: "Jingjing and Abhilash"
 `JJ: Placeholder for introduction` The booming Internet has generated big data...
 
 
-This chapter is organized in <label for="note1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="note1" class="margin-toggle"/><span class="sidenote">JJ: need to fill in more stuff</span>
 
-- **Data paralleling**:
-  - MapReduce {% cite dean2008mapreduce  --file big-data %}
-  - FlumeJava {% cite chambers2010flumejava --file big-data %}
-  - ...
-- **Graph paralleling**:
-  - Pregel
-  - ...
 
-For each programming model, we will discuss the motivation, basic model, execution model, fault-tolerance and performance.
+This chapter is organized in by
 
+- Programming Models
+  - Data parallelism (most popular, standard map/reduce/functional pipelining)
+    - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
+    - Graphs
+    - Querying
+- Execution Models
+  - MapReduce (intermediate writes to disk)
+    - Limitations, iteration, performance
+  - Spark (all in memory)
+    - Limitations ?
+- Performance
+- Things people are building on top of MapReduce/Spark
+  - FlumeJava? ...Etc
+  - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
-Ideas: get a table of what to include in the context
-Idea: instead of data/graph, maybe add one more layer (unstructured vs. structured)
 
-## MapReduce (2004)
+## Programming Model
+### MapReduce  
 MapReduce {% cite dean2008mapreduce  --file big-data %} is a programming model that allows programmers to express the simple computations for terabytes data on thousands of commodity machines.
 
 **Basic & Examples**  
@@ -92,9 +97,9 @@ Many a analytics workloads like K-means, logistic regression, graph processing a
 
 ## Map Reduce inspired large scale data processing systems
 
-**Dryad/DryadLinq** : 
+**Dryad/DryadLinq** :
 
-**Spark (big one)** : 
+**Spark (big one)** :
 
 ## Declarative interfaces for the Map Reduce framework
 Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
@@ -115,7 +120,7 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 - `groupByKey()`, same as shuffle step of MapReduce `JJ: clear this in MapReduce`
 - `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
 
-***Deferred Evaluation*** 
+***Deferred Evaluation***
 `(JJ: placehoder) join, deferred/materialized; execution plan; figure 1 initial execution plan`
 
 ***Optimizer***  
@@ -124,12 +129,12 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 
 **Pig Latin** : Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
 
-**Hive** : 
+**Hive** :
 
 **Dremel** :
 
 
-## SparkSQL - Where Relational meets Procedural : 
+## SparkSQL - Where Relational meets Procedural :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
@@ -151,17 +156,17 @@ Illustrated below is an example of relational operations on employees data frame
 employees.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
 ```
 Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
-A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding. 
+A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding.
 MORE EXPLANATION NEEDED...
 
 
 ## Optimizers are the way to go (still thinking of a better heading..)
 
-It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
+It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules.
 Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
 
-Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. 
+Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules.
 
 
 Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
@@ -178,33 +183,33 @@ In Spark SQL, transformation happens in four phases :
 
 STILL WORKING ON THIS..
 
-## Large Scale Graph processing 
+## Large Scale Graph processing
 
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. 
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. 
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms.
 
 **Bulk synchronous parallel model**
 
 This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-In BSP model 
-  
- - Computation consists of several steps called as supersets. 
+In BSP model
+
+ - Computation consists of several steps called as supersets.
  - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
- - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end. 
+ - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
  - Barrier synchronization synchs all the processors at the end of every superstep.
- - A notable feature of the model is the complete control on data through communication between every processor at every superstep. 
+ - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
  - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
- 
+
 **Pregel**
 Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
- 
+
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
 Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
 
-Pregel’s API provides 
+Pregel’s API provides
 
-- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep. 
+- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
 - User defined handler for handling issues like missing destination vertex etc.
 - Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
 - Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
@@ -217,7 +222,7 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 **Introduce GraphX and why it fares better than BSP model. Explain GraphX**
 
-## Future and Discussion 
+## Future and Discussion
 
 - Current leader in distributed processing - Spark, Google's cloud dataflow
 - Current challenges and upcoming improvements ?? - Apache thunder and any others?
-- 
cgit v1.2.3


From f3a6c2d3a2ba08070f79c03a518cab874a0fc27f Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 14:22:02 -0500
Subject: add outline

---
 chapter/8/big-data.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 23f47b5..c63a300 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -13,9 +13,9 @@ This chapter is organized in by
 
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
-    - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
-    - Graphs
-    - Querying
+      - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
+  - Graphs
+  - Querying
 - Execution Models
   - MapReduce (intermediate writes to disk)
     - Limitations, iteration, performance
@@ -78,7 +78,6 @@ The output from distributed computation should be same as one from non-faulting
 
 There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
 
-`JJ: what about other refinement: `
 
 **Performance**  
 In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
-- 
cgit v1.2.3


From 9784b83f30e0e06efc9ccfde1288a7893c578244 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 3 Dec 2016 14:38:56 -0500
Subject: Updated outline to include Querying execution model

---
 chapter/8/big-data.md | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c63a300..3c2fb29 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -21,6 +21,8 @@ This chapter is organized in by
     - Limitations, iteration, performance
   - Spark (all in memory)
     - Limitations ?
+  - Pig/HiveQL/SparkSQL
+    - Limitations ?
 - Performance
 - Things people are building on top of MapReduce/Spark
   - FlumeJava? ...Etc
-- 
cgit v1.2.3


From 5261d5bd4b985f085076b529b29b4b4bbe2f8b6f Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 3 Dec 2016 14:42:27 -0500
Subject: Updated outline to include Graph exec model

---
 chapter/8/big-data.md | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 3c2fb29..16e6fe1 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -23,6 +23,8 @@ This chapter is organized in by
     - Limitations ?
   - Pig/HiveQL/SparkSQL
     - Limitations ?
+  - Pregel 
+    - Limitations ?
 - Performance
 - Things people are building on top of MapReduce/Spark
   - FlumeJava? ...Etc
-- 
cgit v1.2.3


From 988cf506f64b9305baf0dd990387c39e6bbbefb9 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 16:09:49 -0500
Subject: re-organize content

---
 chapter/8/big-data.md | 91 ++++++++++++++++++++++++++-------------------------
 1 file changed, 46 insertions(+), 45 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 16e6fe1..c048bf5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -6,15 +6,12 @@ by: "Jingjing and Abhilash"
 ## Introduction
 `JJ: Placeholder for introduction` The booming Internet has generated big data...
 
-
-
-
 This chapter is organized in by
 
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
       - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
-  - Graphs
+  - Large-scale Parallelism on Graphs
   - Querying
 - Execution Models
   - MapReduce (intermediate writes to disk)
@@ -23,7 +20,7 @@ This chapter is organized in by
     - Limitations ?
   - Pig/HiveQL/SparkSQL
     - Limitations ?
-  - Pregel 
+  - Pregel
     - Limitations ?
 - Performance
 - Things people are building on top of MapReduce/Spark
@@ -32,66 +29,57 @@ This chapter is organized in by
 
 
 ## Programming Model
-### MapReduce  
-MapReduce {% cite dean2008mapreduce  --file big-data %} is a programming model that allows programmers to express the simple computations for terabytes data on thousands of commodity machines.
-
-**Basic & Examples**  
-This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value.
-
-Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word
-
-```
-map(String key, String value):
-  // key: document name
-  // value: document contents
-  for each word w in value:
-    EmitIntermediate(w, "1");
-
-reduce(String key, Iterator values):
-  // key: a word
-  // values: a list of counts
-  int result = 0;
-  for each v in values:
-    result += ParseInt(v);
-  Emit(AsString(result));
-```
-
-Conceptually, the map and reduction functions have associated **types**:
+### Data parallelism
+The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
 ```
 map (k1,v1) -> → list(k2,v2)
 reduce (k2,list(v2)) -> list(v2)
 ```
-The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values. The implementation given by the authors essentially pass strings and it is users' responsibility to convert between strings and appropriate types.
-
-More formalized descriptions about the `map` and `reduce` function can be found in the original paper {% cite dean2008mapreduce  --file big-data %}.
+The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values.
 
-**Execution**   
-At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions {% cite dean2008mapreduce  --file big-data %} are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
+**Execution** At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
 <figure class="main-container">
   <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
 </figure>
 
+**Limtations**
+- The iterative algorithm is hard to implement in MapReduce;
+- Real-world application often requires pipeline of MapReduce and the management is painful.
+
+-> FlumeJava?
+
+`TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
+
+### Large-scale Parallelism on Graphs
+Spark
 
-**Fault Tolerance**  
-In this model, there are two parts that could fail: the master and the worker.  
-- Worker failure: The master pings every worker periodically and if no response in a certain amount of time, master marks the worker as failed and re-assign it to an idle worker.
-- Master Failure: If the master fail, MapReduce function fails. The model itself assumes that master won't fail and they have separate mechanics to backup the master, which is out of the scope of our discussion.  
 
-The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished.
+## Execution Models
+In **MapReduce**, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. And in the model, MapReduce assumes the master doesn't fail, or if it fails, the whole MapReduce function fails.
 
-There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
+This is very different in **Spark**, in-memory stuff...
 
 
-**Performance**  
+## Performance
+`TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
 In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
 Overall, the performance is very good for conceptually unrelated computations.
 
 
+## References
+{% bibliography --file big-data %}
+
+## Trash
+
+
 ## Iterative processing in Map Reduce
 
 Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
+
+
+
 **Haloop** : HaLoop: Efficient Iterative Data Processing on Large Clusters.
 
 **iMapReduce**: iMapReduce: A Distributed Computing Framework for Iterative Computation
@@ -230,8 +218,21 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 - Current leader in distributed processing - Spark, Google's cloud dataflow
 - Current challenges and upcoming improvements ?? - Apache thunder and any others?
 
-## Conclusion
 
+Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word
 
-## References
-{% bibliography --file big-data %}
+```
+map(String key, String value):
+  // key: document name
+  // value: document contents
+  for each word w in value:
+    EmitIntermediate(w, "1");
+
+reduce(String key, Iterator values):
+  // key: a word
+  // values: a list of counts
+  int result = 0;
+  for each v in values:
+    result += ParseInt(v);
+  Emit(AsString(result));
+```
-- 
cgit v1.2.3


From 8737fbc39fdff48c88a2a63a1e0c1f8e2d5bd6e1 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 16:15:34 -0500
Subject: minor

---
 chapter/8/big-data.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c048bf5..972a10d 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -27,7 +27,6 @@ This chapter is organized in by
   - FlumeJava? ...Etc
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
-
 ## Programming Model
 ### Data parallelism
 The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
@@ -54,12 +53,22 @@ The input keys and values are drawn from a different domain than the output keys
 ### Large-scale Parallelism on Graphs
 Spark
 
+### Querying
+
 
 ## Execution Models
 In **MapReduce**, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. And in the model, MapReduce assumes the master doesn't fail, or if it fails, the whole MapReduce function fails.
 
+- Spark (all in memory)
+  - Limitations ?
+
 This is very different in **Spark**, in-memory stuff...
 
+- Pig/HiveQL/SparkSQL
+  - Limitations ?
+- Pregel
+  - Limitations ?
+
 
 ## Performance
 `TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
@@ -67,6 +76,10 @@ In the paper, the authors measure the performance of MapReduce on two computatio
 Overall, the performance is very good for conceptually unrelated computations.
 
 
+## Things people are building on top of MapReduce/Spark
+  - FlumeJava? ...Etc
+  - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
+
 ## References
 {% bibliography --file big-data %}
 
-- 
cgit v1.2.3


From eb39df7fd7b4fc753204aea97000ae40badde4b3 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 3 Dec 2016 16:24:44 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 972a10d..a4e3500 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -51,7 +51,22 @@ The input keys and values are drawn from a different domain than the output keys
 `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
 ### Large-scale Parallelism on Graphs
-Spark
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
+
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms.
+
+**Bulk synchronous parallel model**
+
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
+In BSP model
+
+ - Computation consists of several steps called as supersets.
+ - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
+ - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
+ - Barrier synchronization synchs all the processors at the end of every superstep.
+ - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
+ - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+
 
 ### Querying
 
@@ -60,7 +75,7 @@ Spark
 In **MapReduce**, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. And in the model, MapReduce assumes the master doesn't fail, or if it fails, the whole MapReduce function fails.
 
 - Spark (all in memory)
-  - Limitations ?
+Apache Spark is a fast, in-memory data processing engine with elegant and expressive development interface to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD) and Scala’s collection API as well as functional style for high performance processing. 
 
 This is very different in **Spark**, in-memory stuff...
 
-- 
cgit v1.2.3


From 3c416c5c91e93d6d8a04c416b408ebf765fb5472 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 3 Dec 2016 16:26:18 -0500
Subject: Added Graph model and Spark

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index a4e3500..d18cab9 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -74,7 +74,7 @@ In BSP model
 ## Execution Models
 In **MapReduce**, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. And in the model, MapReduce assumes the master doesn't fail, or if it fails, the whole MapReduce function fails.
 
-- Spark (all in memory)
+**Spark**
 Apache Spark is a fast, in-memory data processing engine with elegant and expressive development interface to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD) and Scala’s collection API as well as functional style for high performance processing. 
 
 This is very different in **Spark**, in-memory stuff...
-- 
cgit v1.2.3


From b389acf647ea3941d1c3565fb33f1c5f6d08ac83 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 22:36:33 -0500
Subject: update

---
 chapter/8/big-data.md | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d18cab9..70c1e82 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -42,13 +42,14 @@ The input keys and values are drawn from a different domain than the output keys
   <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
 </figure>
 
-**Limtations**
-- The iterative algorithm is hard to implement in MapReduce;
-- Real-world application often requires pipeline of MapReduce and the management is painful.
+**Limitations & Extensions**  
+***Real-world applications often require a pipeline of MapReduce jobs and the management becomes an issue.***  
+**FlumeJava** was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
--> FlumeJava?
+`Where should this section go?` **Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
 
-`TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
+***The iterative algorithm is hard to implement in MapReduce***   
+  `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
 ### Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
@@ -72,13 +73,11 @@ In BSP model
 
 
 ## Execution Models
-In **MapReduce**, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve this. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. And in the model, MapReduce assumes the master doesn't fail, or if it fails, the whole MapReduce function fails.
+**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
 
 **Spark**
 Apache Spark is a fast, in-memory data processing engine with elegant and expressive development interface to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD) and Scala’s collection API as well as functional style for high performance processing. 
 
-This is very different in **Spark**, in-memory stuff...
-
 - Pig/HiveQL/SparkSQL
   - Limitations ?
 - Pregel
@@ -98,6 +97,16 @@ Overall, the performance is very good for conceptually unrelated computations.
 ## References
 {% bibliography --file big-data %}
 
+
+
+
+
+
+
+
+
+
+
 ## Trash
 
 
-- 
cgit v1.2.3


From f49606a058433700c1b50666aa2d75fc8aac4aee Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sat, 3 Dec 2016 23:50:35 -0500
Subject: mv hive etc. to Programming model

---
 chapter/8/big-data.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 70c1e82..1696878 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -12,14 +12,13 @@ This chapter is organized in by
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
       - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
   - Large-scale Parallelism on Graphs
-  - Querying
+  - Querying: DryadLINQ, Pig, Hive, possible Spark SQL
+
 - Execution Models
   - MapReduce (intermediate writes to disk)
     - Limitations, iteration, performance
   - Spark (all in memory)
     - Limitations ?
-  - Pig/HiveQL/SparkSQL
-    - Limitations ?
   - Pregel
     - Limitations ?
 - Performance
@@ -46,11 +45,14 @@ The input keys and values are drawn from a different domain than the output keys
 ***Real-world applications often require a pipeline of MapReduce jobs and the management becomes an issue.***  
 **FlumeJava** was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
-`Where should this section go?` **Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
 
 ***The iterative algorithm is hard to implement in MapReduce***   
   `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
+**Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
+
+
+
 ### Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
@@ -68,10 +70,10 @@ In BSP model
  - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
  - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
-
 ### Querying
 
 
+
 ## Execution Models
 **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
 
-- 
cgit v1.2.3


From 5ce02672da0b42b46517e67dda7a876e05383c8e Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 00:34:39 -0500
Subject: edit outline

---
 chapter/8/big-data.md | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 1696878..b0bf6f9 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -6,13 +6,21 @@ by: "Jingjing and Abhilash"
 ## Introduction
 `JJ: Placeholder for introduction` The booming Internet has generated big data...
 
-This chapter is organized in by
+This chapter is organized in
 
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
-      - Limitations, iteration difficult due to the execution model of MapReduce/Hadoop
+    - PM of MapReduce: basic, limitation, pipelining > FlumeJava
+    - PM of Dryad: can support DAG computation, limitations: low-level, `Q: Should this go to execution model?`
+    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics
   - Large-scale Parallelism on Graphs
-  - Querying: DryadLINQ, Pig, Hive, possible Spark SQL
+    - PM of Pregel/GraphX
+  - Querying: more declarative `Q: put here or in the execution model?`
+    - DryadLINQ, SQL-like, use Dryad as execution engine;
+    - Pig, on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?
+    - Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.
+    - Dremel, query natively w/o translating into MP jobs
+    - Spark SQL, on top of Spark
 
 - Execution Models
   - MapReduce (intermediate writes to disk)
@@ -23,7 +31,7 @@ This chapter is organized in by
     - Limitations ?
 - Performance
 - Things people are building on top of MapReduce/Spark
-  - FlumeJava? ...Etc
+  - // FlumeJava? ...Etc
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
 ## Programming Model
-- 
cgit v1.2.3


From 8b888e6698b98db0d3d42933d6ba3c43acdcb9e0 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 07:14:18 -0500
Subject: Added changes to graph processing

---
 chapter/8/big-data.md | 51 +++++++++++++++++++++++++--------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b0bf6f9..baa787a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -64,19 +64,18 @@ The input keys and values are drawn from a different domain than the output keys
 ### Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms.
-
-**Bulk synchronous parallel model**
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
 
-This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-In BSP model
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
+BSP model is a message passing synchronous model where - 
 
  - Computation consists of several steps called as supersets.
  - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
  - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
+ - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
  - Barrier synchronization synchs all the processors at the end of every superstep.
- - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
- - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+ 
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
 ### Querying
 
@@ -90,8 +89,25 @@ Apache Spark is a fast, in-memory data processing engine with elegant and expre
 
 - Pig/HiveQL/SparkSQL
   - Limitations ?
-- Pregel
-  - Limitations ?
+
+**Pregel**
+Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
+
+Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
+Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
+
+Pregel’s API provides
+
+- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
+- User defined handler for handling issues like missing destination vertex etc.
+- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
+- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
+- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
+- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
+
+Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory and hence constrained by the size of main memory.
+
+Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
 
 ## Performance
@@ -239,24 +255,7 @@ In BSP model
  - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
  - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
-**Pregel**
-Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 
-Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
-
-Pregel’s API provides
-
-- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
-- User defined handler for handling issues like missing destination vertex etc.
-- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
-- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
-
-Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory.
-
-Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
 **Introduce GraphX and why it fares better than BSP model. Explain GraphX**
 
-- 
cgit v1.2.3


From 788a878d8e72e5d3823e19c2dedf68ad15d21bd3 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 07:16:33 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index baa787a..6a4f7c7 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -85,7 +85,40 @@ A notable feature of the model is the complete control on data through communica
 **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
 
 **Spark**
-Apache Spark is a fast, in-memory data processing engine with elegant and expressive development interface to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD) and Scala’s collection API as well as functional style for high performance processing. 
+Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD), Scala’s collection API as well as functional style for high performance processing. 
+
+Distributed in-memory storage - Resilient Distributed Data sets :
+RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ.
+
+The properties that power RDD with the above mentioned features :
+	•	A list of dependencies on other RDD’s. 
+	•	An array of partitions that a dataset is divided into.
+	•	A compute function to do a computation on partitions.
+	•	Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
+	•	Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file)
+
+Spark API provide two kinds of operations on a RDD:
+Transformations - lazy operations that return another RDD.
+`map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element
+`flatMap( f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]` : Return a new FlatMappedRDD[U] by first applying a function to all elements and then flattening the results.
+`filter(f:T⇒Bool) : RDD[T] ⇒ RDD[T]` : Return a FilteredRDD[T] having elemnts that f return true
+`groupByKey()` : Being called on (K,V) Rdd, return a new RDD[([K], Iterable[V])]
+`reduceByKey(f: (V, V) => V)` : Being called on (K, V) Rdd, return a new RDD[(K, V)] by aggregating values using eg: reduceByKey(_+_)
+`join((RDD[(K, V)], RDD[(K, W)]) ⇒ RDD[(K, (V, W))]` :Being called on (K,V) Rdd, return a new RDD[(K, (V, W))] by joining them by key K.
+
+
+Actions - operations that trigger computation on a RDD and return values.
+
+`reduce(f:(T,T)⇒T) : RDD[T] ⇒ T` : return T by reducing the elements using specified commutative and associative binary operator
+`collect()` : Return an Array[T] containing all elements
+`count()` : Return the number of elements
+
+
+Why RDD over Distributed Shared memory (DSM) ?
+RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages.
+RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update.
+Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system).
+
 
 - Pig/HiveQL/SparkSQL
   - Limitations ?
-- 
cgit v1.2.3


From 14db63f36aae0f4ab6472244d0f1acd461482f0e Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 07:17:50 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 6a4f7c7..d823b09 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -85,17 +85,18 @@ A notable feature of the model is the complete control on data through communica
 **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
 
 **Spark**
+
 Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD), Scala’s collection API as well as functional style for high performance processing. 
 
 Distributed in-memory storage - Resilient Distributed Data sets :
 RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ.
 
 The properties that power RDD with the above mentioned features :
-	•	A list of dependencies on other RDD’s. 
-	•	An array of partitions that a dataset is divided into.
-	•	A compute function to do a computation on partitions.
-	•	Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-	•	Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file)
+- A list of dependencies on other RDD’s. 
+- An array of partitions that a dataset is divided into.
+- A compute function to do a computation on partitions.
+- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
+- Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file)
 
 Spark API provide two kinds of operations on a RDD:
 Transformations - lazy operations that return another RDD.
-- 
cgit v1.2.3


From b90835ab7de523f18149ae26b3e972c6f6407e1e Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 07:18:33 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d823b09..c6baac0 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -125,6 +125,7 @@ Other benefits include the scheduling of tasks based on data locality to improve
   - Limitations ?
 
 **Pregel**
+
 Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 
 Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-- 
cgit v1.2.3


From 768f7e51fd7d6bafdc5658b86503463f7e4a2486 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 07:41:04 -0500
Subject: Updated outline

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c6baac0..cc11e28 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,13 +14,13 @@ This chapter is organized in
     - PM of Dryad: can support DAG computation, limitations: low-level, `Q: Should this go to execution model?`
     - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics
   - Large-scale Parallelism on Graphs
-    - PM of Pregel/GraphX
+    - PM of Pregel
   - Querying: more declarative `Q: put here or in the execution model?`
     - DryadLINQ, SQL-like, use Dryad as execution engine;
     - Pig, on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?
     - Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.
     - Dremel, query natively w/o translating into MP jobs
-    - Spark SQL, on top of Spark
+    - Spark SQL - how is it different from other above models? How does it leverage Spark execution engine and enhanced RDDs like data frames? what are its goals? whats a Dataframe API and how is it different from a RDD?
 
 - Execution Models
   - MapReduce (intermediate writes to disk)
-- 
cgit v1.2.3


From 07499f7dac53ad9fcb92594a0069f23e9740e669 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 08:49:01 -0500
Subject: Add files via upload

---
 chapter/8/spark_pipeline.png | Bin 0 -> 17570 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/spark_pipeline.png

(limited to 'chapter')

diff --git a/chapter/8/spark_pipeline.png b/chapter/8/spark_pipeline.png
new file mode 100644
index 0000000..ac8c383
Binary files /dev/null and b/chapter/8/spark_pipeline.png differ
-- 
cgit v1.2.3


From b92cacd9c46dd9da407eacad33a6fdb9acbf2ff2 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 08:50:30 -0500
Subject: update diagram

---
 chapter/8/big-data.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index cc11e28..1ca16aa 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -97,7 +97,9 @@ The properties that power RDD with the above mentioned features :
 - A compute function to do a computation on partitions.
 - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 - Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file)
-
+<figure class="main-container">
+  <img src="./spark_pipeline.png" alt="MapReduce Execution Overview" />
+</figure>
 Spark API provide two kinds of operations on a RDD:
 Transformations - lazy operations that return another RDD.
 `map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element
-- 
cgit v1.2.3


From a9883554b8e4ab00e41dbd8a358f97628f35f392 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 08:51:00 -0500
Subject: Fix diagram

---
 chapter/8/big-data.md | 4 ++++
 1 file changed, 4 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 1ca16aa..bae6b83 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -97,9 +97,13 @@ The properties that power RDD with the above mentioned features :
 - A compute function to do a computation on partitions.
 - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 - Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file)
+
+
 <figure class="main-container">
   <img src="./spark_pipeline.png" alt="MapReduce Execution Overview" />
 </figure>
+
+
 Spark API provide two kinds of operations on a RDD:
 Transformations - lazy operations that return another RDD.
 `map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element
-- 
cgit v1.2.3


From 729cbb73db20226f91b40d16c4af9102c3c80b98 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 08:52:46 -0500
Subject: Fixed indentation

---
 chapter/8/big-data.md | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index bae6b83..6778f52 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -105,20 +105,21 @@ The properties that power RDD with the above mentioned features :
 
 
 Spark API provide two kinds of operations on a RDD:
-Transformations - lazy operations that return another RDD.
-`map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element
-`flatMap( f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]` : Return a new FlatMappedRDD[U] by first applying a function to all elements and then flattening the results.
-`filter(f:T⇒Bool) : RDD[T] ⇒ RDD[T]` : Return a FilteredRDD[T] having elemnts that f return true
-`groupByKey()` : Being called on (K,V) Rdd, return a new RDD[([K], Iterable[V])]
-`reduceByKey(f: (V, V) => V)` : Being called on (K, V) Rdd, return a new RDD[(K, V)] by aggregating values using eg: reduceByKey(_+_)
-`join((RDD[(K, V)], RDD[(K, W)]) ⇒ RDD[(K, (V, W))]` :Being called on (K,V) Rdd, return a new RDD[(K, (V, W))] by joining them by key K.
 
+- Transformations - lazy operations that return another RDD.
+  - `map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element
+  - `flatMap( f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]` : Return a new FlatMappedRDD[U] by first applying a function to all elements     and then flattening the results.
+  - `filter(f:T⇒Bool) : RDD[T] ⇒ RDD[T]` : Return a FilteredRDD[T] having elemnts that f return true
+  - `groupByKey()` : Being called on (K,V) Rdd, return a new RDD[([K], Iterable[V])]
+  - `reduceByKey(f: (V, V) => V)` : Being called on (K, V) Rdd, return a new RDD[(K, V)] by aggregating values using eg: reduceByKey(_+_)
+  - `join((RDD[(K, V)], RDD[(K, W)]) ⇒ RDD[(K, (V, W))]` :Being called on (K,V) Rdd, return a new RDD[(K, (V, W))] by joining them by key K.
 
-Actions - operations that trigger computation on a RDD and return values.
 
-`reduce(f:(T,T)⇒T) : RDD[T] ⇒ T` : return T by reducing the elements using specified commutative and associative binary operator
-`collect()` : Return an Array[T] containing all elements
-`count()` : Return the number of elements
+- Actions - operations that trigger computation on a RDD and return values.
+
+  - `reduce(f:(T,T)⇒T) : RDD[T] ⇒ T` : return T by reducing the elements using specified commutative and associative binary operator
+  - `collect()` : Return an Array[T] containing all elements
+  - `count()` : Return the number of elements
 
 
 Why RDD over Distributed Shared memory (DSM) ?
-- 
cgit v1.2.3


From cd0e236b5a13cb8ef7f96d21d0d82d611e0c64fd Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 08:59:12 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 6778f52..56bb9e2 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,7 +14,7 @@ This chapter is organized in
     - PM of Dryad: can support DAG computation, limitations: low-level, `Q: Should this go to execution model?`
     - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics
   - Large-scale Parallelism on Graphs
-    - PM of Pregel
+    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
   - Querying: more declarative `Q: put here or in the execution model?`
     - DryadLINQ, SQL-like, use Dryad as execution engine;
     - Pig, on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?
@@ -26,9 +26,10 @@ This chapter is organized in
   - MapReduce (intermediate writes to disk)
     - Limitations, iteration, performance
   - Spark (all in memory)
-    - Limitations ?
+      what is Spark? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD?
+      why is RDD better than DSM? What are the transformations and actions available in Spark ? what are the limitations of Spark ?
   - Pregel
-    - Limitations ?
+    Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
 - Performance
 - Things people are building on top of MapReduce/Spark
   - // FlumeJava? ...Etc
-- 
cgit v1.2.3


From 10e5b746748df5fb2803dc8b8d392fdb78a33a0b Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 09:56:22 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 56bb9e2..4ca4529 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -27,7 +27,7 @@ This chapter is organized in
     - Limitations, iteration, performance
   - Spark (all in memory)
       what is Spark? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD?
-      why is RDD better than DSM? What are the transformations and actions available in Spark ? what are the limitations of Spark ?
+      why is RDD better than DSM? What are the transformations and actions available in Spark ? Explain with PageRank example why Spark is better than map reduce. what are the limitations of Spark ? 
   - Pregel
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
 - Performance
-- 
cgit v1.2.3


From 54aa9be71a9a013ab0a25411eba78b1d29597787 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 12:24:25 -0500
Subject: detail outline

---
 chapter/8/big-data.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 4ca4529..b833528 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -10,29 +10,29 @@ This chapter is organized in
 
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
-    - PM of MapReduce: basic, limitation, pipelining > FlumeJava
-    - PM of Dryad: can support DAG computation, limitations: low-level, `Q: Should this go to execution model?`
-    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics
+    - PM of MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
+    - PM of Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
+    `Q: Should this go to execution model?`
+    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics; what is Spark? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD?
+    why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
   - Querying: more declarative `Q: put here or in the execution model?`
-    - DryadLINQ, SQL-like, use Dryad as execution engine;
-    - Pig, on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?
-    - Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.
+    - DryadLINQ: SQL-like, uses Dryad as execution engine;
+    `Suggestion: Merge this with Dryad above?`
+    - Pig: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
+    `Q: Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.`
     - Dremel, query natively w/o translating into MP jobs
     - Spark SQL - how is it different from other above models? How does it leverage Spark execution engine and enhanced RDDs like data frames? what are its goals? whats a Dataframe API and how is it different from a RDD?
 
 - Execution Models
   - MapReduce (intermediate writes to disk)
-    - Limitations, iteration, performance
-  - Spark (all in memory)
-      what is Spark? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD?
-      why is RDD better than DSM? What are the transformations and actions available in Spark ? Explain with PageRank example why Spark is better than map reduce. what are the limitations of Spark ? 
+    - Limitations, iteration, optimizations done by MP and FlumeJava
+  - Spark (all in memory): Explain with PageRank example why Spark is better than map reduce. what are the limitations of Spark ?
   - Pregel
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
-- Performance
+- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, but we could explain the scale of differences.
 - Things people are building on top of MapReduce/Spark
-  - // FlumeJava? ...Etc
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
 ## Programming Model
@@ -68,14 +68,14 @@ Map Reduce doesn’t scale easily and is highly inefficient for iterative / grap
 Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
 
 This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
-BSP model is a message passing synchronous model where - 
+BSP model is a message passing synchronous model where -
 
  - Computation consists of several steps called as supersets.
  - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
  - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
  - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
  - Barrier synchronization synchs all the processors at the end of every superstep.
- 
+
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
 ### Querying
@@ -93,7 +93,7 @@ Distributed in-memory storage - Resilient Distributed Data sets :
 RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ.
 
 The properties that power RDD with the above mentioned features :
-- A list of dependencies on other RDD’s. 
+- A list of dependencies on other RDD’s.
 - An array of partitions that a dataset is divided into.
 - A compute function to do a computation on partitions.
 - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-- 
cgit v1.2.3


From e93d770beedd5addcaf886e38f50f62e0d3eac14 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 12:26:10 -0500
Subject: minor

---
 chapter/8/big-data.md | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b833528..29237f5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -3,11 +3,7 @@ layout: page
 title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
-## Introduction
-`JJ: Placeholder for introduction` The booming Internet has generated big data...
-
-This chapter is organized in
-
+## Outline
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
     - PM of MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
@@ -35,7 +31,7 @@ This chapter is organized in
 - Things people are building on top of MapReduce/Spark
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
-## Programming Model
+## Programming Models
 ### Data parallelism
 The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
 ```
-- 
cgit v1.2.3


From 4506b29cec029921402691f5c1b18a5d3e212ba4 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 14:06:57 -0500
Subject: outline detail for MP

---
 chapter/8/big-data.md | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 29237f5..2919c98 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,7 +14,7 @@ by: "Jingjing and Abhilash"
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
   - Querying: more declarative `Q: put here or in the execution model?`
-    - DryadLINQ: SQL-like, uses Dryad as execution engine;
+    - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
     - Pig: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     `Q: Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.`
@@ -22,14 +22,18 @@ by: "Jingjing and Abhilash"
     - Spark SQL - how is it different from other above models? How does it leverage Spark execution engine and enhanced RDDs like data frames? what are its goals? whats a Dataframe API and how is it different from a RDD?
 
 - Execution Models
-  - MapReduce (intermediate writes to disk)
-    - Limitations, iteration, optimizations done by MP and FlumeJava
+  - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): Explain with PageRank example why Spark is better than map reduce. what are the limitations of Spark ?
   - Pregel
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
-- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, but we could explain the scale of differences.
+- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, so we may want to compare separately:
+  - Hadoop vs. Spark
+  - Spark vs. SparkSQL
+
 - Things people are building on top of MapReduce/Spark
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
+    - GFS/HDFS for MapReduce: Machines are unreliable, so how do we maintain a certain degree of redundancy? ...
+    - Mesos for Spark. New frameworks keep emerging and users have to use multiple different frameworks in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters? Mesos introduce
 
 ## Programming Models
 ### Data parallelism
-- 
cgit v1.2.3


From 22b5184d7fc3d257a5eebaa08e18771d5549967d Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 14:11:35 -0500
Subject: Updated Spark model and execution outline

---
 chapter/8/big-data.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2919c98..7c240fe 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -9,8 +9,7 @@ by: "Jingjing and Abhilash"
     - PM of MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
     - PM of Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
     `Q: Should this go to execution model?`
-    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics; what is Spark? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD?
-    why is RDD better than DSM? What are the transformations and actions available in Spark ?
+    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics; what is Spark? how is it different from map reduce? what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
   - Querying: more declarative `Q: put here or in the execution model?`
@@ -23,7 +22,7 @@ by: "Jingjing and Abhilash"
 
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
-  - Spark (all in memory): Explain with PageRank example why Spark is better than map reduce. what are the limitations of Spark ?
+  - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
   - Pregel
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
 - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, so we may want to compare separately:
-- 
cgit v1.2.3


From adf40ba2af4efc86c315776dc40b0a3ad8c5ef3d Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 14:55:18 -0500
Subject: adjust eco-system

---
 chapter/8/big-data.md | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 7c240fe..516234b 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -6,17 +6,17 @@ by: "Jingjing and Abhilash"
 ## Outline
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
-    - PM of MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
-    - PM of Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
+    - MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
+    - Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
     `Q: Should this go to execution model?`
-    - PM of Spark, RDD/lineage: can support iterative algorithm, interactive analytics; what is Spark? how is it different from map reduce? what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
+    - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
-  - Querying: more declarative `Q: put here or in the execution model?`
+  - Querying: more declarative
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
     - Pig: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
-    `Q: Hive, SQL-like, on top of Hadoop, what is the performance gain/lost.`
+    `Q: Do we need to include Hive?`
     - Dremel, query natively w/o translating into MP jobs
     - Spark SQL - how is it different from other above models? How does it leverage Spark execution engine and enhanced RDDs like data frames? what are its goals? whats a Dataframe API and how is it different from a RDD?
 
@@ -29,10 +29,11 @@ by: "Jingjing and Abhilash"
   - Hadoop vs. Spark
   - Spark vs. SparkSQL
 
-- Things people are building on top of MapReduce/Spark
-  - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-    - GFS/HDFS for MapReduce: Machines are unreliable, so how do we maintain a certain degree of redundancy? ...
-    - Mesos for Spark. New frameworks keep emerging and users have to use multiple different frameworks in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters? Mesos introduce
+- Big Data Ecosystem   
+  Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
+  - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
+  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MP, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
+  - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
 
 ## Programming Models
 ### Data parallelism
-- 
cgit v1.2.3


From 7f8bf61a2cfbf36f5b3e55043244d963ebc92bec Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 15:13:49 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 516234b..bf96553 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -18,13 +18,16 @@ by: "Jingjing and Abhilash"
     - Pig: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     `Q: Do we need to include Hive?`
     - Dremel, query natively w/o translating into MP jobs
-    - Spark SQL - how is it different from other above models? How does it leverage Spark execution engine and enhanced RDDs like data frames? what are its goals? whats a Dataframe API and how is it different from a RDD?
+    - Spark SQL - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
+
 
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
   - Pregel
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+ - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
+ 
 - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, so we may want to compare separately:
   - Hadoop vs. Spark
   - Spark vs. SparkSQL
-- 
cgit v1.2.3


From daed05ae775538ad2edabe0693da3fb832c721e6 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 15:21:43 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index bf96553..ba9affe 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -28,9 +28,9 @@ by: "Jingjing and Abhilash"
     Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
  
-- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad and SparkSQL? There are no direct comparison for all those models, so we may want to compare separately:
+- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
   - Hadoop vs. Spark
-  - Spark vs. SparkSQL
+  - Spark vs. SparkSQL from SparkSQL paper
 
 - Big Data Ecosystem   
   Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-- 
cgit v1.2.3


From 538dc06632cfd59654760392be66372112c1839e Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 15:25:59 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ba9affe..884dead 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -12,6 +12,7 @@ by: "Jingjing and Abhilash"
     - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
+    - GraphX programming model (working on this)
   - Querying: more declarative
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
@@ -24,10 +25,11 @@ by: "Jingjing and Abhilash"
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - Pregel
-    Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+  - Graphs : 
+    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+    - GraphX : Working on this.
  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
- 
+
 - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
   - Hadoop vs. Spark
   - Spark vs. SparkSQL from SparkSQL paper
-- 
cgit v1.2.3


From 07de13e393d8d69e2e421df726435f0d2e465a67 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 4 Dec 2016 16:16:06 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 60 +++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 28 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 884dead..7727026 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -66,36 +66,13 @@ The input keys and values are drawn from a different domain than the output keys
 **Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
 
 
-
-### Large-scale Parallelism on Graphs
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
-
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
-
-This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
-BSP model is a message passing synchronous model where -
-
- - Computation consists of several steps called as supersets.
- - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
- - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
- - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
- - Barrier synchronization synchs all the processors at the end of every superstep.
-
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
-
-### Querying
-
-
-
-## Execution Models
-**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
-
 **Spark**
 
-Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD), Scala’s collection API as well as functional style for high performance processing. 
+Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. 
+
+***Distributed in-memory storage - Resilient Distributed Data sets :***
 
-Distributed in-memory storage - Resilient Distributed Data sets :
-RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ.
+RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times.
 
 The properties that power RDD with the above mentioned features :
 - A list of dependencies on other RDD’s.
@@ -127,13 +104,40 @@ Spark API provide two kinds of operations on a RDD:
   - `collect()` : Return an Array[T] containing all elements
   - `count()` : Return the number of elements
 
+RDDs by default are discarded after use. However, Spark provides two explicit operations  persist() and cache() to ensure RDDs are persisted in memory once the RDD has been computed for the first time.
 
-Why RDD over Distributed Shared memory (DSM) ?
+***Why RDD over Distributed Shared memory (DSM) ?***
 RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages.
 RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update.
 Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system).
 
 
+### Large-scale Parallelism on Graphs
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
+
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
+
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
+BSP model is a message passing synchronous model where -
+
+ - Computation consists of several steps called as supersets.
+ - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
+ - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
+ - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
+ - Barrier synchronization synchs all the processors at the end of every superstep.
+
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+
+### Querying
+
+
+
+## Execution Models
+**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
+
+
+
+
 - Pig/HiveQL/SparkSQL
   - Limitations ?
 
-- 
cgit v1.2.3


From 9632bb4ca6b2f4543cab8c177674f87c4a0e1e55 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 17:29:52 -0500
Subject: add bib

---
 chapter/8/big-data.md | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 7727026..e8c909d 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -13,19 +13,20 @@ by: "Jingjing and Abhilash"
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model (working on this)
-  - Querying: more declarative
+  - Querying: more declarative, built on top MP models.
+    - Sawzall {%cite pike2005interpreting --file big-data %}
+    - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
+    - Hive {%cite thusoo2009hive --file big-data %}
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
-    - Pig: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
-    `Q: Do we need to include Hive?`
     - Dremel, query natively w/o translating into MP jobs
-    - Spark SQL - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
+    - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
 
 
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - Graphs : 
+  - Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
     - GraphX : Working on this.
  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
@@ -42,7 +43,7 @@ by: "Jingjing and Abhilash"
 
 ## Programming Models
 ### Data parallelism
-The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
+The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management for distribution and parallelization. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
 ```
 map (k1,v1) -> → list(k2,v2)
 reduce (k2,list(v2)) -> list(v2)
@@ -195,11 +196,11 @@ Many a analytics workloads like K-means, logistic regression, graph processing a
 
 
-**Haloop** : HaLoop: Efficient Iterative Data Processing on Large Clusters.
+**Haloop** : {% cite bu2010haloop --file big-data %}
 
-**iMapReduce**: iMapReduce: A Distributed Computing Framework for Iterative Computation
+**iMapReduce**: {% cite zhang2012imapreduce --file big-data %}
 
-**Twister** :  Twister: a runtime for iterative MapReduce.
+**Twister** :  {% cite ekanayake2010twister --file big-data %}
 
 ## Map Reduce inspired large scale data processing systems
 
-- 
cgit v1.2.3


From 822d602f00653f79fed6eebce257fafcec2fe932 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Sun, 4 Dec 2016 17:38:35 -0500
Subject: minor

---
 chapter/8/big-data.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e8c909d..16efec6 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -13,8 +13,8 @@ by: "Jingjing and Abhilash"
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model (working on this)
-  - Querying: more declarative, built on top MP models.
-    - Sawzall {%cite pike2005interpreting --file big-data %}
+  - Querying: we need more declarative interfaces, built on top MP models.
+    - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     - Hive {%cite thusoo2009hive --file big-data %}
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
@@ -44,10 +44,12 @@ by: "Jingjing and Abhilash"
 ## Programming Models
 ### Data parallelism
 The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management for distribution and parallelization. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
-```
-map (k1,v1) -> → list(k2,v2)
-reduce (k2,list(v2)) -> list(v2)
-```
+
+\\[map (k1,v1) \rightarrow  list(k2,v2)\\]
+
+\\[reduce (k2,list(v2)) \rightarrow list(v2)\\]
+
+
 The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values.
 
 **Execution** At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
-- 
cgit v1.2.3


From d6ad93fa82e59347e4c6b38a9dc40e4a0fc2ba86 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 07:37:03 -0500
Subject: Add files via upload

---
 chapter/8/cluster-overview.png   | Bin 0 -> 22912 bytes
 chapter/8/sparksql-data-flow.jpg | Bin 0 -> 128479 bytes
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/cluster-overview.png
 create mode 100644 chapter/8/sparksql-data-flow.jpg

(limited to 'chapter')

diff --git a/chapter/8/cluster-overview.png b/chapter/8/cluster-overview.png
new file mode 100644
index 0000000..b1b7c1a
Binary files /dev/null and b/chapter/8/cluster-overview.png differ
diff --git a/chapter/8/sparksql-data-flow.jpg b/chapter/8/sparksql-data-flow.jpg
new file mode 100644
index 0000000..1cf98f5
Binary files /dev/null and b/chapter/8/sparksql-data-flow.jpg differ
-- 
cgit v1.2.3


From f55dc1152aeab23706497e6324658ac41a688154 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 07:39:26 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 16efec6..e428599 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -86,7 +86,7 @@ The properties that power RDD with the above mentioned features :
 
 
 <figure class="main-container">
-  <img src="./spark_pipeline.png" alt="MapReduce Execution Overview" />
+  <img src="./spark_pipeline.png" alt="Spark pipeline" />
 </figure>
 
 
@@ -164,6 +164,19 @@ Pregel works good for sparse graphs. However, dense graph could cause communicat
 
 Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
+**Spark execution model**
+
+<figure class="main-container">
+  <img src="./cluster-overview.png" alt="MapReduce Execution Overview" />
+</figure>
+
+		The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs.
+		The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes.
+	•	A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime.
+
+		Job scheduler optimization : Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing.
+		How are persistent RDD’s memory managed ? 
+		Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
 
 ## Performance
 `TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
-- 
cgit v1.2.3


From b3e083c9f5f3595b79e76ae5130b8c71ae022e02 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 07:40:37 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e428599..1b8f925 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -170,13 +170,14 @@ Apache Giraph is an open source implementation of Pregel in which new features l
   <img src="./cluster-overview.png" alt="MapReduce Execution Overview" />
 </figure>
 
-		The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs.
-		The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes.
-	•	A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime.
+The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs.
+The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes.
+A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime.
 
-		Job scheduler optimization : Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing.
-		How are persistent RDD’s memory managed ? 
-		Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
+***Job scheduler optimization :*** Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing.
+***How are persistent RDD’s memory managed ?***
+
+Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
 
 ## Performance
 `TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
-- 
cgit v1.2.3


From 5ae1c5f5a1612f6f042f75941115e4269581d371 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 07:41:17 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 1b8f925..11b047c 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -175,6 +175,7 @@ The cluster manager manages and allocates the required system resources to the S
 A Spark worker executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime.
 
 ***Job scheduler optimization :*** Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on a RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing.
+
 ***How are persistent RDD’s memory managed ?***
 
 Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
-- 
cgit v1.2.3


From 7249c56384dede68f3d0b707edebc28bd4391d6e Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 07:49:06 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 11b047c..d4045ec 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -180,6 +180,32 @@ A Spark worker executes the business logic submitted by the Spark driver. Spark
 
 Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
 
+
+**SparkSQL execution model**
+
+SparkSQL execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job.
+A Catalyst is a query optimizer. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
+
+Catalyst leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations.
+
+Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. 
+Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
+
+Hence, in Spark SQL, transformation of user queries happens in four phases :
+
+<figure class="main-container">
+  <img src="./sparksql-data-flow.jpg" alt="SparkSQL optimization plan Overview" />
+</figure>
+
+***Analyzing a logical plan to resolve references :*** In the analysis phase a relation either from the abstract syntax  tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions.
+
+***Logical plan optimization :*** In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan.
+
+***Physical planning :*** In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
+
+
+***Code Generation :*** The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
+
 ## Performance
 `TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
 In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
-- 
cgit v1.2.3


From dcb025c0e2ffc3d740bdb7801d6335dd1523cabd Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 08:27:28 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 45 +++++++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d4045ec..c7f5c6b 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -134,6 +134,29 @@ A notable feature of the model is the complete control on data through communica
 ### Querying
 
 
+## SparkSQL - Where Relational meets Procedural :
+Relational interface to big data is good, however, it doesn’t cater to users who want to perform
+
+- ETL to and from various semi or unstructured data sources.
+- advanced analytics like machine learning or graph processing.
+
+These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
+
+Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+
+***Programming API***
+
+Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
+A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
+Spark SQL supports all the major SQL data types including complex data types like arrays, maps and unions.
+Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy).
+Illustrated below is an example of relational operations on employees data frame to compute the number of female employees in each department.
+
+```
+employees.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
+```
+Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
+A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding.
 
 ## Execution Models
 **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
@@ -284,29 +307,7 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 **Dremel** :
 
 
-## SparkSQL - Where Relational meets Procedural :
-Relational interface to big data is good, however, it doesn’t cater to users who want to perform
-
-- ETL to and from various semi or unstructured data sources.
-- advanced analytics like machine learning or graph processing.
-
-These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
-
-Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
-
-***Programming API***
 
-Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface though JDBC/ODBC, command line or Dataframe API.
-A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike a RDD, Dataframe allows developers to define structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed.
-Spark SQL supports all the major SQL data types including complex data types like arrays, maps and unions.
-Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy).
-Illustrated below is an example of relational operations on employees data frame to compute the number of female employees in each department.
-
-```
-employees.join(dept, employees("deptId") === dept("id")) .where(employees("gender") === "female") .groupBy(dept("id"), dept("name")) .agg(count("name"))
-```
-Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
-A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding.
 MORE EXPLANATION NEEDED...
 
 
-- 
cgit v1.2.3


From 45df1dda80fda5f58d0d172abc9c5ccdbb97e42e Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Mon, 5 Dec 2016 09:36:06 -0500
Subject: update mp

---
 chapter/8/big-data.md | 59 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c7f5c6b..ee39c0d 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -7,6 +7,7 @@ by: "Jingjing and Abhilash"
 - Programming Models
   - Data parallelism (most popular, standard map/reduce/functional pipelining)
     - MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
+    - FlumeJava: Pipeling
     - Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
     `Q: Should this go to execution model?`
     - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
@@ -22,7 +23,6 @@ by: "Jingjing and Abhilash"
     - Dremel, query natively w/o translating into MP jobs
     - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
 
-
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
@@ -43,7 +43,12 @@ by: "Jingjing and Abhilash"
 
 ## Programming Models
 ### Data parallelism
-The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management for distribution and parallelization. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
+The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but there are issues of how to parallelize the computation, distribute the data and handle failures. MapReduce solves this by abstracting parallelizable computations into simple map and reduce functions. The model can automatically handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
+- *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it produces a set of intermediate key/value pairs as output.
+- *Shuffle*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function.
+- *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value.
+
+Conceptually, the map and reduction functions have associated **types**:
 
 \\[map (k1,v1) \rightarrow  list(k2,v2)\\]
 
@@ -52,6 +57,25 @@ The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is th
 
 The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values.
 
+
+Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word
+
+```
+map(String key, String value):
+  // key: document name
+  // value: document contents
+  for each word w in value:
+    EmitIntermediate(w, "1");
+
+reduce(String key, Iterator values):
+  // key: a word
+  // values: a list of counts
+  int result = 0;
+  for each v in values:
+    result += ParseInt(v);
+  Emit(AsString(result));
+```
+
 **Execution** At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
 <figure class="main-container">
@@ -60,14 +84,16 @@ The input keys and values are drawn from a different domain than the output keys
 
 **Limitations & Extensions**  
 ***Real-world applications often require a pipeline of MapReduce jobs and the management becomes an issue.***  
-**FlumeJava** was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
-
+- slow
+- complexity
 
 ***The iterative algorithm is hard to implement in MapReduce***   
   `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
-**Dryad/DrydaLINQ** Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
+**FlumeJava** was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
+**Dryad/DrydaLINQ**  
+Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
 
 **Spark**
 
@@ -207,11 +233,11 @@ Persistent RDDs are stored in memory as java objects (for performance) or in mem
 **SparkSQL execution model**
 
 SparkSQL execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job.
-A Catalyst is a query optimizer. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. 
+A Catalyst is a query optimizer. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules.
 
 Catalyst leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations.
 
-Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. 
+Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules.
 Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
 
 Hence, in Spark SQL, transformation of user queries happens in four phases :
@@ -360,22 +386,3 @@ In BSP model
 
 - Current leader in distributed processing - Spark, Google's cloud dataflow
 - Current challenges and upcoming improvements ?? - Apache thunder and any others?
-
-
-Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word
-
-```
-map(String key, String value):
-  // key: document name
-  // value: document contents
-  for each word w in value:
-    EmitIntermediate(w, "1");
-
-reduce(String key, Iterator values):
-  // key: a word
-  // values: a list of counts
-  int result = 0;
-  for each v in values:
-    result += ParseInt(v);
-  Emit(AsString(result));
-```
-- 
cgit v1.2.3


From f17b794774f694a379c95badca6616a715bf93f1 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 10:21:06 -0500
Subject: Update SparkSQL

---
 chapter/8/big-data.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ee39c0d..25e0119 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -184,6 +184,18 @@ employees.join(dept, employees("deptId") === dept("id")) .where(employees("gende
 Several of these operators like  === for equality test, > for greater than, a rithmetic ones (+, -, etc) and aggregators transforms to a abstract syntax tree of the expression which can be passed to Catalyst for optimization.
 A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding.
 
+The DataFrame API also supports inline UDF definitions without complicated packaging and registration. Because UDFs and queries are both expressed in the same general purpose language (Python or Scala), users can use standard debugging tools.
+However, a DataFrame lacks type safety. In the above example, attributes are referred to by string names. Hence, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
+Spark introduced a extension to Dataframe called ***Dataset*** to provide this compile type safety. It embraces object oriented style for programming and has an additional feature termed Encoders. Encoders translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object
+
+
+Winding up - we can compare SQL vs Dataframe vs Dataset as below :
+
+<figure class="main-container">
+  <img src="./sql-vs-dataframes-vs-datasets.png" alt="SQL vs Dataframe vs Dataset" />
+</figure>
+
+
 ## Execution Models
 **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
 
-- 
cgit v1.2.3


From 2256ae1da929d709d12e1d5e7a13ba948a2a9b45 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 10:21:56 -0500
Subject: Added a data frames image

---
 chapter/8/sql-vs-dataframes-vs-datasets.png | Bin 0 -> 48229 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/sql-vs-dataframes-vs-datasets.png

(limited to 'chapter')

diff --git a/chapter/8/sql-vs-dataframes-vs-datasets.png b/chapter/8/sql-vs-dataframes-vs-datasets.png
new file mode 100644
index 0000000..600c68b
Binary files /dev/null and b/chapter/8/sql-vs-dataframes-vs-datasets.png differ
-- 
cgit v1.2.3


From b9f699d22cc89fdee96d257ed9a65137327103ca Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 10:25:56 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 25e0119..e812f54 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -185,6 +185,7 @@ Several of these operators like  === for equality test, > for greater than, a ri
 A cache() operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding.
 
 The DataFrame API also supports inline UDF definitions without complicated packaging and registration. Because UDFs and queries are both expressed in the same general purpose language (Python or Scala), users can use standard debugging tools.
+
 However, a DataFrame lacks type safety. In the above example, attributes are referred to by string names. Hence, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
 Spark introduced a extension to Dataframe called ***Dataset*** to provide this compile type safety. It embraces object oriented style for programming and has an additional feature termed Encoders. Encoders translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object
 
-- 
cgit v1.2.3


From cd24eff0763a2a75fd042bc414a866093c1c42aa Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 5 Dec 2016 10:44:22 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 58 +++++++--------------------------------------------
 1 file changed, 8 insertions(+), 50 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e812f54..c703696 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -13,7 +13,8 @@ by: "Jingjing and Abhilash"
     - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
-    - GraphX programming model (working on this)
+    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
+    
   - Querying: we need more declarative interfaces, built on top MP models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
@@ -140,6 +141,12 @@ RDDs are immutable and can only be created through coarse grained transformation
 RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update.
 Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system).
 
+***Challenges in Spark***
+
+- `Functional API semantics` : The GroupByKey operator is costly in terms of performance. In that it returns a distributed collection of (key, list of value) pairs to a single machine and then an aggregation on individual keys is performed on the same machine resulting in computation overhead. Spark does provide reduceByKey operator which does a partial aggregation on invidual worker nodes before returning the distributed collection. However, developers who are not aware of such a functionality can unintentionally choose groupByKey.
+
+- `Debugging and profiling` : There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient.
+
 
 ### Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
@@ -202,7 +209,6 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below :
 
 
-
 - Pig/HiveQL/SparkSQL
   - Limitations ?
 
@@ -347,55 +353,7 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 
 
-MORE EXPLANATION NEEDED...
-
-
-
-## Optimizers are the way to go (still thinking of a better heading..)
-
-It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules.
-Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer.
-
-Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules.
-
-
-Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied.
-
-In Spark SQL, transformation happens in four phases :
-
-- Analyzing a logical plan to resolve references  : In the analysis phase a relation either from the abstract syntax  tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions.
-
-- Logical plan optimization : In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan.
-
-- Physical planning : In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
-
-- Code Generation : The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
-
-STILL WORKING ON THIS..
-
-## Large Scale Graph processing
-
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
-
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms.
-
-**Bulk synchronous parallel model**
-
-This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce to an extent.<br />
-In BSP model
-
- - Computation consists of several steps called as supersets.
- - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
- - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
- - Barrier synchronization synchs all the processors at the end of every superstep.
- - A notable feature of the model is the complete control on data through communication between every processor at every superstep.
- - Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
-
 
 
-**Introduce GraphX and why it fares better than BSP model. Explain GraphX**
 
-## Future and Discussion
 
-- Current leader in distributed processing - Spark, Google's cloud dataflow
-- Current challenges and upcoming improvements ?? - Apache thunder and any others?
-- 
cgit v1.2.3


From d64b5eea953b10e02e0c9bc232a7b2a803addbdd Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Mon, 5 Dec 2016 10:46:39 -0500
Subject: update mp

---
 chapter/8/big-data.md | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c703696..54dde79 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -14,7 +14,7 @@ by: "Jingjing and Abhilash"
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
-    
+
   - Querying: we need more declarative interfaces, built on top MP models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
@@ -77,21 +77,24 @@ reduce(String key, Iterator values):
   Emit(AsString(result));
 ```
 
-**Execution** At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
+*Execution*  
+At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
 <figure class="main-container">
   <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
 </figure>
 
-**Limitations & Extensions**  
-***Real-world applications often require a pipeline of MapReduce jobs and the management becomes an issue.***  
-- slow
-- complexity
+*Limitations*  
+- It only works for batch processing jobs. More sophisticated applications are not easy to be abstracted as a set of map/reduce operations. In sum, it cannot work well for iterative, graph, or incremental processing.
+- MP has to do I/O operation for each job and makes it too slow to support applications that require low latency.
+- The master is a single point of failure.
+- Writing raw MP program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MP stages.
+
+`TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
-***The iterative algorithm is hard to implement in MapReduce***   
-  `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
+**FlumeJava**  
+FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
-**FlumeJava** was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
 **Dryad/DrydaLINQ**  
 Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
@@ -350,10 +353,3 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 **Hive** :
 
 **Dremel** :
-
-
-
-
-
-
-
-- 
cgit v1.2.3


From 09ae3171dcc60933ed9a1bc3ebf27e6611423626 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Mon, 5 Dec 2016 10:56:29 -0500
Subject: update outline

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 54dde79..608341e 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -30,7 +30,7 @@ by: "Jingjing and Abhilash"
   - Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
     - GraphX : Working on this.
- - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
+  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
 
 - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
   - Hadoop vs. Spark
@@ -77,7 +77,7 @@ reduce(String key, Iterator values):
   Emit(AsString(result));
 ```
 
-*Execution*  
+*Execution*  `TODO: move this to execution and talk about fault-tolerance instead`
 At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
 <figure class="main-container">
-- 
cgit v1.2.3


From 445eb0dd99858f5ddc8ab84177e318e71599baac Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Mon, 5 Dec 2016 18:27:48 -0500
Subject: data parallelism intro

---
 chapter/8/big-data.md | 50 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 14 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 608341e..e9d3a0f 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -3,29 +3,30 @@ layout: page
 title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
+2015 NSDI Ousterhout
+latency numbers that every programmer should know
 ## Outline
 - Programming Models
-  - Data parallelism (most popular, standard map/reduce/functional pipelining)
-    - MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
-    - FlumeJava: Pipeling
-    - Dryad: What if we think individual computation tasks as vertices? We essentially construct a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and take care of scheduling. Like MP, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.   
-    `Q: Should this go to execution model?`
-    - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
+  - Data parallelism: what is data parallelism and how do the following models relate to each other?
+    - MapReduce
+    - FlumeJava
+    - Dryad
+    - Spark
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
 
-  - Querying: we need more declarative interfaces, built on top MP models.
+  - Querying: we need more declarative interfaces, built on top MR models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     - Hive {%cite thusoo2009hive --file big-data %}
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
-    - Dremel, query natively w/o translating into MP jobs
+    - Dremel, query natively w/o translating into MR jobs
     - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
 
 - Execution Models
-  - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
+  - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
   - Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
@@ -39,12 +40,33 @@ by: "Jingjing and Abhilash"
 - Big Data Ecosystem   
   Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
   - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
-  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MP, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
+  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
   - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
 
 ## Programming Models
 ### Data parallelism
-The motivation for MapReduce {% cite dean2008mapreduce  --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but there are issues of how to parallelize the computation, distribute the data and handle failures. MapReduce solves this by abstracting parallelizable computations into simple map and reduce functions. The model can automatically handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
+*Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, in a sequential computation, typically programmers will implement logic like *"for all elements in the dataset, do operation A"*, where dataset is in the order of terabytes or petabytes aka. big data. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures etc.
+
+<figure class="main-container">
+  <img src="{{ site.baseurl }}/resources/img/data-parallelism.png" alt="Data Parallelism" />
+</figure>
+
+*MapReduce* {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface and automatically handles the parallelization and distribution. All programmers need to do is to specify *map* and *reduce* functions.
+
+The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. And most of time, developers need to understand the execution model to do manual optimizations. *FlumeJava* library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
+
+Microsfot Dryad {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR supports to a single input and a single output for each vertex.   
+//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
+
+`PLACEHOLDER FOR INTRO TO SPARK, highlights about MR vs. Spark`  
+// what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
+
+Details about the programming models of MapReduce, Dryad and Spark are discussed in following three sections.
+
+**MapReduce**  
+//The motivation for MapReduce is that we want to use hundreds/thousands of machines to do data processing in parallel, but there are issues of how to parallelize the computation, distribute the data and handle failures. MapReduce solves this by abstracting parallelizable computations into simple map and reduce functions. The model can automatically handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates.
+
+The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
 - *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it produces a set of intermediate key/value pairs as output.
 - *Shuffle*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function.
 - *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value.
@@ -86,11 +108,11 @@ At high level, when the user program calls *MapReduce* function, the input files
 
 *Limitations*  
 - It only works for batch processing jobs. More sophisticated applications are not easy to be abstracted as a set of map/reduce operations. In sum, it cannot work well for iterative, graph, or incremental processing.
-- MP has to do I/O operation for each job and makes it too slow to support applications that require low latency.
+- MR has to do I/O operation for each job and makes it too slow to support applications that require low latency. `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 - The master is a single point of failure.
-- Writing raw MP program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MP stages.
+- Writing raw MR program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MR stages.
+
 
-`TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
 **FlumeJava**  
 FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
-- 
cgit v1.2.3


From 500d9a6c3569c9b934787923295d3dcd6bf1bb2d Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Tue, 6 Dec 2016 21:29:15 -0500
Subject: update

---
 chapter/8/big-data.md | 119 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 71 insertions(+), 48 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e9d3a0f..6df9318 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -6,17 +6,17 @@ by: "Jingjing and Abhilash"
 2015 NSDI Ousterhout
 latency numbers that every programmer should know
 ## Outline
-- Programming Models
-  - Data parallelism: what is data parallelism and how do the following models relate to each other?
-    - MapReduce
-    - FlumeJava
-    - Dryad
-    - Spark
-  - Large-scale Parallelism on Graphs
+- 1. Programming Models
+  - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
+    - 1.1.1 MapReduce
+    - 1.1.2 FlumeJava
+    - 1.1.3 Dryad
+    - 1.1.4 Spark
+  - 1.2. Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
 
-  - Querying: we need more declarative interfaces, built on top MR models.
+  - 1.3. Querying: we need more declarative interfaces, built on top MR models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     - Hive {%cite thusoo2009hive --file big-data %}
@@ -25,10 +25,10 @@ latency numbers that every programmer should know
     - Dremel, query natively w/o translating into MR jobs
     - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
 
-- Execution Models
-  - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
-  - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - Graphs :
+- 2. Execution Models
+  - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
+  - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
+  - 2.3 Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
     - GraphX : Working on this.
   - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
@@ -43,31 +43,30 @@ latency numbers that every programmer should know
   - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
   - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
 
-## Programming Models
-### Data parallelism
-*Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, in a sequential computation, typically programmers will implement logic like *"for all elements in the dataset, do operation A"*, where dataset is in the order of terabytes or petabytes aka. big data. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures etc.
+## 1 Programming Models
+### 1.1 Data parallelism
+*Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, a sequential computation looks like *"for all elements in the dataset, do operation A"*, where dataset could be in the order of terabytes or petabytes aka. big data and one wants to scale up the processing. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures and so on.
 
 <figure class="main-container">
   <img src="{{ site.baseurl }}/resources/img/data-parallelism.png" alt="Data Parallelism" />
 </figure>
 
-*MapReduce* {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface and automatically handles the parallelization and distribution. All programmers need to do is to specify *map* and *reduce* functions.
+*MapReduce* {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface: *map* and *reduce* functions and automatically handles the parallelization and distribution.
 
-The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. And most of time, developers need to understand the execution model to do manual optimizations. *FlumeJava* library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
+The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. *FlumeJava* library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
+
+Microsfot *Dryad* {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also allows memory
 
-Microsfot Dryad {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR supports to a single input and a single output for each vertex.   
-//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
 
-`PLACEHOLDER FOR INTRO TO SPARK, highlights about MR vs. Spark`  
-// what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
+Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. *Spark* {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics.
 
-Details about the programming models of MapReduce, Dryad and Spark are discussed in following three sections.
 
-**MapReduce**  
-//The motivation for MapReduce is that we want to use hundreds/thousands of machines to do data processing in parallel, but there are issues of how to parallelize the computation, distribute the data and handle failures. MapReduce solves this by abstracting parallelizable computations into simple map and reduce functions. The model can automatically handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates.
+Details about the programming models of MapReduce, FlumeJava, Dryad and Spark are discussed in following four sections.
 
-The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
-- *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it produces a set of intermediate key/value pairs as output.
+
+### 1.1.1 MapReduce  
+In this model, parallelizable computations are abstracted into map and reduce functions. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
+- *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it computes a set of intermediate key/value pairs as output.
 - *Shuffle*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function.
 - *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value.
 
@@ -99,29 +98,52 @@ reduce(String key, Iterator values):
   Emit(AsString(result));
 ```
 
-*Execution*  `TODO: move this to execution and talk about fault-tolerance instead`
-At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
+During executing, the MapReduce library assigns a master node to manage data partition and scheduling,  other nodes can serve as workers to run either *map* or *reduce* operations on demands. More details of the execution model is discussed later. Here, it's worth mentioning that the intermediate results are written into disks and reduce operation will read from disk. This is crucial for fault tolerance.
 
-<figure class="main-container">
-  <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
-</figure>
+*Fault Tolerance*  
+MapReduce runs on hundreds or thousands of unreliable commodity machines, so the library must provide fault tolerance. The library assumes that master node would not fail, and it monitors worker failures. If no status update is received from a worker on timeout, the master will mark it as failed. Then the master may schedule the associated task to other workers depending on task type and status. The commits of *map* and *reduce* task outputs are atomic, where the in-progress task writes data into private temporary files, once the task succeeds, it negotiate with the master and rename files to complete the task . In the case of failure, the worker discards those temporary files. This guarantees that if the computation is deterministic, the distribution implementation should produce same outputs as non-faulting sequential execution.
 
-*Limitations*  
+*Limitations* `TODO: re-organize`   
 - It only works for batch processing jobs. More sophisticated applications are not easy to be abstracted as a set of map/reduce operations. In sum, it cannot work well for iterative, graph, or incremental processing.
 - MR has to do I/O operation for each job and makes it too slow to support applications that require low latency. `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 - The master is a single point of failure.
 - Writing raw MR program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MR stages.
 
+### 1.1.2 FlumeJava
+FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
+*Core Abstraction*  
+- `PCollection<T>`, a immutable bag of elements of type `T`
+- `recordOf(...)`, specifies the encoding of the instance
+- `PTable<K, V>`, a subclass of `PCollection<Pair<K,V>>`, a immutable multi-map with keys of type `K` and values of type `V`
+- `parallelDo()`, can be expressed both the map and reduce parts of MapReduce
+- `groupByKey()`, same as shuffle step of MapReduce
+- `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
 
-**FlumeJava**  
-FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs using deferred evaluation and optimizers such as fusions. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
+*Deferred Evaluation*  
+The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed).
+
+*Example*  
+`TODO: example and explain the execution plan`
+```Java
+PCollection<String> words =
+  lines.parallelDo(new DoFn<String,String>() {
+    void process(String line, EmitFn<String> emitFn) {
+      for (String word : splitIntoWords(line)) {
+        emitFn.emit(word);
+      }
+    }
+  }, collectionOf(strings()));
+```
+
+*Optimizer*  
+`(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
-**Dryad/DrydaLINQ**  
+### 1.1.3 Dryad
 Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
 
-**Spark**
+### 1.1.4 Spark
 
 Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. 
 
@@ -173,7 +195,7 @@ Other benefits include the scheduling of tasks based on data locality to improve
 - `Debugging and profiling` : There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient.
 
 
-### Large-scale Parallelism on Graphs
+### 1.2 Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
 Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
@@ -189,10 +211,10 @@ BSP model is a message passing synchronous model where -
 
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
-### Querying
+### 1.3 Querying
 
 
-## SparkSQL - Where Relational meets Procedural :
+#### SparkSQL - Where Relational meets Procedural :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
@@ -229,13 +251,18 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below :
 </figure>
 
 
-## Execution Models
-**MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
+## 2 Execution Models
+**2.1 MapReduce**,   
+as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
+
+*Execution*  `TODO: move this to execution and talk about fault-tolerance instead`
+At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
+<figure class="main-container">
+  <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
+</figure>
 
 
-- Pig/HiveQL/SparkSQL
-  - Limitations ?
 
 **Pregel**
 
@@ -370,8 +397,4 @@ Many real-world computations involves a pipeline of MapReduces, and this motivat
 `(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
-**Pig Latin** : Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008.
-
-**Hive** :
-
-**Dremel** :
+//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
-- 
cgit v1.2.3


From 7b2ff5be6ff2b1c62e1e4b768c7d61d9cd47a013 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 09:45:07 -0500
Subject: Added GraphX and fine tuned graph processing

---
 chapter/8/big-data.md | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 6df9318..c44e9a4 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -198,7 +198,9 @@ Other benefits include the scheduling of tasks based on data locality to improve
 ### 1.2 Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the Bulk Synchronous Parallel model.
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model.
+
+In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model
 
 This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
 BSP model is a message passing synchronous model where -
@@ -211,6 +213,47 @@ BSP model is a message passing synchronous model where -
 
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
+The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction, and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence.
+GraphX is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity.
+
+How does GraphX improve over the existing graph-parallel and data flow models ?
+The RDGs in GraphX provides a set of elegant and expressive computational primitives through which  many a graph parallel systems like Pregel, PowerGraph can be easily expressed with minimal lines of code. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view and graph transformations. It minimizes communication and storage overhead.
+
+Similar to the data flow model, it GraphX away from the vertex centric view and adopts transformations on graphs yielding a new graph.
+
+***Why partitioning is important in graph computation systems ?***
+Graph-parallel computation requires every vertex or edge to be processed in the context of its neighborhood. Each transformation depends on the result of distributed joins between vertices and edges. This means that graph computation systems rely on graph partitioning (edge-cuts in most of the systems) and efficient storage to minimize communication and storage overhead and ensure balanced computation.
+
+<figure class="main-container">
+  <img src="./edge-cuts.png" alt="edge cuts" />
+</figure>
+
+***Why Edge-cuts are expensive ?***
+Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. 
+
+<figure class="main-container">
+  <img src="./spark_pipeline.png" alt="Vertex cuts" />
+</figure>
+
+***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
+
+The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
+	•	EdgeTable(pid, src, dst, data): Stores adjacency structure and edge data
+	•	VertexDataTable(id, data): Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
+	•	VertexMap(id, pid): Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
+A three-way relational join is used to bring together source vertex data, edge data, and target vertex data. The join is straightforward, and takes advantage of a partitioner to ensure the join site is local to the edge table. This means GraphX only has to shuffle vertex data.
+
+***Operators in GraphX***
+Other than standard data-parallel operators like filter, map, leftJoin, and reduceByKey, GraphX supports following graph-parallel operators:
+	•	graph - constructs property graph given a collection of edges and vertices.
+	•	vertices, edges - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs.
+	•	mapV, mapE - transform the vertex or edge collection.
+	•	triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
+	•	leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
+	•	subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
+	•	mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
+
+
 ### 1.3 Querying
 
 
-- 
cgit v1.2.3


From 6570b15076d2839ade3e938feff53ab50a19fccb Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 09:46:39 -0500
Subject: Correcting graphx alignment

---
 chapter/8/big-data.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c44e9a4..175f275 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -232,15 +232,16 @@ Graph-parallel computation requires every vertex or edge to be processed in the
 Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. 
 
 <figure class="main-container">
-  <img src="./spark_pipeline.png" alt="Vertex cuts" />
+  <img src="./vertex-cuts.png" alt="Vertex cuts" />
 </figure>
 
 ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
 The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
-	•	EdgeTable(pid, src, dst, data): Stores adjacency structure and edge data
-	•	VertexDataTable(id, data): Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
-	•	VertexMap(id, pid): Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
+	- EdgeTable(pid, src, dst, data): Stores adjacency structure and edge data
+	- VertexDataTable(id, data): Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
+	- VertexMap(id, pid): Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
+	
 A three-way relational join is used to bring together source vertex data, edge data, and target vertex data. The join is straightforward, and takes advantage of a partitioner to ensure the join site is local to the edge table. This means GraphX only has to shuffle vertex data.
 
 ***Operators in GraphX***
-- 
cgit v1.2.3


From 5fc6b63a91798105e77d5d6896f5a941fdf8c326 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 09:49:29 -0500
Subject: ensuring correctness in alignment

---
 chapter/8/big-data.md | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 175f275..84b597c 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -238,21 +238,23 @@ Edge-cuts for partitioning requires random assignment of vertices and edges acro
 ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
 The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
-	- EdgeTable(pid, src, dst, data): Stores adjacency structure and edge data
-	- VertexDataTable(id, data): Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
-	- VertexMap(id, pid): Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
+
+- `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data.
+-  `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
+- `VertexMap(id, pid)`: Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
 	
 A three-way relational join is used to bring together source vertex data, edge data, and target vertex data. The join is straightforward, and takes advantage of a partitioner to ensure the join site is local to the edge table. This means GraphX only has to shuffle vertex data.
 
 ***Operators in GraphX***
 Other than standard data-parallel operators like filter, map, leftJoin, and reduceByKey, GraphX supports following graph-parallel operators:
-	•	graph - constructs property graph given a collection of edges and vertices.
-	•	vertices, edges - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs.
-	•	mapV, mapE - transform the vertex or edge collection.
-	•	triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
-	•	leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
-	•	subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
-	•	mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
+
+- graph - constructs property graph given a collection of edges and vertices.
+- vertices, edges - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs.
+- mapV, mapE - transform the vertex or edge collection.
+- triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
+- leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
+- subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
+- mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
 
 
 ### 1.3 Querying
-- 
cgit v1.2.3


From 8e36d96069648c019d1e59fea0b0844cf43e3862 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 09:50:23 -0500
Subject: adding graphx and hive diagrams

---
 chapter/8/Hive-architecture.png | Bin 0 -> 29338 bytes
 chapter/8/edge-cuts.png         | Bin 0 -> 11492 bytes
 chapter/8/vertex-cuts.png       | Bin 0 -> 14919 bytes
 3 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-architecture.png
 create mode 100644 chapter/8/edge-cuts.png
 create mode 100644 chapter/8/vertex-cuts.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-architecture.png b/chapter/8/Hive-architecture.png
new file mode 100644
index 0000000..a23e0ac
Binary files /dev/null and b/chapter/8/Hive-architecture.png differ
diff --git a/chapter/8/edge-cuts.png b/chapter/8/edge-cuts.png
new file mode 100644
index 0000000..e9475a8
Binary files /dev/null and b/chapter/8/edge-cuts.png differ
diff --git a/chapter/8/vertex-cuts.png b/chapter/8/vertex-cuts.png
new file mode 100644
index 0000000..b256630
Binary files /dev/null and b/chapter/8/vertex-cuts.png differ
-- 
cgit v1.2.3


From e4a447b9a7c357696d6a1bd5e71d99043b3c57ab Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 10:00:31 -0500
Subject: Added Hive model and architecture

---
 chapter/8/big-data.md | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 84b597c..a75f9fc 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -259,6 +259,30 @@ Other than standard data-parallel operators like filter, map, leftJoin, and redu
 
 ### 1.3 Querying
 
+Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs.
+
+Data in Hive is organized into three different formats :
+
+`Tables`: Like RDBMS tables Hive contains rows and tables and every table can be mapped to HDFS directory. All the data in the table is serialized and stored in files under the corresponding directory. Hive is extensible to accept user defined data formats, custom serialize and de-serialize methods. It also supports external tables stored in other native file systems like HDFS, NFS or local directories.
+
+`Paritions`:  Distribution of data in sub directories of table directory is is determined by one or more partitions. A table can be further partitioned on columns.
+`Buckets`: Data in each partition can be further divided into buckets on the basis on hash of a column in a table. Each bucket is stored as a file in the partition directory.
+
+***HiveSQL*** :
+
+Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this:
+FROM ( 
+    MAP inputdata USING 'python mapper.py' AS (word, count)
+    FROM inputtable
+    CLUSTER BY word
+    )
+    REDUCE word, count USING 'python reduce.py';
+This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py.
+INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handle reader and writer concurrency.
+
+
+***Serialization/Deserialization***
+Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat. 
 
 #### SparkSQL - Where Relational meets Procedural :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
@@ -309,6 +333,35 @@ At high level, when the user program calls *MapReduce* function, the input files
 </figure>
 
 
+**Hive execution model**
+
+
+<figure class="main-container">
+  <img src="./Hive-architecture.png" alt="Hive architecture" />
+</figure>
+
+The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
+
+***Metastore***
+It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler.
+
+***Query Compiler***
+Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree.
+Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators.
+
+***Optimisations of Hive:***
+
+- Column Pruning - Only the columns needed in the query processing are projected.
+- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible.
+- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
+- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
+- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. These include:
+- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
+- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+
+***Execution Engine***
+
+Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
 **Pregel**
 
-- 
cgit v1.2.3


From 70b152a572495f7c2a61d316c70a3da17a574eb2 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 10:10:17 -0500
Subject: Delete Hive-architecture.png

---
 chapter/8/Hive-architecture.png | Bin 29338 -> 0 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 chapter/8/Hive-architecture.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-architecture.png b/chapter/8/Hive-architecture.png
deleted file mode 100644
index a23e0ac..0000000
Binary files a/chapter/8/Hive-architecture.png and /dev/null differ
-- 
cgit v1.2.3


From 373c6407e8fc19bdd66e51f0667e22baa2449e07 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 10:10:36 -0500
Subject: Add files via upload

---
 chapter/8/Hive-architecture.png | Bin 0 -> 29338 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-architecture.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-architecture.png b/chapter/8/Hive-architecture.png
new file mode 100644
index 0000000..a23e0ac
Binary files /dev/null and b/chapter/8/Hive-architecture.png differ
-- 
cgit v1.2.3


From 030f946964a885b5f60ef387bd534df3e77f636a Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 10:13:11 -0500
Subject: Delete Hive-architecture.png

---
 chapter/8/Hive-architecture.png | Bin 29338 -> 0 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 chapter/8/Hive-architecture.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-architecture.png b/chapter/8/Hive-architecture.png
deleted file mode 100644
index a23e0ac..0000000
Binary files a/chapter/8/Hive-architecture.png and /dev/null differ
-- 
cgit v1.2.3


From 877fb344e1a7179f1c160e569178a13d190e7b77 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Wed, 7 Dec 2016 10:13:36 -0500
Subject: Adding hive-architecture diagram

---
 chapter/8/Hive-architecture.png | Bin 0 -> 33250 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-architecture.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-architecture.png b/chapter/8/Hive-architecture.png
new file mode 100644
index 0000000..9f61454
Binary files /dev/null and b/chapter/8/Hive-architecture.png differ
-- 
cgit v1.2.3


From 8aa4751413dc4c8e8e83edf0087a604ffbc0ecb4 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Wed, 7 Dec 2016 16:27:34 -0500
Subject: add mr exec + re-order

---
 chapter/8/big-data.md | 318 +++++++++++++++++++++++++-------------------------
 1 file changed, 159 insertions(+), 159 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index a75f9fc..5bafc4a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -12,11 +12,7 @@ latency numbers that every programmer should know
     - 1.1.2 FlumeJava
     - 1.1.3 Dryad
     - 1.1.4 Spark
-  - 1.2. Large-scale Parallelism on Graphs
-    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
-    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
-
-  - 1.3. Querying: we need more declarative interfaces, built on top MR models.
+  - 1.2. Querying: we need more declarative interfaces, built on top MR models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
     - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
     - Hive {%cite thusoo2009hive --file big-data %}
@@ -24,20 +20,17 @@ latency numbers that every programmer should know
     `Suggestion: Merge this with Dryad above?`
     - Dremel, query natively w/o translating into MR jobs
     - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
-
-- 2. Execution Models
+  - 1.3. Large-scale Parallelism on Graphs
+    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
+    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
+- \2. Execution Models
   - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
   - 2.3 Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
     - GraphX : Working on this.
   - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
-
-- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
-  - Hadoop vs. Spark
-  - Spark vs. SparkSQL from SparkSQL paper
-
-- Big Data Ecosystem   
+- \3. Big Data Ecosystem   
   Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
   - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
   - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
@@ -51,14 +44,14 @@ latency numbers that every programmer should know
   <img src="{{ site.baseurl }}/resources/img/data-parallelism.png" alt="Data Parallelism" />
 </figure>
 
-*MapReduce* {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface: *map* and *reduce* functions and automatically handles the parallelization and distribution.
+**MapReduce** {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface: *map* and *reduce* functions and automatically handles the parallelization and distribution.
 
-The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. *FlumeJava* library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
+The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
 
-Microsfot *Dryad* {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also allows memory
+Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also allows memory
 
 
-Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. *Spark* {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics.
+Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics.
 
 
 Details about the programming models of MapReduce, FlumeJava, Dryad and Spark are discussed in following four sections.
@@ -137,7 +130,7 @@ PCollection<String> words =
 ```
 
 *Optimizer*  
-`(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
+`parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
 ### 1.1.3 Dryad
@@ -147,7 +140,7 @@ Dryad is a more general and flexible execution engine that execute subroutines a
 
 Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. 
 
-***Distributed in-memory storage - Resilient Distributed Data sets :***
+*Distributed in-memory storage - Resilient Distributed Data sets :*
 
 RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times.
 
@@ -183,7 +176,7 @@ Spark API provide two kinds of operations on a RDD:
 
 RDDs by default are discarded after use. However, Spark provides two explicit operations  persist() and cache() to ensure RDDs are persisted in memory once the RDD has been computed for the first time.
 
-***Why RDD over Distributed Shared memory (DSM) ?***
+*Why RDD over Distributed Shared memory (DSM) ?*
 RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages.
 RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update.
 Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system).
@@ -195,69 +188,15 @@ Other benefits include the scheduling of tasks based on data locality to improve
 - `Debugging and profiling` : There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient.
 
 
-### 1.2 Large-scale Parallelism on Graphs
-Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
-
-Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model.
-
-In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model
-
-This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
-BSP model is a message passing synchronous model where -
-
- - Computation consists of several steps called as supersets.
- - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
- - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
- - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
- - Barrier synchronization synchs all the processors at the end of every superstep.
-
-A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
-
-The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction, and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence.
-GraphX is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity.
-
-How does GraphX improve over the existing graph-parallel and data flow models ?
-The RDGs in GraphX provides a set of elegant and expressive computational primitives through which  many a graph parallel systems like Pregel, PowerGraph can be easily expressed with minimal lines of code. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view and graph transformations. It minimizes communication and storage overhead.
-
-Similar to the data flow model, it GraphX away from the vertex centric view and adopts transformations on graphs yielding a new graph.
-
-***Why partitioning is important in graph computation systems ?***
-Graph-parallel computation requires every vertex or edge to be processed in the context of its neighborhood. Each transformation depends on the result of distributed joins between vertices and edges. This means that graph computation systems rely on graph partitioning (edge-cuts in most of the systems) and efficient storage to minimize communication and storage overhead and ensure balanced computation.
-
-<figure class="main-container">
-  <img src="./edge-cuts.png" alt="edge cuts" />
-</figure>
-
-***Why Edge-cuts are expensive ?***
-Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. 
-
-<figure class="main-container">
-  <img src="./vertex-cuts.png" alt="Vertex cuts" />
-</figure>
-
-***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
-The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
+### 1.2 Querying: declarative interfaces
 
-- `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data.
--  `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
-- `VertexMap(id, pid)`: Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
-	
-A three-way relational join is used to bring together source vertex data, edge data, and target vertex data. The join is straightforward, and takes advantage of a partitioner to ensure the join site is local to the edge table. This means GraphX only has to shuffle vertex data.
-
-***Operators in GraphX***
-Other than standard data-parallel operators like filter, map, leftJoin, and reduceByKey, GraphX supports following graph-parallel operators:
-
-- graph - constructs property graph given a collection of edges and vertices.
-- vertices, edges - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs.
-- mapV, mapE - transform the vertex or edge collection.
-- triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
-- leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
-- subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
-- mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
+Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
+Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
 
+**Introduce Sazwal** (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
 
-### 1.3 Querying
+### 1.2.x Hive/HiveQL
 
 Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs.
 
@@ -271,7 +210,7 @@ Data in Hive is organized into three different formats :
 ***HiveSQL*** :
 
 Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this:
-FROM ( 
+FROM (
     MAP inputdata USING 'python mapper.py' AS (word, count)
     FROM inputtable
     CLUSTER BY word
@@ -282,9 +221,9 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl
 
 
 ***Serialization/Deserialization***
-Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat. 
+Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat.
 
-#### SparkSQL - Where Relational meets Procedural :
+### 1.3.x SparkSQL - Where Relational meets Procedural :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
@@ -320,70 +259,89 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below :
   <img src="./sql-vs-dataframes-vs-datasets.png" alt="SQL vs Dataframe vs Dataset" />
 </figure>
 
+### 1.3 Large-scale Parallelism on Graphs
+Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
 
-## 2 Execution Models
-**2.1 MapReduce**,   
-as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages.
+Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model.
 
-*Execution*  `TODO: move this to execution and talk about fault-tolerance instead`
-At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
+In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model
 
-<figure class="main-container">
-  <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
-</figure>
+This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
+BSP model is a message passing synchronous model where -
 
+ - Computation consists of several steps called as supersets.
+ - The processors involved have their own local memory and every processor is connected to other via a point-to-point communication.
+ - At every superstep, a processor receives input at the beginning, performs computation and outputs at the end.
+ - A processor at superstep S can send message to another processor at superstep S+1 and can as well receive message from superstep S-1.
+ - Barrier synchronization synchs all the processors at the end of every superstep.
+
+A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
+
+The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction, and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence.
+GraphX is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity.
+
+How does GraphX improve over the existing graph-parallel and data flow models ?
+The RDGs in GraphX provides a set of elegant and expressive computational primitives through which  many a graph parallel systems like Pregel, PowerGraph can be easily expressed with minimal lines of code. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view and graph transformations. It minimizes communication and storage overhead.
 
-**Hive execution model**
+Similar to the data flow model, it GraphX away from the vertex centric view and adopts transformations on graphs yielding a new graph.
 
+***Why partitioning is important in graph computation systems ?***
+Graph-parallel computation requires every vertex or edge to be processed in the context of its neighborhood. Each transformation depends on the result of distributed joins between vertices and edges. This means that graph computation systems rely on graph partitioning (edge-cuts in most of the systems) and efficient storage to minimize communication and storage overhead and ensure balanced computation.
 
 <figure class="main-container">
-  <img src="./Hive-architecture.png" alt="Hive architecture" />
+  <img src="./edge-cuts.png" alt="edge cuts" />
 </figure>
 
-The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
-
-***Metastore***
-It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler.
+***Why Edge-cuts are expensive ?***
+Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead.
 
-***Query Compiler***
-Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree.
-Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators.
+<figure class="main-container">
+  <img src="./vertex-cuts.png" alt="Vertex cuts" />
+</figure>
 
-***Optimisations of Hive:***
+***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
-- Column Pruning - Only the columns needed in the query processing are projected.
-- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible.
-- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
-- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
-- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. These include:
-- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
-- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
 
-***Execution Engine***
+- `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data.
+-  `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
+- `VertexMap(id, pid)`: Maps from vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change.
 
-Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
+A three-way relational join is used to bring together source vertex data, edge data, and target vertex data. The join is straightforward, and takes advantage of a partitioner to ensure the join site is local to the edge table. This means GraphX only has to shuffle vertex data.
 
-**Pregel**
+***Operators in GraphX***
+Other than standard data-parallel operators like filter, map, leftJoin, and reduceByKey, GraphX supports following graph-parallel operators:
 
-Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
+- graph - constructs property graph given a collection of edges and vertices.
+- vertices, edges - decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs.
+- mapV, mapE - transform the vertex or edge collection.
+- triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
+- leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
+- subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
+- mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
 
-Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
+## 2 Execution Models
+There are many possible implementations for those programming models. In this section, we will discuss about a few different execution models, how the above programming interfaces exploit them, the benefits and limitations of each design and so on.
 
-Pregel’s API provides
+### 2.1 Basic MapReduce Execution  
+The original MapReduce model is implemented and deployed in Google infrastructure. As described in section 1.1.1, user program defines map and reduce functions and the underlying system manages data partition and schedules jobs across different nodes. Figure 2.1.1 shows the overall flow when the user program calls MapReduce function:
+1. Split data
+2. Copy process
+3. Map and buffer
+4. Write to local and log location
+5. shuffle
+6. reduce
+7. master wake up
 
-- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
-- User defined handler for handling issues like missing destination vertex etc.
-- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
-- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
+//At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
-Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory and hence constrained by the size of main memory.
+<figure class="main-container">
+  <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
 
-Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
+</figure>
+<p>Figure 2.1.1 Execution overview<label for="sn-proprietary-monotype-bembo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-proprietary-monotype-bembo" class="margin-toggle"/><span class="sidenote">See Tufte’s comment in the <a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000Vt">Tufte book fonts</a> thread.</span></p>
 
-**Spark execution model**
+### 2.2 Spark execution model
 
 <figure class="main-container">
   <img src="./cluster-overview.png" alt="MapReduce Execution Overview" />
@@ -400,6 +358,37 @@ A Spark worker executes the business logic submitted by the Spark driver. Spark
 Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, LRU policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering a RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory.
 
 
+### 2.3 Hive execution model
+
+
+<figure class="main-container">
+  <img src="./Hive-architecture.png" alt="Hive architecture" />
+</figure>
+
+The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
+
+***Metastore***
+It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler.
+
+***Query Compiler***
+Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree.
+Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators.
+
+***Optimisations of Hive:***
+
+- Column Pruning - Only the columns needed in the query processing are projected.
+- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible.
+- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
+- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
+- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. These include:
+- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
+- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+
+***Execution Engine***
+
+Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
+
+
 **SparkSQL execution model**
 
 SparkSQL execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job.
@@ -439,61 +428,72 @@ Overall, the performance is very good for conceptually unrelated computations.
 {% bibliography --file big-data %}
 
 
-
-
-
-
-
-
-
-
-
 ## Trash
 
+### Pregel Execution (suggestion: delete)
 
-## Iterative processing in Map Reduce
-
-Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop, Twister and iMapReduce adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
+Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 
+Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
+Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
 
+Pregel’s API provides
 
+- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
+- User defined handler for handling issues like missing destination vertex etc.
+- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
+- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
+- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
+- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
 
-**Haloop** : {% cite bu2010haloop --file big-data %}
+Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory and hence constrained by the size of main memory.
 
-**iMapReduce**: {% cite zhang2012imapreduce --file big-data %}
+Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
 
-**Twister** :  {% cite ekanayake2010twister --file big-data %}
 
-## Map Reduce inspired large scale data processing systems
 
-**Dryad/DryadLinq** :
 
-**Spark (big one)** :
 
-## Declarative interfaces for the Map Reduce framework
-Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
-Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
 
-**Introduce Sazwal** (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
+//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
 
-** FlumeJava (2010) **
 
-Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph.
+## Outline
+- 1. Programming Models
+  - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
+    - 1.1.1 MapReduce
+    - 1.1.2 FlumeJava
+    - 1.1.3 Dryad
+    - 1.1.4 Spark
 
-***Core Abstraction***
+  - 1.2. Querying: we need more declarative interfaces, built on top MR models.
+    - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
+    - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
+    - Hive {%cite thusoo2009hive --file big-data %}
+    - DryadLINQ: SQL-like, uses Dryad as execution engine;   
+    `Suggestion: Merge this with Dryad above?`
+    - Dremel, query natively w/o translating into MR jobs
+    - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
 
-- `PCollection<T>`, a immutable bag of elements of type `T`
-- `recordOf(...)`, specifies the encoding of the instance
-- `PTable<K, V>`, a subclass of `PCollection<Pair<K,V>>`, a immutable multi-map with keys of type `K` and values of type `V`
-- `parallelDo()`, can be expressed both the map and reduce parts of MapReduce
-- `groupByKey()`, same as shuffle step of MapReduce `JJ: clear this in MapReduce`
-- `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
+  - 1.3. Large-scale Parallelism on Graphs
+    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
+    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
 
-***Deferred Evaluation***
-`(JJ: placehoder) join, deferred/materialized; execution plan; figure 1 initial execution plan`
 
-***Optimizer***  
-`(JJ: placehoder) parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
+- 2. Execution Models
+  - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
+  - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
+  - 2.3 Graphs :
+    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+    - GraphX : Working on this.
+  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
 
+- 3. Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
+  - Hadoop vs. Spark
+  - Spark vs. SparkSQL from SparkSQL paper
 
-//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
+- 4. Big Data Ecosystem   
+  Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
+  - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
+  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
+  - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
-- 
cgit v1.2.3


From 6c95f42d42439d747ef0aafdc86331705cec00b9 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Wed, 7 Dec 2016 19:04:22 -0500
Subject: master/worker model

---
 chapter/8/big-data.md | 61 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 25 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 5bafc4a..cf1c5b5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -24,7 +24,8 @@ latency numbers that every programmer should know
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
     - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
 - \2. Execution Models
-  - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
+  - 2.1 Master/workers: MapReduce, MapReduce variants, Spark   
+  MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
   - 2.3 Graphs :
     - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
@@ -48,7 +49,7 @@ latency numbers that every programmer should know
 
 The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
 
-Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also allows memory
+Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO. 
 
 
 Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics.
@@ -321,25 +322,32 @@ Other than standard data-parallel operators like filter, map, leftJoin, and redu
 - mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
 
 ## 2 Execution Models
-There are many possible implementations for those programming models. In this section, we will discuss about a few different execution models, how the above programming interfaces exploit them, the benefits and limitations of each design and so on.
+There are many possible implementations for those programming models. In this section, we will discuss about a few different execution models, how the above programming interfaces exploit them, the benefits and limitations of each design and so on. MapReduce, its variants and Spark all use the master/workers model (section 2.1), where the master is responsible for managing data and dynamically scheduling tasks to workers. The master monitors workers' status, and when failure happens, master will reschedule the task to another idle worker. The fault-tolerance is guaranteed by persistence of data in MapReduce versus lineage(for recomputation) in Spark.
 
-### 2.1 Basic MapReduce Execution  
-The original MapReduce model is implemented and deployed in Google infrastructure. As described in section 1.1.1, user program defines map and reduce functions and the underlying system manages data partition and schedules jobs across different nodes. Figure 2.1.1 shows the overall flow when the user program calls MapReduce function:
-1. Split data
-2. Copy process
-3. Map and buffer
-4. Write to local and log location
-5. shuffle
-6. reduce
-7. master wake up
 
-//At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.  
 
-<figure class="main-container">
+### 2.1 Master/Worker model
+The original MapReduce model is implemented and deployed in Google infrastructure. As described in section 1.1.1, user program defines map and reduce functions and the underlying system manages data partition and schedules jobs across different nodes. Figure 2.1.1 shows the overall flow when the user program calls MapReduce function:
+1. Split data. The input files are split into *M* pieces;
+2. Copy processes. The user program create a master process and the workers. The master picks idle workers to do either map or reduce task;
+3. Map. The map worker reads corresponding splits and passes to the map function. The generated intermediate key/value pairs are buffered in memory;
+4. Partition. The buffered pairs are written to local disk and partitioned to *R* regions periodically. Then the locations are passed back to the master;
+5. Shuffle. The reduce worker reads from the local disks and groups together all occurrences of the same key together;
+6. Reduce. The reduce worker iterates over the grouped intermediate data and calls reduce function on each key and its set of values. The worker appends the output to a final output file;
+7. Wake up. When all tasks finish, the master wakes up the user program.
+
+<figure class="fullwidth">
   <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" />
-
 </figure>
-<p>Figure 2.1.1 Execution overview<label for="sn-proprietary-monotype-bembo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-proprietary-monotype-bembo" class="margin-toggle"/><span class="sidenote">See Tufte’s comment in the <a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000Vt">Tufte book fonts</a> thread.</span></p>
+<p>Figure 2.1.1 Execution overview<label for="sn-proprietary-monotype-bembo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-proprietary-monotype-bembo" class="margin-toggle"/><span class="sidenote">from original MapReduce paper {%cite dean2008mapreduce --file big-data%}</span></p>
+
+At step 4 and 5, the intermediate dataset is written to the disk by map worker and then read from the disk by reduce worker. Transferring big data chunks over network is expensive, so the data is stored on local disks of the cluster and the master tries to schedule the map task on the machine that contains the dataset or a nearby machine to minimize the network operation.
+
+There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
+In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
+Overall, the performance is very good for conceptually unrelated computations.
+
+`TODO: introduce fault-tolerance by disk vs. lineage`
 
 ### 2.2 Spark execution model
 
@@ -389,7 +397,7 @@ Optimization logic consists of a chain of transformation operations such that ou
 Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
 
-**SparkSQL execution model**
+### 2.4 SparkSQL execution model
 
 SparkSQL execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job.
 A Catalyst is a query optimizer. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules.
@@ -414,6 +422,16 @@ Hence, in Spark SQL, transformation of user queries happens in four phases :
 
 ***Code Generation :*** The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
 
+## References
+{% bibliography --file big-data %}
+
+
+
+
+
+## Trash
+
+
 ## Performance
 `TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
 In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
@@ -424,13 +442,8 @@ Overall, the performance is very good for conceptually unrelated computations.
   - FlumeJava? ...Etc
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
-## References
-{% bibliography --file big-data %}
-
-
-## Trash
 
-### Pregel Execution (suggestion: delete)
+### Pregel Execution Model (suggestion: delete)
 
 Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
 
@@ -453,8 +466,6 @@ Apache Giraph is an open source implementation of Pregel in which new features l
 
 
-
-
 //[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
 
 
-- 
cgit v1.2.3


From 298a2fd7ef76d22ace7ea63db021aae505541710 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Wed, 7 Dec 2016 21:27:11 -0500
Subject: query

---
 chapter/8/big-data.md | 86 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 64 insertions(+), 22 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index cf1c5b5..c4b4045 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -3,8 +3,6 @@ layout: page
 title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
-2015 NSDI Ousterhout
-latency numbers that every programmer should know
 ## Outline
 - 1. Programming Models
   - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
@@ -14,11 +12,8 @@ latency numbers that every programmer should know
     - 1.1.4 Spark
   - 1.2. Querying: we need more declarative interfaces, built on top MR models.
     - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
-    - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
+    - Pig {% cite olston2008pig --file big-data %}
     - Hive {%cite thusoo2009hive --file big-data %}
-    - DryadLINQ: SQL-like, uses Dryad as execution engine;   
-    `Suggestion: Merge this with Dryad above?`
-    - Dremel, query natively w/o translating into MR jobs
     - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
   - 1.3. Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
@@ -49,20 +44,20 @@ latency numbers that every programmer should know
 
 The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
 
-Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO. 
+Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO.
 
 
-Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics.
+Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics and also has better performance.
 
 
-Details about the programming models of MapReduce, FlumeJava, Dryad and Spark are discussed in following four sections.
+Following four sections discuss about the programming models of MapReduce, FlumeJava, Dryad and Spark.
 
 
 ### 1.1.1 MapReduce  
-In this model, parallelizable computations are abstracted into map and reduce functions. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: *Map* and *Reduce*:
+In this model, parallelizable computations are abstracted into map and reduce functions. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases:
 - *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it computes a set of intermediate key/value pairs as output.
-- *Shuffle*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function.
-- *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value.
+- *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value.  
+  Note: there is a *Shuffle* phase between *map* and *reduce*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function. We will discuss more in Section 2 Execution Models.
 
 Conceptually, the map and reduction functions have associated **types**:
 
@@ -104,7 +99,7 @@ MapReduce runs on hundreds or thousands of unreliable commodity machines, so the
 - Writing raw MR program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MR stages.
 
 ### 1.1.2 FlumeJava
-FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by using methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
+FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
 *Core Abstraction*  
 - `PCollection<T>`, a immutable bag of elements of type `T`
@@ -115,7 +110,7 @@ FlumeJava was introduced to make it easy to develop, test, and run efficient dat
 - `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
 
 *Deferred Evaluation*  
-The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed).
+The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a parallel operation, it does not actually run the operation.
 
 *Example*  
 `TODO: example and explain the execution plan`
@@ -131,7 +126,7 @@ PCollection<String> words =
 ```
 
 *Optimizer*  
-`parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
+`TODO: parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
 
 
 ### 1.1.3 Dryad
@@ -191,11 +186,18 @@ Other benefits include the scheduling of tasks based on data locality to improve
 
 
 ### 1.2 Querying: declarative interfaces
-
-Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
+MapReduce provides only two high level primitives - map and reduce that the programmers have to worry about. MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow.
+Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
 Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
 
-**Introduce Sazwal** (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005
+Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program can specify the filter function, and emits the intermediate pairs to external pre-built aggregators.
+
+Hive {% cite thusoo2009hive --file big-data %} is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL which is easy for anyone who understands SQL. It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do.
+
+Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps.
+
+The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
+
 
 ### 1.2.x Hive/HiveQL
 
@@ -206,17 +208,18 @@ Data in Hive is organized into three different formats :
 `Tables`: Like RDBMS tables Hive contains rows and tables and every table can be mapped to HDFS directory. All the data in the table is serialized and stored in files under the corresponding directory. Hive is extensible to accept user defined data formats, custom serialize and de-serialize methods. It also supports external tables stored in other native file systems like HDFS, NFS or local directories.
 
 `Paritions`:  Distribution of data in sub directories of table directory is is determined by one or more partitions. A table can be further partitioned on columns.
-`Buckets`: Data in each partition can be further divided into buckets on the basis on hash of a column in a table. Each bucket is stored as a file in the partition directory.
 
-***HiveSQL*** :
+`Buckets`: Data in each partition can be further divided into buckets on the basis on hash of a column in a table. Each bucket is stored as a file in the partition directory.
 
-Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this:
+***HiveSQL***: Hive query language consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this:
+```
 FROM (
     MAP inputdata USING 'python mapper.py' AS (word, count)
     FROM inputtable
     CLUSTER BY word
     )
     REDUCE word, count USING 'python reduce.py';
+```
 This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py.
 INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handle reader and writer concurrency.
 
@@ -224,7 +227,36 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl
 ***Serialization/Deserialization***
 Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat.
 
-### 1.3.x SparkSQL - Where Relational meets Procedural :
+### 1.2.x Pig Latin
+The goal of Pig Latin is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence.
+
+For example, suppose we have a table urls: `(url, category, pagerank)`. The following is a simple SQL query that finds, for each suciently large category, the average pagerank of high-pagerank urls in that category.
+
+```
+SELECT category, AVG(pagerank)  
+FROM urls WHERE pagerank > 0.2  
+GROUP BY category HAVING COUNT(*) > 106  
+```
+
+And Pig Latin would address in following way:
+
+```
+good_urls = FILTER urls BY pagerank > 0.2;
+groups = GROUP good_urls BY category;
+big_groups = FILTER groups BY COUNT(good_urls)>106;
+output = FOREACH big_groups GENERATE
+            category, AVG(good_urls.pagerank);
+```
+
+*Interoperability* Pig Latin is designed to support ad-hoc data analysis, which means the input only requires a function to parse the content of files into tuples. This saves the time-consuming import step. While as for the output, Pig provides freedom to convert tuples into byte sequence where the format can be defined by users.  
+
+*Nested Data Model* Pig Latin has a flexible, fully nested data model, and allows complex, non-atomic data types such as set, map, and tuple to occur as fields of a table. The benefits include: closer to how programmer think; data can be stored in the same nested fashion to save recombining time; can have algebraic language; allow rich user defined functions.  
+
+*UDFs as First-Class Citizens* Pig Latin supports user-defined functions (UDFs) to support customized tasks for grouping, filtering, or per-tuple processing.  
+
+*Debugging Environment* Pig Latin has a novel interactive debugging environment that can generate a concise example data table to illustrate output of each step.
+
+### 1.2.x SparkSQL - Where Relational meets Procedural :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
@@ -422,6 +454,12 @@ Hence, in Spark SQL, transformation of user queries happens in four phases :
 
 ***Code Generation :*** The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST).
 
+
+## 3. Big Data Ecosystem
+
+
+
+
 ## References
 {% bibliography --file big-data %}
 
@@ -508,3 +546,7 @@ Apache Giraph is an open source implementation of Pregel in which new features l
   - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
   - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
   - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
+
+  2015 NSDI Ousterhout
+
+  latency numbers that every programmer should know
-- 
cgit v1.2.3


From 9108e0f1c44f27770c9bafe355023f07f02eb945 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Wed, 7 Dec 2016 21:43:39 -0500
Subject: add ecosystem png

---
 chapter/8/big-data.md   |   6 ++++--
 chapter/8/ecosystem.png | Bin 0 -> 190654 bytes
 2 files changed, 4 insertions(+), 2 deletions(-)
 create mode 100644 chapter/8/ecosystem.png

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index c4b4045..ab9fa8c 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -456,8 +456,10 @@ Hence, in Spark SQL, transformation of user queries happens in four phases :
 
 
 ## 3. Big Data Ecosystem
-
-
+`TODO: text`
+<figure class="main-container">
+  <img src="./ecosystem.png" alt="SparkSQL optimization plan Overview" />
+</figure>
 
 
 ## References
diff --git a/chapter/8/ecosystem.png b/chapter/8/ecosystem.png
new file mode 100644
index 0000000..c632ec2
Binary files /dev/null and b/chapter/8/ecosystem.png differ
-- 
cgit v1.2.3


From 55b5d26a8f6dc09141613bf455288b46053523a6 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 8 Dec 2016 08:00:49 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 23 -----------------------
 1 file changed, 23 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ab9fa8c..0423db0 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -483,29 +483,6 @@ Overall, the performance is very good for conceptually unrelated computations.
   - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
 
 
-### Pregel Execution Model (suggestion: delete)
-
-Pregel is an implementation of classic BSP model by Google (PageRank) to analyze large graphs exclusively. It was followed by open source implementations - Apache’s Giraph and Hama; which were BSP models built on top of Hadoop.
-
-Pregel is highly scalable, fault-tolerant and can successfully represent larger complex graphs. Google claims the API becomes easy once a developer adopts “think like a vertex” mode.
-Pregel’s computation system is iterative and every iteration is called as superstep. The system takes a directed graph as input with properties assigned to both vertices and graph. At each superstep, all vertices executes in parallel, a user-defined function which represents the behavior of the vertex. The function has access to message sent to its vertex from the previous superstep S-1 and can update the state of the vertex, its edges, the graph and even send messages to other vertices which would receive in the next superstep S+1. The synchronization happens only between two supersteps.  Every vertex is either active or inactive at any superstep. The iteration stops when all the vertices are inactive. A vertex can deactivate itself by voting for it and gets active if it receives a message. This asynchronous message passing feature eliminates the shared memory, remote reads and latency of Map reduce model.
-
-Pregel’s API provides
-
-- compute() method for the user to implement the logic to change the state of the graph/vertex at every superstep. It guarantees message delivery through an iterator at every superstep.
-- User defined handler for handling issues like missing destination vertex etc.
-- Combiners reduce the amount of messages passed from multiple vertices to the same destination vertex.
-- Aggregators capture the global state of the graph. A reduce operation combines the value given by every vertex to the aggregator. The combined/aggregated value is passed onto to all the vertices in the next superstep.
-- Fault tolerance is achieved through checkpointing and instructing the workers to save the state of nodes to a persistent storage. When a machine fails, all workers restart the execution with state of their recent checkpoint.
-- Master and worker implementation : The master partitions graph into set of vertices (hash on vertex ID mod number of partitions) and outgoing edges per partition. Each partition is assigned to a worker who manages the state of all its vertices by executing compute() method and coordinating the message communication. The workers also notifies the master of the vertices that are active for the next superstep.
-
-Pregel works good for sparse graphs. However, dense graph could cause communication overhead resulting in system to break. Also, the entire computation state resides in the main memory and hence constrained by the size of main memory.
-
-Apache Giraph is an open source implementation of Pregel in which new features like master computation, sharded aggregators, edge-oriented input, out-of-core computation are added making it more efficient.  The most high performance graph processing framework is GraphLab which is developed at Carnegie Melon University and uses the BSP model and executes on MPI.
-
-
-
-
 //[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
 
 
-- 
cgit v1.2.3


From 270b99ede32440f88e4d14f5b89aa9a627f6a1f7 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 8 Dec 2016 12:18:32 -0500
Subject: Made changes to data parallelism flow

---
 chapter/8/big-data.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 0423db0..d7a45c3 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -42,12 +42,14 @@ by: "Jingjing and Abhilash"
 
 **MapReduce** {% cite dean2008mapreduce  --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface: *map* and *reduce* functions and automatically handles the parallelization and distribution.
 
-The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. Besides, Developers mostly need to understand the execution model to do manual optimizations. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
+The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. In map reduce, programmers need to reason about data representation on disk or in storage services such as a database. Besides, developers need to clearly understand the map reduce execution model  to do manual optimizations[ref]. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines by abstracting away the complexity involved in data representation and implicitly handling the optimizations. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
 
-Microsfot **Dryad** {% cite isard2007dryad --file big-data %} designed differently from MapReduce and can support more general computations. It abstracts individual computation tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine to construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports to a single input and a single output for each vertex.   Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO.
+An alternative approach to data prallelism is to construct complex, multi-step directed acyclic graphs (DAGs) of work from the user instructions execute those DAGs all at once. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build and reason about. Dryad, a Microsoft Research project used internally at Microsoft was one such project which leveraged this model of computation.
 
+Microsfot **Dryad** {% cite isard2007dryad --file big-data %} abstracts individual computational tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports a single input and a single output for each vertex. Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO.
 
-Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce, however, instead of writing data to disk for each job as MapReduce does, user program in Spark can explicitly cache an RDD in memory and reuse the same dataset across multiple parallel operations. This feature makes Spark suitable for iterative jobs and interactive analytics and also has better performance.
+
+Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce's model and builds upon the ideas behind DAG, lazy evaluation of DryadLinq. Instead of writing data to disk for each job as MapReduce does Spark can cache the results across jobs. Spark explicitly caches computational data in memory thorugh specialized immutable datasets named Resilient Distributed Sets(RDD) and reuse the same dataset across multiple parallel operations. The Spark builds upon RDD to achieve fault tolerance by reusing the lineage information of the lost RDD. This results in lesser overhead than what is seen in fault tolerance achieved by checkpoint in Distribtued Shared Memory systems. Moreover, Spark powers a stack of other libraries, e.g..SQL&DataFrames, GraphX, and can easily combine those libraries in one single application. These feature makes Spark the best fit for iterative jobs and interactive analytics and also helps it in providing better performance. Above all, any system can be easily expressed by Spark enabling other models to leverage the specific advantages of Spark systems and still retain the process of computation without any changes to Spark system[ref].
 
 
 Following four sections discuss about the programming models of MapReduce, FlumeJava, Dryad and Spark.
-- 
cgit v1.2.3


From 52fe0dcc3a17aff35c46eca6c34d765d678996a8 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 8 Dec 2016 16:56:28 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index d7a45c3..eb475d9 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -44,7 +44,7 @@ by: "Jingjing and Abhilash"
 
 The MapReduce model is simple and powerful, and quickly became very popular among developers. However, when developers start writing real-world applications, they often end up chaining together MapReduce stages. The pipeline of MapReduce forces programmers to write additional coordinating codes, i.e. the development style goes backward from simple logic computation abstraction to lower-level coordination management. In map reduce, programmers need to reason about data representation on disk or in storage services such as a database. Besides, developers need to clearly understand the map reduce execution model  to do manual optimizations[ref]. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines by abstracting away the complexity involved in data representation and implicitly handling the optimizations. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, so there's no need to write raw MR programs directly.
 
-An alternative approach to data prallelism is to construct complex, multi-step directed acyclic graphs (DAGs) of work from the user instructions execute those DAGs all at once. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build and reason about. Dryad, a Microsoft Research project used internally at Microsoft was one such project which leveraged this model of computation.
+An alternative approach to data prallelism is to construct complex, multi-step directed acyclic graphs (DAGs) of work from the user instructions and execute those DAGs all at once. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build and reason about. Dryad, a Microsoft Research project used internally at Microsoft was one such project which leveraged this model of computation.
 
 Microsfot **Dryad** {% cite isard2007dryad --file big-data %} abstracts individual computational tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports a single input and a single output for each vertex. Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe and shared-memory FIFO.
 
@@ -190,13 +190,15 @@ Other benefits include the scheduling of tasks based on data locality to improve
 ### 1.2 Querying: declarative interfaces
 MapReduce provides only two high level primitives - map and reduce that the programmers have to worry about. MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow.
 Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
-Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations.
+Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework[ref]. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Hence, these kind of abstractions provide ample opportunities for query optimizations.
 
 Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program can specify the filter function, and emits the intermediate pairs to external pre-built aggregators.
 
-Hive {% cite thusoo2009hive --file big-data %} is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL which is easy for anyone who understands SQL. It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do.
+Apart from Sawzal, Pig and Hive are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code.
 
-Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps.
+Hive {% cite thusoo2009hive --file big-data %} is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach.  It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do. The drawback to using Hive is programmers have to be familiar with basic techniques and best practices for running their Hive queries at maximum speed as it depends on the Hive optimizer. Hive requires developers  train the Hive optimizer for efficient optimization of their queries.
+
+Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. However, Pig offers 
 
 The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
 
@@ -258,7 +260,7 @@ output = FOREACH big_groups GENERATE
 
 *Debugging Environment* Pig Latin has a novel interactive debugging environment that can generate a concise example data table to illustrate output of each step.
 
-### 1.2.x SparkSQL - Where Relational meets Procedural :
+### 1.2.x SparkSQL  :
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
-- 
cgit v1.2.3


From 919359282b6c81a5a5fec84a463ed402664808a3 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 8 Dec 2016 17:07:22 -0500
Subject: Updated Pig and SparkSQL

---
 chapter/8/big-data.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index eb475d9..1f98e6b 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -198,6 +198,13 @@ Apart from Sawzal, Pig and Hive are the other major components that sit on top o
 
 Hive {% cite thusoo2009hive --file big-data %} is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach.  It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do. The drawback to using Hive is programmers have to be familiar with basic techniques and best practices for running their Hive queries at maximum speed as it depends on the Hive optimizer. Hive requires developers  train the Hive optimizer for efficient optimization of their queries.
 
+Relational interface to big data is good, however, it doesn’t cater to users who want to perform
+
+- ETL to and from various semi or unstructured data sources.
+- advanced analytics like machine learning or graph processing.
+
+These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin and Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
+
 Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. However, Pig offers 
 
 The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
@@ -261,14 +268,8 @@ output = FOREACH big_groups GENERATE
 *Debugging Environment* Pig Latin has a novel interactive debugging environment that can generate a concise example data table to illustrate output of each step.
 
 ### 1.2.x SparkSQL  :
-Relational interface to big data is good, however, it doesn’t cater to users who want to perform
-
-- ETL to and from various semi or unstructured data sources.
-- advanced analytics like machine learning or graph processing.
-
-These user actions require best of both the worlds - relational queries and procedural algorithms. Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
 
-Hence, the major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+The major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
 
 ***Programming API***
 
-- 
cgit v1.2.3


From b6bda137472d20297163ddf001f4f344be563410 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Thu, 8 Dec 2016 17:19:13 -0500
Subject: Updated Query section

---
 chapter/8/big-data.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 1f98e6b..2afb1c5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -203,9 +203,11 @@ Relational interface to big data is good, however, it doesn’t cater to users w
 - ETL to and from various semi or unstructured data sources.
 - advanced analytics like machine learning or graph processing.
 
-These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin and Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API.
+These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin and Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API. Both the frameworks free the programmer from worrying about internal execution model by providing implicit optimization on the user input DAG of transformations.
 
-Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. However, Pig offers 
+Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. 
+
+SparkSQL though has the same goals as that of Pig, is better given the Spark exeuction engine, efficient fault tolerance mechanism of Spark and specialized data structure called Dataset.
 
 The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
 
-- 
cgit v1.2.3


From e1e16f7e92549774ebdbce7553c94bf0d8d8a76d Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 9 Dec 2016 02:56:36 -0500
Subject: Added Spark ecosystem

---
 chapter/8/big-data.md | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2afb1c5..cb6fe86 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -468,6 +468,32 @@ Hence, in Spark SQL, transformation of user queries happens in four phases :
   <img src="./ecosystem.png" alt="SparkSQL optimization plan Overview" />
 </figure>
 
+***Spark Ecosystem***
+
+Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos/Yarn and several major components that have been already discussed in this articlelike Spark-core, SparkSQL, GraphX.
+In this section we will discuss the remaining yet very important components/libraries which help Spark deliver high performance.
+
+<figure class="main-container">
+  <img src="./spark-ecosystem.png" alt="Spark ecosystem" />
+</figure>
+
+*Spark Streaming - A Spark component for streaming workloads*
+
+Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports integration of streaming data with historical data. 
+
+
+*Apache Mesos*
+
+Apache Mesos is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper and is a efficient CPU and memory-aware resource scheduler.
+
+
+*Alluxio/Tachyon*
+
+Alluxio/Tachyon is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with  different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads.
+
+
+
+
 
 ## References
 {% bibliography --file big-data %}
-- 
cgit v1.2.3


From 89e8c52abaec809172eb5a93f47ba26cb91e510f Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 9 Dec 2016 02:57:35 -0500
Subject: Added spark ecosystem images

---
 chapter/8/spark-ecosystem.png | Bin 0 -> 49070 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/spark-ecosystem.png

(limited to 'chapter')

diff --git a/chapter/8/spark-ecosystem.png b/chapter/8/spark-ecosystem.png
new file mode 100644
index 0000000..d3569fc
Binary files /dev/null and b/chapter/8/spark-ecosystem.png differ
-- 
cgit v1.2.3


From d938c9c8d860f6695833c2fa2d40752f9981ab16 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Fri, 9 Dec 2016 19:09:03 -0500
Subject: update

---
 chapter/8/big-data.md          |  91 +++++++++++------------------------------
 chapter/8/hadoop-ecosystem.jpg | Bin 0 -> 76009 bytes
 2 files changed, 23 insertions(+), 68 deletions(-)
 create mode 100644 chapter/8/hadoop-ecosystem.jpg

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index cb6fe86..ab72fa0 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -3,34 +3,7 @@ layout: page
 title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
-## Outline
-- 1. Programming Models
-  - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
-    - 1.1.1 MapReduce
-    - 1.1.2 FlumeJava
-    - 1.1.3 Dryad
-    - 1.1.4 Spark
-  - 1.2. Querying: we need more declarative interfaces, built on top MR models.
-    - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
-    - Pig {% cite olston2008pig --file big-data %}
-    - Hive {%cite thusoo2009hive --file big-data %}
-    - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
-  - 1.3. Large-scale Parallelism on Graphs
-    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
-    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
-- \2. Execution Models
-  - 2.1 Master/workers: MapReduce, MapReduce variants, Spark   
-  MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
-  - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - 2.3 Graphs :
-    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
-    - GraphX : Working on this.
-  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
-- \3. Big Data Ecosystem   
-  Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-  - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
-  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
-  - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
+## Introduction
 
 ## 1 Programming Models
 ### 1.1 Data parallelism
@@ -94,11 +67,8 @@ During executing, the MapReduce library assigns a master node to manage data par
 *Fault Tolerance*  
 MapReduce runs on hundreds or thousands of unreliable commodity machines, so the library must provide fault tolerance. The library assumes that master node would not fail, and it monitors worker failures. If no status update is received from a worker on timeout, the master will mark it as failed. Then the master may schedule the associated task to other workers depending on task type and status. The commits of *map* and *reduce* task outputs are atomic, where the in-progress task writes data into private temporary files, once the task succeeds, it negotiate with the master and rename files to complete the task . In the case of failure, the worker discards those temporary files. This guarantees that if the computation is deterministic, the distribution implementation should produce same outputs as non-faulting sequential execution.
 
-*Limitations* `TODO: re-organize`   
-- It only works for batch processing jobs. More sophisticated applications are not easy to be abstracted as a set of map/reduce operations. In sum, it cannot work well for iterative, graph, or incremental processing.
-- MR has to do I/O operation for each job and makes it too slow to support applications that require low latency. `TODO: FIX text and reference` Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
-- The master is a single point of failure.
-- Writing raw MR program still requires plentiful efforts from programmers, especially when real applications require a pipeline of MapReduce jobs and programmers have to write coordinate code to chain together those MR stages.
+*Limitations*  
+Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
 ### 1.1.2 FlumeJava
 FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
@@ -111,28 +81,12 @@ FlumeJava was introduced to make it easy to develop, test, and run efficient dat
 - `groupByKey()`, same as shuffle step of MapReduce
 - `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.
 
-*Deferred Evaluation*  
-The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a parallel operation, it does not actually run the operation.
-
-*Example*  
-`TODO: example and explain the execution plan`
-```Java
-PCollection<String> words =
-  lines.parallelDo(new DoFn<String,String>() {
-    void process(String line, EmitFn<String> emitFn) {
-      for (String word : splitIntoWords(line)) {
-        emitFn.emit(word);
-      }
-    }
-  }, collectionOf(strings()));
-```
-
-*Optimizer*  
-`TODO: parallelDo Fusion; MSCR;  overall goal to produce the fewest, most efficient MSCR operations in the final optimized plan`
+*Deferred Evaluation & Optimizer*  
+The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a parallel operation, it does not actually run the operation. Instead, it performs the operation only when needed. FlumeJava also provides some optimization practices: 1) parallelDo Fusion: f(g(x)) => f o g(x) to reduce steps; 2) MapShuffleCombineReduce (MSCR) Operation that generalizes MapReduce jobs to accept multiple inputs and multiple outputs. And for this, FlumeJava does another MSCR fusion.  
 
 
 ### 1.1.3 Dryad
-Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine.
+Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine. Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ - a querying language.
 
 ### 1.1.4 Spark
 
@@ -185,8 +139,6 @@ Other benefits include the scheduling of tasks based on data locality to improve
 
 - `Debugging and profiling` : There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient.
 
-
-
 ### 1.2 Querying: declarative interfaces
 MapReduce provides only two high level primitives - map and reduce that the programmers have to worry about. MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow.
 Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
@@ -205,14 +157,14 @@ Relational interface to big data is good, however, it doesn’t cater to users w
 
 These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin and Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API. Both the frameworks free the programmer from worrying about internal execution model by providing implicit optimization on the user input DAG of transformations.
 
-Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. 
+Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive.
 
 SparkSQL though has the same goals as that of Pig, is better given the Spark exeuction engine, efficient fault tolerance mechanism of Spark and specialized data structure called Dataset.
 
 The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
 
 
-### 1.2.x Hive/HiveQL
+### 1.2.1 Hive/HiveQL
 
 Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs.
 
@@ -240,7 +192,7 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl
 ***Serialization/Deserialization***
 Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat.
 
-### 1.2.x Pig Latin
+### 1.2.2 Pig Latin
 The goal of Pig Latin is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence.
 
 For example, suppose we have a table urls: `(url, category, pagerank)`. The following is a simple SQL query that finds, for each suciently large category, the average pagerank of high-pagerank urls in that category.
@@ -269,7 +221,7 @@ output = FOREACH big_groups GENERATE
 
 *Debugging Environment* Pig Latin has a novel interactive debugging environment that can generate a concise example data table to illustrate output of each step.
 
-### 1.2.x SparkSQL  :
+### 1.2.3 SparkSQL  :
 
 The major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
 
@@ -463,10 +415,20 @@ Hence, in Spark SQL, transformation of user queries happens in four phases :
 
 
 ## 3. Big Data Ecosystem
-`TODO: text`
+*Hadoop Ecosystem*  
+
+Apache Hadoop is an open-sourced framework that supports distributed processing of large dataset. It involves a long list of projects that you can find in this table https://hadoopecosystemtable.github.io/. In this section, it is also important to understand the key players in the system, namely two parts: the Hadoop Distributed File System (HDFS) and the open-sourced implementation of MapReduce model - Hadoop.
+
 <figure class="main-container">
-  <img src="./ecosystem.png" alt="SparkSQL optimization plan Overview" />
+  <img src="./hadoop-ecosystem.jpg" alt="Hadoop Ecosystem" />
 </figure>
+*Figure is from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview*
+
+
+HDFS forms the data management layer, which is a distributed file system designed to provide reliable, scalable storage across large clusters of unreliable commodity machines. The idea was inspired by GFS paper. Unlike closed GFS, HDFS is open-sourced and provides various libraries and interfaces to support different file systems, like S3, KFS etc.
+
+To satisfy different needs, big companies like Facebook and Yahoo developed additional tools. Facebook's Hive, as a warehouse system, can provide more declarative programming interface and translate to Hadoop jobs. Yahoo's Pig platform is an ad-hoc analysis tool that can structurize HDFS objects and support operations like grouping, joining and filtering.   
+
 
 ***Spark Ecosystem***
 
@@ -479,7 +441,7 @@ In this section we will discuss the remaining yet very important components/libr
 
 *Spark Streaming - A Spark component for streaming workloads*
 
-Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports integration of streaming data with historical data. 
+Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports integration of streaming data with historical data.
 
 
 *Apache Mesos*
@@ -511,13 +473,6 @@ In the paper, the authors measure the performance of MapReduce on two computatio
 Overall, the performance is very good for conceptually unrelated computations.
 
 
-## Things people are building on top of MapReduce/Spark
-  - FlumeJava? ...Etc
-  - Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-
-
-//[`COMMENT: move this to introducing DryadLINQ`] Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ as a querying language.
-
 
 ## Outline
 - 1. Programming Models
diff --git a/chapter/8/hadoop-ecosystem.jpg b/chapter/8/hadoop-ecosystem.jpg
new file mode 100644
index 0000000..2ba7aa9
Binary files /dev/null and b/chapter/8/hadoop-ecosystem.jpg differ
-- 
cgit v1.2.3


From d6d275acc392ce187a8145d420af750a4b7bfb58 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Fri, 9 Dec 2016 21:21:15 -0500
Subject: intro

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ab72fa0..e800308 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -4,7 +4,7 @@ title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
 ## Introduction
-
+The growth of Internet has generated the so-called big data(terabytes or petabytes). It is not possible to fit them into a single machine or process them with one single program. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce model to abstract the logic and proved to be simple and powerful. From then on, the idea inspired lots of other programming models. In this chapter, we will present how programming models evolve over time, why their execution engines are designed in certain ways, and underlying ecosystem that supports each developing thread. 
 ## 1 Programming Models
 ### 1.1 Data parallelism
 *Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, a sequential computation looks like *"for all elements in the dataset, do operation A"*, where dataset could be in the order of terabytes or petabytes aka. big data and one wants to scale up the processing. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures and so on.
-- 
cgit v1.2.3


From 1c8d7be1ca34a1e9b931394620c1dd81e6197019 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Fri, 9 Dec 2016 21:25:31 -0500
Subject: update

---
 chapter/8/big-data.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e800308..b77fc3f 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -4,7 +4,7 @@ title:  "Large Scale Parallel Data Processing"
 by: "Jingjing and Abhilash"
 ---
 ## Introduction
-The growth of Internet has generated the so-called big data(terabytes or petabytes). It is not possible to fit them into a single machine or process them with one single program. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce model to abstract the logic and proved to be simple and powerful. From then on, the idea inspired lots of other programming models. In this chapter, we will present how programming models evolve over time, why their execution engines are designed in certain ways, and underlying ecosystem that supports each developing thread. 
+The growth of Internet has generated the so-called big data(terabytes or petabytes). It is not possible to fit them into a single machine or process them with one single program. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce model to abstract the logic and proved to be simple and powerful. From then on, the idea inspired lots of other programming models. In this chapter, we will present how programming models evolve over time, why their execution engines are designed in certain ways, and underlying ecosystem that supports each developing thread.
 ## 1 Programming Models
 ### 1.1 Data parallelism
 *Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, a sequential computation looks like *"for all elements in the dataset, do operation A"*, where dataset could be in the order of terabytes or petabytes aka. big data and one wants to scale up the processing. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures and so on.
@@ -338,7 +338,6 @@ There are some practices in this paper that make the model work very well in Goo
 In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
 Overall, the performance is very good for conceptually unrelated computations.
 
-`TODO: introduce fault-tolerance by disk vs. lineage`
 
 ### 2.2 Spark execution model
 
-- 
cgit v1.2.3


From c510053c1e610a855c5990f3fd18571745d198a2 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Fri, 9 Dec 2016 21:27:29 -0500
Subject: update

---
 chapter/8/big-data.md | 58 ---------------------------------------------------
 chapter/8/trash.md    | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 58 deletions(-)
 create mode 100644 chapter/8/trash.md

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b77fc3f..600ca24 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -458,61 +458,3 @@ Alluxio/Tachyon is an open source memory-centric distributed storage system that
 
 ## References
 {% bibliography --file big-data %}
-
-
-
-
-
-## Trash
-
-
-## Performance
-`TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
-In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
-Overall, the performance is very good for conceptually unrelated computations.
-
-
-
-## Outline
-- 1. Programming Models
-  - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
-    - 1.1.1 MapReduce
-    - 1.1.2 FlumeJava
-    - 1.1.3 Dryad
-    - 1.1.4 Spark
-
-  - 1.2. Querying: we need more declarative interfaces, built on top MR models.
-    - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
-    - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
-    - Hive {%cite thusoo2009hive --file big-data %}
-    - DryadLINQ: SQL-like, uses Dryad as execution engine;   
-    `Suggestion: Merge this with Dryad above?`
-    - Dremel, query natively w/o translating into MR jobs
-    - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
-
-  - 1.3. Large-scale Parallelism on Graphs
-    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
-    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
-
-
-- 2. Execution Models
-  - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
-  - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - 2.3 Graphs :
-    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
-    - GraphX : Working on this.
-  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
-
-- 3. Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
-  - Hadoop vs. Spark
-  - Spark vs. SparkSQL from SparkSQL paper
-
-- 4. Big Data Ecosystem   
-  Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-  - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
-  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
-  - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
-
-  2015 NSDI Ousterhout
-
-  latency numbers that every programmer should know
diff --git a/chapter/8/trash.md b/chapter/8/trash.md
new file mode 100644
index 0000000..c9b90fe
--- /dev/null
+++ b/chapter/8/trash.md
@@ -0,0 +1,53 @@
+## Trash
+
+
+## Performance
+`TODO: re-organize` There are some practices in this paper that make the model work very well in Google, one of them is **backup tasks**: when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks ("straggler"). The task is marked as completed whenever either the primary or the backup execution completes.
+In the paper, the authors measure the performance of MapReduce on two computations running on a large cluster of machines. One computation *grep* through approximately 1TB of data. The other computation *sort* approximately 1TB of data. Both computations take in the order of a hundred seconds. In addition, the backup tasks do help largely reduce execution time. In the experiment where 200 out of 1746 tasks were intentionally killed, the scheduler was able to recover quickly and finish the whole computation for just a 5% increased time.  
+Overall, the performance is very good for conceptually unrelated computations.
+
+
+
+## Outline
+- 1. Programming Models
+  - 1.1. Data parallelism: what is data parallelism and how do the following models relate to each other?
+    - 1.1.1 MapReduce
+    - 1.1.2 FlumeJava
+    - 1.1.3 Dryad
+    - 1.1.4 Spark
+
+  - 1.2. Querying: we need more declarative interfaces, built on top MR models.
+    - Sawzall {%cite pike2005interpreting --file big-data %}: first one propose
+    - Pig {% cite olston2008pig --file big-data %}: on top of Hadoop, independent of execution platform, in theory can compiled into DryadLINQ too; what is the performance gain/lost? Easier to debug?   
+    - Hive {%cite thusoo2009hive --file big-data %}
+    - DryadLINQ: SQL-like, uses Dryad as execution engine;   
+    `Suggestion: Merge this with Dryad above?`
+    - Dremel, query natively w/o translating into MR jobs
+    - Spark SQL {%cite --file big-data %} - Limitations of Relational alone models? how SparkSQL model overcomes it? goals of SparkSQL? how it leverages the Spark programming model? what is a DataFrame and how is it different from a RDD? what are the operations a DataFrame provides? how is in-memory caching different from Spark?
+
+  - 1.3. Large-scale Parallelism on Graphs
+    - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
+    - GraphX programming model - discuss disadvantages graph-parallel model to data parallel model for large scale graph processing? how graphX combines the advantages of both the models? representation of a graph in GraphX?  discuss the model, vertex cut partitioning and its importance? graph operations ?
+
+
+- 2. Execution Models
+  - 2.1 MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MR operations, it uses backup tasks. When MR jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
+  - 2.2 Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
+  - 2.3 Graphs :
+    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+    - GraphX : Working on this.
+  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
+
+- 3. Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
+  - Hadoop vs. Spark
+  - Spark vs. SparkSQL from SparkSQL paper
+
+- 4. Big Data Ecosystem   
+  Everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
+  - GFS/HDFS for MapReduce/Hadoop: Machines are unreliable, how do they provide fault-tolerance? How does GFS deal with single point of failure (shadow masters)? How does the master manage partition, transmission of data chunks? Which
+  - Resource Management: Mesos. New frameworks keep emerging and users have to use multiple different frameworks(MR, Spark etc.) in the same clusters, so how should they share access to the large datasets instead of costly replicate across clusters?
+  - Introducing streaming: what happens when data cannot be complete? How does different programming model adapt? windowing `todo: more`
+
+  2015 NSDI Ousterhout
+
+  latency numbers that every programmer should know
-- 
cgit v1.2.3


From 4c2ff735326ce7686844c5738bc130bf78f5b9a8 Mon Sep 17 00:00:00 2001
From: Jingjing Ren <renjj@ccs.neu.edu>
Date: Fri, 9 Dec 2016 21:43:18 -0500
Subject: add bib

---
 chapter/8/big-data.md | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 600ca24..209a3ad 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -71,7 +71,7 @@ MapReduce runs on hundreds or thousands of unreliable commodity machines, so the
 Many a analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth first search require multiple stages of map reduce jobs. In regular map reduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on map reduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations.
 
 ### 1.1.2 FlumeJava
-FlumeJava was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
+FlumeJava {%cite chambers2010flumejava --file big-data %}was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters.
 
 *Core Abstraction*  
 - `PCollection<T>`, a immutable bag of elements of type `T`
@@ -86,11 +86,11 @@ The state of each `PCollection` object is either *deferred* (not yet computed) a
 
 
 ### 1.1.3 Dryad
-Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine. Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ - a querying language.
+Dryad is a more general and flexible execution engine that execute subroutines at a specified graph vertices. Developers can specify an arbitrary directed acyclic graph to combine computational "vertices" with communication channels (file, TCP pipe, shared-memory FIFO) and  build a dataflow graph. Compared with MapReduce, Dryad can specify an arbitrary DAG that have multiple number of inputs/outputs and support multiple stages. Also it can have more channels and boost the performance when using TCP pipes and shared-memory. But like writing a pipeline of MapReduce jobs, Dryad is a low-level programming model and hard for users to program, thus a more declarative model - DryadLINQ  {%cite yu2008dryadlinq --file big-data %} was created to fill in the gap. It exploits LINQ, a query language in .NET and automatically translates the data-parallel part into execution plan and passed to the Dryad execution engine. Like MR, writing raw Dryad is hard, programmers need to understand system resources and other lower-level details. This motivates a more declarative programming model: DryadLINQ - a querying language.
 
 ### 1.1.4 Spark
 
-Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. 
+Spark  {%cite zaharia2010spark --file big-data %} is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Its a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions - distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets high performance processing, scalability and fault tolerance. 
 
 *Distributed in-memory storage - Resilient Distributed Data sets :*
 
@@ -146,18 +146,18 @@ Non-programmers like data scientists would highly prefer SQL like interface over
 
 Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program can specify the filter function, and emits the intermediate pairs to external pre-built aggregators.
 
-Apart from Sawzal, Pig and Hive are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code.
+Apart from Sawzal, Pig  {%cite olston2008pig --file big-data %} and Hive  {%cite thusoo2009hive --file big-data %} are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code.
 
-Hive {% cite thusoo2009hive --file big-data %} is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach.  It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do. The drawback to using Hive is programmers have to be familiar with basic techniques and best practices for running their Hive queries at maximum speed as it depends on the Hive optimizer. Hive requires developers  train the Hive optimizer for efficient optimization of their queries.
+Hive is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL  {%cite thusoo2010hive --file big-data %} which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach.  It has a component called *metastore* that are created and reused each time the table is referenced by HiveQL like the way traditional warehousing solutions do. The drawback to using Hive is programmers have to be familiar with basic techniques and best practices for running their Hive queries at maximum speed as it depends on the Hive optimizer. Hive requires developers  train the Hive optimizer for efficient optimization of their queries.
 
 Relational interface to big data is good, however, it doesn’t cater to users who want to perform
 
 - ETL to and from various semi or unstructured data sources.
 - advanced analytics like machine learning or graph processing.
 
-These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin and Spark SQL bridges this gap by letting users to seamlessly intermix both relational and procedural API. Both the frameworks free the programmer from worrying about internal execution model by providing implicit optimization on the user input DAG of transformations.
+These user actions require best of both the worlds - relational queries and procedural algorithms. Pig Latin {% cite olston2008pig --file big-data%}  and Spark SQL {% cite armbrust2015spark --file big-data%}  bridges this gap by letting users to seamlessly intermix both relational and procedural API. Both the frameworks free the programmer from worrying about internal execution model by providing implicit optimization on the user input DAG of transformations.
 
-Pig Latin {% cite olston2008pig --file big-data%} aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive.
+Pig Latin aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive.
 
 SparkSQL though has the same goals as that of Pig, is better given the Spark exeuction engine, efficient fault tolerance mechanism of Spark and specialized data structure called Dataset.
 
@@ -166,7 +166,7 @@ The following subsections will discuss Hive, Pig Latin, SparkSQL in details.
 
 ### 1.2.1 Hive/HiveQL
 
-Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs.
+Hive is a data-warehousing infrastructure built on top of the map reduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS {% cite shvachko2010hadoop --file big-data%}. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into map reduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs.
 
 Data in Hive is organized into three different formats :
 
@@ -193,7 +193,7 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl
 Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat.
 
 ### 1.2.2 Pig Latin
-The goal of Pig Latin is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence.
+The goal of Pig Latin {% cite olston2008pig --file big-data%} is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence.
 
 For example, suppose we have a table urls: `(url, category, pagerank)`. The following is a simple SQL query that finds, for each suciently large category, the average pagerank of high-pagerank urls in that category.
 
@@ -223,7 +223,7 @@ output = FOREACH big_groups GENERATE
 
 ### 1.2.3 SparkSQL  :
 
-The major contributions of Spark SQL are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
+The major contributions of Spark SQL {% cite armbrust2015spark --file big-data%} are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing.
 
 ***Programming API***
 
@@ -270,7 +270,7 @@ BSP model is a message passing synchronous model where -
 A notable feature of the model is the complete control on data through communication between every processor at every superstep. Though similar to map reduce model, BSP preserves data in memory across supersteps and helps in reasoning iterative graph algorithms.
 
 The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction, and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence.
-GraphX is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity.
+GraphX {%cite xin2013graphx --file big-data%} is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity.
 
 How does GraphX improve over the existing graph-parallel and data flow models ?
 The RDGs in GraphX provides a set of elegant and expressive computational primitives through which  many a graph parallel systems like Pregel, PowerGraph can be easily expressed with minimal lines of code. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view and graph transformations. It minimizes communication and storage overhead.
@@ -424,14 +424,14 @@ Apache Hadoop is an open-sourced framework that supports distributed processing
 *Figure is from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview*
 
 
-HDFS forms the data management layer, which is a distributed file system designed to provide reliable, scalable storage across large clusters of unreliable commodity machines. The idea was inspired by GFS paper. Unlike closed GFS, HDFS is open-sourced and provides various libraries and interfaces to support different file systems, like S3, KFS etc.
+HDFS forms the data management layer, which is a distributed file system designed to provide reliable, scalable storage across large clusters of unreliable commodity machines. The idea was inspired by GFS{%cite ghemawat2003google --file big-data%}. Unlike closed GFS, HDFS is open-sourced and provides various libraries and interfaces to support different file systems, like S3, KFS etc.
 
 To satisfy different needs, big companies like Facebook and Yahoo developed additional tools. Facebook's Hive, as a warehouse system, can provide more declarative programming interface and translate to Hadoop jobs. Yahoo's Pig platform is an ad-hoc analysis tool that can structurize HDFS objects and support operations like grouping, joining and filtering.   
 
 
 ***Spark Ecosystem***
 
-Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos/Yarn and several major components that have been already discussed in this articlelike Spark-core, SparkSQL, GraphX.
+Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos{%cite hindman2011mesos --file big-data%}/Yarn{%cite vavilapalli2013apache --file big-data%} and several major components that have been already discussed in this articlelike Spark-core, SparkSQL, GraphX.
 In this section we will discuss the remaining yet very important components/libraries which help Spark deliver high performance.
 
 <figure class="main-container">
@@ -445,7 +445,7 @@ Spark achieves fault tolerant, high throughput data streaming workloads in real-
 
 *Apache Mesos*
 
-Apache Mesos is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper and is a efficient CPU and memory-aware resource scheduler.
+Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper and is a efficient CPU and memory-aware resource scheduler.
 
 
 *Alluxio/Tachyon*
-- 
cgit v1.2.3


From 2114df37dfd469e992b876b560ef5e1a69542591 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 10 Dec 2016 02:26:27 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 209a3ad..e24dfed 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -190,7 +190,7 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl
 
 
 ***Serialization/Deserialization***
-Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat.
+Hive implements the LazySerDe as the default SerDe interface. A SerDe is a combination of serialization and deserialization which helps developers instruct Hive on how their records should be processed. The Deserializer interface translates rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. The Serializer, however, converts a Java object into a format that Hive can write to HDFS or another supported system. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row.
 
 ### 1.2.2 Pig Latin
 The goal of Pig Latin {% cite olston2008pig --file big-data%} is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence.
@@ -293,7 +293,7 @@ Edge-cuts for partitioning requires random assignment of vertices and edges acro
 
 ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
-The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows:
+The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are as follows:
 
 - `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data.
 -  `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation
@@ -309,7 +309,7 @@ Other than standard data-parallel operators like filter, map, leftJoin, and redu
 - mapV, mapE - transform the vertex or edge collection.
 - triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled.
 - leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure.
-- subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices
+- subgraph - Applies predicates to return a subgraph of the original graph by filtering all the vertices and edges that don't satisfy the vertices and edges predicates respectively.
 - mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms.
 
 ## 2 Execution Models
-- 
cgit v1.2.3


From 9baf00cc2472ecea464a6f2003d34dadb6e73e9a Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sat, 10 Dec 2016 03:15:29 -0500
Subject: Added references, corrected sentences, strengthen arguments

---
 chapter/8/big-data.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index e24dfed..49a4a0d 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -142,9 +142,9 @@ Other benefits include the scheduling of tasks based on data locality to improve
 ### 1.2 Querying: declarative interfaces
 MapReduce provides only two high level primitives - map and reduce that the programmers have to worry about. MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow.
 Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive  when the programmer wants to implement most common operations like projection, filtering etc.
-Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework[ref]. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Hence, these kind of abstractions provide ample opportunities for query optimizations.
+Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework{% cite scaling-spark-in-real-world --file big-data%}. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Hence, these kind of abstractions provide ample opportunities for query optimizations.
 
-Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program can specify the filter function, and emits the intermediate pairs to external pre-built aggregators.
+Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program can specify the filter function, and emit the intermediate pairs to external pre-built aggregators.
 
 Apart from Sawzal, Pig  {%cite olston2008pig --file big-data %} and Hive  {%cite thusoo2009hive --file big-data %} are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code.
 
@@ -256,7 +256,7 @@ Map Reduce doesn’t scale easily and is highly inefficient for iterative / grap
 
 Also graph algorithms require exchange of messages between vertices. In case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. Map reduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model.
 
-In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model
+In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. Most systems adopt the bulk synchronous parallel model {% cite bulk-synchronous-model --file big-data%}.
 
 This model was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for map reduce since it addressed the above mentioned issues with map reduce<br />
 BSP model is a message passing synchronous model where -
@@ -378,7 +378,7 @@ Optimization logic consists of a chain of transformation operations such that ou
 - Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible.
 - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
 - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
-- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags. These include:
+- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
 - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
 - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
 
@@ -431,7 +431,7 @@ To satisfy different needs, big companies like Facebook and Yahoo developed addi
 
 ***Spark Ecosystem***
 
-Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos{%cite hindman2011mesos --file big-data%}/Yarn{%cite vavilapalli2013apache --file big-data%} and several major components that have been already discussed in this articlelike Spark-core, SparkSQL, GraphX.
+Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos{%cite hindman2011mesos --file big-data%}/Yarn{%cite vavilapalli2013apache --file big-data%} and several major components that have been already discussed in this article like Spark-core, SparkSQL, GraphX.
 In this section we will discuss the remaining yet very important components/libraries which help Spark deliver high performance.
 
 <figure class="main-container">
@@ -440,7 +440,7 @@ In this section we will discuss the remaining yet very important components/libr
 
 *Spark Streaming - A Spark component for streaming workloads*
 
-Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports integration of streaming data with historical data.
+Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model{% cite d-streams --file big-data%}. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports integration of streaming data with historical data.
 
 
 *Apache Mesos*
@@ -450,7 +450,7 @@ Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source cluster/
 
 *Alluxio/Tachyon*
 
-Alluxio/Tachyon is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with  different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads.
+Alluxio/Tachyon{% cite Tachyon --file big-data%} is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with  different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads.
 
 
-- 
cgit v1.2.3


From 1181d1ca440e5f74c87193373ec733cac02cdf5c Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 11 Dec 2016 00:05:46 -0500
Subject: Adding missing references

---
 chapter/8/big-data.md | 1 +
 1 file changed, 1 insertion(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 49a4a0d..38b0691 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -250,6 +250,7 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below :
 <figure class="main-container">
   <img src="./sql-vs-dataframes-vs-datasets.png" alt="SQL vs Dataframe vs Dataset" />
 </figure>
+*Figure from the website :* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
 
 ### 1.3 Large-scale Parallelism on Graphs
 Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.
-- 
cgit v1.2.3


From f492fcc86e98074561e494466a107f08d7bc26b0 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 11 Dec 2016 00:07:18 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 38b0691..fc4e8f2 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -284,14 +284,14 @@ Graph-parallel computation requires every vertex or edge to be processed in the
 <figure class="main-container">
   <img src="./edge-cuts.png" alt="edge cuts" />
 </figure>
-
+*Figure from {%cite xin2013graphx --file big-data%}*
 ***Why Edge-cuts are expensive ?***
 Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead.
 
 <figure class="main-container">
   <img src="./vertex-cuts.png" alt="Vertex cuts" />
 </figure>
-
+*Figure from {%cite xin2013graphx --file big-data%}*
 ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
 The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are as follows:
-- 
cgit v1.2.3


From 65f5401fd11bb8d02e5e32800f8fb2e99254b123 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Sun, 11 Dec 2016 19:42:11 -0500
Subject: Fixing alignment issues

---
 chapter/8/big-data.md | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index fc4e8f2..2d96923 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -285,6 +285,7 @@ Graph-parallel computation requires every vertex or edge to be processed in the
   <img src="./edge-cuts.png" alt="edge cuts" />
 </figure>
 *Figure from {%cite xin2013graphx --file big-data%}*
+
 ***Why Edge-cuts are expensive ?***
 Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. hus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead.
 
@@ -292,6 +293,7 @@ Edge-cuts for partitioning requires random assignment of vertices and edges acro
   <img src="./vertex-cuts.png" alt="Vertex cuts" />
 </figure>
 *Figure from {%cite xin2013graphx --file big-data%}*
+
 ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex.
 
 The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are as follows:
-- 
cgit v1.2.3


From 5335e87060a44f9f6fadd3280c77a6ead384c2ad Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:03:14 -0500
Subject: Reordering Hive execution model and adding more information

Still a bit more to come
---
 chapter/8/big-data.md | 38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2d96923..8d0407a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -368,26 +368,34 @@ Persistent RDDs are stored in memory as java objects (for performance) or in mem
 
 The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
 
-***Metastore***
-It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler.
+The Hive execution model as shown above composes of the below important components :
 
-***Query Compiler***
-Hive Query Compiler works similar to traditional database compilers. Antlr is used to generate the Abstract Syntax Tree (AST) of the query. A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree.
-Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a walk on operator DAG. Each visited node in the DAG is tested for different rules. If any rule is satisfied, its corresponding processor is invoked. Dispatcher maintains a mapping for different rules and their processors and does rule matching. GraphWalker manages the overall traversal process. Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators.
+- Driver : Similar to the Drivers of Spark/Map reduce application, the driver in Hive handles query submission & its flow across the system. It also manages the session and its statistics.
+Metastore : 
 
-***Optimisations of Hive:***
+- Metastore – A Hive metastore stores all information about the tables, their partitions, schemas, columns and their types, etc. enabling transparency of data format and its storage to the users.  It in turn helps in data exploration, query compilation and optimization. Criticality of the Matastore for managing the structure of hadoop files requires it to be updated on a regular basis.
 
-- Column Pruning - Only the columns needed in the query processing are projected.
-- Predicate Pushdown - Predicates are pushed down to the scan so that rows are filtered as early as possible.
-- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
-- Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
-- Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
-- Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
-- Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+- Query Compiler – The Hive Query compiler is similar to any traditional database compilers. it processes the query in three steps :
+ - Parse : In this phase it uses Antlr (A parser generator tool) to generate the Abstract syntax tree (AST) of the query. 
+ - Transformation of AST to DAG (Directed acyclic graph) : In this phase it generates logical plan and does a compile type checking. Logical plan is generated using the metadata (stored in Metastore) information of the required tables. It can flag errors if any issues found during the type checking.
 
-***Execution Engine***
+ - Optimization : Optimization forms the core of any declarative interface. In case of Hive, optimization happens through chains of transformation of DAG. A transformation could include even a user defined optimization and it applies an action on the DAG only if a rule is satisfied. Every node in the DAG implements a special interface called as Node interface which makes it easy for the manipulation of the operator DAG using other interfaces like GraphWalker, Dispatcher, Rule and Processor. Hence, by transformation, we mean walking through a DAG and for every Node we encounter we perform a Rule satisfiability check. If a Rule is satisfied, a corresponding processor is invoked. A Dispatcher maintains a list of Rule to Processor mappings.
 
-Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
+<figure class="main-container">
+  <img src="./Hive-transformation.jpeg" alt="Hive transformation" />
+</figure>
+ 
+ Some of the important transformations are :
+
+ - Column Pruning - Consider only the required columns needed in the query processing for projection.
+ - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates.
+ - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
+ - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
+ - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
+ - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
+ - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+
+- Execution Engine : Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
 
 ### 2.4 SparkSQL execution model
-- 
cgit v1.2.3


From 476cd1adda4e276c2eed64f05f9889417bfc543e Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:04:30 -0500
Subject: Add files via upload

---
 chapter/8/Hive-transformation.png | Bin 0 -> 403957 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-transformation.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-transformation.png b/chapter/8/Hive-transformation.png
new file mode 100644
index 0000000..fa04c3b
Binary files /dev/null and b/chapter/8/Hive-transformation.png differ
-- 
cgit v1.2.3


From 13d628ee65439edfe1076d89dff6791f13e0c849 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:06:43 -0500
Subject: Fixing alignments

---
 chapter/8/big-data.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 8d0407a..b3dc0a9 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -366,12 +366,12 @@ Persistent RDDs are stored in memory as java objects (for performance) or in mem
   <img src="./Hive-architecture.png" alt="Hive architecture" />
 </figure>
 
+
 The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
 
 The Hive execution model as shown above composes of the below important components :
 
 - Driver : Similar to the Drivers of Spark/Map reduce application, the driver in Hive handles query submission & its flow across the system. It also manages the session and its statistics.
-Metastore : 
 
 - Metastore – A Hive metastore stores all information about the tables, their partitions, schemas, columns and their types, etc. enabling transparency of data format and its storage to the users.  It in turn helps in data exploration, query compilation and optimization. Criticality of the Matastore for managing the structure of hadoop files requires it to be updated on a regular basis.
 
@@ -382,9 +382,10 @@ Metastore :
  - Optimization : Optimization forms the core of any declarative interface. In case of Hive, optimization happens through chains of transformation of DAG. A transformation could include even a user defined optimization and it applies an action on the DAG only if a rule is satisfied. Every node in the DAG implements a special interface called as Node interface which makes it easy for the manipulation of the operator DAG using other interfaces like GraphWalker, Dispatcher, Rule and Processor. Hence, by transformation, we mean walking through a DAG and for every Node we encounter we perform a Rule satisfiability check. If a Rule is satisfied, a corresponding processor is invoked. A Dispatcher maintains a list of Rule to Processor mappings.
 
 <figure class="main-container">
-  <img src="./Hive-transformation.jpeg" alt="Hive transformation" />
+  <img src="./Hive-transformation.png" alt="Hive transformation" />
 </figure>
- 
+*Figure from:* %cite thusoo2010hive --file big-data %}
+
  Some of the important transformations are :
 
  - Column Pruning - Consider only the required columns needed in the query processing for projection.
-- 
cgit v1.2.3


From 0a606c14ba2215a694de7cb5baf1c7d946d39af8 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:09:06 -0500
Subject: Delete Hive-transformation.png

---
 chapter/8/Hive-transformation.png | Bin 403957 -> 0 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 chapter/8/Hive-transformation.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-transformation.png b/chapter/8/Hive-transformation.png
deleted file mode 100644
index fa04c3b..0000000
Binary files a/chapter/8/Hive-transformation.png and /dev/null differ
-- 
cgit v1.2.3


From 560910c87846d1c77d2a34cf2d517135accc066c Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:10:27 -0500
Subject: Add files via upload

---
 chapter/8/Hive-transformation.png | Bin 0 -> 126527 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-transformation.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-transformation.png b/chapter/8/Hive-transformation.png
new file mode 100644
index 0000000..afa7e07
Binary files /dev/null and b/chapter/8/Hive-transformation.png differ
-- 
cgit v1.2.3


From e4c280784e57b7c7c25d4d09a0c5cbed01f589fe Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:11:51 -0500
Subject: Delete Hive-transformation.png

---
 chapter/8/Hive-transformation.png | Bin 126527 -> 0 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 chapter/8/Hive-transformation.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-transformation.png b/chapter/8/Hive-transformation.png
deleted file mode 100644
index afa7e07..0000000
Binary files a/chapter/8/Hive-transformation.png and /dev/null differ
-- 
cgit v1.2.3


From 012737f0d6a110fe590eec54a8a467d09ae0df2b Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 01:12:03 -0500
Subject: Add files via upload

---
 chapter/8/Hive-transformation.png | Bin 0 -> 43008 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 chapter/8/Hive-transformation.png

(limited to 'chapter')

diff --git a/chapter/8/Hive-transformation.png b/chapter/8/Hive-transformation.png
new file mode 100644
index 0000000..7383188
Binary files /dev/null and b/chapter/8/Hive-transformation.png differ
-- 
cgit v1.2.3


From 25772319d4ed016e47acbb194c185476e20a6d2c Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 09:34:05 -0500
Subject: Fixing alignment issues

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b3dc0a9..8044f70 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -384,7 +384,7 @@ The Hive execution model as shown above composes of the below important componen
 <figure class="main-container">
   <img src="./Hive-transformation.png" alt="Hive transformation" />
 </figure>
-*Figure from:* %cite thusoo2010hive --file big-data %}
+*Figure to depict the transformation flow during optimization, from:* %cite thusoo2010hive --file big-data %}
 
  Some of the important transformations are :
 
@@ -395,7 +395,7 @@ The Hive execution model as shown above composes of the below important componen
  - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
  - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
  - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
-
+ 
 - Execution Engine : Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
 
-- 
cgit v1.2.3


From 89d0ef02079796624c3075d7f4d520594de64674 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 09:36:58 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 8044f70..2dc97d6 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -388,14 +388,15 @@ The Hive execution model as shown above composes of the below important componen
 
  Some of the important transformations are :
 
- - Column Pruning - Consider only the required columns needed in the query processing for projection.
- - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates.
- - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
- - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
- - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
- - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
- - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
+  - Column Pruning - Consider only the required columns needed in the query processing for projection.
+  - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates.
+  - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
+  - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
+  - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
+  - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
+  - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
  
+
 - Execution Engine : Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
 
-- 
cgit v1.2.3


From 2384fa2339950af6a25715c4825a6680611f5a16 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 12:39:17 -0500
Subject: Final changes to Hive

Reordered many a things to make it more clear and informative. I guess the diagram needs a revisit. It once seemed very huge, now it seems so small.
---
 chapter/8/big-data.md | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 2dc97d6..345bde3 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -361,15 +361,7 @@ Persistent RDDs are stored in memory as java objects (for performance) or in mem
 
 ### 2.3 Hive execution model
 
-
-<figure class="main-container">
-  <img src="./Hive-architecture.png" alt="Hive architecture" />
-</figure>
-
-
-The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.
-
-The Hive execution model as shown above composes of the below important components :
+The Hive execution model composes of the below important components (and as shown in the below diagram):
 
 - Driver : Similar to the Drivers of Spark/Map reduce application, the driver in Hive handles query submission & its flow across the system. It also manages the session and its statistics.
 
@@ -389,16 +381,22 @@ The Hive execution model as shown above composes of the below important componen
  Some of the important transformations are :
 
   - Column Pruning - Consider only the required columns needed in the query processing for projection.
-  - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates.
+  - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates. Its important that unnecessary records are filtered first and transformations are applied on only the needed ones.
   - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
-  - Map Side Joins - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
-  - Join Reordering - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.Some optimizations are not enabled by default but can be activated by setting certain flags.
-  - Repartitioning data to handle skew in GROUP BY processing.This is achieved by performing GROUP BY in two MapReduce stages first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
-  - Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.
- 
+  - Map Side Joins - Smaller tables in the join operation can be replicated in all the mappers and the reducers.
+  - Join Reordering - Reduce reducer side join operation memory by keeping only smaller tables in memory. Larger tables need not be kept in memory.
+  - Repartitioning data to handle skew in GROUP BY processing can be achieved by performing GROUP BY in two MapReduce stages. In first stage data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
+  - Similar to combiners in Map reduce, hash based partial aggregations in the mappers can be performed reduce the data that is sent by the mappers to the reducers. This helps in reducing the amount of time spent in sorting and merging the resulting data.
+
+
+Execution Engine : Execution Engine finally executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
+
+<figure class="main-container">
+  <img src="./Hive-architecture.png" alt="Hive architecture" />
+</figure>
 
-- Execution Engine : Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
+Summarizing the flow - the query is first submitted via CLI/web UI/any other interface. The query undergoes all the compiler phases as explained above to form an optimized DAG of MapReduce and hdfs tasks which the execution engine executes in its correct order using Hadoop.
 
 ### 2.4 SparkSQL execution model
 
-- 
cgit v1.2.3


From c2e14a66591f5fa7c878515d5841e3c75ccc9bc5 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Mon, 12 Dec 2016 12:41:38 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 345bde3..f3055b0 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -391,10 +391,11 @@ The Hive execution model composes of the below important components (and as show
 
 Execution Engine : Execution Engine finally executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query).
 
+
 <figure class="main-container">
   <img src="./Hive-architecture.png" alt="Hive architecture" />
 </figure>
-
+*Hive architecture diagram*
 
 Summarizing the flow - the query is first submitted via CLI/web UI/any other interface. The query undergoes all the compiler phases as explained above to form an optimized DAG of MapReduce and hdfs tasks which the execution engine executes in its correct order using Hadoop.
 
-- 
cgit v1.2.3


From 7a779a5b66e67d89cc1b2a95c4fb878e91c96ab3 Mon Sep 17 00:00:00 2001
From: Aviral Goel <goel.aviral@gmail.com>
Date: Mon, 12 Dec 2016 15:50:22 -0500
Subject: Changed figure, updated content for CAP, 2 out of 3 covered in detail

---
 ...dic-to-basic-how-the-database-ph-has-changed.md |  84 +++++++++++++--------
 chapter/6/resources/partitioned-network.jpg        | Bin 23772 -> 24303 bytes
 2 files changed, 51 insertions(+), 33 deletions(-)

(limited to 'chapter')

diff --git a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md
index 540a2a9..ffc94c0 100644
--- a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md
+++ b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md
@@ -10,7 +10,7 @@ Relational Database Management Systems are the most ubiquitous database systems
 
 * **Atomicity** guarantees that any transaction will either complete or leave the database unchanged. If any operation of the transaction fails, the entire transaction fails. Thus, a transaction is perceived as an atomic operation on the database. This property is guaranteed even during power failures, system crashes and other erroneous situations. 
 
-* **Consistency** guarantees that any transaction will always result in a valid database state, i.e., the transaction preserves all database rules, such as unique keys, etc. 
+* **Consistency** guarantees that any transaction will always result in a valid database state, i.e., the transaction preserves all database rules, such as unique keys. 
 
 * **Isolation** guarantees that concurrent transactions do not interfere with each other. No transaction views the effects of other transactions prematurely. In other words, they execute on the database as if they were invoked serially (though a read and write can still be executed in parallel). 
 
@@ -20,17 +20,17 @@ Relational Database Management Systems are the most ubiquitous database systems
 
 Because of the strong guarantees this model simplifies the life of the developer and has been traditionally the go to approach in application development. It is instructive to examine how these properties are enforced. 
 
-Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations.
+Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency, i.e., all transactions are performed in a particular sequence and invariants are always maintained by them. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations.
 
 This model works well on a single node. But it exposes a serious limitation when too many concurrent transactions are performed. A single node database server will only be able to process so many concurrent read operations. The situation worsens when many concurrent write operations are performed. To guarantee *ACID*ity, the write operations will be performed in sequence. The last write request will have to wait for an arbitrary amount of time, a totally unacceptable situation for many real time systems. This requires the application developer to decide on a **Scaling** strategy.
 
-### Transaction Volume 
+### 1.2. Scaling transaction volume 
 
 To increase the volume of transactions against a database, two scaling strategies can be considered
 
-**Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. Since its not commodity hardware, vendor lock-in will add to further costs.
+* **Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. Since its specialized hardware, vendor lock-in will add to further costs.
 
-**Horizontal Scaling** is a more viable option and can be implemented in two ways. Data can be segregated into functional groups spread across databases. This is called *Functional Scaling*. Data within a functional group can be further split across multiple databases, enabling functional areas to be scaled independently of one another for even more transactional capacity. This is called *sharding*.
+* **Horizontal Scaling** is a more viable option and can be implemented in two ways. Data can be segregated into functional groups spread across databases. This is called *Functional Scaling*. Data within a functional group can be further split across multiple databases, enabling functional areas to be scaled independently of one another for even more transactional capacity. This is called *sharding*.
 
 Horizontal Scaling through functional partitioning enables high degree of scalability. However, the functionally separate tables employ constraints such as foreign keys. For these constraints to be enforced by the database itself, all tables have to reside on a single database server. This limits horizontal scaling. To work around this limitation the tables in a functional group have to be stored on different database servers. But now, a single database server can no longer enforce constraints between the tables. In order to ensure *ACID*ity of distributed transactions, distributed databases employ a two-phase commit (2PC) protocol. 
 
@@ -38,13 +38,22 @@ Horizontal Scaling through functional partitioning enables high degree of scalab
 
 * In the second phase, the coordinator asks each database to commit the data.
 
-2PC is a blocking protocol and is usually employed for updates which can take from a few milliseconds up to a few minutes to commit. This means that while a transaction is being processed, other transactions will be blocked. So the application that initiated the transaction will be blocked. Another option is to handle the consistency across databases at the application level. This only complicates the situation for the application developer who is likely to implement a similar strategy if *ACID*ity is to be maintained.
+2PC is a blocking protocol and updates can take from a few milliseconds up to a few minutes to commit. This means that while a transaction is being processed, other transactions will be blocked. So the application that initiated the transaction will be blocked. Another option is to handle the consistency across databases at the application level. This only complicates the situation for the application developer who is likely to implement a similar strategy if *ACID*ity is to be maintained.
 
+## 2. The Distributed Concoction
 
-## The **ACID*ic side effect
-Traditional distributed databases provide strong consistency guarantees. While processing a transaction, they block other client requests. Imagine a large scale internet based shopping service with consistency enforced across functional partitions. This means that any tra
+A distributed application is expected to have the following three desirable properties:
 
-## 3. A Distributed Concoction
+1. **Consistency** - This is the guarantee of total ordering of all operations on a data object such that each operation appears indivisible. This means that any read operation must return the most recently written value. This provides a very convenient invariant to the client application. This definition of consistency is the same as the **Atomic**ity guarantee provided by relational database transactions.
+
+2. **Availability** - Every request to a distributed system must result in a response. However, this is too vague a definition. Whether a node failed in the process of responding or it ran a really long computation to generate a response or whether the request or the response got lost due to network issues is generally impossible to determine by the client and willHence, for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, the amount of delay an application can bear depends on the application domain.
+
+3. **Partition Tolerance** - Partitioning is the loss of messages between the nodes of a distributed system. During a network partition, the system can lose arbitrary number of messages between nodes. A partition tolerant system will always respond correctly unless a total network failure happens.
+
+Consistency requirement implies that every request will be treated atomically by the system even if the nodes lose messages due to network partitions.
+Availability requirement implies that every request should receive a response even if a partition causes messages to be lost arbitrarily.
+
+## 3. The CAP Theorem
 
 ![Partitioned Network](resources/partitioned-network.jpg)
 
@@ -54,34 +63,43 @@ In the network above, all messages between the node set M and N are lost due to
 
 2. **Consistency first** - The system  does not allow any application to write to data objects as it cannot ensure **consistency** of replica states. This means that the system is perceived to be **unavailable** by the applications. 
 
-If there are no partitions, clearly both consistency and availability can be guaranteed by the system.
+If there are no partitions, clearly both consistency and availability can be guaranteed by the system. This observation led Eric Brewer to conjecture in an invited talk at PODC 2000- 
+
+<blockquote>It is impossible for a web service to provide the following three guarantees:
+Consistency
+Availability
+Partition Tolerance</blockquote>
+
+This is called the CAP theorem. 
 
-This simple observation shows a tension between three issues concerning distributed systems -
+It is clear that the prime culprit here is network partition. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system.
 
-**Consistency** is the guarantee of total ordering of all operations on a data object such that each operation appears indivisible. This means that any read operation must return the most recently written value. This provides a very convenient invariant to the client application that uses the distributed data store. This definition of consistency is the same as the **Atomic**ity guarantee provided by relational database transactions.
+## 4. Two of Three - Exploring the CAP Theorem
 
-**Availability** is the guarantee that every request to a distributed system must result in a response. However, this is too vague a definition. Whether a node failed in the process of responding or it ran a really long computation to generate a response or whether the request or the response got lost due to network issues is generally impossible to determine by the client and willHence, for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, the amount of delay an application can bear depends on the application domain.
+The CAP theorem dictates that the three desirable properties, consistency, availability and partition tolerance cannot be offered simultaneously. Let's study if its possible to achieve two of these three properties.
 
-**Partitioning** is the loss of messages between the nodes of a distributed system. 
+### Consistency and Availability
+If there are no network partitions, then there is no loss of messages and all requests receive a response within the stipulated time. It is clearly possible to achieve both consistency and availability. Distributed systems over intranet are an example of such systems.
 
-This observation led Eric Brewer to conjecture in an invited talk at PODC 2000 - 
+### Consistency and Partition Tolerance
+Without availability, both of these properties can be achieved easily. A centralized system can provide these guarantees. The state of the application is maintained on a single designated node. All updates from the client are forwarded by the nodes to this designated node. It updates the state and sends the response. When a failure happens, then the system does not respond and is perceived as unavailable by the client. Distributed locking algorithms in databases also provide these guarantees.
 
-<blockquote>It is impossible for a web service to provide the following three guarantees:
-Consistency
-Availability
-Partition Tolerance</blockquote>
+### Availability and Partition Tolerance
+Without atomic consistency, it is very easy to achieve availability even in the face of partitions. Even if nodes fail to communicate with each other, they can individually handle query and update requests issued by the client. The same data object will have different states on different nodes as the nodes progress independently. This weak consistency model is exhibited by web caches.
+
+Its clear that two of these three properties are easy to achieve in any distributed system. Since large scale distributed systems have to take partitions into account, will they have to sacrifice availability for consistency or consistency for availability? Clearly giving up either consistency or availability is too big a sacrifice.
+
+## 5. The **BASE**ic distributed state
 
-This is called the CAP theorem. It is clear that the prime culprit here is network partition. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system. 
+When viewed through the lens of CAP theorem and its consequences on distributed application design, we realize that we cannot commit to perfect availability and strong consistency. But surely we can explore the middle ground. We can guarantee availability most of the time with occasional inconsistent view of the data. The consistency is eventually achieved when the communication between the nodes resumes. This leads to the following properties of the current distributed applications, referred to by the acronym BASE.
 
-## The **BASE**ic distributed state
+* **Basically Available** services are those which are partially available when partitions happen. Thus, they appear to work most of the time. Partial failures result in the system being unavailable only for a section of the users.
 
-When viewed through the lens of CAP theorem and its consequences on distributed application design, we realize that we cannot commit to perfect availability and strong consistency. But surely we can explore the middle ground. We can guarantee availability most of the time with sometimes inconsistent view of the data. The consistency is eventually achieved when the communication between the nodes resumes. This leads to the following properties of the current distributed applications, referred to by the acronym BASE.
+* **Soft State** services provide no strong consistency guarantees. They are not write consistent. Since replicas may not be mutually consistent, applications have to accept stale data.
 
-**Basically Available** services are those which are partially available when partitions happen. Thus, they appear to work most of the time. Partial failures result in the system being unavailable only for a section of the users.
-**Soft State** services provide no strong consistency guarantees. They are not write consistent. Since replicas may not be mutually consistent, applications have to accept stale data.
-**Eventually Consistent** services try to make application state consistent whenever possible.
+* **Eventually Consistent** services try to make application state consistent whenever possible.
 
-## Partitions and latency
+## 6. Partitions and latency
 Any large scale distributed system has to deal with latency issue. In fact, network partitions and latency are fundamentally related. Once a request is made and no response is received within some duration, the sender node has to assume that a partition has happened. The sender node can take one of the following steps:
 
 1) Cancel the operation as a whole. In doing so, the system is choosing consistency over availability. 
@@ -90,32 +108,32 @@ Any large scale distributed system has to deal with latency issue. In fact, netw
 
 Essentially, a partition is an upper bound on the time spent waiting for a response. Whenever this upper bound is exceeded, the system chooses C over A or A over C. Also, the partition may be perceived only by two nodes of a system as opposed to all of them. This means that partitions are a local occurrence.
 
-## Handling Partitions
+## 7. Handling Partitions
 Once a partition has happened, it has to be handled explicitly. The designer has to decide which operations will be functional during partitions. The partitioned nodes will continue their attempts at communication. When the nodes are able to establish communication, the system has to take steps to recover from the partitions. 
 
-### Partition mode functionality
+### 7.1. Partition mode functionality
 When at least one side of the system has entered into partition mode, the system has to decide which functionality to support. Deciding this depends on the invariants that the system must maintain. Depending on the nature of problem, the designer may choose to compromise on certain invariants by allowing partitioned system to provide functionality which might violate them. This means the designer is choosing availability over consistency. Certain invariants may have to be maintained and operations that will violate them will either have to be modified or prohibited. This means the designer is choosing consistency over availability. 
 Deciding which operations to prohibit, modify or delay also depends on other factors such as the node. If the data is stored on the same node, then operations on that data can typically proceed on that node but not on other node.
 In any event, the bottomline is that if the designer wishes for the system to be available, certain operations have to be allowed. The node has to maintain a history of these operations so that it can be merged with the rest of the system when it is able to reconnect. 
 Since the operations can happen simultaneously on multiple disconnected nodes, all sides will maintain this history. One way to maintain this information is through version vectors.
 Another interesting problem is to communicate the progress of these operations to the user. Until the system gets out of partition mode, the operations cannot be committed completely. Till then, the user interface has to faithfully represent their incomplete or in-progress status to the user.
 
-### Partition Recovery
+### 7.2. Partition Recovery
 When the partitioned nodes are able to communicate, they have to exchange information to maintain consistency. Both sides continued in their independent direction but now the delayed operations on either side have to be performed and violated invariants have to be fixed. Given the state and history of both sides, the system has to accomplish the following tasks.
 
-#### Consistency
+#### 7.2.1. Consistency
 During recovery, the system has to reconcile the inconsistency in state of both nodes. This is relatively straightforward to accomplish. One approach is to start from the state at the time of partition and apply operations of both sides in an appropriate manner, ensuring that the invariants are maintained. Depending on operations allowed during the partition phase, this process may or may not be possible. The general problem of conflict resolution is not solvable but a restricted set of operations may ensure that the system can always always merge conflicts. For example, Google Docs limits operations to style and text editing. But source-code control systems such as Concurrent Versioning System (CVS) may encounter conflict which require manual resolution. Research has been done on techniques for automatic state convergence. Using commutative operations allows the system to sort the operations in a consistent global order and execute them. Though all operations can't be commutative, for example - addition with bounds checking is not commutative. Mark Shapiro and his colleagues at INRIA have developed *commutative replicated data types (CRDTs)* that provably converge as operations are performed. By implementing state through CRDTs, we can ensure Availability and automatic state convergence after partitions.
 
-#### Compensation
+#### 7.2.2. Compensation
 During partition, its possible for both sides to perform a series of actions which are externalized, i.e. their effects are visible outside the system. To compensate for these actions, the partitioned nodes have to maintain a history.
 For example, consider a system in which both sides have executed the same order during a partition. During the recovery phase, the system has to detect this and distinguish it from two intentional orders. Once detected, the duplicate order has to be rolled back. If the order has been committed successfully then the problem has been externalized. The user will see twice the amount deducted from his account for a single purchase. Now, the system has to credit the appropriate amount to the user's account and possibly send an email explaining the entire debacle. All this depends on the system maintaining the history during partition. If the history is not present, then duplicate orders cannot be detected and the user will have to catch the mistake and ask for compensation.
 It would have been great if the duplicate order was not issued by the system in the first place. But the requirement to maintain system availability trumps consistency. Mistakes in such cases cannot always be corrected internally. But by admitting them and compensating for them, the system arguably exhibits equivalent behavior.
 
-### What's the right pH for my distributed solution?
+### 8. What's the right pH for my distributed solution?
 
 Whether an application chooses to be an *ACID*ic or *BASE*ic service depends on the domain. An application developer has to consider the consistency-availability tradeoff on a case by case basis. *ACID*ic databases provide a very simple and strong consistency model making application development easy for domains where data inconsistency cannot be tolerated. *BASE*ic systems provide a very loose consistency model, placing more burden on the application developer to understand the invariants and manage them carefully during partitions by appropriately limiting or modifying the operations.
 
-## References
+## 9. References
 
 https://neo4j.com/blog/acid-vs-base-consistency-models-explained/
 https://en.wikipedia.org/wiki/Eventual_consistency/
diff --git a/chapter/6/resources/partitioned-network.jpg b/chapter/6/resources/partitioned-network.jpg
index 2c91607..513fc13 100644
Binary files a/chapter/6/resources/partitioned-network.jpg and b/chapter/6/resources/partitioned-network.jpg differ
-- 
cgit v1.2.3


From 9ce0c4ac2cdd2f26d677911efbca509031fdde29 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 16 Dec 2016 15:53:42 -0500
Subject: Update big-data.md

---
 chapter/8/big-data.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index f3055b0..194dd5a 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -457,12 +457,12 @@ Spark achieves fault tolerant, high throughput data streaming workloads in real-
 
 *Apache Mesos*
 
-Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper and is a efficient CPU and memory-aware resource scheduler.
+Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper {%cite hunt2010zookeeper --file big-data%} and is a efficient CPU and memory-aware resource scheduler.
 
 
 *Alluxio/Tachyon*
 
-Alluxio/Tachyon{% cite Tachyon --file big-data%} is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with  different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads.
+Alluxio/Tachyon{% cite li2014tachyon --file big-data%} is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with  different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads.
 
 
-- 
cgit v1.2.3


From 7ecfbfe698bc2f62cbabe3966ca81b8da480c753 Mon Sep 17 00:00:00 2001
From: msabhi <abhi.is2006@gmail.com>
Date: Fri, 16 Dec 2016 15:59:07 -0500
Subject: Fixed mesos comment

---
 chapter/8/big-data.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'chapter')

diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 194dd5a..6c0781d 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -457,7 +457,7 @@ Spark achieves fault tolerant, high throughput data streaming workloads in real-
 
 *Apache Mesos*
 
-Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper {%cite hunt2010zookeeper --file big-data%} and is a efficient CPU and memory-aware resource scheduler.
+Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source heterogenous cluster/resource manager developed at the University of California, Berkley and used by  companies such  as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper {%cite hunt2010zookeeper --file big-data%} and is a efficient CPU and memory-aware resource scheduler.
 
 
 *Alluxio/Tachyon*
-- 
cgit v1.2.3