Update big-data.md

author: msabhi <abhi.is2006@gmail.com> 2016-12-04 15:25:59 -0500
committer: GitHub <noreply@github.com> 2016-12-04 15:25:59 -0500
commit: 538dc06632cfd59654760392be66372112c1839e (patch)
tree: 7bd12a09a052d0aa0c21e4cae3f7dcba3ece9399 /chapter
parent: daed05ae775538ad2edabe0693da3fb832c721e6 (diff)
1 files changed, 5 insertions, 3 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index ba9affe..884dead 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -12,6 +12,7 @@ by: "Jingjing and Abhilash"
     - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ?
   - Large-scale Parallelism on Graphs
     - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly.
+    - GraphX programming model (working on this)
   - Querying: more declarative
     - DryadLINQ: SQL-like, uses Dryad as execution engine;   
     `Suggestion: Merge this with Dryad above?`
@@ -24,10 +25,11 @@ by: "Jingjing and Abhilash"
 - Execution Models
   - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?)
   - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ?
-  - Pregel
-    Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+  - Graphs : 
+    - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not  stress more since we have a better model GraphX to explain a lot.
+    - GraphX : Working on this.
  - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
- 
+
 - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
   - Hadoop vs. Spark
   - Spark vs. SparkSQL from SparkSQL paper
author	msabhi <abhi.is2006@gmail.com>	2016-12-04 15:25:59 -0500
committer	GitHub <noreply@github.com>	2016-12-04 15:25:59 -0500
commit	538dc06632cfd59654760392be66372112c1839e (patch)
tree	7bd12a09a052d0aa0c21e4cae3f7dcba3ece9399 /chapter
parent	daed05ae775538ad2edabe0693da3fb832c721e6 (diff)