diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-04 15:25:59 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-04 15:25:59 -0500 |
| commit | 538dc06632cfd59654760392be66372112c1839e (patch) | |
| tree | 7bd12a09a052d0aa0c21e4cae3f7dcba3ece9399 /chapter | |
| parent | daed05ae775538ad2edabe0693da3fb832c721e6 (diff) | |
Update big-data.md
Diffstat (limited to 'chapter')
| -rw-r--r-- | chapter/8/big-data.md | 8 |
1 files changed, 5 insertions, 3 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index ba9affe..884dead 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -12,6 +12,7 @@ by: "Jingjing and Abhilash" - Spark: what is Spark? how is it different from map reduce? (RDD/lineage: can support iterative algorithm, interactive analytics;) what is pipelining? why is Spark so powerful - RDD and API? What is a RDD and why is it so efficient? properties of a RDD? why is RDD better than DSM? What are the transformations and actions available in Spark ? - Large-scale Parallelism on Graphs - Why a separate graph processing model? what is a BSP? working of BSP? Do not stress more since its not a map reduce world exactly. + - GraphX programming model (working on this) - Querying: more declarative - DryadLINQ: SQL-like, uses Dryad as execution engine; `Suggestion: Merge this with Dryad above?` @@ -24,10 +25,11 @@ by: "Jingjing and Abhilash" - Execution Models - MapReduce (intermediate writes to disk): What is the sequence of actions when a MapReduce functions are called? How is write-to-disk good/bad (fault-tolerant/slow)? How does the data are transmitted across clusters efficiently (store locally)? To shorten the total time for MP operations, it uses backup tasks. When MP jobs are pipelined, what optimizations can be performed by FlumeJava? In spite of optimizations and pipelining, what is the inherent limitation (not support iterative algorithm?) - Spark (all in memory): introduce spark architecture, different layers, what happens when a spark job is executed? what is the role of a driver/master/worker, how does a scheduler schedule the tasks and what performance measures are considered while scheduling? how does a scheduler manage node failures and missing partitions? how are the user defined transformations passed to the workers? how are the RDDs stored and memory management measures on workers? do we need checkpointing at all given RDDs leverage lineage for recovery? if so why ? - - Pregel - Overview of Pregel. Its implementation and working. its limitations. Do not stress more since we have a better model GraphX to explain a lot. + - Graphs : + - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not stress more since we have a better model GraphX to explain a lot. + - GraphX : Working on this. - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core - + - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately: - Hadoop vs. Spark - Spark vs. SparkSQL from SparkSQL paper |
