diff options
| author | Jingjing Ren <renjj@ccs.neu.edu> | 2016-12-05 10:56:29 -0500 |
|---|---|---|
| committer | Jingjing Ren <renjj@ccs.neu.edu> | 2016-12-05 10:56:29 -0500 |
| commit | 09ae3171dcc60933ed9a1bc3ebf27e6611423626 (patch) | |
| tree | 1a50f6a4f03f476f18287760ae4ed49e5bc2a6c6 | |
| parent | d64b5eea953b10e02e0c9bc232a7b2a803addbdd (diff) | |
update outline
| -rw-r--r-- | chapter/8/big-data.md | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 54dde79..608341e 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -30,7 +30,7 @@ by: "Jingjing and Abhilash" - Graphs : - Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not stress more since we have a better model GraphX to explain a lot. - GraphX : Working on this. - - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core + - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core - Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately: - Hadoop vs. Spark @@ -77,7 +77,7 @@ reduce(String key, Iterator values): Emit(AsString(result)); ``` -*Execution* +*Execution* `TODO: move this to execution and talk about fault-tolerance instead` At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms. <figure class="main-container"> |
