aboutsummaryrefslogtreecommitdiff
path: root/chapter/8
diff options
context:
space:
mode:
Diffstat (limited to 'chapter/8')
-rw-r--r--chapter/8/big-data.md4
1 files changed, 2 insertions, 2 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 54dde79..608341e 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -30,7 +30,7 @@ by: "Jingjing and Abhilash"
- Graphs :
- Pregel :Overview of Pregel. Its implementation and working. its limitations. Do not stress more since we have a better model GraphX to explain a lot.
- GraphX : Working on this.
- - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
+ - SparkSQL Catalyst & Spark execution model : Discuss Parser, LogicalPlan, Optimizer, PhysicalPlan, Execution Plan. Why catalyst? how catalyst helps in SparkSQL , data flow from sql-core-> catalyst->spark-core
- Evaluation: Given same algorithm, what is the performance differences between Hadoop, Spark, Dryad? There are no direct comparison for all those models, so we may want to compare separately:
- Hadoop vs. Spark
@@ -77,7 +77,7 @@ reduce(String key, Iterator values):
Emit(AsString(result));
```
-*Execution*
+*Execution* `TODO: move this to execution and talk about fault-tolerance instead`
At high level, when the user program calls *MapReduce* function, the input files are split into *M* pieces and it runs *map* function on corresponding splits; then intermediate key space are partitioned into *R* pieces using a partitioning function; After the reduce functions all successfully complete, the output is available in *R* files. The sequences of actions are shown in the figure below. We can see from label (4) and (5) that the intermediate key/value pairs are written/read into disks, this is a key to fault-tolerance in MapReduce model and also a bottleneck for more complex computation algorithms.
<figure class="main-container">