aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJingjing Ren <renjj@ccs.neu.edu>2016-12-04 12:26:10 -0500
committerJingjing Ren <renjj@ccs.neu.edu>2016-12-04 12:26:10 -0500
commite93d770beedd5addcaf886e38f50f62e0d3eac14 (patch)
treeef24b154e9fe30c89c6b0f1efb05776d1bfa9092
parent54aa9be71a9a013ab0a25411eba78b1d29597787 (diff)
minor
-rw-r--r--chapter/8/big-data.md8
1 files changed, 2 insertions, 6 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b833528..29237f5 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -3,11 +3,7 @@ layout: page
title: "Large Scale Parallel Data Processing"
by: "Jingjing and Abhilash"
---
-## Introduction
-`JJ: Placeholder for introduction` The booming Internet has generated big data...
-
-This chapter is organized in
-
+## Outline
- Programming Models
- Data parallelism (most popular, standard map/reduce/functional pipelining)
- PM of MapReduce: What is the motivation for MapReduce? How does the abstraction capture problem in a easy way? What are the map and reduce functions? What are limitations of this model? In real world applications, we want to do pipelining and it comes with lots of management issues, thus we introduce FlumeJava.
@@ -35,7 +31,7 @@ This chapter is organized in
- Things people are building on top of MapReduce/Spark
- Ecosystem, everything interoperates with GFS or HDFS, or makes use of stuff like protocol buffers so systems like Pregel and MapReduce and even MillWheel...
-## Programming Model
+## Programming Models
### Data parallelism
The motivation for MapReduce {% cite dean2008mapreduce --file big-data %} is that we want to use hundreds/thousands of machines to do data processing in parallel, but we don’t want to deal with low-level management. MapReduce can help this by abstracting computing logic into simple map and reduce functions and let the computation model handle the parallelization and distribution, provide fault tolerance, manage I/O scheduling and get proper status updates. The solution in the MapReduce paper is simple and powerful in terms of separating programming model and the executing model. This model applies to computations that are usually parallelizable: A `map` function can operate on each logical "record", this generates a set of intermediate key/value pairs, and then a `reduce` function applies on all values that share the same key and generate one or zero output value. Conceptually, the map and reduction functions have associated **types**:
```