diff options
Diffstat (limited to 'chapter/8')
| -rw-r--r-- | chapter/8/big-data.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index ab72fa0..e800308 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -4,7 +4,7 @@ title: "Large Scale Parallel Data Processing" by: "Jingjing and Abhilash" --- ## Introduction - +The growth of Internet has generated the so-called big data(terabytes or petabytes). It is not possible to fit them into a single machine or process them with one single program. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce model to abstract the logic and proved to be simple and powerful. From then on, the idea inspired lots of other programming models. In this chapter, we will present how programming models evolve over time, why their execution engines are designed in certain ways, and underlying ecosystem that supports each developing thread. ## 1 Programming Models ### 1.1 Data parallelism *Data parallelism* is to run a single operation on different pieces of the data on different machines in parallel. Comparably, a sequential computation looks like *"for all elements in the dataset, do operation A"*, where dataset could be in the order of terabytes or petabytes aka. big data and one wants to scale up the processing. The challenges to do this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines, how to handle failures and so on. |
