flumejava 2

author: Jingjing Ren <renjj@ccs.neu.edu> 2016-12-13 15:36:23 -0500
committer: Jingjing Ren <renjj@ccs.neu.edu> 2016-12-13 15:36:23 -0500
commit: b214b7afb85a61ea6932bdf235062e8f784cc0df (patch)
tree: 639ab2f420124609f7ec5a70555d9ab709129283
parent: 68eea603c1dd2a4997e410b7e37a1de20291e2c7 (diff)
1 files changed, 12 insertions, 4 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index b8e1afb..111b3a8 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -101,13 +101,21 @@ PTable<String,Integer> wordCounts =
 ```
 
 *Deferred Evaluation & Optimizer*  
-One of the merits of using FlumeJava to pipeline MapReduce jobs is that it enables optimization automatically, by executing parallel operations lazily using *deferred evaluation*. The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a *parallelDo()*, it creates an operation pointer to the actual deferred operation object. These operations form a directed acyclic graph called execution plan. The execution plan doesn't get evaluated until *run()* is called. This will cause optimization of the execution plan and evaluation in forward topological order. These optimization for transferring the modular execution plan into an efficient one include:
-- parallelDo Fusion: $$f(g(x)) => f \circ g(x)$$. This can reduce steps
-- MapShuffleCombineReduce (MSCR) Operation: combination of ParallelDo, GroupByKey, CombineValues and Flatten into one MapReduce job. This extends MapReduce to accept multiple inputs and multiple outputs. `todo: example of Figure 3`
+One of the merits of using FlumeJava to pipeline MapReduce jobs is that it enables optimization automatically, by executing parallel operations lazily using *deferred evaluation*. The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a *parallelDo()*, it creates an operation pointer to the actual deferred operation object. These operations form a directed acyclic graph called execution plan. The execution plan doesn't get evaluated until *run()* is called. This will cause optimization of the execution plan and evaluation in forward topological order. These optimization strategies for transferring the modular execution plan into an efficient one include:
+- Fusion: $$f(g(x)) => g \circ f(x)$$, which is essentially function composition. This usually help reduce steps.
+- MapShuffleCombineReduce (MSCR) Operation: combination of ParallelDo, GroupByKey, CombineValues and Flatten into one MapReduce job. This extends MapReduce to accept multiple inputs and multiple outputs. Following figure illustrates the case a MSCR operation with 3 input channels, 2 grouping(GroupByKey) output channels and 1 pass-through output channel.
   <figure class="main-container">
   <img src="{{ site.baseurl }}/resources/img/mscr.png" alt="A MapShuffleCombineReduce operation with 3 input channels" />
-</figure>
+  </figure>
+
+A overall optimizer strategy involves a sequences of optimization actions with the ultimate goal to produce the fewest, most efficient MSCR operations:
+1. Sink Flatten: $$h(f(a)+g(b)) \rightarrow h(f(a)) + h(g(b))$$
+2. Lift combineValues operations: If *CombineValues* operation immediately follows a *GroupByKey* operation, the GroupByKey records the fact and original *CombineValues* is left in place, which can be treated as normal *ParallelDo* operation and subject to ParallelDo fusions.
+3. Insert fusion blocks:
+4. Fuse ParallelDos
+5. Fuse MSCRs: create MSCR opertions, and convert any remaining unfused ParallelDo operations into trivial MSCRs.
 
+The SiteData example{%cite chambers2010flumejava --file big-data %} shows that a 16 data-parallel operations can be optimized into two MSCR opertions in the final execution plan (refer to Figure 5 in the original paper). One limitation of the optimizer is that all these optimizations are based on the structures of the execution plan, FluemJava doesn't analyze user-defined functions.
 
 
 ### 1.1.3 Dryad
author	Jingjing Ren <renjj@ccs.neu.edu>	2016-12-13 15:36:23 -0500
committer	Jingjing Ren <renjj@ccs.neu.edu>	2016-12-13 15:36:23 -0500
commit	b214b7afb85a61ea6932bdf235062e8f784cc0df (patch)
tree	639ab2f420124609f7ec5a70555d9ab709129283
parent	68eea603c1dd2a4997e410b7e37a1de20291e2c7 (diff)