diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-02 05:59:18 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-02 05:59:18 -0500 |
| commit | 84e84e420de861d89c43093947ed9be6bc51eff6 (patch) | |
| tree | fc0bdfd2b7a987829d877b24d345dc40a78b547d /chapter/8 | |
| parent | d8e57b0148773e05d0d21716833a12db031d3de5 (diff) | |
Update big-data.md
Diffstat (limited to 'chapter/8')
| -rw-r--r-- | chapter/8/big-data.md | 9 |
1 files changed, 5 insertions, 4 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 55d6b12..7765cd7 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -94,15 +94,16 @@ Many a analytics workloads like K-means, logistic regression, graph processing a **Dryad/DryadLinq** : -**Spark (big one)** : content is ready, need to format a bit and paste +**Spark (big one)** : ## Declarative interfaces for the Map Reduce framework Map reduce provides only two high level primitives - map and reduce; that the programmers have to worry about. Map reduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework still suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like joins (which could be highly complex depending on the data) are extremely hard to implement and reason about for a programmer. Sometimes the code could be become repetitive when the programmer wants to implement most common operations like projection, filtering etc. Non-programmers like data scientists would highly prefer SQL like interface over a cumbersome and rigid framework. Such a high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. Also, these kind of abstractions provide ample opportunities for query optimizations. -Introduce Sazwal (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005 +**Introduce Sazwal** (its now no more used but one of the first ideas) : Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005 ** FlumeJava (2010) ** + Many real-world computations involves a pipeline of MapReduces, and this motivates additional management to chain together those separate MapReduce stages in an efficient way. FlumeJava {% cite chambers2010flumejava --file big-data %} can help build those pipelines and keep computations modular. At core, FlumeJava are a couple of classes that represent immutable parallel collections. It defers evaluation and optimization by internally constructing an execution plan dataflow graph. ***Core Abstraction*** @@ -155,7 +156,7 @@ MORE EXPLANATION NEEDED... -## Optimizers are the way to go +## Optimizers are the way to go (still thinking of a better heading..) It is tough to understand the internals of a framework like Spark for any developer who has just started to program a Spark application. Also, with the advent of relational code, it becomes still more challenging when one has to program keeping in mind the rules for an efficient query - rightly ordered joins, early filtering of data or usage of available indexes. Even if the programmer is aware of such rules, it is still prone to human errors which can potentially lead to longer runtime applications. Query optimizers for map reduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize such user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. Catalyst is one such framework which leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. Most of the power of Spark SQL comes due to this optimizer. @@ -219,7 +220,7 @@ Apache Giraph is an open source implementation of Pregel in which new features l ## Future and Discussion - Current leader in distributed processing - Spark, Google's cloud dataflow -- Current challenges and upcoming improvements ?? +- Current challenges and upcoming improvements ?? - Apache thunder and any others? ## Conclusion |
