diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-15 02:59:39 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-15 02:59:39 -0500 |
| commit | e2fcdd9405ce8c4aadb67b99b02142eb55ef6836 (patch) | |
| tree | e8957e38030904a038be7a8bbf39523957ca1bf1 /chapter/8/big-data.md | |
| parent | 72230c4f9d3ff0bd0a0385662a3a6338e6f241c5 (diff) | |
Adding word count example to SparkSQL
Diffstat (limited to 'chapter/8/big-data.md')
| -rw-r--r-- | chapter/8/big-data.md | 16 |
1 files changed, 15 insertions, 1 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 252f008..9f8a9b2 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -280,7 +280,7 @@ RDDs are immutable and hence a straggler (slow node) can be replaced with a back - `Debugging and profiling` : There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient. ### 1.2 Querying: declarative interfaces -MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like equi-joins and theta-joins [http://www.ccs.neu.edu/home/mirek/papers/2011-SIGMOD-ParallelJoins.pdf] which could be highly complex depending on the data, require programmers to implement by hand. Hence, map reduce lacks many such high level abstractions requiring programmers to be well versed with several of the design patterns like map-side joins, reduce-side equi-join etc. Also, java based code ( like in Hadoop framework) in map-reduce can sometimes become repetitive when the programmer wants to implement most common operations like projection, filtering etc. A simple word count program as shown in Figure X, can span up to 200 lines. +MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like equi-joins and theta-joins [http://www.ccs.neu.edu/home/mirek/papers/2011-SIGMOD-ParallelJoins.pdf] which could be highly complex depending on the data, require programmers to implement by hand. Hence, map reduce lacks many such high level abstractions requiring programmers to be well versed with several of the design patterns like map-side joins, reduce-side equi-join etc. Also, java based code ( like in Hadoop framework) in map-reduce can sometimes become repetitive when the programmer wants to implement most common operations like projection, filtering etc. A simple word count program as shown in Figure X, can span up to 63 lines. *Why SQL over map reduce ?* @@ -417,6 +417,20 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below : </figure> *Figure from the website :* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html + +*Word count example in SparkSQL* + +``` +val ds = sqlContext.read.text("input_file").as[String] +val result = ds + .flatMap(_.split(" ")) + .filter(_ != "") + .toDF() + .groupBy($"value") + .agg(count("*") as "count") + .orderBy($"count" desc) +``` + ### 1.3 Large-scale Parallelism on Graphs Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. |
