aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authormsabhi <abhi.is2006@gmail.com>2016-12-11 00:05:46 -0500
committerGitHub <noreply@github.com>2016-12-11 00:05:46 -0500
commit1181d1ca440e5f74c87193373ec733cac02cdf5c (patch)
tree87585dfad0ff20cc38885b83f27960b3133098d0
parent9baf00cc2472ecea464a6f2003d34dadb6e73e9a (diff)
Adding missing references
-rw-r--r--chapter/8/big-data.md1
1 files changed, 1 insertions, 0 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md
index 49a4a0d..38b0691 100644
--- a/chapter/8/big-data.md
+++ b/chapter/8/big-data.md
@@ -250,6 +250,7 @@ Winding up - we can compare SQL vs Dataframe vs Dataset as below :
<figure class="main-container">
<img src="./sql-vs-dataframes-vs-datasets.png" alt="SQL vs Dataframe vs Dataset" />
</figure>
+*Figure from the website :* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
### 1.3 Large-scale Parallelism on Graphs
Map Reduce doesn’t scale easily and is highly inefficient for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms requires programmer to explicitly handle the intermediate results (writing to disks). Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system.