diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-04 07:16:33 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-04 07:16:33 -0500 |
| commit | 788a878d8e72e5d3823e19c2dedf68ad15d21bd3 (patch) | |
| tree | 6d33d9b7573a7082ff410b272d511fedfa576156 /chapter/8/big-data.md | |
| parent | 8b888e6698b98db0d3d42933d6ba3c43acdcb9e0 (diff) | |
Update big-data.md
Diffstat (limited to 'chapter/8/big-data.md')
| -rw-r--r-- | chapter/8/big-data.md | 35 |
1 files changed, 34 insertions, 1 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index baa787a..6a4f7c7 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -85,7 +85,40 @@ A notable feature of the model is the complete control on data through communica **MapReduce**, as mentioned in the programming model section, the execution model is interesting that all the intermediate key/value pairs are written to and read from disk. The output from distributed computation should be same as one from non-faulting sequential execution of the entire program. And the model relies on the atomic commits of map and reduce task outputs to achieve it. The basic idea is to create private temporary files and rename them only when the task has finished. This makes fault-tolerance easy, one could simple start another one if the worker failed. But this is also the bottleneck to run multiple stages. **Spark** -Apache Spark is a fast, in-memory data processing engine with elegant and expressive development interface to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD) and Scala’s collection API as well as functional style for high performance processing. +Spark is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. Spark takes advantage of the distributed in-memory storage (RDD), Scala’s collection API as well as functional style for high performance processing. + +Distributed in-memory storage - Resilient Distributed Data sets : +RDD is a partitioned, read only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes in a cluster and is fault tolerant(Resilient). If a node fails, a RDD can always be recovered using its lineage graph (information on how it was derived from dataset). A RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These are the lazy transformations which are applied only if any action is performed on the RDD. Hence, RDD need not be materialized at all times. Lazy feature exists even in DyradLINQ. + +The properties that power RDD with the above mentioned features : + • A list of dependencies on other RDD’s. + • An array of partitions that a dataset is divided into. + • A compute function to do a computation on partitions. + • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) + • Optional preferred locations (aka locality info), (e.g. block locations for an HDFS file) + +Spark API provide two kinds of operations on a RDD: +Transformations - lazy operations that return another RDD. +`map (f : T => U) : RDD[T] ⇒ RDD[U]` : Return a MappedRDD[U] by applying function f to each element +`flatMap( f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]` : Return a new FlatMappedRDD[U] by first applying a function to all elements and then flattening the results. +`filter(f:T⇒Bool) : RDD[T] ⇒ RDD[T]` : Return a FilteredRDD[T] having elemnts that f return true +`groupByKey()` : Being called on (K,V) Rdd, return a new RDD[([K], Iterable[V])] +`reduceByKey(f: (V, V) => V)` : Being called on (K, V) Rdd, return a new RDD[(K, V)] by aggregating values using eg: reduceByKey(_+_) +`join((RDD[(K, V)], RDD[(K, W)]) ⇒ RDD[(K, (V, W))]` :Being called on (K,V) Rdd, return a new RDD[(K, (V, W))] by joining them by key K. + + +Actions - operations that trigger computation on a RDD and return values. + +`reduce(f:(T,T)⇒T) : RDD[T] ⇒ T` : return T by reducing the elements using specified commutative and associative binary operator +`collect()` : Return an Array[T] containing all elements +`count()` : Return the number of elements + + +Why RDD over Distributed Shared memory (DSM) ? +RDDs are immutable and can only be created through coarse grained transformation while DSM allows fine grained read and write operations to each memory location. Hence RDDs do not incur the overhead of checkpointing thats present in DSM and can be recovered using their lineages. +RDDs are immutable and hence a straggler(slow node) can be replaced with backup copy as in Map reduce. This is hard to implement in DSM as two copies point to the same location and can interfere in each other’s update. +Other benefits include the scheduling of tasks based on data locality to improve performance and the ability of the RDDs to degrade gracefully incase of memory shortage. Partitions that do not fit in RAM gets spilled to the disk (performance will then be equal to that of any data parallel system). + - Pig/HiveQL/SparkSQL - Limitations ? |
