diff options
| author | msabhi <abhi.is2006@gmail.com> | 2016-12-10 02:26:27 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2016-12-10 02:26:27 -0500 |
| commit | 2114df37dfd469e992b876b560ef5e1a69542591 (patch) | |
| tree | 7a91ab43820810bdd2d5a77d366c15d3d79167a5 | |
| parent | 4c2ff735326ce7686844c5738bc130bf78f5b9a8 (diff) | |
Update big-data.md
| -rw-r--r-- | chapter/8/big-data.md | 6 |
1 files changed, 3 insertions, 3 deletions
diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index 209a3ad..e24dfed 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -190,7 +190,7 @@ INSERT INTO, UPDATE, and DELETE are not supported which makes it easier to handl ***Serialization/Deserialization*** -Hive implements the LazySerDe as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like TextInputFormat, SequenceFileInputFormat and RCFileInputFormat. +Hive implements the LazySerDe as the default SerDe interface. A SerDe is a combination of serialization and deserialization which helps developers instruct Hive on how their records should be processed. The Deserializer interface translates rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. The Serializer, however, converts a Java object into a format that Hive can write to HDFS or another supported system. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. ### 1.2.2 Pig Latin The goal of Pig Latin {% cite olston2008pig --file big-data%} is to attract experienced programmers to perform ad-hoc analysis on big data. Parallel database products provide a simple SQL query interface, which is good for non-programmers and simple tasks, but not in a style where experienced programmers would approach. Instead such programmers prefer to specify single steps and operate as a sequence. @@ -293,7 +293,7 @@ Edge-cuts for partitioning requires random assignment of vertices and edges acro ***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in way that minimizes the number of machines spanned by each vertex. -The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are gone into in more detail in the paper, but the general purposes are as follows: +The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are as follows: - `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data. - `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation @@ -309,7 +309,7 @@ Other than standard data-parallel operators like filter, map, leftJoin, and redu - mapV, mapE - transform the vertex or edge collection. - triplets -returns collection of form ((i, j), (PV(i), PE(i, j), PV(j))). The operator essentially requires a multiway join between vertex and edge RDD. This operation is optimized by shifting the site of joins to edges, using the routing table, so that only vertex data needs to be shuffled. - leftJoin - given a collection of vertices and a graph, returns a new graph which incorporates the property of matching vertices from the given collection into the given graph without changing the underlying graph structure. -- subgraph - returns a subgraph of the original graph by applying predicates on edges and vertices +- subgraph - Applies predicates to return a subgraph of the original graph by filtering all the vertices and edges that don't satisfy the vertices and edges predicates respectively. - mrTriplets (MapReduce triplet) - logical composition of triplets followed by map and reduceByKey. It is the building block of graph-parallel algorithms. ## 2 Execution Models |
