diff options
| -rw-r--r-- | chapter/1/rpc.md | 230 | ||||
| -rw-r--r-- | chapter/2/futures.md | 302 | ||||
| -rw-r--r-- | chapter/4/dist-langs.md | 18 | ||||
| -rw-r--r-- | chapter/5/langs-extended-for-dist.md | 209 | ||||
| -rw-r--r-- | chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md | 96 | ||||
| -rw-r--r-- | chapter/6/being-consistent.md | 34 | ||||
| -rw-r--r-- | chapter/6/counters.md | 267 | ||||
| -rw-r--r-- | chapter/7/langs-consistency.md | 56 | ||||
| -rw-r--r-- | chapter/8/big-data.md | 397 | ||||
| -rw-r--r-- | chapter/9/streaming.md | 117 |
10 files changed, 813 insertions, 913 deletions
diff --git a/chapter/1/rpc.md b/chapter/1/rpc.md index d56b13f..504069d 100644 --- a/chapter/1/rpc.md +++ b/chapter/1/rpc.md @@ -4,17 +4,17 @@ title: "RPC is Not Dead: Rise, Fall and the Rise of Remote Procedure Calls" by: "Muzammil Abdul Rehman and Paul Grosu" --- -## Introduction: +## Introduction -*Remote Procedure Call* (RPC) is a design *paradigm* that allow two entities to communicate over a communication channel in a general request-response mechanism. The definition of RPC has mutated and evolved significantly over the past three decades, and therefore RPC *paradigm* is a generic, broadly classifying term to refer to all RPC-esque systems that have arisen over the past four decades. The *definition* of RPC has evolved over the decades. It has moved on from a simple *client-server* design to a group of inter-connected *services*. While the initial RPC *implementations* were designed as tools for outsourcing computation to a server in a distributed system, however, RPC has evolved over the years to build language-agnostic ecosystem of applications. This RPC *paradigm* has been part of the driving force in creating truly revolutionizing distributed systems and giving rise to various communication schemes and protocols between diverse systems. +*Remote Procedure Call* (RPC) is a design *paradigm* that allow two entities to communicate over a communication channel in a general request-response mechanism. The definition of RPC has mutated and evolved significantly over the past three decades, and therefore the RPC *paradigm* is a broad classifying term which refers to all RPC-esque systems that have arisen over the past four decades. The *definition* of RPC has evolved over the decades. It has moved on from a simple *client-server* design to a group of inter-connected *services*. While the initial RPC *implementations* were designed as tools for outsourcing computation to a server in a distributed system, RPC has evolved over the years to build a language-agnostic ecosystem of applications. The RPC *paradigm* has been part of the driving force in creating truly revolutionary distributed systems and has given rise to various communication schemes and protocols between diverse systems. -RPC *paradigm* has been used to implement our every-day systems. From lower level applications like Network File Systems{% cite sunnfs --file rpc %} and Remote Direct Memory Access{% cite rpcoverrdma --file rpc %} to access protocols to developing an ecosystem of microservices, RPC has been used everywhere. RPC has a diverse variety of applications -- SunNFS{% cite sunnfs --file rpc %}, Twitter's Finagle{% cite finagle --file rpc %}, Apache Thrift{% cite thrift --file rpc %}, Java RMI{% cite rmipaper --file rpc %}, SOAP, CORBA{% cite corba --file rpc %} and Google's gRPC{% cite grpc --file rpc %} to name a few. +The RPC *paradigm* has been used to implement our every-day systems. From lower level applications like Network File Systems {% cite sunnfs --file rpc %} and Remote Direct Memory Access {% cite rpcoverrdma --file rpc %} to access protocols to developing an ecosystem of microservices, RPC has been used everywhere. RPC has a diversity of applications -- SunNFS {% cite sunnfs --file rpc %}, Twitter's Finagle {% cite finagle --file rpc %}, Apache Thrift {% cite thrift --file rpc %}, Java RMI {% cite rmipaper --file rpc %}, SOAP, CORBA {% cite corba --file rpc %} and Google's gRPC {% cite grpc --file rpc %} to name a few. -RPC has evolved over the years. Starting off as a synchronous, insecure, request-response system, RPC has evolved into a secure, asynchronous, resilient *paradigm* that has influenced protocols and programming designs, like, HTTP, REST, and just about anything with a request-response system. It has transitioned to an asynchronous, bidirectional, communication mechanism for connecting services and devices across the internet. While the initial RPC implementations mainly focused on a local, private network with multiple clients communicating with a server and synchronously waiting for the response from the server, modern RPC systems have *endpoints* communicating with each other, asynchronously passing arguments and processing responses, as well having two-way request-response streams(from client to server, and also from server to client). RPC has influenced various design paradigms and communication protocols. +RPC has evolved over the years. Starting off as a synchronous, insecure, request-response system, RPC has evolved into a secure, asynchronous, resilient *paradigm* that has influenced protocols and programming designs like HTTP, REST, and just about anything with a request-response system. It has transitioned to an asynchronous, bidirectional, communication mechanism for connecting services and devices across the internet. The initial RPC implementations mainly focused on a local, private network with multiple clients communicating with a server, synchronously waiting for the response from the server. Modern RPC systems have *endpoints* communicating with each other, asynchronously passing arguments and processing responses, as well as having two-way request-response streams (from client to server, and also from server to client). RPC has influenced various design paradigms and communication protocols. -## Remote Procedure Calls: +## Remote Procedure Calls -The *Remote Procedure Call paradigm* can be defined, at a high level, as a set of two communication *endpoints* connected over a network with one endpoint sending a request and the other endpoint generating a response based on that request. In the simplest terms, it's a request-response paradigm where the two *endpoints*/hosts have different *address space*. The host that requests a remote procedure can be referred to as *caller* and the host that responds to this can be referred to as *callee*. +The *Remote Procedure Call paradigm* can be defined, at a high level, as a set of two communication *endpoints* connected over a network with one endpoint sending a request and the other endpoint generating a response based on that request. In the simplest terms, it's a request-response paradigm where the two *endpoints*/hosts have different *address spaces*. The endpoint that requests a remote procedure can be referred to as *caller* and the endpoint that responds to this can be referred to as *callee*. The *endpoints* in the RPC can either be a client and a server, two nodes in a peer-to-peer network, two hosts in a grid computation system, or even two microservices. The RPC communication is not limited to two hosts, rather could have multiple hosts or *endpoints* involved {% cite anycastrpc --file rpc %}. @@ -23,32 +23,33 @@ The *endpoints* in the RPC can either be a client and a server, two nodes in a p </p> <figure> <img src="{{ site.baseurl }}/resources/img/rpc_chapter_1_ycog_10_steps.png" alt="RPC in 10 Steps." /> -<p>Fig1. - Remote Procedure Call.</p> +<p>Figure 1 - Remote Procedure Call.</p> </figure> -The simplest RPC implementation looks like Fig1. In this case, the *client*(or *caller*) and the *server*(or *callee*) are separated by a physical network. The main components of the system are the client routine/program, the client stub, the server routine/program, the server stub, and the network routines. A *stub* is a small program that is generally used as a stand-in(or an interface) for a larger program{% cite stubrpc --file rpc %}. A *client stub* exposes the functionality provided by the server routine to the client routine while the server stub provides a client-like program to the server routine{% cite rpcimage --file rpc %}. The client stub takes the input arguments from the client program and returns the result, while the server stub provides input arguments to the server program and gets the results. The client program can only interact with the client stub that provides the interface of the remote server to the client. This stub also provides marshalling/pickling/serialization of the input arguments sent to the stub by the client routine. Similarly, the server stub provides a client interface to the server routines as well as the marshalling services. +The simplest RPC implementation looks like Figure 1. In this case, the *client* (or *caller*) and the *server* (or *callee*) are separated by a physical network. The main components of the system are the client routine/program, the client stub, the server routine/program, the server stub, and the network routines. A *stub* is a small program that is generally used as a stand-in (or an interface) for a larger program {% cite stubrpc --file rpc %}. A *client stub* exposes the functionality provided by the server routine to the client routine while the server stub provides a client-like program to the server routine {% cite rpcimage --file rpc %}. The client stub takes the input arguments from the client program and returns the result, while the server stub provides input arguments to the server program and gets the results. The client program can only interact with the client stub that provides the interface of the remote server to the client. This stub also serializes input arguments sent to the stub by the client routine. Similarly, the server stub provides a client interface to the server routines as well handling serialization of data sent to the client. -When a client routine performs a *remote procedure*, it calls the client stub, which serializes the input argument. This serialized data is sent to the server using OS network routines (TCP/IP){% cite rpcimage --file rpc %}. The data is serialized by the server stub, present to the server routines for the given arguments. The return value from the server routines is serialized again and sent over the network back to the client where it's deserialized by the client stub and presented to the client routine. This *remote procedure* is generally hidden from the client routine and it appears as a *local procedure* to the client. RPC services also require a discovery service/host-resolution mechanism to bootstrap the communication between the client and the server. +When a client routine performs a *remote procedure*, it calls the client stub, which serializes the input arguments. This serialized data is sent to the server using OS network routines (TCP/IP) {% cite rpcimage --file rpc %}. The data is then deserialized by the server stub and presented to the server routines with the given arguments. The return value from the server routines is serialized again and sent over the network back to the client where it's deserialized by the client stub and presented to the client routine. This *remote procedure* is generally hidden from the client routine and it appears as a *local procedure* to the client. RPC services also require a discovery service/host-resolution mechanism to bootstrap the communication between the client and the server. -One important feature of RPC is different *address space* {% cite implementingrpc --file rpc %} for all the endpoints, however, passing the locations to a global storage(Amazon S3, Microsoft Azure, Google Cloud Store) is not impossible. In RPC, all the hosts have separate *address spaces*. They can't share pointers or references to a memory location in one host. This *address space* isolation means that all the information is passed in the messages between the host communicating as a value (objects or variables) but not by reference. Since RPC is a *remote* procedure call, the values sent to the *remote* host cannot be pointers or references to a *local* memory. However, passing links to a global shared memory location is not impossible but rather dependent on the type of system (see *Applications* section for detail). +One important feature of RPC is different *address space* {% cite implementingrpc --file rpc %} for all the endpoints, however, passing the locations to a global storage (Amazon S3, Microsoft Azure, Google Cloud Store) is not impossible. In RPC, all the hosts have separate *address spaces*. They can't share pointers or references to a memory location in one host. This *address space* isolation means that all the information is passed in the messages between the host communicating as a value (objects or variables) but not by reference. Since RPC is a *remote* procedure call, the values sent to the *remote* host cannot be pointers or references to a *local* memory. However, passing links to a global shared memory location is not impossible but rather dependent on the type of system (see *Applications* section for detail). -Originally, RPC was developed as a synchronous request-response mechanism, tied to a specific programming language implementation, with a custom network protocol to outsource computation {% cite implementingrpc --file rpc %}. It had registry system to register all the servers. One of the earliest RPC-based system {% cite implementingrpc --file rpc %} was implemented in the Cedar programming language in early 1980's. The goal of this system was to provide similar programming semantics as local procedure calls. Developed for a LAN network with an inefficient network protocol and a *serialization* scheme to transfer information using the said network protocol, this system aimed at executing a *procedure*(also referred as *method* or a *function*) in a remote *address space*. The single-thread synchronous client and the server were written in an old *Cedar* programming language with a registry system used by the servers to *bind*(or register) their procedures. The clients used this registry system to find a specific server to execute their *remote* procedures. This RPC implementation {% cite implementingrpc --file rpc %} had a very specific use-case. It was built specifically for outsourcing computation between a "Xerox research internetwork", a small, closed, ethernet network with 16-bit addresses{% cite implementingrpc --file rpc %}. +Originally, RPC was developed as a synchronous request-response mechanism, tied to a specific programming language implementation, with a custom network protocol to outsource computation {% cite implementingrpc --file rpc %}. It had a registry system to register all the servers. One of the earliest RPC-based system {% cite implementingrpc --file rpc %} was implemented in the Cedar programming language in early 1980's. The goal of this system was to provide similar programming semantics to local procedure calls. It was developed for a LAN network with an inefficient network protocol and a *serialization* scheme to transfer information using the said network protocol, this system aimed at executing a *procedure* (also referred as a *method* or a *function*) in a remote *address space*. The single-threaded synchronous client and the server were accompanied by a registry system used by the servers to *bind* (or register) their procedures. Clients used this registry system to find a specific server to execute their *remote* procedures on. This RPC implementation {% cite implementingrpc --file rpc %} had a very specific use-case. It was built specifically for outsourcing computation between a "Xerox research internetwork", which was a small, closed, ethernet network with 16-bit addresses {% cite implementingrpc --file rpc %}. Modern RPC-based systems are language-agnostic, asynchronous, load-balanced systems. Authentication and authorization to these systems have been added as needed along with other security features. Most of these systems have fault-handling built into them as modules and the systems are generally spread all across the internet. -RPC programs have a network (or a communication channel), therefore, they need to handle remote errors and be able to communicate information successfully. Error handling generally varies and is categorized as *remote-host* or *network* failure handling. Depending on the type of the system, and the error, the caller (or the callee) return an error and these errors can be handled accordingly. For asynchronous RPC calls, it's possible to specify events to ensure progress. +RPC programs operate across a network (or communication channel), therefore they need to handle remote errors and be able to communicate information successfully. Error handling generally varies and is categorized as *remote-host* or *network* failure handling. Depending on the type of the system, and the error, the caller (or the callee) returns an error and these errors can be handled accordingly. For asynchronous RPC calls, it's possible to specify events to ensure progress. -RPC implementations use a *serialization*(also referred to as *marshalling* or *pickling*) scheme on top of an underlying communication protocol (traditionally TCP over IP). These *serialization* schemes allow both the caller *caller* and *callee* to become language agnostic allowing both these systems to be developed in parallel without any language restrictions. Some examples of serialization schemes are JSON, XML, or Protocol Buffers {% cite grpc --file rpc %}. +RPC implementations use a *serialization* (also referred to as *marshalling* or *pickling*) scheme on top of an underlying communication protocol (traditionally TCP over IP). These *serialization* schemes allow both the caller *caller* and *callee* to become language agnostic allowing both these systems to be developed in parallel without any language restrictions. Some examples of serialization schemes are JSON, XML, or Protocol Buffers {% cite grpc --file rpc %}. Modern RPC systems allow different components of a larger system to be developed independently of one another. The language-agnostic nature combined with a decoupling of some parts of the system allows the two components (caller and callee) to scale separately and add new functionalities. This independent scaling of the system might lead to a mesh of interconnected RPC *services* facilitating one another. ### Examples of RPC -RPC has become very predominant in modern systems. Google even performs orders of 10^10^ RPC calls per second {% cite grpcpersec --file rpc %}. That's *tens of trillions* of RPC calls *every second*. It's more than the *annual GDP of United States* {%cite usgdp --file rpc%}. +RPC has become very predominant in modern systems. Google even performs orders of 10^10 RPC calls per second {% cite grpcpersec --file rpc %}. That's *tens of trillions* of RPC calls *every second*. It's more than the *annual GDP of United States* {%cite usgdp --file rpc%}. In the simplest RPC systems, a client connects to a server over a network connection and performs a *procedure*. This procedure could be as simple as `return "Hello World"` in your favorite programming language. However, the complexity of the of this remote procedure has no upper bound. -Here's the code of this simple RPC server, written in Python3. +An example of a simple RPC server, written in Python 3, is shown below. + ```python from xmlrpc.server import SimpleXMLRPCServer @@ -62,7 +63,7 @@ server.register_function(remote_procedure, "remote_procedure") server.serve_forever() ``` -This code for a simple RPC client for the above server, written in Python3, is as follows. +This code for a simple RPC client for the above server, written in Python 3, is as follows. ```python import xmlrpc.client @@ -71,37 +72,37 @@ with xmlrpc.client.ServerProxy("http://localhost:8080/") as proxy: print(proxy.remote_procedure()) ``` -In the above example, we create a simple function called `remote_procedure` and *bind* it to port *8080* on *localhost*. The RPC client then connects to the server and *request* the `remote_procedure` with no input arguments. The server then *responds* with a return value of the `remote_procedure`. +In the above example, we create a simple function called `remote_procedure` and *bind* it to port `8080` on `localhost`. The RPC client then connects to the server and *requests* the `remote_procedure` with no input arguments. The server then *responds* with the return value returned from executing `remote_procedure`. -One can even view the *three-way handshake* as an example of RPC paradigm. The *three-way handshake* is most commonly used in establishing a TCP connection. Here, a server-side application *binds* to a port on the server, and adds a hostname resolution entry is added to a DNS server(can be seen as a *registry* in RPC). Now, when the client has to connect to the server, it requests a DNS server to resolve the hostname to an IP address and the client sends a SYN packet. This SYN packet can be seen as a *request* to another *address space*. The server, upon receiving this, returns a SYN-ACK packet. This SYN-ACK packet from the server can be seen as *response* from the server, as well as a *request* to establish the connection. The client then *responds* with an ACK packet. +One can even view the *three-way handshake* as an example of RPC paradigm. The *three-way handshake* is most commonly used in establishing a TCP connection. Here, a server-side application *binds* to a port on the server, and adds a hostname resolution entry is added to a DNS server (can be seen as a *registry* in RPC). Now, when the client has to connect to the server, it requests a DNS server to resolve the hostname to an IP address and the client sends a SYN packet. This SYN packet can be seen as a *request* to another *address space*. The server, upon receiving this, returns a SYN-ACK packet. This SYN-ACK packet from the server can be seen as *response* from the server, as well as a *request* to establish the connection. The client then *responds* with an ACK packet. -## Evolution of RPC: +## Evolution of RPC -RPC paradigm was first proposed in 1980’s and still continues as a relevant model of performing distributed computation, which initially was developed for a LAN and now can be implemented on open networks, as web services across the internet. It has had a long and arduous journey to its current state. Here are the three main(overlapping) stages that RPC went through. +RPC paradigm was first proposed in 1980’s and still continues as a relevant model of performing distributed computation, which was initially developed for a LAN and now can be implemented on open networks, as web services across the internet. It has had a long and arduous journey to its current state. Here are the three main (overlapping) stages that RPC went through. -### The Rise: All Hail RPC(Early 1970's - Mid 1980's) +### The Rise: All Hail RPC (Early 1970's - Mid 1980's) -RPC started off strong. With RFC 674{% cite rfc674 --file rpc %} and RFC 707{% cite rfc674 rfc707 --file rpc %} coming out and specifying the design of Remote Procedure Calls, followed by Nelson et. al{% cite implementingrpc --file rpc %} coming up with a first RPC implementation for the Cedar programming language, RPC revolutionized systems in general and gave rise to one of the earliest distributed systems(apart from the internet, of course). +RPC started off strong. With RFC 674 {% cite rfc674 --file rpc %} and RFC 707 {% cite rfc674 rfc707 --file rpc %} coming out and specifying the design of Remote Procedure Calls, followed by Nelson et. al {% cite implementingrpc --file rpc %} coming up with a first RPC implementation for the Cedar programming language, RPC revolutionized systems in general and gave rise to one of the earliest distributed systems. With these early achievements, people started using RPC as the defacto design choice. It became a Holy Grail in the systems community for a few years after the first implementation. -### The Fall: RPC is Dead(Late 1970's - Late 1990's) +### The Fall: RPC is Dead (Late 1970's - Late 1990's) -RPC, despite being an initial success, wasn't without flaws. Within a year of its inception, the limitation of the RPC started to catch up with it. RFC 684 criticized RPC for latency, failures, and the cost. It also focussed on message-passing systems as an alternative to RPC design. Similarly, a few years down the road, in 1988, Tenenbaum et.~al presented similar concerns against RPC {%cite critiqueofrpc --file rpc %}. It talked about problems heterogeneous devices, message passing as an alternative, packet loss, network failure, RPC's synchronous nature, and highlighted that RPC is not a one-size-fits-all model. +RPC, despite being an initial success, wasn't without flaws. Within a year of its inception, the limitations of RPC started to catch up to it. RFC 684 criticized RPC for latency, failures, and overhead cost. It also focussed on message-passing systems as an alternative to RPC design. In 1988, Tenenbaum et. al presented similar concerns with RPC {%cite critiqueofrpc --file rpc %}. This talked about problems with heterogeneous devices, message passing as an alternative, packet loss, network failure, RPC's synchronous nature, and highlighted that RPC is not a one-size-fits-all model. -In 1994, *A Note on Distributed Computing* was published. This paper claimed RPC to be "fundamentally flawed" {%cite notedistributed --file rpc %}. It talked about a unified object view and cited four main problems with dividing these objects for distributed computing in RPC: communication latency, address space separation, partial failures and concurrency issues(resulting from accessing same remote object by two concurrent client requests). Although most of these problems(except partial failures) were inherently associated with distributed computing itself but partial failures for RPC systems meant that progress might not always be possible in an RPC system. +In 1994, *A Note on Distributed Computing* was published. This paper claimed RPC to be "fundamentally flawed" {%cite notedistributed --file rpc %}. It talked about a unified object view and cited four main problems with dividing these objects for distributed computing in RPC: communication latency, address space separation, partial failures and concurrency issues (resulting from accessing same remote object by two concurrent client requests). Most of these problems (except partial failures) were inherently associated with distributed computing itself, but partial failures for RPC systems meant that progress might not always be possible in an RPC system. -This era wasn't a dead end for RPC, though. Some of the preliminary designs for modern RPC systems were introduced in this era. Perhaps, the earliest system in this era was SunRPC {% cite sunnfs --file rpc %} used for the Sun Network File System(NFS). Soon to follow SunRPC was CORBA{% cite corba --file rpc %} which was followed by Java RMI{% cite rmipaper --file rpc %}. +This era wasn't a dead end for RPC, though. Some of the preliminary designs for modern RPC systems were introduced in this era. Perhaps, the earliest system in this era was SunRPC {% cite sunnfs --file rpc %} used for the Sun Network File System (NFS). Soon to follow SunRPC was CORBA {% cite corba --file rpc %} which was followed by Java RMI {% cite rmipaper --file rpc %}. -However, the initial implementations of these systems were riddled with various issues and design flaws. For instance, Java RMI didn't handle network failures and assumed a reliable network with zero-latency{% cite rmipaper --file rpc %}. +However, the initial implementations of these systems were riddled with various issues and design flaws. For instance, Java RMI didn't handle network failures and assumed a reliable network with zero-latency {% cite rmipaper --file rpc %}. -### The Rise, Again: Long Live RPC(Late 1990's - Today) +### The Rise, Again: Long Live RPC (Late 1990's - Today) -Despite facing problems in its early days, RPC withstood the test of time. Researchers realized the limitations of RPC and focussed on rectifying and instead of enforcing RPC, they started to use RPC in applications where it was needed. The designer started adding exception-handling, async, network failure handling and heterogeneity between different languages/devices to RPC. +Despite facing problems in its early days, RPC withstood the test of time. Researchers realized the limitations of RPC and focussed on rectifying and instead of enforcing RPC, they started to use RPC in applications where it was needed. Implementers of RPC systems started adding exception-handling, asyncronous processing, network failure handling, and support for heterogeneity between different languages/devices to RPC. -In this era, SunRPC went through various additions and became came to be known as Open Network Computing RPC(ONC RPC). CORBA and RMI have also undergone various modifications as internet standards were set. +In this era, SunRPC went through various additions and became came to be known as Open Network Computing RPC (ONC RPC). CORBA and RMI have also undergone various modifications as internet standards were set. -A new breed of RPC also started in this era, Async(asynchronous) RPC, giving rise to systems that use *futures* and *promises*, like Finagle{% cite finagle --file rpc %} and Cap'n Proto(post-2010). +A new breed of RPC also started in this era, Async (asynchronous) RPC, giving rise to systems that use *futures* and *promises*, like Finagle {% cite finagle --file rpc %} and Cap'n Proto (post-2010). <p align="center"> @@ -109,7 +110,7 @@ A new breed of RPC also started in this era, Async(asynchronous) RPC, giving ris </p> <figure> <img src="{{ site.baseurl }}/resources/img/rpc_chapter_1_syncrpc.jpg" alt="RPC in 10 Steps." /> -<p>Fig2. - Synchronous RPC.</p> +<p>Figure 2 - Synchronous RPC.</p> </figure> @@ -118,37 +119,39 @@ A new breed of RPC also started in this era, Async(asynchronous) RPC, giving ris </p> <figure> <img src="{{ site.baseurl }}/resources/img/rpc_chapter_1_asyncrpc.jpg" alt="RPC in 10 Steps." /> -<p>Fig3. - Asynchronous RPC.</p> +<p>Figure 3 - Asynchronous RPC.</p> </figure> -A traditional, synchronous RPC is a *blocking* operation while an asynchronous RPC is a *non-blocking* operation{%cite dewan --file rpc %}. Fig2. shows a synchronous RPC call while Fig3. shows an asynchronous RPC call. In synchronous RPC, the client sends a request to the server and blocks and waits for the server to perform its computation and return the result. Only after getting the result from the server, the client proceeds onwards. In an asynchronous RPC, the client performs a request to the server and waits only for the acknowledgment of the delivery of input parameters/arguments. After this, the client proceeds onwards and when the server is finished processing, it sends an interrupt to the client. The client receives this message from the server, receives the results, and continues. +A traditional, synchronous RPC is a *blocking* operation while an asynchronous RPC is a *non-blocking* operation {%cite dewan --file rpc %}. Figure 2 shows a synchronous RPC call while Figure 3 shows an asynchronous RPC call. In synchronous RPC, the client sends a request to the server and then blocks waiting for the server to perform its computation and return the result. The client is only able to proceed after getting the result from the server. In an asynchronous RPC, the client performs a request to the server and waits only for the acknowledgment of the delivery of input parameters/arguments. After this, the client proceeds onwards and when the server is finished processing, it sends an interrupt to the client. The client receives this message from the server, receives the results, and continues. -Asynchronous RPC makes it possible to separate the remote call from the return value making it possible to write a single-threaded client to handle multiple RPC calls at the specific intervals it needs to process{%cite async --file rpc%}. It also allows easier handling of slow clients/servers as well as transferring large data easily(due to their incremental nature){%cite async --file rpc%}. +Asynchronous RPC makes it possible to separate the remote call from the return value, making it possible to write a single-threaded client to handle multiple RPC calls at the specific intervals it needs to process {%cite async --file rpc%}. It also allows for easier handling of slow clients/servers as well as transferring large data easily (due to their incremental nature) {%cite async --file rpc%}. -In the post-2000 era, MAUI{% cite maui --file rpc %}, Cap'n Proto{% cite capnprotosecure --file rpc %}, gRPC{% cite grpc --file rpc %}, Thrift{% cite thrift --file rpc %} and Finagle{% cite finagle --file rpc %} have been released, which have significantly boosted the widespread use of RPC. +In the post-2000 era, MAUI {% cite maui --file rpc %}, Cap'n Proto {% cite capnprotosecure --file rpc %}, gRPC {% cite grpc --file rpc %}, Thrift {% cite thrift --file rpc %} and Finagle {% cite finagle --file rpc %} have been released, which have significantly boosted the widespread use of RPC. -Most of these newer systems came up with their Interface Description Languages(IDLs). These IDLs specified the common protocols and interfacing language that could be used to transfer information clients and servers written in different programming languages, making these RPC implementations language-agnostic. Some of the most common IDLs are JSON, XML, and ProtoBufs. +Most of these newer systems include Interface Description Languages (IDLs). These IDLs specify the common protocols and interfacing languages that can be used to transfer information between clients and servers written in different programming languages, making these RPC implementations language-agnostic. Some of the most common IDLs are JSON, XML, and ProtoBufs. -A high-level overview of some of the most important RPC implementation is as follows. +A high-level overview of some of the most important RPC implementation is as follows. #### Java Remote Method Invocation -Java RMI (Java Remote Method Invocation){% cite rmibook --file rpc %} is a Java implementation for performing RPC (Remote Procedure Calls) between a client and a server. The client using a stub passes via a socket connection the information over the network to the server that contains remote objects. The Remote Object Registry (ROR){% cite rmipaper --file rpc %} on the server contains the references to objects that can be accessed remotely and through which the client will connect to. The client then can request the invocation of methods on the server for processing the requested call and then responds with the answer. -RMI provides some security by being encoded but not encrypted, though that can be augmented by tunneling over a secure connection or other methods. Moreover, RMI is very specific to Java. It cannot be used to take advantage of the language-independence feature that is inherent to most RPC implementations. Perhaps the main problem with RMI is that it doesn't provide *access transparency*. This means that a programmer(not the client program) cannot distinguish between the local objects or the remote objects making it relatively difficult handle partial failures in the network{%cite roi --file rpc %}. +Java RMI (Java Remote Method Invocation) {% cite rmibook --file rpc %} is a Java implementation for performing RPC (Remote Procedure Calls) between a client and a server. The client using a stub passes via a socket connection the information over the network to the server that contains remote objects. The Remote Object Registry (ROR) {% cite rmipaper --file rpc %} on the server contains the references to objects that can be accessed remotely and through which the client will connect to. The client can then request the invocation of methods on the server which responds with an answer. + +RMI provides some security by being encoded but not encrypted, though that can be augmented by tunneling over a secure connection or other methods. Moreover, RMI is very specific to Java. It cannot be used to take advantage of the language-independence feature that is inherent to most RPC implementations. Perhaps the main problem with RMI is that it doesn't provide *access transparency*. This means that a programmer (not the client program) cannot distinguish between the local objects or the remote objects making it relatively difficult handle partial failures in the network {%cite roi --file rpc %}. #### CORBA -CORBA (Common Object Request Broker Architecture){% cite corba --file rpc %} was created by the Object Management Group {% cite corbasite --file rpc %} to allow for language-agnostic communication among multiple computers. It is an object-oriented model defined via an Interface Definition Language (IDL) and the communication is managed through an Object Request Broker (ORB). This ORB acts as a broker for objects. CORBA can be viewed as a language-independent RMI system where each client and server have an ORB by which they communicate. The benefits of CORBA is that it allows for multi-language implementations that can communicate with each other, but much of the criticism around CORBA relates to poor consistency among implementations and it's relatively outdated by now. Moreover, CORBA suffers from same access transparency issues as Java RMI. + +CORBA (Common Object Request Broker Architecture) {% cite corba --file rpc %} was created by the Object Management Group {% cite corbasite --file rpc %} to allow for language-agnostic communication among multiple computers. It is an object-oriented model defined via an Interface Definition Language (IDL) and the communication is managed through an Object Request Broker (ORB). This ORB acts as a broker for objects. CORBA can be viewed as a language-independent RMI system where each client and server have an ORB by which they communicate. The benefits of CORBA is that it allows for multi-language implementations that can communicate with each other, but much of the criticism around CORBA relates to poor consistency among implementations and it's relatively outdated by now. Moreover, CORBA suffers from same access transparency issues as Java RMI. #### XML-RPC and SOAP + The XML-RPC specifications {% cite Wiener --file rpc%} performs an HTTP Post request to a server formatted as XML composed of a *header* and *payload* that calls only one method. It was originally released in the late 1990's and unlike RMI, it provides transparency by using HTTP as a transparent mechanism. The header has to provide the basic information, like user agent and the size of the payload. The payload has to initiate a `methodCall` structure by specifying the name via `methodName` and associated parameter values. Parameters for the method can be scalar, structures or (recursive) arrays. The types of scalar can be one of `i4`, `int`, `boolean`, `string`, `double`, `dateTime.iso8601` or `base64`. The scalars are used to create more complex structures and arrays. -Below is an example as provided by the XML-RPC documentation{% cite Wiener --file rpc%}: +Below is an example as provided by the XML-RPC documentation {% cite Wiener --file rpc%}: ```XML - POST /RPC2 HTTP/1.0 User-Agent: Frontier/5.1.2 (WinNT) Host: betty.userland.com @@ -157,16 +160,16 @@ Content-length: 181 <?xml version="1.0"?> <methodCall> - <methodName>examples.getStateName</methodName> - <params> - <param> - <value><i4>41</i4></value> - </param> - </params> - </methodCall> + <methodName>examples.getStateName</methodName> + <params> + <param> + <value><i4>41</i4></value> + </param> + </params> +</methodCall> ``` -The response to a request will have the `methodResponse` with `params` and values, or a `fault` with the associated `faultCode` in case of an error {% cite Wiener --file rpc %}: +The response to a request will have the `methodResponse` with `params` and values, or a `fault` with the associated `faultCode` in the case of an error {% cite Wiener --file rpc %}: ```XML HTTP/1.1 200 OK @@ -178,33 +181,35 @@ Server: UserLand Frontier/5.1.2-WinNT <?xml version="1.0"?> <methodResponse> - <params> - <param> - <value><string>South Dakota</string></value> - </param> - </params> - </methodResponse> + <params> + <param> + <value><string>South Dakota</string></value> + </param> + </params> +</methodResponse> ``` -SOAP (Simple Object Access Protocol) is a successor of XML-RPC as a web-services protocol for communicating between a client and server. It was initially designed by a group at Microsoft {% cite soaparticle1 --file rpc %}. The SOAP message is an XML-formatted message composed of an envelope inside which a header and a payload are provided(just like XML-RPC). The payload of the message contains the request and response of the message, which is transmitted over HTTP or SMTP(unlike XML-RPC). +SOAP (Simple Object Access Protocol) is a successor of XML-RPC as a web-services protocol for communicating between a client and server. It was initially designed by a group at Microsoft {% cite soaparticle1 --file rpc %}. The SOAP message is an XML-formatted message composed of an envelope inside which a header and a payload are provided(just like XML-RPC). The payload of the message contains the request and response of the message, which is transmitted over HTTP or SMTP (unlike XML-RPC). -SOAP can be viewed as the superset of XML-RPC that provides support for more complex authentication schemes{%cite soapvsxml --file rpc %} as well as its support for WSDL(Web Services Description Language), allowing easier discovery and integration with remote web services{%cite soapvsxml --file rpc %}. +SOAP can be viewed as the superset of XML-RPC that provides support for more complex authentication schemes {% cite soapvsxml --file rpc %} as well as its support for WSDL (Web Services Description Language), allowing easier discovery and integration with remote web services {% cite soapvsxml --file rpc %}. -The benefit of SOAP is that it provides the flexibility for transmission over multiple transport protocol. The XML-based messages allow SOAP to become language agnostic, though parsing such messages could become a bottleneck. +The benefit of SOAP is that it provides the flexibility for transmission over multiple transport protocols. The XML-based messages allow SOAP to become language agnostic, though parsing such messages could become a bottleneck. #### Thrift -Thrift is an *asynchronous* RPC system created by Facebook and now part of the Apache Foundation {% cite thrift --file rpc %}. It is a language-agnostic Interface Description Language(IDL) by which one generates the code for the client and server. It provides the opportunity for compressed serialization by customizing the protocol and the transport after the description file has been processed. -Perhaps, the biggest advantage of Thrift is that its binary data format has a very low overhead. It has a relatively lower transmission cost(as compared to other alternatives like SOAP){%cite thrifttut --file rpc %} making it very efficient for large amounts of data transfer. +Thrift is an *asynchronous* RPC system created by Facebook which is now part of the Apache Foundation {% cite thrift --file rpc %}. It is a language-agnostic Interface Description Language (IDL) by which one generates the code for the client and server. It provides the opportunity for compressed serialization by customizing the protocol and the transport after the description file has been processed. + +Perhaps, the biggest advantage of Thrift is that its binary data format has a very low overhead. It has a relatively lower transmission cost (compared to other alternatives like SOAP) {%cite thrifttut --file rpc %} making it very efficient for large amounts of data transfer. #### Finagle -Finagle is a fault-tolerant, protocol-agnostic runtime for doing RPC and high-level API for composing futures(see Async RPC section), with RPC calls generated under the hood. It was created by Twitter and is written in Scala to run on a JVM. It is based on three object types: Service objects, Filter objects and Future objects {% cite finagle --file rpc %}. -The Future objects act by asynchronously being requested for a computation that would return a response at some time in the future. These Future objects are the main communication mechanism in Finagle. All the inputs and the output are represented as Future objects. +Finagle is a fault-tolerant, protocol-agnostic runtime for doing RPC and high-level API calls for composing futures (see Async RPC section), with RPC calls generated under the hood. It was created by Twitter and is written in Scala to run on the JVM. It is based on three object types: Service objects, Filter objects, and Future objects {% cite finagle --file rpc %}. -The Service objects are an endpoint that will return a Future upon processing a request. These Service objects can be viewed as the interfaces used to implement a client or a server. +Future objects act by asynchronously requesting computation that would return a response at some time in the future. These Future objects are the main communication mechanism in Finagle. All the inputs and the output are represented as Future objects. -A sample Finagle Server that reads a request and returns the version of the request is shown below. This example is taken from Finagle documentation{% cite finagletut --file rpc %} +Service objects endpoints that return a Future upon processing a request. These Service objects can be viewed as the interfaces used to implement a client or a server. + +A sample Finagle server that reads a request and returns the version of the request is shown below. This example is taken from the Finagle documentation {% cite finagletut --file rpc %} ```Scala import com.twitter.finagle.{Http, Service} @@ -223,9 +228,9 @@ object Server extends App { } ``` -A Filter object transforms requests for further processing in case additional customization is required from a request. These provide program-independent operations like, timeouts, etc. They take in a Service and provide a new Service object with the applied Filter. Aggregating multiple Filters is alos possible in Finagle. +A Filter object transforms requests for further processing in case additional customization is required from a request. These provide program-independent operations like, timeouts, etc. They take in a Service and provide a new Service object with the applied Filter. Aggregating multiple Filters is also possible in Finagle. -A sample timeout Filter that takes in a service and creates a new service with timeouts is shown below. This example is taken from Finagle documentation{% cite finagletut --file rpc %} +A sample timeout Filter that takes in a service and creates a new service with timeouts is shown below. This example is taken from the Finagle documentation {% cite finagletut --file rpc %} ```Scala import com.twitter.finagle.{Service, SimpleFilter} @@ -241,54 +246,55 @@ class TimeoutFilter[Req, Rep](timeout: Duration, timer: Timer) } ``` -#### Open Network Computing RPC(ONC RPC) -ONC was originally introduced as SunRPC {%cite sunnfs --file rpc %} for the Sun NFS. The Sun NFS system had a stateless server, with client-side caching, unique file handlers, and supported NFS read, write, truncate, unlink, etc operations. However, SunRPC was later revised as ONC in 1995 {%cite rfc1831 --file rpc %} and then in 2009 {%cite rfc5531 --file rpc %}. The IDL used in ONC(and SunRPC) is External Data Representation (XDR), a serialization mechanism specific to networks communication and therefore, ONC is limited to applications like Network File Systems. +#### Open Network Computing RPC (ONC RPC) + +ONC was originally introduced as SunRPC {% cite sunnfs --file rpc %} for the Sun NFS. The Sun NFS system had a stateless server, with client-side caching, unique file handlers, and supported NFS read, write, truncate, unlink, etc operations. However, SunRPC was later revised as ONC in 1995 {% cite rfc1831 --file rpc %} and then in 2009 {% cite rfc5531 --file rpc %}. The IDL used in ONC (and SunRPC) is External Data Representation (XDR), a serialization mechanism specific to networks communication and therefore, ONC is limited to applications like Network File Systems. + +#### Mobile Assistance Using Infrastructure (MAUI) -#### Mobile Assistance Using Infrastructure(MAUI) -The MAUI project {% cite maui --file rpc %}, developed by Microsoft is a computation offloading system for mobile systems. It's an automated system that offloads a mobile code to a dedicated infrastructure in order to increase the battery life of the mobile, minimize the load on the programmer and perform complex computations offsite. MAUI uses RPC as the communication protocol between the mobile and the infrastructure. +The MAUI project {% cite maui --file rpc %}, developed by Microsoft is a computation offloading system for mobile systems. It's an automated system that offloads a mobile code to a dedicated infrastructure in order to increase the battery life of the mobile device, minimize the load on the programmer, and perform any complex computations offsite. MAUI uses RPC as the communication protocol between the mobile and the infrastructure. #### gRPC -gRPC is a multiplexed, bi-directional streaming RPC protocol developed Google and Square. The IDL for gRPC is Protocol Buffers(also referred as ProtoBuf) and is meant as a public replacement of Stubby, ARCWire, and Sake {% cite Apigee --file rpc %}. More details on Protocol Buffers, Stubby, ARCWire, and Sake are available in our gRPC chapter{% cite grpcchapter --file rpc %}. +gRPC is a multiplexed, bi-directional streaming RPC protocol developed Google and Square. The IDL for gRPC is Protocol Buffers (also referred as ProtoBuf) and is meant as a public replacement fro Stubby, ARCWire, and Sake {% cite Apigee --file rpc %}. More details on Protocol Buffers, Stubby, ARCWire, and Sake are available in our gRPC chapter {% cite grpcchapter --file rpc %}. -gRPC provides a platform for scalable, bi-directional streaming using both synchronized and asynchronous communication. +gRPC provides a platform for scalable, bi-directional streaming using both synchronized and asynchronous communication. -In a general RPC mechanism, the client initiates a connection to the server and only the client can *request* while the server can only *respond* to the incoming requests. However, in bi-directional gRPC streams, although the initial connection is initiated by the client(call it *endpoint 1*), once the connection is established, both the server(call it *endpoint 2*) and the *endpoint 1* can send *requests* and receive *responses*. This significantly eases the development where both *endpoints* are communicating with each other(like, grid computing). It also saves the hassle of creating two separate connections between the endpoints (one from *endpoint 1* to *endpoint 2* and another from *endpoint 2* to *endpoint 1*) since both streams are independent. +In a general RPC mechanism, the client initiates a connection to the server and only the client can *request* while the server can only *respond* to the incoming requests. However, in bi-directional gRPC streams, although the initial connection is initiated by the client (call it *endpoint 1*), once the connection is established, both the server (call it *endpoint 2*) and the *endpoint 1* can send *requests* and receive *responses*. This significantly eases the development where both *endpoints* are communicating with each other (like, grid computing). It also saves the hassle of creating two separate connections between the endpoints (one from *endpoint 1* to *endpoint 2* and another from *endpoint 2* to *endpoint 1*) since both streams are independent. -It multiplexes the requests over a single connection using header compression. This makes it possible for gRPC to be used for mobile clients where battery life and data usage are important. -The core library is in C -- except for Java and GO -- and surface APIs are implemented for all the other languages connecting through it{% cite CoreSurfaceAPIs --file rpc %}. +gRPC multiplexes the requests over a single connection using header compression. This makes it possible for gRPC to be used for mobile clients where battery life and data usage are important. +The core library is in C -- except for Java and Go -- and surface APIs are implemented for all the other languages connecting through it {% cite CoreSurfaceAPIs --file rpc %}. -Since Protocol Buffers has been utilized by many individuals and companies, gRPC makes it natural to extend their RPC ecosystems via gRPC. Companies like Cisco, Juniper and Netflix {% cite gRPCCompanies --file rpc %} have found it practical to adopt it. -A majority of the Google Public APIs, like their places and maps APIs, have been ported to gRPC ProtoBuf {% cite gRPCProtos --file rpc %} as well. +Since Protocol Buffers have been utilized by many individuals and companies, gRPC makes it natural to extend their RPC ecosystems via gRPC. Companies like Cisco, Juniper, and Netflix {% cite gRPCCompanies --file rpc %} have found it practical to adopt it. A majority of the Google Public APIs, like their places and maps APIs, have been ported to gRPC ProtoBuf {% cite gRPCProtos --file rpc %} as well. More details about gRPC and bi-directional streaming can be found in our gRPC chapter {% cite grpcchapter --file rpc %} #### Cap'n Proto -CapnProto{% cite capnprotosecure --file rpc %} is a data interchange RPC system that bypasses data-encoding step(like JSON or ProtoBuf) to significantly improve the performance. It's developed by the original author of gRPC's ProtoBuf, but since it uses bytes(binary data) for encoding/decoding, it outperforms gRPC's ProtoBuf. It uses futures and promises to combine various remote operations into a single operation to save the transportation round-trips. This means if an client calls a function `foo` and then calls another function `bar` on the output of `foo`, Cap'n Proto will aggregate these two operations into a single `bar(foo(x))` where `x` is the input to the function `foo` {% cite capnprotosecure --file rpc %}. This saves multiple roundtrips, especially in object-oriented programs. + +CapnProto {% cite capnprotosecure --file rpc %} is a data interchange RPC system that bypasses the data-encoding step to significantly improve the performance of calls. It is developed by the original author of gRPC's ProtoBuf, but since it uses bytes (binary data) for encoding/decoding, it outperforms gRPC's ProtoBuf. It uses futures and promises to combine various remote operations into a single operation to save on transportation round-trips. This means if an client calls a function `foo` and then calls another function `bar` on the output of `foo`, Cap'n Proto will aggregate these two operations into a single `bar(foo(x))` where `x` is the input to the function `foo` {% cite capnprotosecure --file rpc %}. This saves multiple roundtrips, especially in object-oriented programs. ### The Heir to the Throne: gRPC or Thrift -Although there are many candidates to be considered as top contenders for RPC throne, most of these are targeted for a specific type of application. ONC is generally specific to the Network File System(though it's being pushed as a standard), Cap'n Proto is relatively new and untested, MAUI is specific to mobile systems, the open-source Finagle is primarily being used at Twitter(not widespread), and the Java RMI simply doesn't even come close due to its transparency issues(sorry to burst your bubble Java fans). +Although there are many candidates to be considered as top contenders for RPC throne, most of these are targeted for a specific type of application. ONC is generally specific to the Network File System (though it's being pushed as a standard), Cap'n Proto is relatively new and untested, MAUI is specific to mobile systems, the open-source Finagle is primarily being used at Twitter (not widespread), and the Java RMI simply doesn't even come close due to its transparency issues. -Probably, the most powerful, and practical systems out there are Apache Thrift and Google's gRPC, primarily because these two systems cater to a large number of programming languages, have a significant performance benefit over other techniques and are being actively developed. +Probably, the most powerful, and practical systems out there are Apache Thrift and Google's gRPC, primarily because these two systems cater to a large number of programming languages, have a significant performance benefit over other techniques, and are being actively developed. -Thrift was actually released a few years ago, while the first stable release for gRPC came out in August 2016. However, despite being 'out there', Thrift is currently less popular than gRPC {%cite trendrpcthrift --file rpc %}. +Thrift was actually released a few years ago, while the first stable release for gRPC came out in August 2016. However, despite being released for some time, Thrift is currently less popular than gRPC {% cite trendrpcthrift --file rpc %}. gRPC {% cite gRPCLanguages --file rpc %} and Thrift, both, support most of the popular languages, including Java, C/C++, and Python. Thrift supports other languages, like Ruby, Erlang, Perl, Javascript, Node.js and OCaml while gRPC currently supports Node.js and Go. -The gRPC core is written in C(with the exception of Java and Go) and wrappers are written in other languages to communicate with the core, while the Thrift core is written in C++. +The gRPC core is written in C (with the exception of Java and Go) and wrappers are written in other languages to communicate with the core, while the Thrift core is written in C++. -gRPC also provides easier bidrectional streaming communication between the caller and callee. The client generally initiates the communication {% cite gRPCLanguages --file rpc %} and once the connection is established the client and the server can perform reads and writes independently of each other. However, bi-directional streaming in Thrift might be a little difficult to handle, since it focuses explicitly on a client-server model. To enable bidirectional, async streaming, one may have to run two separate systems {%cite grpcbetter --file rpc%}. +gRPC also provides easier bidrectional streaming communication between the caller and callee. The client generally initiates the communication {% cite gRPCLanguages --file rpc %} and once the connection is established the client and the server can perform reads and writes independently of each other. However, bi-directional streaming in Thrift might be a little difficult to handle, since it focuses explicitly on a client-server model. To enable bidirectional, async streaming, one may have to run two separate systems {% cite grpcbetter --file rpc %}. Thrift provides exception-handling as a message while the programmer has to handle exceptions in gRPC. In Thrift, exceptions can be returned built into the message, while in gRPC, the programmer explicitly defines this behavior. This Thrift exception-handling makes it easier to write client-side applications. -Although custom authentication mechanisms can be implemented in both these system, gRPC come with a Google-backed authentication using SSL/TLS and Google Tokens {% cite grpcauth --file rpc %}. +Although custom authentication mechanisms can be implemented in both these system, gRPC comes with Google-backed authentication using SSL/TLS and Google Tokens {% cite grpcauth --file rpc %}. -Moreover, gRPC-based network communication is done using HTTP/2. HTTP/2 makes it feasible for communicating parties to multiplex network connections using the same port. This is more efficient(in terms of memory usage) as compared to HTTP/1.1. Since gRPC communication is done HTTP/2, it means that gRPC can easily multiplex different services. As for Thrift, multiplexing services is possible, however, due to lack of support from underlying transport protocol, it is performed using a `TMulitplexingProcessor` class(in code) {% cite multiplexingthrift --file rpc %}. +Moreover, gRPC-based network communication is done using HTTP/2. HTTP/2 makes it feasible for communicating parties to multiplex network connections using the same port. This is more efficient (in terms of memory usage) as compared to HTTP/1.1. Since gRPC communication is done HTTP/2, it means that gRPC can easily multiplex different services. As for Thrift, multiplexing services is possible, however, due to lack of support from underlying transport protocol, it is performed using a `TMultiplexingProcessor` class (in code) {% cite multiplexingthrift --file rpc %}. However, both gRPC and Thrift allow async RPC calls. This means that a client can send a request to the server and continue with its execution and the response from the server is processed it arrives. - The major comparison between gRPC and Thrift can be summed in this table. | Comparison | Thrift | gRPC | @@ -300,73 +306,73 @@ The major comparison between gRPC and Thrift can be summed in this table. | Exceptions | Allows being built in the message | Implemented by the programmer | | Authentication | Custom | Custom + Google Tokens | | Bi-Directionality | Not straightforward | Straightforward | -| Multiplexing | Possible via `TMulitplexingProcessor` class | Possible via HTTP/2 | +| Multiplexing | Possible via `TMultiplexingProcessor` class | Possible via HTTP/2 | -Although, it's difficult to specifically choose one over the other, however, with increasing popularity of gRPC, and the fact that it's still in early stages of development, the general trend{%cite trendrpcthrift --file rpc %} over the past year has started to shift in favor of gRPC and it's giving Thrift a run for its money. Although, it may not be considered as a metric, but the gRPC was searched, on average, three times more as compared to Thrift{%cite trendrpcthrift --file rpc %}. +Although, it's difficult to specifically choose one over the other, with the increasing popularity of gRPC, and the fact that it's still in early stages of development, the general trend {% cite trendrpcthrift --file rpc %} over the past year has started to shift in favor of gRPC and it's giving Thrift a run for its money. It may not be a useful metric, but on average gRPC was searched for three times more frequently than Thrift {% cite trendrpcthrift --file rpc %}. -**Note:** This comparison is performed in December 2016 so the results are expected to change with time. +**Note:** This comparison was performed in December 2016 so the results are expected to change over time. -## Applications: +## Applications -Since its inception, various papers have been published in applying RPC paradigm to different domains, as well as using RPC implementations to create new systems. Here are some of applications and systems that incorporated RPC. +Since its inception, various papers have been published on applying the RPC paradigm to different domains, as well as using RPC implementations to create new systems. Here are some of applications and systems that incorporate RPC. #### Shared State and Persistence Layer -One major limitation(and the advantage) of RPC is considered the separate *address space* of all the machines in the network. This means that *pointers* or *references* to a data object cannot be passed between the caller and the callee. Therefore, Interweave {% cite interweave2 interweave1 interweave3 --file rpc %} is a *middleware* system that allows scalable sharing of arbitrary datatypes and language-independent processes running on heterogeneous hardware. Interweave is specifically designed and is compatible with RPC-based systems and allows easier access to the shared resources between different applications using memory blocks and locks. +One major limitation (and advantage) of RPC is considered the separate *address space* of all the machines in the network. This means that *pointers* or *references* to a data object cannot be passed between the caller and the callee. Therefore, Interweave {% cite interweave2 interweave1 interweave3 --file rpc %} is a *middleware* system that allows scalable sharing of arbitrary datatypes and language-independent processes running on heterogeneous hardware. Interweave is specifically designed and is compatible with RPC-based systems and allows easier access to the shared resources between different applications using memory blocks and locks. -Although research has been done in order to ensure a global shared state for an RPC-based system, However, these systems tend to take away the sense of independence and modularity between the *caller* and the *callee* by using a shared storage instead of a separate *address space*. +Although research has been done in order to ensure a global shared state for an RPC-based system, these systems tend to take away the sense of independence and modularity between the *caller* and the *callee* by using a shared storage instead of a separate *address space*. #### GridRPC -Grid computing is one of the most widely used applications of RPC paradigm. At a high level, it can be seen as a mesh (or a network) of computers connected with each other to for *grid* such each system can leverage resources from any other system in the network. +Grid computing is one of the most widely used applications of the RPC paradigm. At a high level, it can be seen as a mesh (or a network) of computers connected with each other to form a *grid* such that each system can leverage resources from any other system in the network. In the GridRPC paradigm, each computer in the network can act as the *caller* or the *callee* depending on the amount of resources required {% cite grid1 --file rpc %}. It's also possible for the same computer to act as the *caller* as well as the *callee* for *different* computations. -Some of the most popular implementations that allow one to have GridRPC-compliant middleware are GridSolve{% cite gridsolve1 gridsolve2 --file rpc %} and Ninf-G{% cite ninf --file rpc %}. Ninf is relatively older than GridSolve and was first published in the late 1990's. It's a simple RPC layer that also provides authentication and secure communication between the two parties. GridSolve, on the other hand, is relatively complex and provides a middleware for the communications using a client-agent-server model. +Some of the most popular implementations that allow one to have GridRPC-compliant middleware are GridSolve {% cite gridsolve1 gridsolve2 --file rpc %} and Ninf-G {% cite ninf --file rpc %}. Ninf is relatively older than GridSolve and was first published in the late 1990's. It's a simple RPC layer that also provides authentication and secure communication between the two parties. GridSolve, on the other hand, is relatively complex and provides a middleware for the communications using a client-agent-server model. #### Mobile Systems and Computation Offloading -Mobile systems have become very powerful these days. With multi-core processors and gigabytes of RAM, they can undertake relatively complex computations without a hassle. Due to this advancement, they consume a larger amount of energy and hence, their batteries, despite becoming larger, drain quickly with usage. Moreover, mobile data (network bandwidth) is still limited and expensive. Due to these requirements, it's better to offload mobile computations from mobile systems when possible. RPC plays an important role in the communication for this *computation offloading*. Some of these services use Grid RPC technologies to offload this computation. Whereas, other technologies use an RMI(Remote Method Invocation) system for this. +Mobile systems have become very powerful these days. With multi-core processors and gigabytes of RAM, they can undertake relatively complex computations without a hassle. Due to this advancement, they consume a larger amount of energy and hence, their batteries, despite becoming larger, drain quickly with usage. Moreover, mobile data (network bandwidth) is still limited and expensive. Due to these requirements, it's better to offload mobile computations from mobile systems when possible. RPC plays an important role in the communication for this *computation offloading*. Some of these services use Grid RPC technologies to offload this computation. Whereas, other technologies use an RMI (Remote Method Invocation) system for this. -The Ibis Project {% cite ibis --file rpc %} builds an RMI(similar to JavaRMI) and GMI (Group Method Invocation) model to facilitate outsourcing computation. Cuckoo {% cite cuckoo --file rpc %} uses this Ibis communication middleware to offload computation from applications(built using Cuckoo) running on Android smartphones to remote Cuckoo servers. +The Ibis Project {% cite ibis --file rpc %} builds an RMI (similar to JavaRMI) and GMI (Group Method Invocation) model to facilitate outsourcing computation. Cuckoo {% cite cuckoo --file rpc %} uses this Ibis communication middleware to offload computation from applications (built using Cuckoo) running on Android smartphones to remote Cuckoo servers. The Microsoft's MAUI Project {% cite maui --file rpc %} uses RPC communication and allows partitioning of .NET applications and "fine-grained code offload to maximize energy savings with minimal burden on the programmer". MAUI decides the methods to offload to the external MAUI server at runtime. #### Async RPC, Futures and Promises -Remote Procedure Calls can be asynchronous. Not only that but these async RPCs play in integral role in the *futures* and *promises*. *Future* and *promises* are programming constructs that where a *future* is seen as variable/data/return type/error while a *promise* is seen as a *future* that doesn't have a value, yet. We follow Finagle's {% cite finagle --file rpc %} definition of *futures* and *promises*, where the *promise* of a *future*(an empty *future*) is considered as a *request* while the async fulfillment of this *promise* by a *future* is seen as the *response*. This construct is primarily used for asynchronous programming. +Remote Procedure Calls can be asynchronous. Not only that but these async RPCs play in integral role in the *futures* and *promises*. *Futures* and *promises* are programming constructs that where a *future* is seen as variable/data/return type/error while a *promise* is seen as a *future* that doesn't have a value, yet. We follow Finagle's {% cite finagle --file rpc %} definition of *futures* and *promises*, where the *promise* of a *future* (an empty *future*) is considered as a *request* while the async fulfillment of this *promise* by a *future* is seen as the *response*. This construct is primarily used for asynchronous programming. -Perhaps the most renowned systems using this type of RPC model are Twitter's Finagle{% cite finagle --file rpc %} and Cap'n Proto{% cite capnprotosecure --file rpc %}. +Perhaps the most renowned systems using this type of RPC model are Twitter's Finagle {% cite finagle --file rpc %} and Cap'n Proto {% cite capnprotosecure --file rpc %}. -#### RPC in Microservices Ecosystem: +#### RPC in Microservices Ecosystem -RPC implementations have moved from a one-server model to multiple servers and on to dynamically-created, load-balanced microservices. RPC started as separate implementations of REST, Streaming RPC, MAUI, gRPC, Cap'n Proto, and has now made it possible for integration of all these implementations as a single abstraction as a user *endpoint*. The endpoints are the building blocks of *microservices*. A *microservice* is usually *service* with a very simple, well-defined purpose, written in almost any language that interacts with other microservices to give the feel of one large monolithic *service*. These microservices are language-agnostic. One *microservice* for airline tickets written in C/C\++, might be communicating with a number of other microservices for individual airlines written in different languages(Python, C\++, Java, Node.js) using a language-agnostic, asynchronous, RPC framework like gRPC{%cite grpc --file rpc %} or Thrift{%cite thrift --file rpc %}. +RPC implementations have moved from a one-server model to multiple servers and on to dynamically-created, load-balanced microservices. RPC started as separate implementations of REST, Streaming RPC, MAUI, gRPC, Cap'n Proto, and has now made it possible for integration of all these implementations as a single abstraction as a user *endpoint*. The endpoints are the building blocks of *microservices*. A *microservice* is usually *service* with a very simple, well-defined purpose, written in almost any language that interacts with other microservices to give the feel of one large monolithic *service*. These microservices are language-agnostic. One *microservice* for airline tickets written in C/C\++, might be communicating with a number of other microservices for individual airlines written in different languages (Python, C\++, Java, Node.js) using a language-agnostic, asynchronous, RPC framework like gRPC {% cite grpc --file rpc %} or Thrift {% cite thrift --file rpc %}. The use of RPC has allowed us to create new microservices on-the-fly. The microservices can not only created and bootstrapped at runtime but also have inherent features like load-balancing and failure-recovery. This bootstrapping might occur on the same machine, adding to a Docker container {% cite docker --file rpc %}, or across a network (using any combination of DNS, NATs or other mechanisms). -RPC can be defined as the "glue" that holds all the microservices together{% cite microservices1rpc --file rpc %}. This means that RPC is one of the primary communication mechanism between different microservices running on different systems. A microservice requests another microservice to perform an operation/query. The other microservice, upon receiving such request, performs an operation and returns a response. This operation could vary from a simple computation to invoking another microservice creating a series of RPC events to creating new microservices on the fly to dynamically load balance the microservices system. These microservices are language-agnostic. One *microservice* could be written in C/C++, another one could be in different languages(Python, C++, Java, Node.js) and they all might be communicating with each other using a language-agnostic, asynchronous, performant RPC framework like gRPC{%cite grpc --file rpc %} or Thrift{%cite thrift --file rpc %}. +RPC can be defined as the "glue" that holds all the microservices together {% cite microservices1rpc --file rpc %}. This means that RPC is one of the primary communication mechanism between different microservices running on different systems. A microservice requests another microservice to perform an operation/query. The other microservice, upon receiving such request, performs an operation and returns a response. This operation could vary from a simple computation to invoking another microservice creating a series of RPC events to creating new microservices on the fly to dynamically load balance the microservices system. These microservices are language-agnostic. One *microservice* could be written in C/C++, another one could be in different languages (Python, C++, Java, Node.js) and they all might be communicating with each other using a language-agnostic, asynchronous, performant RPC framework like gRPC {% cite grpc --file rpc %} or Thrift {% cite thrift --file rpc %}. -An example of a microservices ecosystem that uses futures/promises is Finagle{%cite finagle --file rpc %} at Twitter. +An example of a microservices ecosystem that uses futures/promises is Finagle {% cite finagle --file rpc %} at Twitter. -## Security in RPC: +## Security in RPC The initial RPC implementation {% cite implementingrpc --file rpc %} was developed for an isolated LAN network and didn't focus much on security. There're various attack surfaces in that model, from the malicious registry to a malicious server, to a client targeting for Denial-of-Service to Man-in-the-Middle attack between client and server. As time progressed and internet evolved, new standards came along, and RPC implementations became much more secure. Security, in RPC, is generally added as a *module* or a *package*. These modules have libraries for authentication and authorization of the communication services (caller and callee). These modules are not always bug-free and it's possible to gain unauthorized access to the system. Efforts are being made to rectify these situations by the security in general, using code inspection and bug bounty programs to catch these bugs beforehand. However, with time new bugs arise and this cycle continues. It's a vicious cycle between attackers and security experts, both of whom tries to outdo their opponent. -For example, the Oracle Network File System uses a *Secure RPC*{% cite oraclenfs --file rpc %} to perform authentication in the NFS. This *Secure RPC* uses Diffie-Hellman authentication mechanism with DES encryption to allow only authorized users to access the NFS. Similarly, Cap'n Proto {% cite capnprotosecure --file rpc %} claims that it is resilient to memory leaks, segfaults, and malicious inputs and can be used between mutually untrusting parties. However, in Cap'n Proto "the RPC layer is not robust against resource exhaustion attacks, possibly allowing denials of service", nor has it undergone any formal verification {% cite capnprotosecure --file rpc %}. +For example, the Oracle Network File System uses *Secure RPC* {% cite oraclenfs --file rpc %} to perform authentication in the NFS. This *Secure RPC* uses a Diffie-Hellman authentication mechanism with DES encryption to allow only authorized users to access the NFS. Similarly, Cap'n Proto {% cite capnprotosecure --file rpc %} claims that it is resilient to memory leaks, segfaults, and malicious inputs and can be used between mutually untrusting parties. However, in Cap'n Proto "the RPC layer is not robust against resource exhaustion attacks, possibly allowing denials of service", nor has it undergone any formal verification {% cite capnprotosecure --file rpc %}. -Although, it's possible to come up with a *Threat Model* that would make an RPC implementation insecure to use, however, one has to understand that using any distributed system increases the attack surface anyways and claiming one *paradigm* to be more secure than another would be a biased statement, since *paradigms* are generally an idea and it depends on different system designers to use these *paradigms* to build their systems and take care of features specific to real systems, like security and load-balancing. There's always a possibility of rerouting a request to a malicious server(if the registry gets hacked), or there's no trust between the *caller* and *callee*. However, we maintain that RPC *paradigm* is not secure or insecure(for that matter), and that the most secure systems are the ones that are in an isolated environment, disconnected from the public internet with a self-destruct mechanism{% cite selfdest --file rpc %} in place, in an impenetrable bunker, and guarded by the Knights Templar(*they don't exist! Well, maybe Fort Meade comes close*). +Although, it's possible to come up with a *Threat Model* that would make an RPC implementation insecure to use, one has to understand that using any distributed system increases the attack surface anyways and claiming one *paradigm* to be more secure than another would be a biased statement, since *paradigms* are generally an idea and it depends on different system designers to use these *paradigms* to build their systems and take care of features specific to real systems, like security and load-balancing. There's always a possibility of rerouting a request to a malicious server (if the registry gets hacked), or there's no trust between the *caller* and *callee*. However, we maintain that RPC *paradigm* is not secure or insecure (for that matter), and that the most secure systems are the ones that are in an isolated environment, disconnected from the public internet with a self-destruct mechanism {% cite selfdest --file rpc %} in place, in an impenetrable bunker, and guarded by the Knights Templar (*they don't exist! Well, maybe Fort Meade comes close*). -## Discussion: +## Discussion -RPC *paradigm* shines the most in *request-response* mechanisms. Futures and Promises also appear to a new breed of RPC. This leads one to question, as to whether every *request-response* system is a modified implementation to of the RPC *paradigm*, or does it actually bring anything new to the table? These modern communication protocols, like HTTP and REST, might just be a different flavor of RPC. In HTTP, a client *requests* a web page(or some other content), the server then *responds* with the required content. The dynamics of this communication might be slightly different from your traditional RPC, however, an HTTP Stateless server adheres to most of the concepts behind RPC *paradigm*. Similarly, consider sending a request to your favorite Google API. Say, you want to translate your latitude/longitude to an address using their Reverse Geocoding API, or maybe want to find out a good restaurant in your vicinity using their Places API, you'll send a *request* to their server to perform a *procedure* that would take a few input arguments, like the coordinates, and return the result. Even though these APIs follow a RESTful design, it appears to be an extension to the RPC *paradigm*. +The RPC *paradigm* shines the most in *request-response* mechanisms. Futures and Promises also appear to a new breed of RPC. This leads one to question, as to whether every *request-response* system is a modified implementation to of the RPC *paradigm*, or does it actually bring anything new to the table? These modern communication protocols, like HTTP and REST, might just be a different flavor of RPC. In HTTP, a client *requests* a web page (or some other content), the server then *responds* with the required content. The dynamics of this communication might be slightly different from your traditional RPC, however, an HTTP Stateless server adheres to most of the concepts behind the RPC *paradigm*. Similarly, consider sending a request to your favorite Google API. Say, you want to translate your latitude/longitude to an address using their Reverse Geocoding API, or maybe want to find out a good restaurant in your vicinity using their Places API, you'll send a *request* to their server to perform a *procedure* that would take a few input arguments, like the coordinates, and return the result. Even though these APIs follow a RESTful design, it appears to be an extension to the RPC *paradigm*. -RPC paradigm has evolved over time. It has evolved to the extent that, currently, it's become very difficult differentiate RPC from non-RPC. With each passing year, the restrictions and limitations of RPC evolve. Current RPC implementations even have the support for the server to *request* information from the client to *respond* to these requests and vice versa (bidirectionality). This *bidirectional* nature of RPCs have transitioned RPC from simple *client-server* model to a set of *endpoints* communicating with each other. +The RPC paradigm has evolved over time. It has evolved to the extent that, currently, it's become very difficult differentiate RPC from non-RPC. With each passing year, the restrictions and limitations of RPC evolve. Current RPC implementations even have the support for the server to *request* information from the client to *respond* to these requests and vice versa (bidirectionality). This *bidirectional* nature of RPCs have transitioned RPC from simple *client-server* model to a set of *endpoints* communicating with each other. For the past four decades, researchers and industry leaders have tried to come up with *their* definition of RPC. The proponents of RPC paradigm view every *request-response* communication as an implementation the RPC paradigm while those against RPC try to explicitly enumerate the limitations of RPC. These limitations, however, seem to slowly vanish as new RPC models are introduced with time. RPC supporters consider it as the Holy Grail of distributed systems. They view it as the foundation of modern distributed communication. From Apache Thrift and ONC to HTTP and REST, they advocate it all as RPC while REST developers have strong opinions against RPC. -Moreover, with modern global storage mechanisms, the need for RPC systems to have a separate *address space* seems to be slowly dissolving and disappearing into thin air. So, the question remains what *is* RPC and what * is not* RPC? This is an open-ended question. There is no unanimous agreement about what RPC should look like, except that it has communication between two *endpoints*. What we think of RPC is: +Moreover, with modern global storage mechanisms, the need for RPC systems to have a separate *address space* seems to be slowly dissolving and disappearing into thin air. So, the question remains what *is* RPC and what *is not* RPC? This is an open-ended question. There is no unanimous agreement about what RPC should look like, except that it has communication between two *endpoints*. What we think of RPC is: *In the world of distributed systems, where every individual component of a system, be it a hard disk, a multi-core processor, or a microservice, is an extension of the RPC, it's difficult to come with a concrete definition of the RPC paradigm. Therefore, anything loosely associated with a request-response mechanism can be considered as RPC.* diff --git a/chapter/2/futures.md b/chapter/2/futures.md index 9c738d6..7fd9fb2 100644 --- a/chapter/2/futures.md +++ b/chapter/2/futures.md @@ -10,12 +10,12 @@ by: "Kisalaya Prasad, Avanti Patil, and Heather Miller" As human beings we have the ability to multitask _i.e._ we can walk, talk, and eat at the same time except when sneezing. Sneezing is a blocking activity because it forces you to stop what you’re doing for a brief moment, and then you resume where you left off. One can think of the human sense of multitasking as multithreading in the context of computers. -Consider for a moment a simple computer processor; no parallelism, just the ability to complete one task or process at a time. In this scenario, sometimes the processor is blocked when some blocking operation is called. Such blocking calls can include I/O operations like reading/writing to disk, or sending or receiving packets over the network. And as programmers, we know that blocking calls like I/O can take a disproportionately more time than a typical CPU-bound task, like iterating over a list. +Consider for a moment a simple computer processor; no parallelism, just the ability to complete one task or process at a time. In this scenario, sometimes the processor is blocked when some blocking operation is called. Such blocking calls can include I/O operations like reading/writing to disk, or sending or receiving packets over the network. And as programmers, we know that blocking calls like I/O can take disproportionately more time than a typical CPU-bound task, like iterating over a list. The processor can handle blocking calls in two ways: - **Synchronously**: the processor waits until the blocking call completes its task and returns the result. Afterwards, the processor will move on to processing the next task. _This can oftentimes be problematic because the CPU may not be utilized in an efficient manner; it may wait for long periods of time._ -- **Asynchronously**: When tasks are processed asynchronously, CPU time spent waiting in the synchronous case is instead spent processing some other task using a preemptive time sharing algorithm. That is, rather than wait, process some other task instead. Thus, the processor is never left waiting at any time. +- **Asynchronously**: When tasks are processed asynchronously, CPU time spent waiting in the synchronous case is instead spent processing some other task using a preemptive time sharing algorithm. That is, rather than wait, process some other task instead. Thus, the processor is never left waiting as long as there is more work that can be done. In the world of programming, many constructs have been introduced in order to help programmers reach ideal levels of resource utilization. Arguably one of the most widely-used of which are futures and/or promises. @@ -45,7 +45,7 @@ Importantly, futures/promises typically enable some degree of concurrency. That <footer>{% cite Multilisp --file futures%}</footer> </blockquote> -Some interpretations of futures/promises have a type associated with them, others not. Typically a future/promise is single-assignment; that is, you it can only be written to once. Some interpretations are blocking (synchronous), others are completely non-blocking (asynchronous). Some interpretations must be explicitly _kicked off_ (i.e., manually started), while in other interpretations, computation is started implicitly. +Some interpretations of futures/promises have a type associated with them, others not. Typically a future/promise is single-assignment; that is, it can only be written to once. Some interpretations are blocking (synchronous), others are completely non-blocking (asynchronous). Some interpretations must be explicitly _kicked off_ (i.e. manually started), while in other interpretations, computation is started implicitly. Inspired by functional programming, one of the major distinctions between different interpretations of this construct have to do with _pipelineing_ or _composition_. Some of the more popular interpretations of futures/promises make it possible to _chain_ operations, or define a pipeline of operations to be invoked upon completion of the computation represented by the future/promise. This is in contrast to callback-heavy or more imperative direct blocking approaches. @@ -53,7 +53,7 @@ Inspired by functional programming, one of the major distinctions between differ The rise of promises and futures as a topic of relevance has for the most part occurred alongside of the rise of parallel and concurrent programming and distributed systems. This follows somewhat naturally, since, as an abstraction which encodes time, futures/promises introduce a nice way to reason about state changes when latency becomes an issue; a common concern faced by programmers when a node must communicate with another node in a distributed system. -However promises and futures are considered useful in a number of contexts as well, both distributed and not. Some such contexts include: +However promises and futures are considered useful in a number of other contexts as well, both distributed and not. Some such contexts include: - **Request-Response Patterns**, such as web service calls over HTTP. A future may be used to represent the value of the response of the HTTP request. - **Input/Output**, such as UI dialogs requiring user input, or operations such as reading large files from disk. A future may be used to represent the IO call and the resulting value of the IO (e.g., terminal input, array of bytes of a file that was read). @@ -64,7 +64,7 @@ However promises and futures are considered useful in a number of contexts as we - **Timeouts**, such as managing timeouts in a web service. A future representing a timeout could simply return no result or some kind of empty result like the `Unit` type in typed programming languages. -Many real world services and systems today make heavy use of futures/promises in popular contexts such as these, thanks to the notion of a future or a promise having been introduced in popular languages and frameworks such as JavaScript, NodeJS, Scala, Java, C++, amongst many others. As we will see in further sections, this proliferation of futures/promises has resulted in futures/promises changing meanings and names over time and across languages. +Many real world services and systems today make heavy use of futures/promises in popular contexts such as these, thanks to the notion of a future or a promise having been introduced in popular languages and frameworks such as JavaScript, Node.js, Scala, Java, C++, amongst many others. As we will see in further sections, this proliferation of futures/promises has resulted in futures/promises changing meanings and names over time and across languages. ## Diverging Terminology @@ -76,7 +76,7 @@ Sometimes, a language may have _one_ construct named future, promise, delay, def However, in other cases, a language may have _two_ constructs, typically referred to as futures and promises. Languages like Scala, Java, and Dart fall into this category. In this case, - A `Future` is a read-only reference to a yet-to-be-computed value. -- A `Promise` (or a `CompletableFuture`/`Completer`/etc) is a single-assignment variable which the `Future` refers to. +- A `Promise` (or a `CompletableFuture`/`Completer`/etc.) is a single-assignment variable which the `Future` refers to. In other words, a future is a read-only window to a value written into a promise. You can get the `Future` associated with a `Promise` by calling the `future` method on it, but conversion in the other direction is not possible. Another way to look at it would be, if you _promise_ something to someone, you are responsible for keeping it, but if someone else makes a _promise_ to you, you expect them to honor it in the _future_. @@ -91,7 +91,7 @@ In Scala, they are defined as follows: An important difference between Scala and Java (6) futures is that Scala futures are asynchronous in nature. Java's future, at least till Java 6, were blocking. Java 7 introduced asynchronous futures to great fanfare. -In Java 8, the `Future<T>` interface has methods to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation when it is complete. `CompletableFutures` can be thought of as something of a promise, since their value can be explicitly set. However, `CompletableFuture` also implements the `Future` interface allowing it to be used as a `Future` as well. Promises can be thought of as a future with a public set method which the caller (or anybody else) can use to set the value of the future. +In Java 8, the `Future<T>` interface has methods to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation when it is complete. `CompletableFutures` can be thought of as a promise, since their value can be explicitly set. However, `CompletableFuture` also implements the `Future` interface allowing it to be used as a `Future` as well. Promises can be thought of as a future with a public set method which the caller (or anybody else) can use to set the value of the future. In the JavaScript world, JQuery introduces a notion of `Deferred` objects which are used to represent a unit of work which is not yet finished. The `Deferred` object contains a promise object which represents the result of that unit of work. Promises are values returned by a function. The deferred object can also be canceled by its caller. @@ -99,7 +99,7 @@ Like Scala and Java, C# also makes the distinction between the future and promis And confusingly, the JavaScript community has standardized on a single construct known as a `Promise` which can be used like other languages' notions of futures. The Promises specification {% cite PromisesAPlus --file futures %} defines only a single interface and leaves the details of completing (or _fulfilling_) the promise to the implementer of the spec. Promises in JavaScript are also asynchronous and able to be pipelined. JavaScript promises are enabled by default in browsers that support ECMAScript 6 (EC6), or are available in a number of libraries such as [Bluebird](http://bluebirdjs.com/docs/getting-started.html) and [Q](https://github.com/kriskowal/q). -As we can see, concepts, semantics, and terminology seem to be a bit mixed up between languages and library implementations of futures/promises. These differences in terminology and semantics arise from the long history and independent language communities that have proliferated the use of futures/promises. +As we can see, concepts, semantics, and terminology seem to differ between languages and library implementations of futures/promises. These differences in terminology and semantics arise from the long history and independent language communities that have proliferated the use of futures/promises. ## Brief History @@ -107,7 +107,7 @@ As we can see, concepts, semantics, and terminology seem to be a bit mixed up be Here's a brief glimpse at a timeline spanning the history of futures and promises as we know them today: -<figure style="margin-left: 0px; width: 110%;"> +<figure style="margin-left: 0px; width: 105%;"> <img src="./images/1.png" alt="timeline" /> </figure> @@ -120,7 +120,7 @@ The first concept which eventually led to futures/promises appeared in 1961, wit Thunks were designed as a way of binding actual parameters to their formal definitions in Algol-60 procedure calls. If a procedure is called with an expression in the place of a formal parameter, the compiler generates a thunk which computes the expression and leaves the address of the result in some standard location. Think of a thunk as a continuation or a function that was intended to be evaluated in a single-threaded environment. -The first mention of Futures was by Baker and Hewitt in a paper on Incremental Garbage Collection of Processes {% cite Hewitt77 --file futures %}. They coined the term, _call-by-futures_, to describe a calling convention in which each formal parameter to a method is bound to a process which evaluates the expression in the parameter in parallel with other parameters. Before this paper, Algol 68 **{% cite missingref --file futures%}** also presented a way to make this kind of concurrent parameter evaluation possible, using the collateral clauses and parallel clauses for parameter binding. +The first mention of Futures was by Baker and Hewitt in a paper on Incremental Garbage Collection of Processes {% cite Hewitt77 --file futures %}. They coined the term, _call-by-futures_, to describe a calling convention in which each formal parameter to a method is bound to a process which evaluates the expression in the parameter in parallel with other parameters. Before this paper, Algol 68 **{% cite missingref --file futures %}** also presented a way to make this kind of concurrent parameter evaluation possible, using the collateral clauses and parallel clauses for parameter binding. In their paper, Baker and Hewitt introduced a notion of Futures as a 3-tuple representing an expression `E` consisting of: @@ -142,7 +142,7 @@ Futures and promises remained primarily an academic fascination until the early Among the mainstream programming languages, Python was perhaps the first, in 2002, to get a library which introduced a construct along the same lines as E’s promises in the Twisted library {% cite Twisted --file futures %}. Twisted introduced the notion of _Deferred_ objects, are used to receive the result of an operation not yet completed. In Twisted, deferred objects are just like normal first-class objects; they can be passed along anywhere a normal object can, the only difference is that deferred objects don't have values. Deferred objects support callbacks, which are called once the result of the operation is complete. -Perhaps most famous in recent memory is that of promises in JavaScript. In 2007, inspired by Python’s Twisted library, the authors of the Dojo Toolkit came up a JavaScript implementation of Twisted's deferred objects, known as `dojo.Deferred`. This in turn inspired Kris Zyp to propose the CommonJS Promises/A spec in 2009 {% cite PromisesA --file futures %}. The same year, Ryan Dahl introduced NodeJS. In it’s early versions, Node used promises in its non-blocking API. However, when NodeJS moved away from promises to its now familiar error-first callback API (the first argument for the callback should be an error object), it left a void to fill for a promises API. [Q.js](https://github.com/kriskowal/q) is an implementation of Promises/A spec by Kris Kowal around this time {% cite Qjs --file futures %}. The [FuturesJS](https://github.com/FuturesJS/FuturesJS) library by AJ O'Neal was another library which aimed to solve flow-control problems without using Promises in the strictest of senses. In 2011, JQuery v1.5 introduced Promises to its wider and ever-growing audience. However, JQuery's promises API was subtly different than the Promises/A spec {% cite JQueryPromises --file futures %}. With the rise of HTML5 and different APIs, there came a problem of different and messy interfaces which added to the already infamous callback hell. The Promises/A+ spec {% cite PromisesAPlus --file futures %} aimed to solve this problem. Following the broad community acceptance of the Promises/A+ spec, promises were finally made a part of the ECMAScript® 2015 Language Specification {% cite Ecmascript15 --file futures %}. However, a lack of backward compatibility and additional features missing in the Promises/A+ spec means that libraries like [BlueBird](http://bluebirdjs.com/docs/getting-started.html) and [Q.js](https://github.com/kriskowal/q) still have a place in the JavaScript ecosystem. +Perhaps most famous in recent memory is that of promises in JavaScript. In 2007, inspired by Python’s Twisted library, the authors of the Dojo Toolkit came up a JavaScript implementation of Twisted's deferred objects, known as `dojo.Deferred`. This in turn inspired Kris Zyp to propose the CommonJS Promises/A spec in 2009 {% cite PromisesA --file futures %}. The same year, Ryan Dahl introduced Node.js. In it’s early versions, Node.js used promises in its non-blocking API. However, when Node.js moved away from promises to its now familiar error-first callback API (the first argument for the callback should be an error object), it left a void to fill for a promises API. [Q.js](https://github.com/kriskowal/q) is an implementation of Promises/A spec by Kris Kowal around this time {% cite Qjs --file futures %}. The [FuturesJS](https://github.com/FuturesJS/FuturesJS) library by AJ O'Neal was another library which aimed to solve flow-control problems without using Promises in the strictest of senses. In 2011, JQuery v1.5 introduced Promises to its wider and ever-growing audience. However, JQuery's promises API was subtly different than the Promises/A spec {% cite JQueryPromises --file futures %}. With the rise of HTML5 and different APIs, there came a problem of different and messy interfaces which added to the already infamous callback hell. The Promises/A+ spec {% cite PromisesAPlus --file futures %} aimed to solve this problem. Following the broad community acceptance of the Promises/A+ spec, promises were finally made a part of the ECMAScript® 2015 Language Specification {% cite Ecmascript15 --file futures %}. However, a lack of backward compatibility and additional features missing in the Promises/A+ spec means that libraries like [BlueBird](http://bluebirdjs.com/docs/getting-started.html) and [Q.js](https://github.com/kriskowal/q) still have a place in the JavaScript ecosystem. ## Semantics of Execution @@ -185,29 +185,26 @@ f.onComplete { } ``` -In this example, we first create a future `f`, and when it completes, we provide two possible expressions that can be invoked depending on whether the future was executed successfully or if there was an error. In this case, if successful, we get the result of the computation an HTTP string, and we print it. Else, if an exception was thrown, we get the message string contained within the exception and we print that. +In this example, we first create a future `f`, and when it completes, we provide two possible expressions that can be invoked depending on whether the future was executed successfully or if there was an error. In this case, if successful, we get the result of the computation an HTTP string, and we print it. If an exception was thrown, we get the message string contained within the exception and we print that. So, how does it all work together? -As we mentioned, Futures require an ExecutionContext, which is an implicit parameter to virtually all of the futures API. This ExecutionContext is used to execute the future. Scala is flexible enough to let users implement their own Execution Contexts, but let’s talk about the default ExecutionContext, which is a ForkJoinPool. +As we mentioned, Futures require an `ExecutionContext`, which is an implicit parameter to virtually all of the futures API. This `ExecutionContext` is used to execute the future. Scala is flexible enough to let users implement their own Execution Contexts, but let’s talk about the default `ExecutionContext`, which is a `ForkJoinPool`. - -ForkJoinPool is ideal for many small computations that spawn off and then come back together. Scala’s ForkJoinPool requires the tasks submitted to it to be a ForkJoinTask. The tasks submitted to the global ExecutionContext is quietly wrapped inside a ForkJoinTask and then executed. ForkJoinPool also supports a possibly blocking task, using ManagedBlock method which creates a spare thread if required to ensure that there is sufficient parallelism if the current thread is blocked. To summarize, ForkJoinPool is an really good general purpose ExecutionContext, which works really well in most of the scenarios. +`ForkJoinPool` is ideal for many small computations that spawn off and then come back together. Scala’s `ForkJoinPool` requires the tasks submitted to it to be a `ForkJoinTask`. The tasks submitted to the global `ExecutionContext` is quietly wrapped inside a `ForkJoinTask` and then executed. `ForkJoinPool` also supports a possibly blocking task, using the `ManagedBlock` method which creates a spare thread if required to ensure that there is sufficient parallelism if the current thread is blocked. To summarize, `ForkJoinPool` is an really good general purpose `ExecutionContext`, which works really well in most of the scenarios. ### Event Loops -Modern platforms and runtimes typically rely on many underlying system layers to operate. For example, there’s an underlying file system, a database system, and other web services that may be relied on by a given language implementation, library, or framework. Interaction with these components typically involves a period where we’re doing nothing but waiting for the response. This is single largest waste of computing resources. - -Javascript is a single threaded asynchronous runtime. Now, conventionally async programming is generally associated with multi-threading, but we’re not allowed to create new threads in Javascript. Instead, asynchronicity in Javascript is achieved using an event-loop mechanism. - -Javascript has historically been used to interact with the DOM and user interactions in the browser, and thus an event-driven programming model was a natural fit for the language. This has scaled up surprisingly well in high throughput scenarios in NodeJS. +Modern platforms and runtimes typically rely on many underlying system layers to operate. For example, there’s an underlying file system, a database system, and other web services that may be relied on by a given language implementation, library, or framework. Interaction with these components typically involves a period where we’re doing nothing but waiting for the response. This can be a very large waste of computing resources. +JavaScript is a single threaded asynchronous runtime. Now, conventionally async programming is generally associated with multi-threading, but we’re not allowed to create new threads in JavaScript. Instead, asynchronicity in JavaScript is achieved using an event-loop mechanism. -The general idea behind event-driven programming model is that the logic flow control is determined by the order in which events are processed. This is underpinned by a mechanism which is constantly listening for events and fires a callback when it is detected. This is the Javascript’s event loop in a nutshell. +JavaScript has historically been used to interact with the DOM and user interactions in the browser, and thus an event-driven programming model was a natural fit for the language. This has scaled up surprisingly well in high throughput scenarios in Node.js. +The general idea behind event-driven programming model is that the logic flow control is determined by the order in which events are processed. This is underpinned by a mechanism which is constantly listening for events and fires a callback when it is detected. This is the JavaScript’s event loop in a nutshell. -A typical Javascript engine has a few basic components. They are : +A typical JavaScript engine has a few basic components. They are : - **Heap** Used to allocate memory for objects - **Stack** @@ -215,15 +212,12 @@ Function call frames go into a stack from where they’re picked up from top to - **Queue** A message queue holds the messages to be processed. - Each message has a callback function which is fired when the message is processed. These messages can be generated by user actions like button clicks or scrolling, or by actions like HTTP requests, request to a database to fetch records or reading/writing to a file. - Separating when a message is queued from when it is executed means the single thread doesn’t have to wait for an action to complete before moving on to another. We attach a callback to the action we want to do, and when the time comes, the callback is run with the result of our action. Callbacks work good in isolation, but they force us into a continuation passing style of execution, what is otherwise known as Callback hell. ```javascript - getData = function(param, callback){ $.get('http://example.com/get/'+param, function(responseText){ @@ -232,23 +226,21 @@ getData = function(param, callback){ } getData(0, function(a){ - getData(a, function(b){ - getData(b, function(c){ - getData(c, function(d){ - getData(d, function(e){ - - }); - }); + getData(a, function(b){ + getData(b, function(c){ + getData(c, function(d){ + getData(d, function(e){ + // ... }); + }); }); + }); }); - ``` <center><h4> VS </h4></center> ```javascript - getData = function(param, callback){ return new Promise(function(resolve, reject) { $.get('http://example.com/get/'+param, @@ -259,30 +251,27 @@ getData = function(param, callback){ } getData(0).then(getData) - .then(getData). - then(getData). - then(getData); - - + .then(getData) + .then(getData) + .then(getData); ``` > **Programs must be written for people to read, and only incidentally for machines to execute.** - *Harold Abelson and Gerald Jay Sussman* -Promises are an abstraction which make working with async operations in javascript much more fun. Callbacks lead to inversion of control, which is difficult to reason about at scale. Moving on from a continuation passing style, where you specify what needs to be done once the action is done, the callee simply returns a Promise object. This inverts the chain of responsibility, as now the caller is responsible for handling the result of the promise when it is settled. +Promises are an abstraction which make working with async operations in JavaScript much more fun. Callbacks lead to inversion of control, which is difficult to reason about at scale. Moving on from a continuation passing style, where you specify what needs to be done once the action is done, the callee simply returns a Promise object. This inverts the chain of responsibility, as now the caller is responsible for handling the result of the promise when it is settled. The ES2015 spec specifies that “promises must not fire their resolution/rejection function on the same turn of the event loop that they are created on.” This is an important property because it ensures deterministic order of execution. Also, once a promise is fulfilled or failed, the promise’s value MUST not be changed. This ensures that a promise cannot be resolved more than once. -Let’s take an example to understand the promise resolution workflow as it happens inside the Javascript Engine. - -Suppose we execute a function, here g() which in turn, calls function f(). Function f returns a promise, which, after counting down for 1000 ms, resolves the promise with a single value, true. Once f gets resolved, a value true or false is alerted based on the value of the promise. +Let’s take an example to understand the promise resolution workflow as it happens inside the JavaScript Engine. +Suppose we execute a function, here `g()` which in turn, calls another function `f()`. Function `f` returns a promise, which, after counting down for 1000 ms, resolves the promise with a single value, true. Once `f` gets resolved, a value `true` or `false` is alerted based on the value of the promise. <figure> <img src="./images/5.png" alt="timeline" /> </figure> -Now, javascript’s runtime is single threaded. This statement is true, and not true. The thread which executes the user code is single threaded. It executes what is on top of the stack, runs it to completion, and then moves onto what is next on the stack. But, there are also a number of helper threads which handle things like network or timer/settimeout type events. This timing thread handles the counter for setTimeout. +Now, JavaScript’s runtime is single threaded. This statement is true, and not true. The thread which executes the user code is single threaded. It executes what is on top of the stack, runs it to completion, and then moves onto what is next on the stack. But, there are also a number of helper threads which handle things like network or timer/settimeout type events. This timing thread handles the counter for setTimeout. <figure> <img src="./images/6.png" alt="timeline" /> @@ -294,59 +283,50 @@ Once the timer expires, the timer thread puts a message on the message queue. Th <img src="./images/7.png" alt="timeline" /> </figure> -Here, since the future is resolved with a value of true, we are alerted with a value true when the callback is picked up for execution. +Here, since the future is resolved with a value of `true`, we are alerted with a value `true` when the callback is picked up for execution. <figure> <img src="./images/8.png" alt="timeline" /> </figure> -Some finer details : We’ve ignored the heap here, but all the functions, variables and callbacks are stored on heap. -As we’ve seen here, even though Javascript is said to be single threaded, there are number of helper threads to help main thread do things like timeout, UI, network operations, file operations etc. -Run-to-completion helps us reason about the code in a nice way. Whenever a function starts, it needs to finish before yielding the main thread. The data it accesses cannot be modified by someone else. This also means every function needs to finish in a reasonable amount of time, otherwise the program seems hung. This makes Javascript well suited for I/O tasks which are queued up and then picked up when finished, but not for data processing intensive tasks which generally take long time to finish. -We haven’t talked about error handling, but it gets handled the same exact way, with the error callback being called with the error object the promise is rejected with. +As we’ve seen here, even though JavaScript is said to be single threaded, there are number of helper threads to help main thread do things like timeout, UI, network operations, file operations etc. +Run-to-completion helps us reason about the code in a nice way. Whenever a function starts, it needs to finish before yielding the main thread. The data it accesses cannot be modified by someone else. This also means every function needs to finish in a reasonable amount of time, otherwise the program seems hung. This makes JavaScript well suited for I/O tasks which are queued up and then picked up when finished, but not for data processing intensive tasks which generally take long time to finish. -Event loops have proven to be surprisingly performant. When network servers are designed around multithreading, as soon as you end up with a few hundred concurrent connections, the CPU spends so much of its time task switching that you start to lose overall performance. Switching from one thread to another has overhead which can add up significantly at scale. Apache used to choke even as low as a few hundred concurrent users when using a thread per connection while Node can scale up to a 100,000 concurrent connections based on event loops and asynchronous IO. +We haven’t talked about error handling, but it gets handled the same exact way, with the error callback being called with the error object the promise is rejected with. +Event loops have proven to be surprisingly performant. When network servers are designed around multithreading, as soon as you end up with a few hundred concurrent connections, the CPU spends so much of its time task switching that you start to lose overall performance. Switching from one thread to another has overhead which can add up significantly at scale. Apache used to choke even as low as a few hundred concurrent users when using a thread per connection while Node.js can scale up to a 100,000 concurrent connections based on event loops and asynchronous IO. ### Thread Model - -Oz programming language introduced an idea of dataflow concurrency model. In Oz, whenever the program comes across an unbound variable, it waits for it to be resolved. This dataflow property of variables helps us write threads in Oz that communicate through streams in a producer-consumer pattern. The major benefit of dataflow based concurrency model is that it’s deterministic - same operation called with same parameters always produces the same result. It makes it a lot easier to reason about concurrent programs, if the code is side-effect free. - +The Oz programming language introduced an idea of dataflow concurrency model. In Oz, whenever the program comes across an unbound variable, it waits for it to be resolved. This dataflow property of variables helps us write threads in Oz that communicate through streams in a producer-consumer pattern. The major benefit of dataflow based concurrency model is that it’s deterministic - same operation called with same parameters always produces the same result. It makes it a lot easier to reason about concurrent programs, if the code is side-effect free. Alice ML is a dialect of Standard ML with support for lazy evaluation, concurrent, distributed, and constraint programming. The early aim of Alice project was to reconstruct the functionalities of Oz programming language on top of a typed programming language. Building on the Standard ML dialect, Alice also provides concurrency features as part of the language through the use of a future type. Futures in Alice represent an undetermined result of a concurrent operation. Promises in Alice ML are explicit handles for futures. - Any expression in Alice can be evaluated in it's own thread using spawn keyword. Spawn always returns a future which acts as a placeholder for the result of the operation. Futures in Alice ML can be thought of as functional threads, in a sense that threads in Alice always have a result. A thread is said to be touching a future if it performs an operation that requires the value future is a placeholder for. All threads touching a future are blocked until the future is resolved. If a thread raises an exception, the future is failed and this exception is re-raised in the threads touching it. Futures can also be passed along as values. This helps us achieve the dataflow model of concurrency in Alice. - Alice also allows for lazy evaluation of expressions. Expressions preceded with the lazy keyword are evaluated to a lazy future. The lazy future is evaluated when it is needed. If the computation associated with a concurrent or lazy future ends with an exception, it results in a failed future. Requesting a failed future does not block, it simply raises the exception that was the cause of the failure. ## Implicit vs. Explicit Promises - -We define Implicit promises as ones where we don’t have to manually trigger the computation vs Explicit promises where we have to trigger the resolution of future manually, either by calling a start function or by requiring the value. This distinction can be understood in terms of what triggers the calculation : With Implicit promises, the creation of a promise also triggers the computation, while with Explicit futures, one needs to triggers the resolution of a promise. This trigger can in turn be explicit, like calling a start method, or implicit, like lazy evaluation where the first use of a promise’s value triggers its evaluation. - +We define implicit promises as ones where we don’t have to manually trigger the computation vs Explicit promises where we have to trigger the resolution of future manually, either by calling a start function or by requiring the value. This distinction can be understood in terms of what triggers the calculation: With implicit promises, the creation of a promise also triggers the computation, while with explicit futures, one needs to triggers the resolution of a promise. This trigger can in turn be explicit, like calling a start method, or implicit, like lazy evaluation where the first use of a promise’s value triggers its evaluation. The idea for explicit futures were introduced in the Baker and Hewitt paper. They’re a little trickier to implement, and require some support from the underlying language, and as such they aren’t that common. The Baker and Hewitt paper talked about using futures as placeholders for arguments to a function, which get evaluated in parallel, but when they’re needed. MultiLisp also had a mechanism to delay the evaluation of the future to the time when it's value is first used, using the defer construct. Lazy futures in Alice ML have a similar explicit invocation mechanism, the first thread touching a future triggers its evaluation. -An example for Explicit Futures would be (from AliceML): +An example of explicit futures would be (from AliceML): ``` fun enum n = lazy n :: enum (n+1) - ``` This example generates an infinite stream of integers and if stated when it is created, will compete for the system resources. -Implicit futures were introduced originally by Friedman and Wise in a paper in 1978. The ideas presented in that paper inspired the design of promises in MultiLisp. Futures are also implicit in Scala and Javascript, where they’re supported as libraries on top of the core languages. Implicit futures can be implemented this way as they don’t require support from language itself. Alice ML’s concurrent futures are also an example of implicit invocation. +Implicit futures were introduced originally by Friedman and Wise in a paper in 1978. The ideas presented in that paper inspired the design of promises in MultiLisp. Futures are also implicit in Scala and JavaScript, where they’re supported as libraries on top of the core languages. Implicit futures can be implemented this way as they don’t require support from language itself. Alice ML’s concurrent futures are also an example of implicit invocation. -For example +In Scala, we can see an example of an implicit future when making an HTTP request. ```scala - val f = Future { Http("http://api.fixer.io/latest?base=USD").asString } @@ -355,29 +335,22 @@ f onComplete { case Success(response) => println(response.body) case Failure(t) => println(t) } - ``` - This sends the HTTP call as soon as it the Future is created. In Scala, although the futures are implicit, Promises can be used to have an explicit-like behavior. This is useful in a scenario where we need to stack up some computations and then resolve the Promise. -An Example : - ```scala - val p = Promise[Foo]() p.future.map( ... ).filter( ... ) foreach println p.complete(new Foo) - ``` -Here, we create a Promise, and complete it later. In between we stack up a set of computations which get executed once the promise is completed. - +Here, we create a Promise, and complete it later. Between creation and completion we stack up a set of computations which then get executed once the promise is completed. ## Promise Pipelining -One of the criticism of traditional RPC systems would be that they’re blocking. Imagine a scenario where you need to call an API ‘a’ and another API ‘b’, then aggregate the results of both the calls and use that result as a parameter to another API ‘c’. Now, the logical way to go about doing this would be to call A and B in parallel, then once both finish, aggregate the result and call C. Unfortunately, in a blocking system, the way to go about is call a, wait for it to finish, call b, wait, then aggregate and call c. This seems like a waste of time, but in absence of asynchronicity, it is impossible. Even with asynchronicity, it gets a little difficult to manage or scale up the system linearly. Fortunately, we have promises. +One of the criticism of traditional RPC systems would be that they’re blocking. Imagine a scenario where you need to call an API ‘A’ and another API ‘B’, then aggregate the results of both the calls and use that result as a parameter to another API ‘C’. Now, the logical way to go about doing this would be to call A and B in parallel, then once both finish, aggregate the result and call C. Unfortunately, in a blocking system, the way to go about is call A, wait for it to finish, call B, wait, then aggregate and call C. This seems like a waste of time, but in absence of asynchronicity, it is impossible. Even with asynchronicity, it gets a little difficult to manage or scale up the system linearly. Fortunately, we have promises. <figure> @@ -388,32 +361,27 @@ One of the criticism of traditional RPC systems would be that they’re blocking <img src="./images/p-2.png" alt="timeline" /> </figure> -Futures/Promises can be passed along, waited upon, or chained and joined together. These properties helps make life easier for the programmers working with them. This also reduces the latency associated with distributed computing. Promises enable dataflow concurrency, which is also deterministic, and easier to reason. +Futures/Promises can be passed along, waited upon, or chained and joined together. These properties helps make life easier for the programmers working with them. This also reduces the latency associated with distributed computing. Promises enable dataflow concurrency, which is also deterministic, and easier to reason about. -The history of promise pipelining can be traced back to the call-streams in Argus. In Argus, Call streams are a mechanism for communication between distributed components. The communicating entities, a sender and a receiver are connected by a stream, and sender can make calls to receiver over it. Streams can be thought of as RPC, except that these allow callers to run in parallel with the receiver while processing the call. When making a call in Argus, the caller receives a promise for the result. In the paper on Promises by Liskov and Shrira, they mention that having integrated Promises into call streams, next logical step would be to talk about stream composition. This means arranging streams into pipelines where output of one stream can be used as input of the next stream. They talk about composing streams using fork and coenter. +The history of promise pipelining can be traced back to the call-streams in Argus. In Argus, call-streams are a mechanism for communication between distributed components. The communicating entities, a sender and a receiver are connected by a stream, and sender can make calls to receiver over it. Streams can be thought of as RPC, except that these allow callers to run in parallel with the receiver while processing the call. When making a call in Argus, the caller receives a promise for the result. In the paper on Promises by Liskov and Shrira, they mention that having integrated Promises into call streams, next logical step would be to talk about stream composition. This means arranging streams into pipelines where output of one stream can be used as input of the next stream. They talk about composing streams using fork and coenter. Channels in Joule were a similar idea, providing a channel which connects an acceptor and a distributor. Joule was a direct ancestor to E language, and talked about it in more detail. ``` - t3 := (x <- a()) <- c(y <- b()) t1 := x <- a() t2 := y <- b() t3 := t1 <- c(t2) - ``` -Without pipelining in E, this call will require three round trips. First to send a() to x, then b() to y then finally c to the result t1 with t2 as an argument. But with pipelining, the later messages can be sent with promises as result of earlier messages as argument. This allowed sending all the messages together, thereby saving the costly round trips. This is assuming x and y are on the same remote machine, otherwise we can still evaluate t1 and t2 parallely. - - -Notice that this pipelining mechanism is different from asynchronous message passing, as in asynchronous message passing, even if t1 and t2 get evaluated in parallel, to resolve t3 we still wait for t1 and t2 to be resolved, and send it again in another call to the remote machine. +Without pipelining in E, this call will require three round trips. First to send `a()` to `x`, then `b()` to `y` then finally `c` to the result `t1` with `t2` as an argument. But with pipelining, the later messages can be sent with promises as result of earlier messages as argument. This allowed sending all the messages together, thereby saving the costly round trips. This is assuming `x` and `y` are on the same remote machine, otherwise we can still evaluate `t1` and `t2` parallely. +Notice that this pipelining mechanism is different from asynchronous message passing, as in asynchronous message passing, even if `t1` and `t2` get evaluated in parallel, to resolve `t3` we still wait for `t1` and `t2` to be resolved, and send it again in another call to the remote machine. -Modern promise specifications, like one in Javascript comes with methods which help working with promise pipelining easier. In javascript, a Promises.all method is provided, which takes in an iterable and returns a new Promise which gets resolved when all the promises in the iterable get resolved. There’s also a race method, which returns a promise which is resolved when the first promise in the iterable gets resolved. +Modern promise specifications, like one in JavaScript comes with methods which help working with promise pipelining easier. In JavaScript, a `Promise.all` method is provided, which takes in an iterable and returns a new `Promise` which gets resolved when all the promises in the iterable get resolved. There’s also a `Promise.race` method, which returns a promise which is resolved when the first promise in the iterable gets resolved. Examples using these methods are shown below. ```javascript - var a = Promise.resolve(1); var b = new Promise(function (resolve, reject) { setTimeout(resolve, 100, 2); @@ -426,49 +394,39 @@ Promise.all([p1, p2]).then(values => { Promise.race([p1, p2]).then(function(value) { console.log(value); // 1 }); - ``` -In Scala, futures have a onSuccess method which acts as a callback to when the future is complete. This callback itself can be used to sequentially chain futures together. But this results in bulkier code. Fortunately, Scala api comes with combinators which allow for easier combination of results from futures. Examples of combinators are map, flatmap, filter, withFilter. - +In Scala, futures have an `onSuccess` method which acts as a callback to when the future is complete. This callback itself can be used to sequentially chain futures together. But this results in bulkier code. Fortunately, the Scala API has combinators which allow for easier combination of results from futures. Examples of combinators are `map`, `flatMap`, `filter`, `withFilter`. ## Handling Errors -If world would have run without errors we would rejoice in unison, but it is not the case in programming world as well. When you run a program you either receive an expected output or an error. Error can be defined as wrong output or an exception. In a synchronous programming model, the most logical way of handling errors is a try...catch block. +If the world ran without errors we would rejoice in unison, but this is not the case in the programming world. When you run a program you either receive an expected output or an error. Error can be defined as wrong output or an exception. In a synchronous programming model, the most logical way of handling errors is a `try...catch` block. ```javascript - -try{ - do something1; - do something2; - do something3; - ... -} catch ( exception ){ - HandleException; +try { + do something1; + do something2; + do something3; + // ... +} catch (exception) { + HandleException; } - ``` Unfortunately, the same thing doesn’t directly translate to asynchronous code. - ```javascript - foo = doSomethingAsync(); -try{ - foo(); - // This doesn’t work as the error might not have been thrown yet -} catch ( exception ){ - handleException; +try { + foo(); + // This doesn’t work as the error might not have been thrown yet +} catch (exception) { + handleException; } - - ``` - - -Although most of the earlier papers did not talk about error handling, the Promises paper by Liskov and Shrira did acknowledge the possibility of failure in a distributed environment. To put this in Argus's perspective, the 'claim' operation waits until the promise is ready. Then it returns normally if the call terminated normally, and otherwise it signals the appropriate 'exception', e.g., +Although most of the earlier papers did not talk about error handling, the Promises paper by Liskov and Shrira did acknowledge the possibility of failure in a distributed environment. To put this in Argus's perspective, the 'claim' operation waits until the promise is ready. Then it returns normally if the call terminated normally, and otherwise it signals the appropriate 'exception', e.g. ``` y: real := pt$claim(x) @@ -476,27 +434,29 @@ y: real := pt$claim(x) when unavailable(s: string): . when failure(s: string): . . end - ``` + Here x is a promise object of type pt; the form pi$claim illustrates the way Argus identifies an operation of a type by concatenating the type name with the operation name. When there are communication problems, RPCs in Argus terminate either with the 'unavailable' exception or the 'failure' exception. -'Unavailable' - means that the problem is temporary, e.g., communication is impossible right now. -'Failure' - means that the problem is permanent, e.g., the handler’s guardian does not exist. -Thus stream calls (and sends) whose replies are lost because of broken streams will terminate with one of these exceptions. Both exceptions have a string argument that explains the reason for the failure, e.g., future(“handler does not exist”), or unavailable(“cannot communicate”). Since any call can fail, every handler can raise the exceptions failure and unavailable. In this paper they also talked about propagation of exceptions from the called procedure to the caller. In paper about E language they talk about broken promises and setting a promise to the exception of broken references. + +* **Unavailable** means that the problem is temporary e.g. communication is impossible right now. +* **Failure** means that the problem is permanent e.g. the handler’s guardian does not exist. + +Thus stream calls (and sends) whose replies are lost because of broken streams will terminate with one of these exceptions. Both exceptions have a string argument that explains the reason for the failure, e.g., `future("handler does not exist")` or `unavailable("cannot communicate")`. Since any call can fail, every handler can raise the exceptions failure and unavailable. In this paper they also talked about propagation of exceptions from the called procedure to the caller. In the paper about the E language they talk about broken promises and setting a promise to the exception of broken references. + +### Modern Languages In modern languages like Scala, Promises generally come with two callbacks. One to handle the success case and other to handle the failure. e.g. ```scala - f onComplete { - case Success(data) => handleSuccess(data) - case Failure(e) => handleFailure(e) + case Success(data) => handleSuccess(data) + case Failure(e) => handleFailure(e) } ``` -In Scala, the Try type represents a computation that may either result in an exception, or return a successfully computed value. For example, Try[Int] represents a computation which can either result in Int if it's successful, or return a Throwable if something is wrong. +In Scala, the `Try` type represents a computation that may either result in an exception, or return a successfully computed value. For example, `Try[Int]` represents a computation which can either result in `Int` if it's successful, or return a `Throwable` if something is wrong. ```scala - val a: Int = 100 val b: Int = 0 def divide: Try[Int] = Try(a/b) @@ -507,128 +467,115 @@ divide match { case Failure(e) => println(e) // java.lang.ArithmeticException: / by zero } - ``` -Try type can be pipelined, allowing for catching exceptions and recovering from them along the way. +The `Try` type can be pipelined, allowing for catching exceptions and recovering from them along the way. -#### In Javascript -```javascript +A similar pattern for handling exceptions can be seen in JavaScript. +```javascript promise.then(function (data) { - // success callback - console.log(data); + // success callback + console.log(data); }, function (error) { - // failure callback - console.error(error); + // failure callback + console.error(error); }); - ``` + Scala futures exception handling: -When asynchronous computations throw unhandled exceptions, futures associated with those computations fail. Failed futures store an instance of Throwable instead of the result value. Futures provide the onFailure callback method, which accepts a PartialFunction to be applied to a Throwable. TimeoutException, scala.runtime.NonLocalReturnControl[] and ExecutionException exceptions are treated differently +When asynchronous computations throw unhandled exceptions, futures associated with those computations fail. Failed futures store an instance of `Throwable` instead of the result value. Futures provide the onFailure callback method, which accepts a PartialFunction to be applied to a `Throwable`. `TimeoutException`, `scala.runtime.NonLocalReturnControl[]` and `ExecutionException` exceptions are treated differently Scala promises exception handling: -When failing a promise with an exception, three subtypes of Throwables are handled specially. If the Throwable used to break the promise is a scala.runtime.NonLocalReturnControl, then the promise is completed with the corresponding value. If the Throwable used to break the promise is an instance of Error, InterruptedException, or scala.util.control.ControlThrowable, the Throwable is wrapped as the cause of a new ExecutionException which, in turn, is failing the promise. - +When failing a promise with an exception, three subtypes of `Throwable` are handled specially. If the `Throwable` used to break the promise is a `scala.runtime.NonLocalReturnControl`, then the promise is completed with the corresponding value. If the `Throwable` used to break the promise is an instance of `Error`, `InterruptedException`, or `scala.util.control.ControlThrowable`, the `Throwable` is wrapped as the cause of a new `ExecutionException` which, in turn, is failing the promise. -To handle errors with asynchronous methods and callbacks, the error-first callback style ( which we've seen before, also adopted by Node) is the most common convention. Although this works, but it is not very composable, and eventually takes us back to what is called callback hell. Fortunately, Promises allow asynchronous code to apply structured error handling. Promises .then method takes in two callbacks, a onFulfilled to handle when a promise is resolved successfully and a onRejected to handle if the promise is rejected. +To handle errors with asynchronous methods and callbacks, the error-first callback style (which we've seen before, and has also been adopted by Node.js) is the most common convention. Although this works, but it is not very composable, and eventually takes us back to what is called callback hell. Fortunately, Promises allow asynchronous code to apply structured error handling. The `Promises.then` method takes in two callbacks, a `onFulfilled` to handle when a promise is resolved successfully and a onRejected to handle if the promise is rejected. ```javascript - -var p = new Promise(function(resolve, reject){ +var p = new Promise(function(resolve, reject) { resolve(100); }); -p.then(function(data){ +p.then(function(data) { console.log(data); // 100 -},function(error){ +}, function(error) { console.err(error); }); -var q = new Promise(function(resolve, reject){ +var q = new Promise(function(resolve, reject) { reject(new Error( {'message':'Divide by zero'} )); }); -q.then(function(data){ +q.then(function(data) { console.log(data); -},function(error){ - console.err(error);// {'message':'Divide by zero'} +}, function(error) { + console.err(error); // {'message':'Divide by zero'} }); ``` - Promises also have a catch method, which work the same way as onFailure callback, but also help deal with errors in a composition. Exceptions in promises behave the same way as they do in a synchronous block of code : they jump to the nearest exception handler. ```javascript function work(data) { - return Promise.resolve(data+"1"); + return Promise.resolve(data + "1"); } function error(data) { - return Promise.reject(data+"2"); + return Promise.reject(data + "2"); } function handleError(error) { - return error +"3"; + return error + "3"; } - work("") -.then(work) -.then(error) -.then(work) // this will be skipped -.then(work, handleError) -.then(check); + .then(work) + .then(error) + .then(work) // this will be skipped + .then(work, handleError) + .then(check); function check(data) { - console.log(data == "1123"); - return Promise.resolve(); + console.log(data == "1123"); + return Promise.resolve(); } - ``` The same behavior can be written using catch block. ```javascript - work("") -.then(work) -.then(error) -.then(work) -.catch(handleError) -.then(check); + .then(work) + .then(error) + .then(work) + .catch(handleError) + .then(check); function check(data) { - console.log(data == "1123"); - return Promise.resolve(); + console.log(data == "1123"); + return Promise.resolve(); } - ``` - ## Futures and Promises in Action - ### Twitter Finagle +Finagle is a protocol-agnostic, asynchronous RPC system for the JVM that makes it easy to build robust clients and servers in Java, Scala, or any other JVM language. It uses Futures to encapsulate concurrent tasks. Finagle introduces two other abstractions built on top of Futures to reason about distributed software: -Finagle is a protocol-agnostic, asynchronous RPC system for the JVM that makes it easy to build robust clients and servers in Java, Scala, or any JVM-hosted language. It uses Futures to encapsulate concurrent tasks. Finagle -introduces two other abstractions built on top of Futures to reason about distributed software : - -- **Services** are asynchronous functions which represent system boundaries. +- **Services** are asynchronous functions which represent system boundaries. - **Filters** are application-independent blocks of logic like handling timeouts and authentication. In Finagle, operations describe what needs to be done, while the actual execution is left to be handled by the runtime. The runtime comes with a robust implementation of connection pooling, failure detection and recovery and load balancers. -Example of a Service: - +An example of a `Service`: ```scala @@ -636,20 +583,19 @@ val service = new Service[HttpRequest, HttpResponse] { def apply(request: HttpRequest) = Future(new DefaultHttpResponse(HTTP_1_1, OK)) } - ``` -A timeout filter can be implemented as : - -```scala -def timeoutFilter(d: Duration) = - { (req, service) => service(req).within(d) } +A timeout filter can be implemented as: +```scala +def timeoutFilter(d: Duration) = { + (req, service) => service(req).within(d) +} ``` - ### Correctables -Correctables were introduced by Rachid Guerraoui, Matej Pavlovic, and Dragos-Adrian Seredinschi at OSDI ‘16, in a paper titled Incremental Consistency Guarantees for Replicated Objects. As the title suggests, Correctables aim to solve the problems with consistency in replicated objects. They provide incremental consistency guarantees by capturing successive changes to the value of a replicated object. Applications can opt to receive a fast but possibly inconsistent result if eventual consistency is acceptable, or to wait for a strongly consistent result. Correctables API draws inspiration from, and builds on the API of Promises. Promises have a two state model to represent an asynchronous task, it starts in blocked state and proceeds to a ready state when the value is available. This cannot represent the incremental nature of correctables. Instead, Correctables have a updating state when it starts. From there on, it remains in updating state during intermediate updates, and when the final result is available, it transitions to final state. If an error occurs in between, it moves into an error state. Each state change triggers a callback. + +Correctables were introduced by Rachid Guerraoui, Matej Pavlovic, and Dragos-Adrian Seredinschi at OSDI ‘16, in a paper titled Incremental Consistency Guarantees for Replicated Objects. As the title suggests, Correctables aim to solve the problems with consistency in replicated objects. They provide incremental consistency guarantees by capturing successive changes to the value of a replicated object. Applications can opt to receive a fast but possibly inconsistent result if eventual consistency is acceptable, or to wait for a strongly consistent result. Correctables API draws inspiration from, and builds on the API of Promises. Promises have a two state model to represent an asynchronous task, it starts in blocked state and proceeds to a ready state when the value is available. This cannot represent the incremental nature of correctables. Instead, Correctables have a updating state when they start. From there on, they remain in an updating state during intermediate updates, and when the final result is available, they transition to final state. If an error occurs in between, they move into an error state. Each state change triggers a callback. <figure> <img src="./images/15.png" alt="timeline" /> @@ -657,12 +603,12 @@ Correctables were introduced by Rachid Guerraoui, Matej Pavlovic, and Dragos-Adr ### Folly Futures -Folly is a library by Facebook for asynchronous C++ inspired by the implementation of Futures by Twitter for Scala. It builds upon the Futures in the C++11 Standard. Like Scala’s futures, they also allow for implementing a custom executor which provides different ways of running a Future (thread pool, event loop etc). +Folly is a library by Facebook for asynchronous C++ inspired by the implementation of Futures by Twitter for Scala. It builds upon the Futures in the C++11 Standard. Like Scala’s futures, they also allow for implementing a custom executor which provides different ways of running a Future (thread pool, event loop etc). -### NodeJS Fiber -Fibers provide coroutine support for v8 and node. Applications can use Fibers to allow users to write code without using a ton of callbacks, without sacrificing the performance benefits of asynchronous IO. Think of fibers as light-weight threads for NodeJs where the scheduling is in the hands of the programmer. The node-fibers library doesn’t recommend using raw API and code together without any abstractions, and provides a Futures implementation which is ‘fiber-aware’. +### Node.js Fibers +Fibers provide coroutine support for V8 and Node.js. Applications can use Fibers to allow users to write code without using a ton of callbacks, without sacrificing the performance benefits of asynchronous IO. Think of fibers as light-weight threads for Node.js where the scheduling is in the hands of the programmer. The node-fibers library doesn’t recommend using raw API and code together without any abstractions, and provides a Futures implementation which is ‘fiber-aware’. ## References diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 295a307..a6ec35f 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -1,10 +1,9 @@ --- layout: page title: "Distributed Programming Languages" -by: "Joe Schmoe and Mary Jane" +by: "Connor Zanin" --- - ## Problems of Distributed Programming There are problems that exist in distributed system environments that do not exist in single-machine environments. @@ -14,7 +13,7 @@ In this section, we present an overview of these three problems and their impact ### Partial Failure -In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system {% cite waldo1997 --file dist-langs.bib %} +In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. {% cite waldo1997 --file dist-langs.bib %} If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge. For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about the failed process' state, etc. @@ -23,7 +22,7 @@ In a distributed environment, there is no (reliable) central manager that can re Further, due to the inherent concurrency in a distributed system, nondeterminism is a problem that must be considered when designing distributed models, languages, and systems. Communication is perhaps the most obvious example of this; messages may be lost or arrive out-of-order. Finally, unlike in a local environment where failure returns control to the caller, failure may not be reported or the response may simply vanish. -Because of this, distributed communication must be designed expecting partial failure, and be able to "fail gracefully." +Because of this, distributed communication must be designed expecting partial failure, and be able to "fail gracefully". Several methods have been developed to deal with the problem of partial failure. One method, made popular with batch processing and MapReduce style frameworks, is to remember the series of computations needed to obtain a result and recompute the result in the case of failure. @@ -43,7 +42,7 @@ In other systems, like Argus, objects are reconstructed from state that is autom ### Consistency (Concurrency) -If computing on shared data can be avoided, parallel computations would not be bottlenecked by serialized accesses. +If computing on shared data can be avoided, parallel computations will not be bottlenecked by serialized accesses. Unfortunately, there are many instances where operating on shared data is necessary. While problems with shared data can be dealt with fairly simply in the local case, distribution introduces problems that make consistency more complex. @@ -87,7 +86,7 @@ Although their may be subtle timing differences that arise from contention from As well, machine hardware is constant. There are no changes to the latency of communication channels on a single machine. -Distribution introduces network topology. +Distribution of computation introduces a network topology which changes many of the underlying assumptions around latency. This topology significantly (orders of magnitude) increases the latency of communication, as well as introduces a source of nondeterminism. At any time, routing protocols or hardware changes (or both) may cause the latency between two machines to change. Therefore, distributed applications may not rely on specific timings of communication in order to function. @@ -472,12 +471,12 @@ Another definition is offered (and commonly cited) by {% cite Deursen2000 --file Generally, I would refer to a domain-specific language (DSL) as a *system*, be it a standalone language, compiler extension, library, set of macros, etc., that is designed for a set of cohesive operations to be easily expressed. -For example, the python twitter library is designed for easily expressing operations that manage a twitter account. +For example, the Python Twitter library is designed for easily expressing operations that manage a Twitter account. The problem in defining this term (I believe) is the the vagueness of the components *domain* and *language*. Depending on the classification, a set of problems designated in a certain domain may span a "wide" or "narrow" scope. -For example, does "tweeting" qualify as a domain (within the twitter library)? -Would "social media sharing" qualify as a domain (containing the twitter library)? +For example, does "tweeting" qualify as a domain (within the Twitter library)? +Would "social media sharing" qualify as a domain (containing the Twitter library)? For my purposes I will accept the definition of a domain as a "well-defined, cohesive set of operations." It is also difficult to come up with a definition for a language. @@ -513,7 +512,6 @@ Another problem that drives DSL development towards libraries is ecosystem. In order for a DSL to be adopted, there must be a body of developers that can incorporate the DSL into existing systems. If either (1) the DSL does not incorporate well with existing code bases or (2) the DSL requires significant investment to learn, adoption will be less likely. - ## References {% bibliography --file dist-langs %} diff --git a/chapter/5/langs-extended-for-dist.md b/chapter/5/langs-extended-for-dist.md index 70a33fc..6d06b11 100644 --- a/chapter/5/langs-extended-for-dist.md +++ b/chapter/5/langs-extended-for-dist.md @@ -1,25 +1,16 @@ --- layout: page -title: "General Purpose Languages Extended for Distribution" +title: "General Purpose Languages Extended for Distribution" by: "Sam Caldwell" --- ## Introduction -In very general terms, a distributed system is comprised of nodes -that internally perform computation and communicate with each other. -Therefore programming a distributed system requires two distinct -models: one for the computation on each node and one to model the -network of communications between nodes. +In very general terms, a distributed system is comprised of nodes that internally perform computation and communicate with each other. Therefore programming a distributed system requires two distinct models: one for the computation on each node and one to model the network of communications between nodes. A slightly-secondary concern is fault-tolerance. Failure of nodes and communication is one of the defining aspects of distributed computing {% cite note-on-dc --file langs-extended-for-dist %}. A programming model for distributed systems then must either implement a strategy for handling failures or equip programmers to design their own. -Nodes can perform computation in any of the various paradigms -(imperative, functional, object-oriented, relational, and so on). The -models are not completely orthogonal. Some matters necessarily -concern both. Serialization is a communication concern that is greatly -influenced by the design of the computation language. Similar concerns -affect the means of deployment and updating systems. +Nodes can perform computation in any of the various paradigms (imperative, functional, object-oriented, relational, and so on). The models are not completely orthogonal. Some matters necessarily concern both. Serialization is a communication concern that is greatly influenced by the design of the computation language. Similar concerns affect the means of deployment and updating systems. Early designs of distributed programming models in the late 1970s focused on *building novel programming languages* (Eden, Argus, Emerald). As time has gone on researchers shifted towards *extending existing languages* with facilities for programming distributed system. This article explores the history of these designs and the tradeoffs to be made between the two approaches. @@ -28,80 +19,48 @@ Early designs of distributed programming models in the late 1970s focused on *bu The different approaches to implementing a distributed programming model differ on the tradeoffs they offer to both language designers and users. ### Clean-slate language implementations -A straightforward approach to implementing a new language is to start from scratch. Beginning with a clean slate offers several advantages. First, the implementor has complete control (up to what they are willing and able to implement) over every aspect of the design. Second, some elements of language design are difficult to integrate with other languages. Type systems are a good example of the problem. How to combine the designs of two type systems into one that has the properties of both is an open problem. If types play a prominent role in the language than starting anew may be the only option. However, as explained below, this strategy has apparent drawbacks not only in term of implementation effort for the creator(s) but also for users. + +A straightforward approach to implementing a new language is to start from scratch. Beginning with a clean slate offers several advantages. First, the implementor has complete control (up to what they are willing and able to implement) over every aspect of the design. Second, some elements of language design are difficult to integrate with other languages. Type systems are a good example of the problem. How to combine the designs of two type systems into one that has the properties of both is an open problem. If types play a prominent role in the language then starting anew may be the only option. However, this strategy has apparent drawbacks not only in term of implementation effort for the creator(s) but also for users. ### Extension to an existing language + Another option is to use another language as a starting point and modify that language’s implementation for your own purposes. Think of it as forking a compiler and/or runtime and then making your own modifications. An advantage of this approach is the savings in implementation effort. If the computation models coincide then the language designer only has to implement the communication model and fault-tolerance strategy. Another plus is that users of the existing language may be more likely to consider trying or adopting the new language since it is only a step away from what they already know. A downside for maintaining any fork is keeping up to date with upstream changes. ### Library + The final approach to consider is similar to extending a language, but this time doing so only by writing code in that language to create new abstractions implementing a language model. This strategy offers similar savings in implementation effort to extending a language (perhaps more so). But a major benefit is that is significantly easier for users to use and adopt; the only new concepts they must learn are the specifics of the programming model, as opposed to other concerns such as syntax. ## Roadmap + Language designers have explored a broad spectrum of designs for distributed programming models. ### Actors -One common sense and popular approach to extending a language for -distribution is to implement the constructs of the Actor model (discussed in chapter 3). This -is the approach taken by Termite Scheme, CloudHaskell, and Scala -Actors to name a few. Doing so requires two steps: first, implementing facilities -for spawning and linking actors and sending and receiving messages; and second, fitting actor-style concurrency on top of the language’s concurrency model. - - -Erlang/OTP {% cite Erlang --file langs-extended-for-dist %} is a language designed with the intention of -building resilient distributed applications for operating telephony -switches. Erlang provides a simple base language and an extensive -library of facilities for distribution. Erlang/OTP fits into the -Actor model of distribution: the basic agents in a system are -processes which communicate by sending messages to each other. -Erlang/OTP stands out for the extent to which fault tolerance is -considered. Erlang programmers are encouraged to think about failure -as a routine occurrence and provides libraries for describing policies -on how to handle such failures. Chapter 3 includes a more detailed overview of Erlang. - -TermiteScheme {% cite TermiteScheme --file langs-extended-for-dist %} is an extension to a Scheme -language with constructs for Erlang-style distribution. The primary -innovations of Termite are found by leveraging features of the host -language. In particular, the features of interest are macros and -continuations. - -Macros allow users or library developers to design higher-level abstractions than -the actor model and treat them as first-class in the language. Macros are a very powerful tool for abstraction. They allow library authors to elevate many patterns, such as patterns of communication, to first-class constructs. A simple example is a construct for RPC implemented by TermiteScheme as a macro expanding to a send followed by a receive. - -A continuation is a concrete representation of how to finish computation of a program. A helpful analogy is to the call-stack in a procedural language. The call-stack tells you, once you’ve finished the current procedure, what to do next. Likewise, a continuation tells you, once you’ve finished evaluating the current expression, what to do next. Languages with first-class continuations have a way of reifying this concept, historically named `call/cc`, an abbreviation for `call-with-current-continuation`. First-class continuations allow for simple process migration; a process can capture its continuation and use that as the behavior of a spawned -actor on another node. - -In addition to supporting the classic actor style which places few if -any constraints on what messages may be sent, Haskell and Scala based -implementations leverage the host language's powerful type system to -create more structured communication patterns. Both CloudHaskell and -Akka, the successor to Scala Actors, provide *typed - channels* between actors. For example, the type -checker will reject a program where an actor expecting to receive -numbers is instead sent a string. Anecdotally, typed actors in Akka are not commonly used. This might suggest that errors due to incorrect message types are not that serious of a concern for users. + +One common sense and popular approach to extending a language for distribution is to implement the constructs of the Actor model (discussed in chapter 3). This is the approach taken by Termite Scheme, CloudHaskell, and Scala Actors to name a few. Doing so requires two steps: first, implementing facilities for spawning and linking actors and sending and receiving messages; and second, fitting actor-style concurrency on top of the language’s concurrency model. + +Erlang/OTP {% cite Erlang --file langs-extended-for-dist %} is a language designed with the intention of building resilient distributed applications for operating telephony switches. Erlang provides a simple base language and an extensive library of facilities for distribution. Erlang/OTP fits into the Actor model of distribution: the basic agents in a system are processes which communicate by sending messages to each other. Erlang/OTP stands out for the extent to which fault tolerance is considered. Erlang programmers are encouraged to think about failure as a routine occurrence and provides libraries for describing policies on how to handle such failures. Chapter 3 includes a more detailed overview of Erlang. + +TermiteScheme {% cite TermiteScheme --file langs-extended-for-dist %} is an extension to a Scheme language with constructs for Erlang-style distribution. The primary innovations of Termite are found by leveraging features of the host language. In particular, the features of interest are macros and continuations. + +Macros allow users or library developers to design higher-level abstractions than the actor model and treat them as first-class in the language. Macros are a very powerful tool for abstraction. They allow library authors to elevate many patterns, such as patterns of communication, to first-class constructs. A simple example is a construct for RPC implemented by TermiteScheme as a macro expanding to a send followed by a receive. + +A continuation is a concrete representation of how to finish computation of a program. A helpful analogy is to the call-stack in a procedural language. The call-stack tells you, once you’ve finished the current procedure, what to do next. Likewise, a continuation tells you, once you’ve finished evaluating the current expression, what to do next. Languages with first-class continuations have a way of reifying this concept, historically named `call/cc`, an abbreviation for `call-with-current-continuation`. First-class continuations allow for simple process migration; a process can capture its continuation and use that as the behavior of a spawned actor on another node. + +In addition to supporting the classic actor style which places few if any constraints on what messages may be sent, Haskell and Scala based implementations leverage the host language's powerful type system to create more structured communication patterns. Both CloudHaskell and Akka, the successor to Scala Actors, provide *typed channels* between actors. For example, the type checker will reject a program where an actor expecting to receive numbers is instead sent a string. Anecdotally, typed actors in Akka are not commonly used. This might suggest that errors due to incorrect message types are not that serious of a concern for users. One disadvantage of this approach of implementing the simple actor model of Erlang is that Erlang also provides an extensive support platform for creating and deploying distributed systems. Even after the Erlang model has been implemented in Scheme, it is a long road to feature-parity. ### Types -Several efforts have explored the interplay between statically typed -languages and distributed computing. Research is focused on extending functional languages like SML, Haskell, and Scala which feature relatively advanced type systems. Areas of investigation include how to integrate an existing model of distribution with a type system and how to take advantage of types to help programmers. +Several efforts have explored the interplay between statically typed languages and distributed computing. Research is focused on extending functional languages like SML, Haskell, and Scala which feature relatively advanced type systems. Areas of investigation include how to integrate an existing model of distribution with a type system and how to take advantage of types to help programmers. -CloudHaskell {% cite CloudHaskell --file langs-extended-for-dist %} is an extension to Haskell. -CloudHaskell is largely designed in the mold of Erlang -style process-based actors. It offers a novel typeclass based approach for -ensuring safe serialization (as in only serializing things known to be -serializable). Typeclasses {% cite typeclasses --file langs-extended-for-dist %} are a novel feature of Haskell’s language system that enable flexible overloading. Typeclass *instances* can be *derived automatically* {% cite deriving-typeclasses --file langs-extended-for-dist %} to reduce boilerplate. For example, the code to perform serialization and deserialization can be automatically generated for each kind of message using its definition. CloudHaskell takes the stance that the cost of communication operations should be readily apparent to the programmer. A consequence of this is that calls to the (automatically generated) serialization functions for messages must be explicitly inserted into the code. For example, in Erlang where we might write `send msg` in CloudHaskell we write `send (serialize msg)`. CloudHaskell is implemented as an extension to GHC Haskell (actually it was first implemented as a library then later on the features they needed were added as extensions to GHC). +CloudHaskell {% cite CloudHaskell --file langs-extended-for-dist %} is an extension to Haskell. CloudHaskell is largely designed in the mold of Erlang style process-based actors. It offers a novel typeclass based approach for ensuring safe serialization (as in only serializing things known to be serializable). Typeclasses {% cite typeclasses --file langs-extended-for-dist %} are a novel feature of Haskell’s language system that enable flexible overloading. Typeclass *instances* can be *derived automatically* {% cite deriving-typeclasses --file langs-extended-for-dist %} to reduce boilerplate. For example, the code to perform serialization and deserialization can be automatically generated for each kind of message using its definition. CloudHaskell takes the stance that the cost of communication operations should be readily apparent to the programmer. A consequence of this is that calls to the (automatically generated) serialization functions for messages must be explicitly inserted into the code. For example, in Erlang where we might write `send msg` in CloudHaskell we write `send (serialize msg)`. CloudHaskell is implemented as an extension to GHC Haskell (actually it was first implemented as a library then later on the features they needed were added as extensions to GHC). The fact that CloudHaskell allows concurrency within an actor demonstrates the separation between computation and communication models. That is, in CloudHaskell a single process might internally be implemented by a group of threads. To the rest of the system this fact is completely opaque (a good thing). CloudHaskell also takes a stance against “everything is serializable”, for example mutable reference cells should not be. This makes serializing functions (closures) tricky because it is not apparent how to tell if a closure’s environment is serializable. To solve this, CloudHaskell uses *static pointers* and forces closures to use pre-serialized environments. Pre-serialization means that a closure is a pair of a closed expression and a byte string representing the environment; presumably the first code the closure will execute when invoked is to deserialize the environment to retrieve the proper information. -ML5 {% cite ML5 --file langs-extended-for-dist %} is an extension to -the SML {% cite sml --file langs-extended-for-dist %} programming language. ML5 uses the notion of -*location* and *movability*. The type system keeps track -of the locations (nodes) in the system, which location each piece of -the program is, and what can and cannot be moved between locations. As -a result, the system is able to detect unpermitted attempts to access a resource. The ML5 type system cleanly integrates with the ML type system. ML5 supports type inference, a tricky but fundamental feature of SML. +ML5 {% cite ML5 --file langs-extended-for-dist %} is an extension to the SML {% cite sml --file langs-extended-for-dist %} programming language. ML5 uses the notion of *location* and *movability*. The type system keeps track of the locations (nodes) in the system, which location each piece of the program is, and what can and cannot be moved between locations. As a result, the system is able to detect unpermitted attempts to access a resource. The ML5 type system cleanly integrates with the ML type system. ML5 supports type inference, a tricky but fundamental feature of SML. Though applicable to distributed systems in general, the presentation of ML5 focuses in particular on client-server interactions, and more specifically on those between a web front-end and back-end. The ML5 compiler is able to generate Javascript code that executes in-browser. @@ -113,9 +72,10 @@ extern val alert : string -> unit @ home do alert [Hello, home!] ``` -This snippet declares a location in the system called `home` that can execute Javascript code. Additionally, there is a function named `alert` that can be used from the location `home` (defined through an interface to plain Javascript). The last line is an example of calling the `alert` function with the string `”Hello, home!”` (yes, strings are in brackets). +This snippet declares a location in the system called `home` that can execute Javascript code. Additionally, there is a function named `alert` that can be used from the location `home` (defined through an interface to plain Javascript). The last line is an example of calling the `alert` function with the String `"Hello, home!"`. (In ML5 the String type is indicated by square brackets) A slightly more complex example (again from {% cite ML5 --file langs-extended-for-dist %}) demonstrates movement between locations: + ``` extern bytecode world server extern val server : server addr @ home @@ -123,62 +83,43 @@ extern val version : unit -> string @ server extern val alert : string -> unit @ home do alert (from server get version ()) ``` + This example features two locations (worlds), the default `home` location and one named `server`. As before, we can call `alert` from `home`, but now we can also call `version` on `server`. Locations access one another using addresses; the declaration of `server` as a `server addr @ home` says that the home location knows the address of the `server` location and can therefore access `version`. -ML5 does not consider fault-tolerance. For the web front/back-end examples this is perhaps not too big of an issue. Still, the absence of a compelling strategy for handling failures is a shortcoming of the design. Since `home` is the default location, the same way that `main` is the default entry point for C programs, this program typechecks. An error would be raised if the `alert` function was called from a different location. +ML5 does not consider fault-tolerance. For the web front/back-end examples this is perhaps not too big of an issue, but the absence of a compelling strategy for handling failures is a shortcoming of the design. Since `home` is the default location, the same way that `main` is the default entry point for C programs, this program typechecks. An error would be raised if the `alert` function was called from a different location. -AliceML {% cite Alice --file langs-extended-for-dist %} is an -example of another extension to SML. AliceML leverages SML's advanced module -system to explore building *open* distributed systems. Open -systems are those where each node is only loosely coupled to its -peers, meaning it allows for dynamic update and replacements. AliceML -enables this by forcing components to program to interfaces and -dynamically type-checking components as they enter the system. +AliceML {% cite Alice --file langs-extended-for-dist %} is an example of another extension to SML. AliceML leverages SML's advanced module system to explore building *open* distributed systems. Open systems are those where each node is only loosely coupled to its peers, meaning it allows for dynamic update and replacements. AliceML enables this by forcing components to program to interfaces and dynamically type-checking components as they enter the system. ### Objects -Object-Oriented languages have been extended for distribution in -various fashions. Objects are an appealing model for agents in a -distributed system. Indeed, Alan Kay's metaphor of objects as -miniature computers and method invocation as message passing -{% cite smalltalkHistory --file langs-extended-for-dist %} seems directly evocative of distributed -systems. - -Eden {% cite Eden --file langs-extended-for-dist %} was an early pioneer (the project began in 1979) in -both distributed and object-oriented programming languages. In fact, -Eden can be classified as a language extended for distribution: Eden -applications “were written in the Eden Programming Language (EPL) — a -version of Concurrent Euclid to which the Eden team had added -support for remote object invocation” {% cite bhjl07 --file langs-extended-for-dist %}. However, Eden -did not take full advantage of the overlap (namely, objects) between -the computation and distribution models. + +Object-Oriented languages have been extended for distribution in various fashions. Objects are an appealing model for agents in a distributed system. Alan Kay's metaphor of objects as miniature computers and method invocation as message passing {% cite smalltalkHistory --file langs-extended-for-dist %} seems directly evocative of distributed systems. + +Eden {% cite Eden --file langs-extended-for-dist %} was an early pioneer (the project began in 1979) in both distributed and object-oriented programming languages. Eden can be classified as a language extended for distribution: Eden applications “were written in the Eden Programming Language (EPL) — a version of Concurrent Euclid to which the Eden team had added support for remote object invocation” {% cite bhjl07 --file langs-extended-for-dist %}. However, Eden did not take full advantage of the overlap (namely, objects) between the computation and distribution models. See Chapter 4 for a more extensive overview of the languages using an object-based model for distribution, such as Emerald, Argus, and E. ### Batch Processing -Another common extension is in the domain of batch processing -large-scale data. MapReduce (and correspondingly Hadoop) is the landmark example of a programming model for distributed batch processing. Essentially, batch processing systems present a restricted programming model. For example, MapReduce programs must fit into the rigid model of two-step produce then aggregate. A restricted model alllows the system to make more guarantees, such as for fault tolerance. Subsequently language designers have sought to increase the expressiveness of the programming models and to boost performance. Chapter 8 covers batch processing languages in more detail. + +Another common extension is in the domain of batch processing large-scale data. MapReduce (and correspondingly Hadoop) is the landmark example of a programming model for distributed batch processing. Essentially, batch processing systems present a restricted programming model. For example, MapReduce programs must fit into the rigid model of two-step produce then aggregate. A restricted model alllows the system to make more guarantees, such as for fault tolerance. Subsequently language designers have sought to increase the expressiveness of the programming models and to boost performance. Chapter 8 covers batch processing languages in more detail. MBrace {% cite MBrace --file langs-extended-for-dist %} is an extension to the F# programming language for writing computations for clusters. MBrace provides a *monadic* interface (point to something about monads here) which allows for building cluster jobs out of smaller jobs while still exploiting available parallelism. MBrace is a framework that features its own runtime for provisioning, scheduling, and monitoring nodes in a cluster. -Batch processing systems offer an interesting take on how to handle -fault tolerance. The common approach taken is to use a central -coordinator (for example, on the machine that initiated the job) to -detect the failure of nodes. By tracking what each other node in the -system is doing, the coordinator can restart a task on failure. +Batch processing systems offer an interesting take on how to handle fault tolerance. The common approach taken is to use a central coordinator (for example, on the machine that initiated the job) to detect the failure of nodes. By tracking what each other node in the system is doing, the coordinator can restart a task on failure. ### Consistency Several languages explore designs for ensuring the *consistency* of distributed applications, or the requirement that each node agree on some state. CRDTs are a family of distributed data structures that maintain consistency. Chapter 6 explains consistency and CRDTS in further detail while chapter 7 takes a closer look at the following languages. -Dedalus {% cite Dedalus --file langs-extended-for-dist %} and Bloom {% cite Bloom --file langs-extended-for-dist %} both represent uses of a logic-programming based attempt to design languages providing consistency guarantees. Dedalus is meant to provide the underlying model of distribution while Bloom provides a high-level language intended to be usable for creating applications. Logic-programming is an attractive model because execution is order-independent. Using Datalog, a logic-programming language, as the computation model encourages programmers to develop applications that are agnostic to message-reorderings. Bloom is implemented as a Ruby library BUD. +Dedalus {% cite Dedalus --file langs-extended-for-dist %} and Bloom {% cite Bloom --file langs-extended-for-dist %} both represent uses of a logic-programming based attempt to design languages providing consistency guarantees. Dedalus is meant to provide the underlying model of distribution while Bloom provides a high-level language intended to be usable for creating applications. Logic-programming is an attractive model because execution is order-independent. Using Datalog, a logic-programming language, as the computation model encourages programmers to develop applications that are agnostic to message-reorderings. Bloom is implemented as a Ruby library called Bud (Bloom Under Development). Lasp {% cite Lasp --file langs-extended-for-dist %} is a language model where CRDTs are the only distributed data. CRDT sets are the primary data structure and Lasp programs are built by composing set transformations such as maps, filters, and folds. Lasp is implemented as an Erlang library. In practice, applications using Lasp are free to use other forms of distributed data, but consistency is only promised for the library-provided tools. Lasp nodes then share the same fault-tolerance platform as Erlang applications. -Lasp’s programming model is very restrictive; sets are the primary data structure and folding operations. Future work may show how to enrich the programming model while still making the same strong consistency guarantees. +Lasp’s programming model is very restrictive; sets are the primary data structure and folding operations are the primary operations provided. Future work may show how to enrich the programming model while still making the same strong consistency guarantees. ### Tuplespaces + Linda {% cite Linda --file langs-extended-for-dist %} is a distributed programming model more in the vein of shared memory than message passing. Linda programs are a collection of processes and a shared *tuplespace*. The tuplespace holds an unordered collection of tuples. A process adds data to the tuplespace using the `out` form. The `in` form takes a *pattern* and removes and returns a tuple matching the pattern from the tuplespace, blocking until one such appears. A simple example demonstrating communication in Linda: @@ -186,17 +127,17 @@ A simple example demonstrating communication in Linda: ``` ;; INITIALLY the tuplespace is empty ;; Process A -out (“Hello”, “from process A”); +out ("Hello", "from process A"); -;; after Process A executes the tuplespace contains one record, (“Hello”, “from process A”) +;; after Process A executes the tuplespace contains one record, ("Hello", "from process A") ;; Process B -in(“Hello”, x); -;; x = “from process A” +in("Hello", x); +;; x = "from process A" ;; the tuplespace is empty again ``` -If the tuplespace was not empty but instead contained tuples of some for *besides* 2-entry tuples where the first element was the string `”Hello”` the above interaction would remain unchanged. However, if some process C had entered the tuple `(“Hello”, “I am process C”)` then +If the tuplespace was not empty but instead contained tuples of some for *besides* 2-entry tuples where the first element was the string `"Hello"` the above interaction would remain unchanged. However, if some process C had entered the tuple `("Hello", "I am process C")` then B would receive either A’s or B’s tuple with one of the tuples remaining in the tuplespace. The Linda model of distribution has several advantages. Linda processes are not only spatially decoupled (able to execute on different machines, as in actor implementations) but *temporally* decoupled as well. That is, a Linda process can communicate with other processes even after it exits! Tuples remain in the shared space even after the process that created them exits. Therefore a different process can receive those tuples at any point in the future. Another point is that Linda processes don’t need to know the identities of the other process it communicates with. Unlike actors, which in order to send a message need to know the address of the other actor’s mailbox, Linda processes operate entirely through their connection to the tuplespace. Processes need to “know exactly as much about one another as is appropriate for the programming situation at hand” {% cite Linda --file langs-extended-for-dist %}. @@ -207,41 +148,19 @@ Maintaining the consistency of the tuplespace is of paramount importance for Lin ## Discussion -The line between a language and a library can be extremely blurry -{% cite LanguagesAsLibraries --file langs-extended-for-dist %}. A language provides some building blocks -and forms for combining and composing {% cite 700pl --file langs-extended-for-dist %}. For example, numbers, strings, and Booleans are common primitive building blocks. Operations like addition, concatenation, and negation offer ways of combining such blocks. More interesting operations allow *abstraction* over parts of the language: function (or λ) abstraction allows for creating operations over all numbers. Objects provide a way of combining data and functions in a way that abstracts over particular implementation details. Most importantly, a language -defines the *semantics*, or meaning, of such operations. - -Many libraries can -be described in the same way. The first-order values provided by a -library are the primitives while the provided functions, classes, or objects perform -combination and composition. A library can implement the primitives -and operations such that they are in correspondence with the forms and semantics defined by a language. - -A theme in the literature is presenting an idea as a language and -implementing it as a library or extension to an existing language. -Doing so allows the authors to analyze a minimal presentation of their -idea, but enjoy the benefits of the library/extension approach. Both Linda and Lasp take this approach, for example. - -Language models benefit from implementations {% cite Redex --file langs-extended-for-dist %}. The -reasons to implement a new distributed language model on top of -existing work are abundant: - -* As mentioned above, a distributed language includes both a model - for single-node computation as well as inter-node communication. - Building a distributed language from scratch also means - re-implementing one of the paradigms of computing, and we would rather re-use existing language implementations - -* Implementing real-world applications calls for models of many - domains besides computation and communication, such as persistence - and parsing. The availability of a rich repository of libraries is - an important concern for many users in adopting a new technology - {% cite socioPLT --file langs-extended-for-dist %}. - -* Many applications of distributed systems are extremely - performance sensitive. Performance engineering and fine-tuning are - time and labor intensive endeavors. It makes sense to re-use as much - of the work in existing language ecosystems as possible. +The line between a language and a library can be extremely blurry {% cite LanguagesAsLibraries --file langs-extended-for-dist %}. A language provides some building blocks and forms for combining and composing. {% cite 700pl --file langs-extended-for-dist %} For example, numbers, strings, and Booleans are common primitive building blocks. Operations like addition, concatenation, and negation offer ways of combining such blocks. More interesting operations allow *abstraction* over parts of the language: function (or λ) abstraction allows for creating operations over all numbers. Objects provide a way of combining data and functions in a way that abstracts over particular implementation details. Most importantly, a language defines the *semantics*, or meaning, of such operations. + +Many libraries can be described in the same way. The first-order values provided by a library are the primitives while the provided functions, classes, or objects perform combination and composition. A library can implement the primitives and operations such that they are in correspondence with the forms and semantics defined by a language. + +A theme in the literature is presenting an idea as a language and implementing it as a library or extension to an existing language. Doing so allows the authors to analyze a minimal presentation of their idea, but enjoy the benefits of the library/extension approach. Both Linda and Lasp take this approach, for example. + +Language models benefit from implementations {% cite Redex --file langs-extended-for-dist %}. The reasons to implement a new distributed language model on top of existing work are abundant: + +* As mentioned above, a distributed language includes both a model for single-node computation as well as inter-node communication. Building a distributed language from scratch also means re-implementing one of the paradigms of computing, and we would rather re-use existing language implementations + +* Implementing real-world applications calls for models of many domains besides computation and communication, such as persistence and parsing. The availability of a rich repository of libraries is an important concern for many users in adopting a new technology {% cite socioPLT --file langs-extended-for-dist %}. + +* Many applications of distributed systems are extremely performance sensitive. Performance engineering and fine-tuning are time and labor intensive endeavors. It makes sense to re-use as much of the work in existing language ecosystems as possible. These reasons form the foundation of an argument in favor of general-purpose programming languages that allow for the creation of rich abstractions. Such a language can be adapted for different models of computation. (Although, most general-purpose languages have a fixed computation model such as functional or object-oriented). When models are implemented as libraries in a common language, users can mix-and-match different models to get exactly the right behavior where it is needed by the application. @@ -251,22 +170,12 @@ That being said, there are advantages to clean slate languages. A fresh start gr Erik Meijer points out in {% cite salesman --file langs-extended-for-dist %} that programmers don't jump to new technologies to access new features. Rather, in time those features make their way to old technologies. Examples abound, such as the arrival of λ in Java and of course all of Meijer's work in C# and Visual Basic. -Ideas from distributed programming models have influenced more than just the languages used in practice. Microservices {% cite microservices --file langs-extended-for-dist %} are a popular technique for architecting large systems as a collection of manageable components. Microservices apply the lessons learned from organizing programs -around actors to organizing the entire system. Actor programming uses -shared-nothing concurrent agents and emphasizes fault-tolerance. -Building a system using microservices means treating an entire block -of functionality as an actor: it can exchange messages with the other -components, but it can also go on and offline (or crash) independent -from the rest of the system. The result is more resilient and modular. +Ideas from distributed programming models have influenced more than just the languages used in practice. Microservices {% cite microservices --file langs-extended-for-dist %} are a popular technique for architecting large systems as a collection of manageable components. Microservices apply the lessons learned from organizing programs around actors to organizing the entire system. Actor programming uses shared-nothing concurrent agents and emphasizes fault-tolerance. Building a system using microservices means treating an entire block of functionality as an actor: it can exchange messages with the other components, but it can also go on and offline (or crash) independent from the rest of the system. The result is more resilient and modular. ## Conclusion -All design lies on a spectrum of tradeoffs; to gain convenience in one -area means to sacrifice in another. Distributed systems come with a -famous trade off: the CAP theorem. Building new language models on top -of an existing, flexible general-purpose language is an attractive -option for both implementors and users. +All design lies on a spectrum of tradeoffs; to gain convenience in one area means to sacrifice in another. Distributed systems come with a famous trade off: the CAP theorem. Building new language models on top of an existing, flexible general-purpose language is an attractive option for both implementors and users. ## References -{% bibliography --file langs-extended-for-dist %}
\ No newline at end of file +{% bibliography --file langs-extended-for-dist %} diff --git a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md index c140ee0..9515384 100644 --- a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md +++ b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md @@ -4,7 +4,7 @@ title: "ACIDic to BASEic: How the database pH has changed" by: "Aviral Goel" --- -## 1. The **ACID**ic Database Systems +## The **ACID**ic Database Systems Relational Database Management Systems are the most ubiquitous database systems for persisting state. Their properties are defined in terms of transactions on their data. A database transaction can be either a single operation or a sequence of operations, but is treated as a single logical operation on the data by the database. The properties of these transactions provide certain guarantees to the application developer. The acronym **ACID** was coined by Andreas Reuter and Theo Härder in 1983 to describe them. @@ -16,19 +16,21 @@ Relational Database Management Systems are the most ubiquitous database systems * **Durability** guarantees that upon the completion of a transaction, the effects are applied permanently on the database and cannot be undone. They remain visible even in the event of power failures or crashes. This is done by ensuring that the changes are committed to disk (non-volatile memory). -<blockquote><p><b>ACID</b>ity implies that if a transaction is complete, the database state is structurally consistent (adhering to the rules of the schema) and stored on disk to prevent any loss.</p></blockquote> +<blockquote> + <p><b>ACID</b>ity implies that if a transaction is complete, the database state is structurally consistent (adhering to the rules of the schema) and stored on disk to prevent any loss.</p> +</blockquote> Because of the strong guarantees this model simplifies the life of the developer and has been traditionally the go to approach in application development. It is instructive to examine how these properties are enforced. -Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency, i.e., all transactions are performed in a particular sequence and invariants are always maintained by them. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations. +Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency, i.e. all transactions are performed in a particular sequence and invariants are always maintained by them. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations. This model works well on a single node. But it exposes a serious limitation when too many concurrent transactions are performed. A single node database server will only be able to process so many concurrent read operations. The situation worsens when many concurrent write operations are performed. To guarantee *ACID*ity, the write operations will be performed in sequence. The last write request will have to wait for an arbitrary amount of time, a totally unacceptable situation for many real time systems. This requires the application developer to decide on a **Scaling** strategy. -### 1.2. Scaling transaction volume +### Scaling transaction volume -To increase the volume of transactions against a database, two scaling strategies can be considered +To increase the volume of transactions against a database, two scaling strategies can be considered: -* **Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. Since its specialized hardware, vendor lock-in will add to further costs. +* **Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. * **Horizontal Scaling** is a more viable option and can be implemented in two ways. Data can be segregated into functional groups spread across databases. This is called *Functional Scaling*. Data within a functional group can be further split across multiple databases, enabling functional areas to be scaled independently of one another for even more transactional capacity. This is called *sharding*. @@ -40,59 +42,58 @@ Horizontal Scaling through functional partitioning enables high degree of scalab 2PC is a blocking protocol and updates can take from a few milliseconds up to a few minutes to commit. This means that while a transaction is being processed, other transactions will be blocked. So the application that initiated the transaction will be blocked. Another option is to handle the consistency across databases at the application level. This only complicates the situation for the application developer who is likely to implement a similar strategy if *ACID*ity is to be maintained. -## 2. The Distributed Concoction +## The Distributed Concoction A distributed application is expected to have the following three desirable properties: 1. **Consistency** - This is the guarantee of total ordering of all operations on a data object such that each operation appears indivisible. This means that any read operation must return the most recently written value. This provides a very convenient invariant to the client application. This definition of consistency is the same as the **Atomic**ity guarantee provided by relational database transactions. -2. **Availability** - Every request to a distributed system must result in a response. However, this is too vague a definition. Whether a node failed in the process of responding or it ran a really long computation to generate a response or whether the request or the response got lost due to network issues is generally impossible to determine by the client and willHence, for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, the amount of delay an application can bear depends on the application domain. +2. **Availability** - Every request to a distributed system must result in a response. However, this is too vague a definition. Determining whether a node failed in the process of responding, ran a really long computation to generate a response, or whether the request or response got lost due to network issues is generally impossible from the client's perspective. This problem means that for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, where the amount of delay an application can bear depends on the application domain. 3. **Partition Tolerance** - Partitioning is the loss of messages between the nodes of a distributed system. During a network partition, the system can lose arbitrary number of messages between nodes. A partition tolerant system will always respond correctly unless a total network failure happens. -Consistency requirement implies that every request will be treated atomically by the system even if the nodes lose messages due to network partitions. -Availability requirement implies that every request should receive a response even if a partition causes messages to be lost arbitrarily. +The consistency requirement implies that every request will be treated atomically by the system even if the nodes lose messages due to network partitions. The availability requirement implies that every request should receive a response even if a partition causes messages to be lost arbitrarily. -## 3. The CAP Theorem +## The CAP Theorem <figure class="main-container"> <img src="./resources/images/partitioned-network.jpg" alt="A partitioned network" /> <footer>A partitioned network</footer> </figure> -In the network above, all messages between the node set M and N are lost due to a network issue. The system as a whole detects this situation. There are two options - +In the network above, all messages between the node set M and N are lost due to a network issue. The system as a whole detects this situation. There are two options: 1. **Availability first** - The system allows any application to read and write to data objects on these nodes independently even though they are not able to communicate. The application writes to a data object on node M. Due to **network partition**, this change is not propagated to replicas of the data object in N. Subsequently, the application tries to read the value of that data object and the read operation executes in one of the nodes of N. The read operation returns the older value of the data object, thus making the application state not **consistent**. -2. **Consistency first** - The system does not allow any application to write to data objects as it cannot ensure **consistency** of replica states. This means that the system is perceived to be **unavailable** by the applications. +2. **Consistency first** - The system does not allow any application to write to data objects as it cannot ensure **consistency** of replica states. This means that the system is perceived to be **unavailable** by the applications. -If there are no partitions, clearly both consistency and availability can be guaranteed by the system. This observation led Eric Brewer to conjecture in an invited talk at PODC 2000- +If there are no partitions, clearly both consistency and availability can be guaranteed by the system. This observation led Eric Brewer to conjecture the CAP Theorem in an invited talk at PODC 2000. The CAP Theorem states: -<blockquote>It is impossible for a web service to provide the following three guarantees: -Consistency -Availability -Partition Tolerance</blockquote> +<blockquote> +It is impossible for a web service to provide the following three guarantees: Consistency, Availability, and Partition Tolerance +</blockquote> -This is called the CAP theorem. +It is clear that the prime culprit here is network partitions. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system. -It is clear that the prime culprit here is network partition. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system. - -## 4. Two of Three - Exploring the CAP Theorem +## Two of Three - Exploring the CAP Theorem The CAP theorem dictates that the three desirable properties, consistency, availability and partition tolerance cannot be offered simultaneously. Let's study if its possible to achieve two of these three properties. ### Consistency and Availability -If there are no network partitions, then there is no loss of messages and all requests receive a response within the stipulated time. It is clearly possible to achieve both consistency and availability. Distributed systems over intranet are an example of such systems. + +If there are no network partitions, then there is no loss of messages and all requests receive a response within the stipulated time. It is clearly possible to achieve both consistency and availability. Distributed systems over an intranet are an example of such systems. ### Consistency and Partition Tolerance + Without availability, both of these properties can be achieved easily. A centralized system can provide these guarantees. The state of the application is maintained on a single designated node. All updates from the client are forwarded by the nodes to this designated node. It updates the state and sends the response. When a failure happens, then the system does not respond and is perceived as unavailable by the client. Distributed locking algorithms in databases also provide these guarantees. ### Availability and Partition Tolerance + Without atomic consistency, it is very easy to achieve availability even in the face of partitions. Even if nodes fail to communicate with each other, they can individually handle query and update requests issued by the client. The same data object will have different states on different nodes as the nodes progress independently. This weak consistency model is exhibited by web caches. -Its clear that two of these three properties are easy to achieve in any distributed system. Since large scale distributed systems have to take partitions into account, will they have to sacrifice availability for consistency or consistency for availability? Clearly giving up either consistency or availability is too big a sacrifice. +Its clear that two of these three properties are easy to achieve in any distributed system. Since large scale distributed systems have to take partitions into account, will they have to sacrifice availability for consistency or consistency for availability? Clearly totally giving up either consistency or availability is too big a sacrifice. -## 5. The **BASE**ic distributed state +## The **BASE**ic distributed state When viewed through the lens of CAP theorem and its consequences on distributed application design, we realize that we cannot commit to perfect availability and strong consistency. But surely we can explore the middle ground. We can guarantee availability most of the time with occasional inconsistent view of the data. The consistency is eventually achieved when the communication between the nodes resumes. This leads to the following properties of the current distributed applications, referred to by the acronym BASE. @@ -102,41 +103,53 @@ When viewed through the lens of CAP theorem and its consequences on distributed * **Eventually Consistent** services try to make application state consistent whenever possible. -## 6. Partitions and latency -Any large scale distributed system has to deal with latency issue. In fact, network partitions and latency are fundamentally related. Once a request is made and no response is received within some duration, the sender node has to assume that a partition has happened. The sender node can take one of the following steps: +## Partitions and latency + +Any large scale distributed system has to deal with latency issues. In fact, network partitions and latency are fundamentally related. Once a request is made and no response is received within some duration, the sender node has to assume that a partition has happened. The sender node can take one of the following steps: -1) Cancel the operation as a whole. In doing so, the system is choosing consistency over availability. -2) Proceed with the rest of the operation. This can lead to inconsistency but makes the system highly available. -3) Retry the operation until it succeeds. This means that the system is trying to ensure consistency and reducing availability. +* Cancel the operation as a whole. In doing so, the system is choosing consistency over availability. +* Proceed with the rest of the operation. This can lead to inconsistency but makes the system highly available. +* Retry the operation until it succeeds. This means that the system is trying to ensure consistency and reducing availability. Essentially, a partition is an upper bound on the time spent waiting for a response. Whenever this upper bound is exceeded, the system chooses C over A or A over C. Also, the partition may be perceived only by two nodes of a system as opposed to all of them. This means that partitions are a local occurrence. -## 7. Handling Partitions +## Handling Partitions + Once a partition has happened, it has to be handled explicitly. The designer has to decide which operations will be functional during partitions. The partitioned nodes will continue their attempts at communication. When the nodes are able to establish communication, the system has to take steps to recover from the partitions. -### 7.1. Partition mode functionality +### Partition mode functionality + When at least one side of the system has entered into partition mode, the system has to decide which functionality to support. Deciding this depends on the invariants that the system must maintain. Depending on the nature of problem, the designer may choose to compromise on certain invariants by allowing partitioned system to provide functionality which might violate them. This means the designer is choosing availability over consistency. Certain invariants may have to be maintained and operations that will violate them will either have to be modified or prohibited. This means the designer is choosing consistency over availability. + Deciding which operations to prohibit, modify or delay also depends on other factors such as the node. If the data is stored on the same node, then operations on that data can typically proceed on that node but not on other node. -In any event, the bottomline is that if the designer wishes for the system to be available, certain operations have to be allowed. The node has to maintain a history of these operations so that it can be merged with the rest of the system when it is able to reconnect. -Since the operations can happen simultaneously on multiple disconnected nodes, all sides will maintain this history. One way to maintain this information is through version vectors. + +In any event, the bottomline is that if the designer wishes for the system to be available, certain operations have to be allowed. The node has to maintain a history of these operations so that it can be merged with the rest of the system when it is able to reconnect. Since the operations can happen simultaneously on multiple disconnected nodes, all sides will maintain this history. One way to maintain this information is through version vectors. + Another interesting problem is to communicate the progress of these operations to the user. Until the system gets out of partition mode, the operations cannot be committed completely. Till then, the user interface has to faithfully represent their incomplete or in-progress status to the user. -### 7.2. Partition Recovery -When the partitioned nodes are able to communicate, they have to exchange information to maintain consistency. Both sides continued in their independent direction but now the delayed operations on either side have to be performed and violated invariants have to be fixed. Given the state and history of both sides, the system has to accomplish the following tasks. +### Partition Recovery + +When the partitioned nodes are able to communicate, they have to exchange information to maintain consistency. During the partition, both sides continued processing independently, but now the delayed operations on either side have to be performed and violated invariants have to be fixed. Given the state and history of both sides, the system has to accomplish the following tasks. + +#### Consistency + +During recovery, the system has to reconcile the inconsistency in state of both nodes. This is relatively straightforward to accomplish. One approach is to start from the state at the time of partition and apply operations of both sides in an appropriate manner, ensuring that the invariants are maintained. Depending on operations allowed during the partition phase, this process may or may not be possible. The general problem of conflict resolution is not solvable but a restricted set of operations may ensure that the system can always always merge conflicts. For example, Google Docs limits operations to style and text editing. But source-code control systems such as Concurrent Versioning System (CVS) may encounter conflict which require manual resolution. -#### 7.2.1. Consistency -During recovery, the system has to reconcile the inconsistency in state of both nodes. This is relatively straightforward to accomplish. One approach is to start from the state at the time of partition and apply operations of both sides in an appropriate manner, ensuring that the invariants are maintained. Depending on operations allowed during the partition phase, this process may or may not be possible. The general problem of conflict resolution is not solvable but a restricted set of operations may ensure that the system can always always merge conflicts. For example, Google Docs limits operations to style and text editing. But source-code control systems such as Concurrent Versioning System (CVS) may encounter conflict which require manual resolution. Research has been done on techniques for automatic state convergence. Using commutative operations allows the system to sort the operations in a consistent global order and execute them. Though all operations can't be commutative, for example - addition with bounds checking is not commutative. Mark Shapiro and his colleagues at INRIA have developed *commutative replicated data types (CRDTs)* that provably converge as operations are performed. By implementing state through CRDTs, we can ensure Availability and automatic state convergence after partitions. +Research has been done on techniques for automatic state convergence. Using commutative operations allows the system to sort the operations in a consistent global order and execute them. Though all operations can't be commutative, for example - addition with bounds checking is not commutative. Mark Shapiro and his colleagues at INRIA have developed *commutative replicated data types (CRDTs)* that provably converge as operations are performed. By implementing state through CRDTs, we can ensure Availability and automatic state convergence after partitions. + +#### Compensation -#### 7.2.2. Compensation During partition, its possible for both sides to perform a series of actions which are externalized, i.e. their effects are visible outside the system. To compensate for these actions, the partitioned nodes have to maintain a history. + For example, consider a system in which both sides have executed the same order during a partition. During the recovery phase, the system has to detect this and distinguish it from two intentional orders. Once detected, the duplicate order has to be rolled back. If the order has been committed successfully then the problem has been externalized. The user will see twice the amount deducted from his account for a single purchase. Now, the system has to credit the appropriate amount to the user's account and possibly send an email explaining the entire debacle. All this depends on the system maintaining the history during partition. If the history is not present, then duplicate orders cannot be detected and the user will have to catch the mistake and ask for compensation. + It would have been great if the duplicate order was not issued by the system in the first place. But the requirement to maintain system availability trumps consistency. Mistakes in such cases cannot always be corrected internally. But by admitting them and compensating for them, the system arguably exhibits equivalent behavior. -### 8. What's the right pH for my distributed solution? +## What's the right pH for my distributed solution? Whether an application chooses to be an *ACID*ic or *BASE*ic service depends on the domain. An application developer has to consider the consistency-availability tradeoff on a case by case basis. *ACID*ic databases provide a very simple and strong consistency model making application development easy for domains where data inconsistency cannot be tolerated. *BASE*ic systems provide a very loose consistency model, placing more burden on the application developer to understand the invariants and manage them carefully during partitions by appropriately limiting or modifying the operations. -## 9. References +## References https://neo4j.com/blog/acid-vs-base-consistency-models-explained/ https://en.wikipedia.org/wiki/Eventual_consistency/ @@ -182,4 +195,5 @@ https://en.wikipedia.org/wiki/Distributed_database https://en.wikipedia.org/wiki/ACID http://searchstorage.techtarget.com/definition/data-availability https://datatechnologytoday.wordpress.com/2013/06/24/defining-database-availability/ + {% bibliography --file rpc %} diff --git a/chapter/6/being-consistent.md b/chapter/6/being-consistent.md index 233d987..bc7a55b 100644 --- a/chapter/6/being-consistent.md +++ b/chapter/6/being-consistent.md @@ -5,6 +5,7 @@ by: "Aviral Goel" --- ## Replication and Consistency + Availability and Consistency are the defining characteristics of any distributed system. As dictated by the CAP theorem, accommodating network partitions requires a trade off between the two properties. Modern day large scale internet based distributed systems have to be highly available. To manage huge volumes of data (big data) and to reduce access latency for geographically diverse user base, their data centers also have to be geographically spread out. Network partitions which would otherwise happen with a low probability on a local network become certain events in such systems. To ensure availability in the event of partitions, these systems have to replicate data objects. This begs the question, how to ensure consistency of these replicas? It turns out there are different notions of consistency which the system can adhere to. * **Strong Consistency** implies linearizability of updates, i.e., all updates applied to a replicated data type are serialized in a global total order. This means that any update will have to be simultaneously applied to all other replicas. Its obvious that this notion of consistency is too restrictive. A single unavailable node will violate this condition. Forcing all updates to happen synchronously will impact system availability negatively. This notion clearly does not fit the requirements of highly available fault tolerant systems. @@ -16,26 +17,27 @@ Most large scale distributed systems try to be **Eventually Consistent** to ensu ## A Distributed Setting ### TODO need to write pseudocode. Will finish this part with the detailed explanation of CRDTs in the next chapter. -Consider a replicated counter. Each node can increment the value of its local copy. The figure below shows three nodes which increment their local copies at arbitrary time points and each replica sends its value asynchronously to the other two replicas. Whenever it recieves the value of its replica, it adds it to its current value. If two values are received concurrently, both will be added together to its current value. So merging replicas in this example becomes trivial. -Let's take a look at another interesting generalization of this. Integer Vector +Consider a replicated counter. Each node can increment the value of its local copy. The figure below shows three nodes which increment their local copies at arbitrary time points and each replica sends its value asynchronously to the other two replicas. Whenever it receives the value of its replica, it adds it to its current value. If two values are received concurrently, both will be added together to its current value. So merging replicas in this example becomes trivial. +Let's take a look at another interesting generalization of this. Integer Vector We can make an interesting observation from the previous examples: - + __*All distributed data structures don't need conflict resolution*__ This raises the following question: - + __*How can we design a distributed structure such that we don't need conflict resolution?*__ The answer to this question lies in an algebraic structure called the **join semilattice**. ## Join Semilattice + A join-semilattice or upper semilattice is a *partial order* `≤` with a *least upper bound* (LUB) `⊔` for all pairs. -`m = x ⊔ y` is a Least Upper Bound of `{` `x` `,` `y` `}` under `≤` iff `∀m′, x ≤ m′ ∧ y ≤ m′ ⇒ x ≤ m ∧ y ≤ m ∧ m ≤ m′`. +`m = x ⊔ y` is a Least Upper Bound of `{ x , y }` under `≤` iff `∀m′, x ≤ m′ ∧ y ≤ m′ ⇒ x ≤ m ∧ y ≤ m ∧ m ≤ m′`. -`⊔` is: +Here the least upper bound `⊔` has the follow properties: **Associative** @@ -50,31 +52,39 @@ A join-semilattice or upper semilattice is a *partial order* `≤` with a *least `x ⊔ x = x` The examples we saw earlier were of structures that could be modeled as join semilattices. The merge operation for the increment only counter is the summation function and for the integer vector it is the per-index maximum of the vectors being merged. + So, if we can model the state of the data structure as a partially ordered set and design the merge operation to always compute the "larger" of the two states, its replicas will never need consensus. They will always converge as execution proceeds. Such data structures are called CRDTs (Conflict-free Replicated Data Type). But what about consistency of these replicas? ## Strong Eventual Consistency (SEC) + We discussed a notion of consistency, *Eventual Consistency*, in which replicas eventually become consistent if there are no more updates to be merged. But the update operation is left unspecified. Its possible for an update to render the replica in a state that causes it to conflict with a later update. In this case the replica may have to roll back and use consensus to ensure that all replicas do the same to ensure consistency. This is complicated and wasteful. But if replicas are modeled as CRDTs, the updates never conflict. Regardless of the order in which the updates are applied, all replicas will eventually have equivalent state. Note that no conflict arbitration is necessary. This kind of Eventual Consistency is a stronger notion of consistency than the one that requires conflict arbitration and hence is called *Strong Eventual Consistency*. ### Strong Eventual Consistency and CAP Theorem Let's study SEC data objects from the perspective of CAP theorem. -#### 1. Consistency and Network Partition +#### Consistency and Network Partition + Each distributed replica will communicate asynchronously with other reachable replicas. These replicas will eventually converge to the same value. There is no consistency guarantee on the value of replicas not reachable due to network conditions and hence this condition is strictly weaker than strong consistency. But as soon as those replicas can be reached, they will also converge in a self-stabilizing manner. -#### 2. Availability and Network Partition -Each distributed replica will always be available for local reads and writes regardless of network partitions. In fact, if there are n replicas, a single replica will function even if the remaining n - 1 replicas crash simultaneously. This **provides an extreme form of availability**. +#### Availability and Network Partition + +Each distributed replica will always be available for local reads and writes regardless of network partitions. In fact, if there are n replicas, a single replica will function even if the remaining n - 1 replicas crash simultaneously. This **provides an extreme form of availability**. SEC facilitates maximum consistency and availability in the event of network partitions by relaxing the requirement of global consistency. Note that this is achieved by virtue of modeling the data objects as join semilattices. #### Strong Eventual Consistency and Linearizability -In a distributed setting, a replica has to handle concurrent updates. In addition to its sequential behavior, a CRDT also has to ensure that its concurrent behavior also ensures strong eventual consistency. This makes it possible for CRDTs to exhibit behavior that is simply not possible for sequentially consistent objects. + +In a distributed setting, a replica has to handle concurrent updates. In addition to its sequential behavior, a CRDT also has to ensure that its concurrent behavior also ensures strong eventual consistency. This makes it possible for CRDTs to exhibit behavior that is simply not possible for sequentially consistent objects. + Consider a set CRDT used in a distributed setting. One of the replicas p<sub>i</sub> executes the sequence `add(a); remove(b)`. Another replica p<sub>j</sub> executes the sequence `add(b); remove(a)`. Now both send their states asynchronously to another replica p<sub>k</sub> which has to merge them concurrently. Same element exists in one of the sets and does not exist in the other set. There are multiple choices that the CRDT designer can make. Let's assume that the implementation always prefers inclusion over exclusion. So in this case, p<sub>k</sub> will include both `a` and `b`. -Now consider a sequential execution of the two sequences on set data structure. The order of execution will be either `add(a); remove(b); add(b); remove(a)` or `add(b); remove(a); add(a); remove(b)`. In both cases one of the elements is excluded. This is different from the state of the CRDT set implementation. -Thus, strong eventually consistent data structures can be sequentially inconsistent. + +Now consider a sequential execution of the two sequences on set data structure. The order of execution will be either `add(a); remove(b); add(b); remove(a)` or `add(b); remove(a); add(a); remove(b)`. In both cases one of the elements is excluded. This is different from the state of the CRDT set implementation. Thus, strong eventually consistent data structures can be sequentially inconsistent. + Similarly, if there are `n` sequentially consistent replicas, then they would need consensus to ensure a single order of execution of operations across all replicas. But if `n - 1` replicas crash, then consensus cannot happen. This makes the idea of sequential consistency incomparable to that of strong eventual consistency. ## What Next? + This chapter introduced Strong Eventual Consistency and the formalism behind CRDTs, join semilattices, which enables CRDTs to exhibit strong eventual consistency. The discussion however does not answer an important question: __*Can all standard data structures be designed as CRDTs?*__ diff --git a/chapter/6/counters.md b/chapter/6/counters.md index 67e822a..23dccde 100644 --- a/chapter/6/counters.md +++ b/chapter/6/counters.md @@ -6,29 +6,28 @@ by: "Aviral Goel" Counters are replicated integers. They are the most basic distributed object. This chapter describes the different variations of counter CRDT in both state based and operation based form. -## 1. G-counter - Increment only counter +## G-counter - Increment only counter As the name suggests, these counters only support increment operation. They can be used to implement the `like` button functionality of social media websites. -### 1.1. CmRDT: Operation based design +### CmRDT: Operation based design In the operation based implementation, the increment operation is transmitted to all other replicas. This is straightforward to implement as there is only one update operation. -<pre style="background:#fff;color:#000"><span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CmRDT): +```python +class Counter(CmRDT): + def __init__(self): + self._count = 0 - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self): <span style="color:#919191"># constructor function</span> - self._count <span style="color:#ff5600">=</span> 0 + def value(self): + return self._count - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> self._count - - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._count <span style="color:#ff5600">+=</span> 1 - <span style="color:#ff5600">for</span> replica <span style="color:#ff5600">in</span> self.replicas(): - self.transmit(<span style="color:#00a33f">"increment"</span>, replica) - -</pre> + def increment(self): + self._count += 1 + for replica in self.replicas(): + self.transmit("increment", replica) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. In accordance with the specification, each increment request increments the counter locally by one unit and the operation is transmitted by the replica to all other replicas of the system. @@ -44,7 +43,7 @@ We can make the following observations: * **Concurrent operations** - A replica may have to handle concurrent operations. For example, at *t<sub>19</sub>*, <b>c<sub>1</sub></b> encounters two increment operations. Since its the same operation, there is only one way to handle the situation. There is no conflict. -### 1.2. CvRDT: State based design +### CvRDT: State based design In the state based implementation, the counter state is transmitted to all other replicas. But how do we model the state? Of course, the counter's count is its state. @@ -52,24 +51,23 @@ Since the count always increases, modeling the state as count automatically make The code below provides the specification of this counter. -<pre style="background:#fff;color:#000"><span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CvRDT): - - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self, count <span style="color:#ff5600">=</span> 0): <span style="color:#919191"># constructor function</span> - self._count <span style="color:#ff5600">=</span> count +```python +class Counter(CvRDT): + def __init__(self, count = 0): + self._count = count - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> self._count + def value(self): + return self._count - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._count <span style="color:#ff5600">+=</span> 1 + def increment(self): + self._count += 1 - <span style="color:#ff5600">def</span> <span style="color:#21439c">compare</span>(self, other): <span style="color:#919191"># comparison function</span> - <span style="color:#ff5600">return</span> self.value() <span style="color:#ff5600"><=</span> other.value() + def compare(self, other): + return self.value() <= other.value() - <span style="color:#ff5600">def</span> <span style="color:#21439c">merge</span>(self, other): <span style="color:#919191"># merge function</span> - <span style="color:#ff5600">return</span> Counter(<span style="color:#a535ae">max</span>(self.value(), other.value())) - -</pre> + def merge(self, other): + return Counter(max(self.value(), other.value())) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. The replicas keep transmitting their state at random times to other randomly chosen replicas. @@ -92,33 +90,33 @@ Let's observe the problem again. Our merge method only returns the state of that The specification below shows how this can be implemented. -<pre style="background:#fff;color:#000"><span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CvRDT): - - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self, counts <span style="color:#ff5600">=</span> <span style="color:#a535ae">None</span>): <span style="color:#919191"># constructor function</span> - <span style="color:#ff5600">if</span> counts <span style="color:#ff5600">is</span> <span style="color:#a535ae">None</span>: - self._counts <span style="color:#ff5600">=</span> [0] <span style="color:#ff5600">*</span> length(self.replicas()) - <span style="color:#ff5600">else</span>: - self._counts <span style="color:#ff5600">=</span> counts +```python +class Counter(CvRDT): + def __init__(self, counts = None): + if counts is None: + self._counts = [0] * length(self.replicas()) + else: + self._counts = counts - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">sum</span>(self._counts) + def value(self): + return sum(self._counts) - <span style="color:#ff5600">def</span> <span style="color:#21439c">counts</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">list</span>(self._counts) <span style="color:#919191"># return a clone</span> + def counts(self): + # return a copy of the counts + return list(self._counts) - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._counts[self.replicaId()] <span style="color:#ff5600">+=</span> 1 + def increment(self): + self._counts[self.replicaId()] += 1 - <span style="color:#ff5600">def</span> <span style="color:#21439c">compare</span>(self, other): <span style="color:#919191"># comparison function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">all</span>(v1 <span style="color:#ff5600"><=</span> v2 <span style="color:#ff5600">for</span> (v1, v2) <span style="color:#ff5600">in</span> - <span style="color:#a535ae">zip</span>(self.counts(), + def compare(self, other): + return all(v1 <= v2 for (v1, v2) in + zip(self.counts(), other.counts())) - <span style="color:#ff5600">def</span> <span style="color:#21439c">merge</span>(self, other): <span style="color:#919191"># merge function</span> - <span style="color:#ff5600">return</span> Counter(<span style="color:#a535ae">map</span>(<span style="color:#a535ae">max</span>, <span style="color:#a535ae">zip</span>(self.counts(), + def merge(self, other): + return Counter(map(max, zip(self.counts(), other.counts()))) - -</pre> +``` The figure below shows an execution trace of three replicas confirming to this specification. @@ -128,34 +126,32 @@ The figure below shows an execution trace of three replicas confirming to this s This design converges to the correct value. This provides us an eventually consistent state based increment only counter. -## 2. PN-counter - Increment and Decrement counter +## PN-counter - Increment and Decrement counter A PN-counter can be incremented and decremented. These can serve as a general purpose counters, as they also provide a decrement operation. For example - counting the number of users active one a social media website at any point. Note that the users can go offline and the counter will have to decremented. This can't be done with an increment only counter. -### 2.1. CmRDT: Operation based design +### CmRDT: Operation based design The code below provides the specification of an operation based increment and decrement counter. -<pre style="background:#fff;color:#000"> - -<span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CmRDT): +```python +class Counter(CmRDT): + def __init__(self): + self._count = 0 - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self): <span style="color:#919191"># constructor function</span> - self._count <span style="color:#ff5600">=</span> 0 + def value(self): + return self._count - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> self._count + def increment(self): + self._count += 1 + for replica in self.replicas(): + self.transmit("increment", replica) - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._count <span style="color:#ff5600">+=</span> 1 - <span style="color:#ff5600">for</span> replica <span style="color:#ff5600">in</span> self.replicas(): - self.transmit(<span style="color:#00a33f">"increment"</span>, replica) - - <span style="color:#ff5600">def</span> <span style="color:#21439c">decrement</span>(self): <span style="color:#919191"># update function</span> - self._count <span style="color:#ff5600">-=</span> 1 - <span style="color:#ff5600">for</span> replica <span style="color:#ff5600">in</span> self.replicas(): - self.transmit(<span style="color:#00a33f">"decrement"</span>, replica) -</pre> + def decrement(self): + self._count -= 1 + for replica in self.replicas(): + self.transmit("decrement", replica) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. In accordance with the specification, each increment request increments the counter locally by one unit and each decrement request decrements the counter locally by one unit. The corresponding operation is transmitted by the replica to all other replicas of the system. @@ -177,40 +173,41 @@ We can make the the following observations: In both cases the result is same because the two operations commute, i.e. - the order in which they are executed by the replicas does not matter. Both replicas perform them in different orders and still achieve consistency, eventually. -### 2.2. CvRDT: State based design +### CvRDT: State based design The code below provides the specification of a state based increment and decrement counter. We take inspiration from the design of state based increment only counter and model the state of this counter in the same way. -<pre style="background:#fff;color:#000"><span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CvRDT): +```python +class Counter(CvRDT): - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self, counts <span style="color:#ff5600">=</span> <span style="color:#a535ae">None</span>): <span style="color:#919191"># constructor function</span> - <span style="color:#ff5600">if</span> counts <span style="color:#ff5600">is</span> <span style="color:#a535ae">None</span>: - self._counts <span style="color:#ff5600">=</span> [0] <span style="color:#ff5600">*</span> length(self.replicas()) - <span style="color:#ff5600">else</span>: - self._counts <span style="color:#ff5600">=</span> counts + def __init__(self, counts = None): + if counts is None: + self._counts = [0] * length(self.replicas()) + else: + self._counts = counts - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">sum</span>(self._counts) + def value(self): + return sum(self._counts) - <span style="color:#ff5600">def</span> <span style="color:#21439c">counts</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">list</span>(self._counts) <span style="color:#919191"># return a clone</span> + def counts(self): + # return a copy of the counts + return list(self._counts) - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._counts[self.replicaId()] <span style="color:#ff5600">+=</span> 1 + def increment(self): + self._counts[self.replicaId()] += 1 - <span style="color:#ff5600">def</span> <span style="color:#21439c">decrement</span>(self): <span style="color:#919191"># update function</span> - self._counts[self.replicaId()] <span style="color:#ff5600">-=</span> 1 + def decrement(self): + self._counts[self.replicaId()] -= 1 - <span style="color:#ff5600">def</span> <span style="color:#21439c">compare</span>(self, other): <span style="color:#919191"># comparison function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">all</span>(v1 <span style="color:#ff5600"><=</span> v2 <span style="color:#ff5600">for</span> (v1, v2) <span style="color:#ff5600">in</span> - <span style="color:#a535ae">zip</span>(self.counts(), + def compare(self, other): + return all(v1 <= v2 for (v1, v2) in + zip(self.counts(), other.counts())) - <span style="color:#ff5600">def</span> <span style="color:#21439c">merge</span>(self, other): <span style="color:#919191"># merge function</span> - <span style="color:#ff5600">return</span> Counter(<span style="color:#a535ae">map</span>(<span style="color:#a535ae">max</span>, <span style="color:#a535ae">zip</span>(self.counts(), + def merge(self, other): + return Counter(map(max, zip(self.counts(), other.counts()))) - -</pre> +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. The replicas keep transmitting their state at random times to other randomly chosen replicas. @@ -246,52 +243,54 @@ But we do gain two valuable insights from this design- Let's try to correct this problem. We need a way to count the decrement operations without losing monotonicity. One solution is to model this counter using two increment only counters. The first counter counts the increment operations and the second one counts the decrement operations. The value of the actual counter is the difference between the two corresponding counters. The specification below shows how this can be implemented. -<pre style="background:#fff;color:#000"><span style="color:#ff5600">class</span> <span style="color:#21439c">Counter</span>(CvRDT): - - <span style="color:#ff5600">def</span> <span style="color:#21439c"><span style="color:#a535ae">__init__</span></span>(self, - increments <span style="color:#ff5600">=</span> <span style="color:#a535ae">None</span>, - decrements <span style="color:#ff5600">=</span> <span style="color:#a535ae">None</span>): <span style="color:#919191"># constructor function</span> - <span style="color:#ff5600">if</span> increments <span style="color:#ff5600">is</span> <span style="color:#a535ae">None</span>: - self._increments <span style="color:#ff5600">=</span> [0] <span style="color:#ff5600">*</span> length(replicas()) - <span style="color:#ff5600">else</span>: - self._increments <span style="color:#ff5600">=</span> increments - <span style="color:#ff5600">if</span> decrements <span style="color:#ff5600">is</span> <span style="color:#a535ae">None</span>: - self._decrements <span style="color:#ff5600">=</span> [0] <span style="color:#ff5600">*</span> length(replicas()) - <span style="color:#ff5600">else</span>: - self._decrements <span style="color:#ff5600">=</span> decrements - - <span style="color:#ff5600">def</span> <span style="color:#21439c">increments</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">list</span>(self._increments) <span style="color:#919191"># return a clone</span> - - <span style="color:#ff5600">def</span> <span style="color:#21439c">decrements</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> <span style="color:#a535ae">list</span>(self._decrements) <span style="color:#919191"># return a clone</span> - - <span style="color:#ff5600">def</span> <span style="color:#21439c">value</span>(self): <span style="color:#919191"># query function</span> - <span style="color:#ff5600">return</span> (<span style="color:#a535ae">sum</span>(self.increments()) <span style="color:#ff5600">-</span> - <span style="color:#a535ae">sum</span>(self.decrements())) - - <span style="color:#ff5600">def</span> <span style="color:#21439c">increment</span>(self): <span style="color:#919191"># update function</span> - self._increments[self.replicaId()] <span style="color:#ff5600">+=</span> 1 - - <span style="color:#ff5600">def</span> <span style="color:#21439c">decrement</span>(self): <span style="color:#919191"># update function</span> - self._decrements[self.replicaId()] <span style="color:#ff5600">+=</span> 1 - - <span style="color:#ff5600">def</span> <span style="color:#21439c">compare</span>(self, other): <span style="color:#919191"># comparison function</span> - <span style="color:#ff5600">return</span> (<span style="color:#a535ae">all</span>(v1 <span style="color:#ff5600"><=</span> v2 <span style="color:#ff5600">for</span> (v1, v2) <span style="color:#ff5600">in</span> - <span style="color:#a535ae">zip</span>(self.increments(), +```python +class Counter(CvRDT): + def __init__(self, + increments = None, + decrements = None): + if increments is None: + self._increments = [0] * length(replicas()) + else: + self._increments = increments + if decrements is None: + self._decrements = [0] * length(replicas()) + else: + self._decrements = decrements + + def increments(self): + # return a copy of the increments + return list(self._increments) + + def decrements(self): + # return a copy of the decrements + return list(self._decrements) + + def value(self): + return (sum(self.increments()) - + sum(self.decrements())) + + def increment(self): + self._increments[self.replicaId()] += 1 + + def decrement(self): + self._decrements[self.replicaId()] += 1 + + def compare(self, other): + return (all(v1 <= v2 for (v1, v2) in + zip(self.increments(), other.increments())) - <span style="color:#ff5600">and</span> - <span style="color:#a535ae">all</span>(v1 <span style="color:#ff5600"><=</span> v2 <span style="color:#ff5600">for</span> (v1, v2) <span style="color:#ff5600">in</span> - <span style="color:#a535ae">zip</span>(self.decrements(), + and + all(v1 <= v2 for (v1, v2) in + zip(self.decrements(), other.decrements()))) - <span style="color:#ff5600">def</span> <span style="color:#21439c">merge</span>(self, other): <span style="color:#919191"># merge function</span> - <span style="color:#ff5600">return</span> Counter(increments <span style="color:#ff5600">=</span> <span style="color:#a535ae">map</span>(<span style="color:#a535ae">max</span>, <span style="color:#a535ae">zip</span>(self.increments(), - other.increments())), - decrements <span style="color:#ff5600">=</span> <span style="color:#a535ae">map</span>(<span style="color:#a535ae">max</span>, <span style="color:#a535ae">zip</span>(self.decrements(), - other.decrements()))) - -</pre> + def merge(self, other): + return Counter( + increments = map(max, zip(self.increments(), + other.increments())), + decrements = map(max, zip(self.decrements(), + other.decrements()))) +``` The figure below shows an execution trace of three replicas confirming to this specification. @@ -305,7 +304,7 @@ This design converges to the correct value. This provides us an eventually consi <img src="{{ site.baseurl }}/chapter/6/resources/images/counters/state-based-increment-and-decrement-counter-correct-lattice.png" alt="State based increment and decrement counter(correct) lattice"/> </figure> -## 3. References +## References {% bibliography --file counters %} diff --git a/chapter/7/langs-consistency.md b/chapter/7/langs-consistency.md index 114f6f0..186fd76 100644 --- a/chapter/7/langs-consistency.md +++ b/chapter/7/langs-consistency.md @@ -3,20 +3,20 @@ layout: page title: "Formal, Yet Relaxed: Models for Consistency" by: "James Larisch" --- -# Formal, Yet Relaxed: Models for Consistency ## What's the problem? - In many ways, web developers deal with distributed systems problems every day: your client and your server are in two different geographical locations, and thus, some coordination between computers is required. +In many ways, web developers deal with distributed systems problems every day: your client and your server are in two different geographical locations, and thus, some coordination between computers is required. - As Aviral discussed in the previous section, many computer scientists have done a lot of thinking about the nature of distributed systems problems. As such, we realize that it's impossible to completely emulate the behavior of a single computational machine using multiple machines. For example, the network is simply not as reliable as, say, memory - and waiting for responses can result in untimeliness for the application's user base. After discussing the Consistency/Availability/Partition-tolerance theorem, Section 6 discussed how we can drill down into the CAP pyramid and choose the necessary and unnecessary properties of our systems. As stated, we can't perfectly emulate a single computer using multiple machines, but once we accept that fact and learn to work with it, there are plenty of things we *can* do! +As Aviral discussed in the previous section, many computer scientists have done a lot of thinking about the nature of distributed systems problems. As such, we realize that it's impossible to completely emulate the behavior of a single computational machine using multiple machines. For example, the network is simply not as reliable as, say, memory - and waiting for responses can result in untimeliness for the application's user base. After discussing the Consistency/Availability/Partition-tolerance theorem, Section 6 discussed how we can drill down into the CAP pyramid and choose the necessary and unnecessary properties of our systems. As stated, we can't perfectly emulate a single computer using multiple machines, but once we accept that fact and learn to work with it, there are plenty of things we *can* do! ## The Shopping Cart - Let's bring all this theorem talk back to reality. Let's say you're working at a new e-commerce startup, and you'd like to revolutionize the electronic shopping cart. You'd like to give the customer the ability to do the following: - 1. Log in to the site and add a candle to the cart while traveling in Beijing. - 1. Take a HyperLoop (3 hours) from Beijing to Los Angeles. - 1. Log back in, remove the candle from the cart, and add a skateboard. - 1. Take another HyperLoop train from Los Angeles to Paris (5 hours). - 1. Log back into the site, add another skateboard, and checkout. +Let's bring all this theorem talk back to reality. Let's say you're working at a new e-commerce startup, and you'd like to revolutionize the electronic shopping cart. You'd like to give the customer the ability to do the following: + +1. Log in to the site and add a candle to the cart while traveling in Beijing. +2. Take a HyperLoop (3 hours) from Beijing to Los Angeles. +3. Log back in, remove the candle from the cart, and add a skateboard. +4. Take another HyperLoop train from Los Angeles to Paris (5 hours). +5. Log back into the site, add another skateboard, and checkout. Let's assume you have a server in every single country, and customers connect to the geographically closest server. @@ -33,6 +33,7 @@ This is an important moment. By thinking about our specific problem, we've reali Turns out there's a company out there called Amazon.com - and they've been having a similar problem. Amazon sells things on their website too, and users can add and remove things from their cart. Amazon has lots of servers spread out across the world. They also have quite a few customers. They need to ensure their customers' carts are robust: if/when servers fail or lose communication with one another, a "best-effort" should be made to display the customer's cart. Amazon acknowledges that failure, latency, or HyperLoop-traveling users can cause inconsistent cart data, depending on which server you ask. How does Amazon resolve these issues? ## Dynamo + Amazon built DynamoDB {% cite Dynamo --file langs-consistency %}, which is basically a big distributed hash table. I won't go into the details of DHTs, but let's imagine Dynamo as a hashmap, replicated across multiple servers. A user's cart is stored as a value under the user's username as the key. (`{'james': ['candle', 'skateboard']}`) When a user adds a new item to her cart, either the entire cart or this update is sent to every other server (or, replica). Since, say, a network cable can fail, one replica may have *inconsistent state*: a different view of the universe (a shopping cart, in this case) than every other server. Dynamo has a rather unique way of dealing with these types of inconsistencies. Since Dynamo always wants to be available for both writes and reads (add/removes, viewing/checkouts, resp) it must have a way of combining inconsistent data. Dynamo chooses to perform this resolution at read time. When a client performs a `get()` on the user's cart, Dynamo will query multiple servers for the cart data, for redunancy's sake. Dynamo recognizes the inconsistent state and will take the multiple conflicting carts and push them up to the application! Huh? I thought Dynamo resolves this for the programmer!? Actually, Dynamo is a rather unopinionated key-value store. It detects inconsistencies in the data - but once it does, it simply tells the application (in this case the application is the shopping cart code) that there are some conflicts. The application (shopping cart, in this case) is free to resolve these inconsistencies as it pleases. @@ -59,6 +60,7 @@ Green Umbrella Dynamo has multiple machines in charge of storing the contents of your cart. When you add something to your cart, Dynamo specifies a minimum number of nodes that must receive the new data before the write is considered complete. The same thing goes for reading the contents of your cart: Dynamo requires a minimum number of healthy, responsive nodes to return cart data before relaying this data to the user. Nodes periodically gossip their local state to their neighbors to ensure that any updates, which occurred while the node may have been offline, are eventually delivered. However, Dynamo sends updates to your carts asynchronously to all replicas. This means when you read the contents of your cart, it's possible to receive different results from different replicas. ## Dynamo Simplification + What do we love about Dynamo? It's a highly available key-value store. It replicates data well, and according to the paper, has high uptime and low latency. We love that it's *eventually consistent*. Nodes are constantly gossiping and `put`s are asynchronously propagated, so given enough time (and assuming failures are resolved), nodes' states will eventually converge. However, this property is *weak*. It's weak because when failures & conflicts occur, and [and they will occur](https://www.youtube.com/watch?v=JG2ESDGwHHY), it's up to the application developer to figure out how to handle it. Given a conflict, there isn't a one-size-fits-all solution for resolving them. In the case of the shopping cart, it's relatively trivial: our resolution strategy errs on the side of inclusion. But as a programmer, every time you use DynamoDB for a different purpose you need to consider your resolution strategy. The database doesn't provide a general solution. Instead of constructing an all-purpose database and forcing the burden of resolution on programmers, what if we constructed multi-purpose (read: multi, not *all*) data structures that required no manual resolution? These data structures would resolve conflicts inherently, themselves, and depending on your application you could choose which data structure works best for you. @@ -123,19 +125,21 @@ cart.items // the cart's items ``` Here is an (almost) fully functional shopping cart program. You can imagine this code running across multiple nodes scattered over the world. The meat of the program lies in the `run()` method. Let's walk through that: - 1. Program receives an addition to the cart from the user. - 2. Program adds that item to the current local state if it exists. - 3. Program checks its UDP socket for any messages. - 4. If it received one, it's means another instance of this program has sent us its state. What is state in this case? Simply a set of cart items. Let's handle this set of items by unioning it with our current set. - 5. Synchronize our current state by sending our state to every peer that we know about. - 6. Sleep for 10 seconds. - 7. Repeat! + +1. Program receives an addition to the cart from the user. +2. Program adds that item to the current local state if it exists. +3. Program checks its UDP socket for any messages. +4. If it received one, it's means another instance of this program has sent us its state. What is state in this case? Simply a set of cart items. Let's handle this set of items by unioning it with our current set. +5. Synchronize our current state by sending our state to every peer that we know about. +6. Sleep for 10 seconds. +7. Repeat! Hopefully it's clear that if a client adds an item to her cart in Beijing and then 10 seconds later checks her cart in Paris, she should see the same thing. Well, not exactly - remember, the network is unreliable, and Beijing's `synchronize` messages might have been dropped, delayed, or reordered. But no worries! Beijing is `synchronizing` again in another 10 seconds. This should remind you of Dynamo's gossip and propagation: nodes are constantly attempting to converge. Both systems are eventually consistent - the difference here is our Javascript shopping cart displays *strong* eventual consistency. It's strong because the resolution strategy is built in. In order words, the carts know *how to handle inconsistency*, rather than simply asking the programmer what to do. When a node transmits its state to another node, there's absolutely no question about how to integrate that state into the current one. There's no conflict. This is certainly an improvement from Dynamo. ## The Intern: A Lack of Guarantees + Unfortunately Jerry, the intern, has found your code. He'd like to add `remove` functionality to the cart. So he makes the following changes: ```javascript @@ -195,9 +199,9 @@ cart.items // the cart's items Uh-oh. Can you spot the problem? Let's break it down. In the original code, the current node's cart items were *unioned* with the communicating node's cart. Since there was no deletion, carts could only ever expand. Here was Jerry's plan: -``` > I want to delete things. If you delete something from node 1, and intersect it's state from node 2, the item will be deleted from node 2 as well. +``` Node1: { A, B } Node2: { A, B } @@ -234,6 +238,7 @@ Node 2: { } This is pretty nasty. Jerry has come along and with a few lines of code he's obliterated our nice strong eventually consistent code. Surely there's a better way. ## Logical Monotonicity + The original Javascript we wrote down exhibits the property from Section 6 known as logical *monotonicity*. The union operation ensures that a given node's state is always "greater than or equal to" the states of the other nodes. However, how can we be *sure* that this property is maintained throughout the development of this program? As we've seen, there's nothing stopping an intern from coming along, making a mindless change, and destroying this wonderful property. Ideally, we want to make it impossible (or at least very difficult) to write programs that violate this property. Or, at the very least, we want to make it very easy to write programs that maintain these types of properties. But where should these guarantees live? In the above Javascript example, the guarantees aren't guarantees at all, really. There's no restriction on what the programmer is allowed to do - the programmer has simply constructed a program that mirrors guarantees that she has modeled in her brain. In order to maintain properties such as *monotonicity*, she must constantly check the model in her brain against the code. We haven't really helped the programmer out that much - she has a lot of thinking to do. @@ -245,9 +250,11 @@ Why not push the consistency guarantees in between the IO-level and the applicat Wouldn't it be great if tools like this existed? ## Bloom + Before talking about such tools, I'd like you to forget almost everything you know about programming for a second (unless of course you've never programmed in a Von Neumann-based language in which you sequentially update pieces of memory). Imagine the following scenario: you are "programming" a node in a cluster of computers. All of the other computers work as expected. As a node in this cluser, when you receive a message (all messages will include an integer), your task is to save the message, increment the integer, and resend the new message with the incremented integer back to its originator. You must also send any new messages you've received from `stdin`. Unfortunately, the programming environment is a little strange. + You have access to five sets: * Messages you have received in the last 5 seconds (read) * Inputs you've received from `stdin` in the last 5 seconds (read) @@ -342,6 +349,7 @@ After this block/callback is called, the system automatically flushes & routes m Bloom {% cite Bloom --file langs-consistency %}, a research language developed at UC Berkeley, has a similar programming model to the one described above. Execution is broken up into a series of "timesteps". In the above example, one "timestemp" would be the execution of one `on_five_second_interval` function. Bloom, like the theoretical system above, automatically flushes and populates certain sets before and after each timestep. In the above example, 5 seconds was an arbitrary amount of time. In Bloom, timesteps (rounds of evaluation) are logical tools - they may happen every second, 10 seconds, etc. Logically, it shouldn't affect how your program executes. In reality, Bud's timesteps correspond to evaluation iterations. Your code is evaluated, executed, and the process repeats. So what does a Bloom program look like? Bloom's prototypal implementation is called Bud and is implemented in Ruby. There are two main parts to a Bloom program: + 1. User defined sets: rather than the four sets I gave you above, Bloom users can define their own sets. There are different types of sets depending on the behavior you desire. Bloom refers to these sets as 'collections': * `channel`: Above, `recv_set` and `send_set` would be considered channels. They facilitate sending network messages to and from other nodes. Like the messages above, messages sent into these channels contain a "location-specifier", which tells Bloom where the message should be sent. If you wanted to send a message to `A`, you could push the message `(@A, 10)` into your send set (in Ruby, `["@A", 10]`). The `@` denotes the location-specifier. At the end of the timestep (or callback execution in the above example), these set are flushed. * `table`: Above, `saved_set` would be considered a table. The contents of tables persist across timesteps, which means tables are never flushed. @@ -399,25 +407,28 @@ end ``` Don't panic. Remember - the output of this program is identical to our Ruby callback program from earlier. Let's walk through it step by step. + ```ruby declare def increment_messages network_channel <~ network_channel.map { |x| [x.src, x.dst, x.integer] } end ``` + Here, we take messages we've received from the network channel and send them back into the network channel. The `<~` operator says "copy all of the elements in the right-hand-side and eventually send them off onto the network in the channel on the left-hand-side". So, we map over the contents of network channel *in the current timestep*: switching the `src` and `dst` fields, and incrementing the integer. This mapped collection is passed back into the network channel. Bud will ensure that those messages are sent off at some point. -``` +```ruby declare def save_messages saved_set <= network_channel end ``` + In `save_messages`, we use the `<=` operator. `<=` says "copy all of the elements in the right-hand-side and add them to the table on the left-hand-side." It's important to note that this movement occurs *within the current timestep*. This means if `saved_set` is referenced elsewhere in the code, it will include the contents of `network_channel`. If we had used the `<+` operator instead, the contents of `network_channel` would show up in `saved_set` in the *next* timestep. The latter is useful if you'd like to operate on the current contents of `saved_set` in the current timestep but want to specify how `saved_set` should be updated for the next timestep. Remember, all of this code is executed in *each* timestep - the separation of code into separate methods is merely for readability. -``` +```ruby declare def send_messages network_channel <~ stdin_input_set @@ -493,6 +504,7 @@ Functionally, this code is equivalent to our working Javascript shopping cart im Bloom has leveraged the montononic, add-only set and constructed a declarative programming model based around these sets. When you treat everything as sets (not unlike SQL) and you introduce the notion of "timestemps", you can express programs as descriptions of state rather than an order of operations. Besides being a rather unique model, Bloom presents an accessible and (perhaps...) safe model for programming eventually consistent programs. ### Sets only? + Bloom's programming model is built around the set. As Aviral discussed in the previous chapter, however, sets are not the only monotonic data structures. Other CRDTs are incredibly useful for programming eventually consistent distributed programs. Recall that a *bounded join semilattice* (CRDT) can be represented as a 3-tuple: `(S, U, ⊥)`. `S` is the set of all elements within the semilattice. `U` is the `least-upper bound` operation. `⊥` is the "least" element within the set. For example, for add-only sets, `S = the set of all sets`, `U = union` and `⊥ = {}`. Elements of these semilattices, when `U` is applied, can only "stay the same or get larger". Sets can only stay the same size or get larger - they can never rollback. For some element `e` in `S`, `e U ⊥` must equal `e`. @@ -502,7 +514,7 @@ These semilattices (and many more!) can be used to program other types of distri Unfortunately, Bloom does not provide support for other CRDTs. In fact, you cannot define your own datatypes at all. You are bound by the collections described. -Bloom<sup>L</sup>{% cite BloomL --file langs-consistency %}, an addendum to the Bloom language, provides support for these types of data structures. Specifically, Bloom<sup>L</sup> does two things: +Bloom<sup>L</sup> {% cite BloomL --file langs-consistency %}, an addendum to the Bloom language, provides support for these types of data structures. Specifically, Bloom<sup>L</sup> does two things: * Adds a number of built-in lattices such as `lmax` (`integerMax`), `lmin`, etc. * Adds an "interface" for lattices: the user can define lattices that "implement" this interface. @@ -544,11 +556,13 @@ Currently Bloom exists as a Ruby prototype: Bud. Hypothetically speaking, there' All in all, Bloom provides programmers with a new model for writing distributed programs. If the user desires monotonic data structures and operations, it's relatively easy to use and reason about. Rather than blindly destroying the properties of your system, you will know exactly when you introduce a possible point of order into your program. It's up to you to decide whether or not you need to introduce coordination. ## Lasp -Lasp {% cite Lasp --file langs-consistency %}is an Erlang library which aims to facilitate this type of "disorderly" programming. + +Lasp {% cite Lasp --file langs-consistency %} is an Erlang library which aims to facilitate this type of "disorderly" programming. Lasp provides access to myriad of CRDTs. The programmer can have confidence that the CRDTs obey the lattice formal requirements. Like Bloom<sup>L</sup>, if the user desires a new lattice he or she may implement it using an interface. A Simple Lasp Program is defined as either a: + * Single CRDT instance * A "Lasp process" with *m* inputs, all Simple Lasp Programs, and one output CRDT instance diff --git a/chapter/8/big-data.md b/chapter/8/big-data.md index a04c72a..4f07722 100644 --- a/chapter/8/big-data.md +++ b/chapter/8/big-data.md @@ -4,9 +4,10 @@ title: "Large Scale Parallel Data Processing" by: "Jingjing and Abhilash Mysore Somashekar" --- ## Introduction -The growth of Internet has generated the so-called big data(terabytes or petabytes). It is not possible to fit them into a single machine or process them with one single program. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce framework, which separates the programming logic and underlying execution details(data distribution, fault tolerance and scheduling). The model has been proved to be simple and powerful, and from then on, the idea inspired many other programming models. -This chapter covers the original idea of MapReduce framework, split into two sections: programming model and execution model. For each section, we first introduce the original design for MapReduce and its limitations. Then we present follow-up models(e.g. FlumeJava) to either work around these limitations or other models (e.g. Dryad, Spark) that take alternative designs to circumvent inabilities of MapReduce. We also review declarative programming interfaces(Pig, Hive, SparkSQL) built on top of MapReduce frameworks to provide programming efficiency and optimization benefits. In the last section, we briefly outline the ecosystem of Hadoop and Spark. +The growth of Internet has generated so-called big data (terabytes or petabytes). It is not possible to fit this data onto a single machine or process it with one single program in a reasonable timeframe. Often the computation has to be done fast enough to provide practical services. A common approach taken by tech giants like Google, Yahoo, Facebook is to process big data across clusters of commodity machines. Many of the computations are conceptually straightforward, and Google proposed the MapReduce framework, which separates the programming logic and underlying execution details (data distribution, fault tolerance, and scheduling). The model has been proved to be simple and powerful, and from then on, the idea inspired many other programming models. + +This chapter covers the original idea of MapReduce framework, split into two sections: the programming model and the execution model. For each section, we first introduce the original design for MapReduce and its limitations. Then we present follow-up models (e.g. FlumeJava) to either work around these limitations or other models (e.g. Dryad, Spark) which take alternative designs to circumvent inabilities of MapReduce. We also review declarative programming interfaces (Pig, Hive, and SparkSQL) built on top of MapReduce frameworks to provide programming efficiency and optimization benefits. In the last section, we briefly outline the ecosystem of Hadoop and Spark. Outline 1. Programming Models @@ -22,9 +23,11 @@ Outline - 3.1 Hadoop ecosystem - 3.2 Spark ecosystem -## 1 Programming Models -### 1.1 Data parallelism -*Data parallelism* is, given a dataset, the simultaneous execution on multiple machines or threads of the same function across groups of elements of a dataset. Data parallelism can also be thought of as a subset of SIMD ("single instruction, multiple data") execution, a class of parallel execution in Flynn's taxonomy. Comparably, one could think a sequential computation as *"for all elements in the dataset, do operation A"* on a single big dataset, whose size can reach to terabytes or petabytes. The challenges to doing this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines or clusters, how to schedule tasks and handle failures and so on. +## Programming Models + +### Data parallelism + +*Data parallelism* is, given a dataset, the simultaneous execution on multiple machines or threads of the same function across groups of elements of a dataset. Data parallelism can also be thought of as a subset of SIMD ("single instruction, multiple data") execution, a class of parallel execution in Flynn's taxonomy. Comparably, one could think a sequential computation as *"for all elements in the dataset, do operation A"* on a single big dataset, whose size can reach to terabytes or petabytes. The challenges to doing this sequential computation in a parallelized manner include how to abstract the different types of computations in a simple and correct way, how to distribute the data to hundreds/thousands of machines or clusters, and how to schedule tasks and handle failures. <figure class="main-container"> <img src="{{ site.baseurl }}/resources/img/data-parallelism.png" alt="Data Parallelism" /> @@ -32,112 +35,119 @@ Outline **MapReduce** {% cite dean2008mapreduce --file big-data %} is a programming model proposed by Google to initially satisfy their demand of large-scale indexing for web search service. It provides a simple user program interface: *map* and *reduce* functions and automatically handles the parallelization and distribution. The underlying execution systems can provide fault tolerance and scheduling. -The MapReduce model is simple and powerful and quickly becomes very popular among developers. However, when developers start writing real-world applications, they often end up writing many boilerplates and chaining together these stages. Moreover, The pipeline of MapReduce forces them to write additional coordinating codes, i.e., the development style goes backward from simple logic computation abstraction to lower-level coordination management. As we will discuss in *section 2 execution model*, MapReduce writes all data into disk after each stage, which causes severe delays. Programmers need to do manual optimizations for targeted performance, and this again requires them to understand the underlying execution model. The whole process soon becomes cumbersome. **FlumeJava** {%cite chambers2010flumejava --file big-data%} library intends to provide support for developing data-parallel pipelines by abstracting away the complexity involved in data representation and implicitly handling the optimizations. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, thus there is no much need to write raw MR programs directly. +The MapReduce model is simple and powerful and quickly becomes very popular among developers. However, when developers start writing real-world applications, they often end up writing many pieces of boilerplate code and chaining together these stages. Moreover, The pipeline of MapReduce forces them to write additional coordinating codes, i.e., the development style goes backward from simple logic computation abstraction to lower-level coordination management. As we will discuss in *section 2 execution model*, MapReduce writes all data into disk after each stage, which causes severe delays. Programmers need to do manual optimizations for targeted performance, and this again requires them to understand the underlying execution model. The whole process soon becomes cumbersome. The **FlumeJava** {% cite chambers2010flumejava --file big-data %} library intends to provide support for developing data-parallel pipelines by abstracting away the complexity involved in data representation and implicitly handling the optimizations. It defers the evaluation, constructs an execution plan from parallel collections, optimizes the plan, and then executes underlying MR primitives. The optimized execution is comparable with hand-optimized pipelines, thus there is no much need to write raw MR programs directly. -After MapReduce, Microsoft proposed their counterpart data parallelism model: **Dryad** {% cite isard2007dryad --file big-data %}, which abstracts individual computational tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG graph and let Dryad execution engine construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports a single input and a single output for each vertex. Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe, and shared-memory FIFO. The programming model is less elegant than MapReduce, programmers are not meant to interact with them directly. Instead, they are expected to use the high-level programming interfaces DryadLinq {% cite yu2008dryadlinq --file big-data %}, which more expressive and well embedded with .NET framework. We can see some examples in the end of *section 1.1.3 Dryad*. +After MapReduce, Microsoft proposed their counterpart data parallelism model, **Dryad** {% cite isard2007dryad --file big-data %}, which abstracts individual computational tasks as vertices, and constructs a communication graph between those vertices. What programmers need to do is to describe this DAG and let the Dryad execution engine construct the execution plan and manage scheduling and optimization. One of the advantages of Dryad over MapReduce is that Dryad vertices can process an arbitrary number of inputs and outputs, while MR only supports a single input and a single output for each vertex. Besides the flexibility of computations, Dryad also supports different types of communication channel: file, TCP pipe, and shared-memory FIFO. The programming model is less elegant than MapReduce, as programmers are not meant to interact with it directly. Instead, they are expected to use the high-level programming interface DryadLinq {% cite yu2008dryadlinq --file big-data %}, which is more expressive and better embedded within the .NET framework. We can see some examples later in the section devote to Dryad. -Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications, e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data%} is a framework that uses functional programming and pipelining to provide such support. It is largely inspired by MapReduce's model and builds upon the ideas behind DAG, lazy evaluation of DryadLinq. Instead of writing data to disk for each job as MapReduce does Spark can cache the results across jobs. Spark explicitly caches computational data in memory through specialized immutable data structure named Resilient Distributed Sets(RDD) and reuse the same dataset across multiple parallel operations. The Spark builds upon RDD to achieve fault tolerance by reusing the lineage information of the lost RDD. This results in lesser overhead than what is seen in fault tolerance achieved by the checkpoint in Distributed Shared Memory systems. Moreover, Spark is the underlying framework upon which many very different systems are built, e.g., Spark SQL & DataFrames, GraphX, Streaming Spark, which makes it easy to mix and match the use of these systems all in the same application.These feature makes Spark the best fit for iterative jobs and interactive analytics and also helps it in providing better performance. +Dryad expresses computation as acyclic data flows, which might be too expensive for some complex applications e.g. iterative machine learning algorithms. **Spark** {% cite zaharia2010spark --file big-data %} is a framework that uses functional programming and pipelining to provide such support. Spark is largely inspired by MapReduce's model and builds upon the ideas behind using DAGs and lazy evaluation found in DryadLinq. Instead of writing data to disk for each job as MapReduce does Spark can cache the results across jobs. Spark explicitly caches computational data in memory through specialized immutable data structure named Resilient Distributed Sets (RDD) and reuse the same dataset across multiple parallel operations. The Spark builds upon RDDs to achieve fault tolerance by reusing the lineage information of the lost RDD. This results in less overhead than what is seen in fault tolerance achieved by using checkpoints in Distributed Shared Memory systems. Moreover, Spark is the underlying framework upon which many very different systems are built e.g. Spark SQL & DataFrames, GraphX, Streaming Spark, which makes it easy to mix and match the use of these systems all in the same application. These features make Spark the best fit for iterative jobs and interactive analytics and also help it to provide better performance. -Following four sections discuss the programming models of MapReduce, FlumeJava, Dryad, and Spark. +The following four sections discuss the programming models of MapReduce, FlumeJava, Dryad, and Spark. -### 1.1.1 MapReduce +#### MapReduce In this model, parallelizable computations are abstracted into map and reduce functions. The computation accepts a set of key/value pairs as input and produces a set of key/value pairs as output. The process involves two phases: -- *Map*, written by the user, accepts a set of key/value pairs("record") as input, applies *map* operation on each record, then it computes a set of intermediate key/value pairs as output. -- *Reduce*, also written by the user, accepts an intermediate key and a set of values associated with that key, operate on them, produces zero or one output value. - Note: there is a *Shuffle* phase between *map* and *reduce*, provided by MapReduce library, groups the all the intermediate values of the same key together and pass to *Reduce* function. We will discuss more in Section 2 Execution Models. +- **Map**, written by the user, accepts a set of key/value pairs (a "record") as input, applies *map* operation on each record, and computes a set of intermediate key/value pairs as output. +- **Reduce**, also written by the user, accepts an intermediate key and a set of values associated with that key, operates on them, and produces zero or one output value. -Conceptually, the map and reduction functions have associated **types**: +Note: there is also a *Shuffle* phase between *Map* and *Reduce*, provided by MapReduce library, which groups all the intermediate values of the same key together and passes them to the *Reduce* function. This is discussed more in the section on Execution Models. + +Conceptually, the map and reduce functions have associated **types**: \\[map (k1,v1) \rightarrow list(k2,v2)\\] \\[reduce (k2,list(v2)) \rightarrow list(v2)\\] - The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values. - -Concretely, considering the problem of counting the number of occurrence of each word in a large collection of documents: each time, a `map` function that emits a word plus its count 1; a `reduce` function sums together all counts emitted for the same word +As an example, we can consider the problem of counting the number of occurrences of each word in a large collection of documents. This could be modeled as a `map` function that emits a word plus its count `1` and a `reduce` function which sums together all the counts emitted for the same word. Pseudocode of the map and reduce phases to solve this problem is given below. ``` map(String key, String value): // key: document name // value: document contents - for each word w in value: - EmitIntermediate(w, "1"); + for each word in value: + EmitIntermediate(word, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; - for each v in values: - result += ParseInt(v); + for each value in values: + result += ParseInt(value); Emit(AsString(result)); ``` -During executing, the MapReduce library assigns a master node to manage data partition and scheduling, other nodes can serve as workers to run either *map* or *reduce* operations on demands. More details of the execution model are discussed later. Here, it's worth mentioning that the intermediate results are written into disks and reduce operation will read from disk. This is crucial for fault tolerance. +During execution, the MapReduce library assigns a master node to manage data partition and scheduling. Other nodes can serve as workers to run either *map* or *reduce* operations on demand. More details of the execution model are discussed later. Here, it's worth mentioning that the intermediate results output by the map phase are written to disk and the reduce operation then reads its input from disk. This is crucial for fault tolerance. + +*Fault Tolerance* + +MapReduce runs on hundreds or thousands of unreliable commodity machines, so the library must provide fault tolerance. The library assumes that master node will not fail, and it monitors worker failures. If no status update is received from a worker on timeout, the master will mark it as failed. Then the master may schedule the associated task to other workers depending on task type and status. The commits of *map* and *reduce* task outputs are atomic, where the in-progress task writes data into private temporary files, and then once the task succeeds, it negotiates with the master and renames files to complete the task. In the case of failure, the worker discards those temporary files. This guarantees that if the computation is deterministic, the distributed implementation should produce the same outputs as a failure-free sequential execution. + +*Limitations* -*Fault Tolerance* -MapReduce runs on hundreds or thousands of unreliable commodity machines, so the library must provide fault tolerance. The library assumes that master node would not fail, and it monitors worker failures. If no status update is received from a worker on timeout, the master will mark it as failed. Then the master may schedule the associated task to other workers depending on task type and status. The commits of *map* and *reduce* task outputs are atomic, where the in-progress task writes data into private temporary files, once the task succeeds, it negotiate with the master and rename files to complete the task. In the case of failure, the worker discards those temporary files. This guarantees that if the computation is deterministic, the distribution implementation should produce same outputs as non-faulting sequential execution. +Many analytics workloads like K-means, logistic regression, graph processing applications like PageRank, and shortest path using parallel breadth-first search require multiple stages of MapReduce jobs. In a regular MapReduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason for this is waste of network bandwidth, CPU resources, and primarily inherently slow disk I/O operations. In order to address such challenges in iterative workloads on MapReduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %}, and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations. -*Limitations* -Many analytics workloads like K-means, logistic regression, graph processing applications like PageRank, shortest path using parallel breadth-first search require multiple stages of MapReduce jobs. In regular MapReduce framework like Hadoop, this requires the developer to manually handle the iterations in the driver code. At every iteration, the result of each stage T is written to HDFS and loaded back again at stage T+1 causing a performance bottleneck. The reason being wastage of network bandwidth, CPU resources, and mainly the disk I/O operations which are inherently slow. In order to address such challenges in iterative workloads on MapReduce, frameworks like Haloop {% cite bu2010haloop --file big-data %}, Twister {% cite ekanayake2010twister --file big-data %} and iMapReduce {% cite zhang2012imapreduce --file big-data %} adopt special techniques like caching the data between iterations and keeping the mapper and reducer alive across the iterations. +#### FlumeJava +FlumeJava {% cite chambers2010flumejava --file big-data %} was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on their local machine first and then deploy their job to large clusters. -### 1.1.2 FlumeJava -FlumeJava {%cite chambers2010flumejava --file big-data %}was introduced to make it easy to develop, test, and run efficient data-parallel pipelines. FlumeJava represents each dataset as an object and transformation is invoked by applying methods on these objects. It constructs an efficient internal execution plan from a pipeline of MapReduce jobs, uses deferred evaluation and optimizes based on plan structures. The debugging ability allows programmers to run on the local machine first and then deploy to large clusters. +*Core Abstractions* -*Core Abstraction* -- `PCollection<T>`, a immutable bag of elements of type `T`, it can be created from in-memory Java `Collection<T>` or from reading a file with encoding specified by `recordOf`. +- `PCollection<T>`, a immutable bag of elements of type `T`, which can be created from an in-memory Java `Collection<T>` or from reading a file with encoding specified by `recordOf`. - `recordOf(...)`, specifies the encoding of the instance -- `PTable<K, V>`, a subclass of `PCollection<Pair<K,V>>`, a immutable multi-map with keys of type `K` and values of type `V` -- `parallelDo()`, can be expressed both the map and reduce parts of MapReduce +- `PTable<K, V>`, a subclass of `PCollection<Pair<K,V>>`, an immutable multi-map with keys of type `K` and values of type `V` +- `parallelDo()`, can express both the map and reduce parts of MapReduce - `groupByKey()`, same as shuffle step of MapReduce - `combineValues()`, semantically a special case of `parallelDo()`, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer. -- `flatten`, takes a list of `PCollection<T>`s and returns a single logic `PCollection<T>`. +- `flatten`, takes a list of `PCollection<T>`s and returns a single `PCollection<T>`. An example implemented using FlumeJava: + ```java -PTable<String,Integer> wordsWithOnes = +PTable<String, Integer> wordsWithOnes = words.parallelDo( - new DoFn<String, Pair<String,Integer>>() { - void process(String word, - EmitFn<Pair<String,Integer>> emitFn) { - emitFn.emit(Pair.of(word, 1)); - } - }, tableOf(strings(), ints())); -PTable<String,Collection<Integer>> + new DoFn<String, Pair<String, Integer>> () { + void process(String word, + EmitFn<Pair<String, Integer>> emitFn) { + emitFn.emit(Pair.of(word, 1)); + } + }, tableOf(strings(), ints())); +PTable<String, Collection<Integer>> groupedWordsWithOnes = wordsWithOnes.groupByKey(); -PTable<String,Integer> wordCounts = +PTable<String, Integer> wordCounts = groupedWordsWithOnes.combineValues(SUM_INTS); ``` -*Deferred Evaluation & Optimizer* -One of the merits of using FlumeJava to pipeline MapReduce jobs is that it enables optimization automatically, by executing parallel operations lazily using *deferred evaluation*. The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a *parallelDo()*, it creates an operation pointer to the actual deferred operation object. These operations form a directed acyclic graph called execution plan. The execution plan doesn't get evaluated until *run()* is called. This will cause optimization of the execution plan and evaluation in forward topological order. These optimization strategies for transferring the modular execution plan into an efficient one include: -- Fusion: $$f(g(x)) => g \circ f(x)$$, which is essentially function composition. This usually help reduce steps. -- MapShuffleCombineReduce (MSCR) Operation: combination of ParallelDo, GroupByKey, CombineValues and Flatten into one MapReduce job. This extends MapReduce to accept multiple inputs and multiple outputs. Following figure illustrates the case a MSCR operation with 3 input channels, 2 grouping(GroupByKey) output channels and 1 pass-through output channel. - <figure class="main-container"> +*Deferred Evaluation & Optimization* + +One of the merits of using FlumeJava to pipeline MapReduce jobs is that it enables optimization automatically, by executing parallel operations lazily using *deferred evaluation*. The state of each `PCollection` object is either *deferred* (not yet computed) and *materialized* (computed). When the program invokes a `parallelDo()`, it creates an operation pointer to the actual deferred operation object. These operations form a directed acyclic graph called execution plan. The execution plan doesn't get evaluated until `run()` is called. This will cause optimization of the execution plan and evaluation in forward topological order. These optimization strategies for transferring the modular execution plan into an efficient one include: + +- **Fusion**: $$f(g(x)) => g \circ f(x)$$, which is essentially function composition. This usually help reduce the number of steps need for a given job by combining multiple composable steps into one. +- **MapShuffleCombineReduce (MSCR) Operation**: a combination of ParallelDo, `GroupByKey`, `CombineValues` and `Flatten` into one MapReduce job. This extends MapReduce to accept multiple inputs and multiple outputs. Following figure illustrates the case a MSCR operation with 3 input channels, 2 grouping (`GroupByKey`) output channels and 1 pass-through output channel. + +<figure class="main-container"> <img src="{{ site.baseurl }}/resources/img/mscr.png" alt="A MapShuffleCombineReduce operation with 3 input channels" /> - </figure> +</figure> An overall optimizer strategy involves a sequence of optimization actions with the ultimate goal to produce the fewest, most efficient MSCR operations: + 1. Sink Flatten: $$h(f(a)+g(b)) \rightarrow h(f(a)) + h(g(b))$$ -2. Lift combineValues operations: If *CombineValues* operation immediately follows a *GroupByKey* operation, the GroupByKey records the fact and original *CombineValues* is left in place, which can be treated as normal *ParallelDo* operation and subject to ParallelDo fusions. +2. Lift combineValues operations: If a `CombineValues` operation immediately follows a `GroupByKey` operation, the `GroupByKey` records the fact and original `CombineValues` is left in place, which can be treated as normal `ParallelDo` operation and subject to ParallelDo fusions. 3. Insert fusion blocks: -4. Fuse ParallelDos +4. Fuse `ParallelDo`s 5. Fuse MSCRs: create MSCR operations, and convert any remaining unfused ParallelDo operations into trivial MSCRs. -The SiteData example{%cite chambers2010flumejava --file big-data %} shows that 16 data-parallel operations can be optimized into two MSCR operations in the final execution plan (refer to Figure 5 in the original paper). One limitation of the optimizer is that all these optimizations are based on the structures of the execution plan, FlumeJava doesn't analyze user-defined functions. +The SiteData example {% cite chambers2010flumejava --file big-data %} shows that 16 data-parallel operations can be optimized into two MSCR operations in the final execution plan (refer to Figure 5 in the original paper). One limitation of the optimizer is that all these optimizations are based on the structures of the execution plan, FlumeJava doesn't analyze the contents of user-defined functions. -### 1.1.3 Dryad -Dryad is a general-purpose data-parallel execution engine that allows developers to *explicitly* specify an arbitrary directed acyclic graph (DAG) for computations, where each vertex is a computation task and the edges represent communication channels(file, TCP pipe, or shared-memory FIFI) between tasks. +#### Dryad -A Dryad job is a logic computation graph that is automatically mapped to physical resources at runtime. From programmers' point of view, the channels produce or consume heap objects and the type of data channel makes no difference to reading or writing these objects. In Dryad system, a process called "job manager" connects to the cluster network and is responsible for scheduling jobs by consulting the name server (NS) and delegating commands to the daemon (D) running on each computer in the cluster. +Dryad is a general-purpose data-parallel execution engine that allows developers to *explicitly* specify an arbitrary directed acyclic graph (DAG) for computations, where each vertex is a computation task and the edges represent communication channels (file, TCP pipe, or shared-memory FIFI) between tasks. +A Dryad job is a logic computation graph that is automatically mapped to physical resources at runtime. From a programmer's point of view, the channels produce or consume heap objects and the type of data channel makes no difference when reading or writing these objects. In the Dryad system, a process called a "job manager" connects to the cluster network and is responsible for scheduling jobs by consulting the name server (NS) and delegating commands to the daemon (D) running on each computer in the cluster. -*Writing program* +*Writing a program* The Dryad library is written in C++ and it uses a mixture of method calls and operator overloading. It describes a Dryad graph as $$G=\langle V_G, E_G, I_G, O_G \rangle$$, where $$V_G$$ is a sequences of vertices, $$E_G$$ is a set of directed edges, $$I_G$$ and $$O_G$$ represent vertices for *inputs* and *outputs*. @@ -171,10 +181,10 @@ GraphBuilder HOutputs = HSet >= output; GraphBuilder final = XInputs || YInputs || XToY || YToH || HOutputs; ``` -In fact, developers are not expected to write raw Dryad programs as complex as above. Instead, Microsoft introduced a querying model DryadLINQ {% cite yu2008dryadlinq --file big-data %} which is more declarative. We will discuss querying models and their power to express complex operations like join in *section 1.2 Querying*. Here we just show a glimpse of querying example in DryadLINQ (who is compiled into Dryad jobs and executed in Dryad execution engine): +Developers are not expected to write raw Dryad programs as complex as the one shown above. Instead, Microsoft introduced a querying model called DryadLINQ {% cite yu2008dryadlinq --file big-data %} which is more declarative. We will discuss querying models and their power to express complex operations like join in *section 1.2 Querying*. Here we just show a glimpse of querying example in DryadLINQ (who is compiled into Dryad jobs and executed in Dryad execution engine): ```c# -//SQL-style syntax to join two input sets: +// SQL-style syntax to join two input sets: // scoreTriples and staticRank var adjustedScoreTriples = from d in scoreTriples @@ -198,19 +208,20 @@ var rankedQueries = ``` *Fault tolerance policy* + The communication graph is acyclic, so if given immutable inputs, the computation result should remain same regardless of the sequence of failures. When a vertex fails, the job manager will either get notified or receive a heartbeat timeout and then the job manager will immediately schedule to re-execute the vertex. *Comparison with FlumeJava* -Both support multiple inputs/outputs for the computation nodes. The big difference is that FlumeJava still exploits the MapReduce approach to reading from/writing to disks between stages, where Dryad has the option to do in-memory transmission. This leaves Dryad a good position to do optimization like re-using in-memory data. In the other hand, Dryad has no optimizations on the graph itself. +Both support multiple inputs/outputs for the computation nodes. The big difference is that FlumeJava still exploits the MapReduce approach to reading from/writing to disks between stages, where Dryad has the option to do in-memory transmission. This leaves Dryad a good position to do optimization like re-using in-memory data. In the other hand, Dryad has no optimizations on the graph itself. -### 1.1.4 Spark +#### Spark -Spark {%cite zaharia2010spark --file big-data %} is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. It is a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions: distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets with high-performance processing, scalability, and fault tolerance. +Spark {% cite zaharia2010spark --file big-data %} is a fast, in-memory data processing engine with an elegant and expressive development interface which enables developers to efficiently execute machine learning, SQL or streaming workloads that require fast iterative access to datasets. It is a functional style programming model (similar to DryadLINQ) where a developer can create acyclic data flow graphs and transform a set of input data through a map - reduce like operators. Spark provides two main abstractions: distributed in-memory storage (RDD) and parallel operations (based on Scala’s collection API) on data sets with high-performance processing, scalability, and fault tolerance. *Distributed in-memory storage - Resilient Distributed Data sets* -RDD is a partitioned, read-only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(resilient). If a node fails, an RDD can always be recovered using its lineage; the DAG of computations performed on the source dataset. An RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These transformations are deferred; that means they are built up and staged and are not actually applied until an action is performed on an RDD. Thus, it is important to note that while one might have applied many transformations to a given RDD, some resulting transformed RDD may not be materialized even though one may hold a reference to it. +A RDD is a partitioned, read-only collection of objects which can be created from data in stable storage or by transforming other RDD. It can be distributed across multiple nodes (parallelize) in a cluster and is fault tolerant(resilient). If a node fails, an RDD can always be recovered using its lineage; the DAG of computations performed on the source dataset. An RDD is stored in memory (as much as it can fit and rest is spilled to disk) and is immutable - It can only be transformed to a new RDD. These transformations are deferred; that means they are built up and staged and are not actually applied until an action is performed on an RDD. Thus, it is important to note that while one might have applied many transformations to a given RDD, some resulting transformed RDD may not be materialized even though one may hold a reference to it. The properties that power RDD with the above-mentioned features: - A list of dependencies on other RDD’s. @@ -224,7 +235,6 @@ The properties that power RDD with the above-mentioned features: <img src="./spark_pipeline.png" alt="Spark pipeline" /> </figure> - Spark API provide two kinds of operations on an RDD: - Transformations - lazy operations that return another RDD. @@ -242,22 +252,21 @@ Spark API provide two kinds of operations on an RDD: - `collect()` Return an Array[T] containing all elements - `count()` Return the number of elements -RDDs by default are discarded after use. However, Spark provides two explicit operations: persist() and cache() to ensure RDDs are persisted in memory once the RDD has been computed for the first time. +RDDs by default are discarded after use. However, Spark provides two explicit operations: `persist()` and `cache()` to ensure RDDs are persisted in memory once the RDD has been computed for the first time. *Why RDD, not Distributed Shared memory (DSM) ?* RDDs are immutable and can only be created through coarse-grained transformations while DSM allows fine-grained read and write operations to each memory location. Since RDDs are immutable and can be derived from their lineages, they do not require checkpointing at all. Hence RDDs do not incur the overhead of checkpointing as DSM does. Additionally, in DSM, any failure requires the whole program to be restored. In the case of RDDs, only the lost RDD partitions need to be recovered. This recovery happens parallelly on the affected nodes. RDDs are immutable and hence a straggler (slow node) can be replaced with a backup copy as in MapReduce. This is hard to implement in DSM as two copies point to the same location and can interfere the update with one another. - - -***Challenges in Spark*** {% cite armbrust2015scaling --file big-data%} +***Challenges in Spark*** {% cite armbrust2015scaling --file big-data %} - *Functional API semantics* The `GroupByKey` operator is costly in terms of performance. In that it returns a distributed collection of (key, list of value) pairs to a single machine and then an aggregation on individual keys is performed on the same machine resulting in computation overhead. Spark does provide `reduceByKey` operator which does a partial aggregation on individual worker nodes before returning the distributed collection. However, developers who are not aware of such a functionality can unintentionally choose `groupByKey`. The reason being functional programmers (Scala developers) tend to think more declaratively about the problem and only see the end result of the `groupByKey` operator. They may not be necessarily trained on how `groupByKey` is implemented atop of the cluster. Therefore, to use Spark, unlike functional programming languages, one needs to understand how the underlying cluster is going to execute the code. The burden of saving performance is then left to the programmer, who is expected to understand the underlying execution model of Spark, and to know when to use `reduceByKey` over `groupByKey`. - *Debugging and profiling* There is no availability of debugging tools and developers find it hard to realize if a computation is happening more on a single machine or if the data-structure they used were inefficient. -### 1.2 Querying: declarative interfaces -MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like equi-joins and theta-joins {% cite okcan2011processing --file big-data%} which could be highly complex depending on the data, require programmers to implement by hand. Hence, MapReduce lacks many such high level abstractions requiring programmers to be well versed with several of the design patterns like map-side joins, reduce-side equi-join etc. Also, java based code (like in Hadoop framework) in MapReduce can sometimes become repetitive when the programmer wants to implement most common operations like projection, filtering etc. A simple word count program as shown below, can span up to 63 lines. +### Querying: declarative interfaces + +MapReduce takes care of all the processing over a cluster, failure and recovery, data partitioning etc. However, the framework suffers from rigidity with respect to its one-input data format (key/value pair) and two-stage data flow. Several important patterns like equi-joins and theta-joins {% cite okcan2011processing --file big-data %} which could be highly complex depending on the data, require programmers to implement by hand. Hence, MapReduce lacks many such high level abstractions requiring programmers to be well versed with several of the design patterns like map-side joins, reduce-side equi-join etc. Also, java based code (like in Hadoop framework) in MapReduce can sometimes become repetitive when the programmer wants to implement most common operations like projection, filtering etc. A simple word count program as shown below, can span up to 63 lines. *Complete code for Word count in Hadoop (Java based implementation of MapReduce)* @@ -325,70 +334,71 @@ public class WordCount *Why SQL over MapReduce ?* SQL already provides several operations like join, group by, sort which can be mapped to the above mentioned MapReduce operations. Also, by leveraging SQL like interface, it becomes easy for non MapReduce experts/non-programmers like data scientists to focus more on logic than hand coding complex operations {% cite armbrust2015scaling --file big-data%}. Such an high level declarative language can easily express their task while leaving all of the execution optimization details to the backend engine. + SQL also lessens the amount of code (code examples can be seen in individual model’s section) and significantly reduces the development time. + Most importantly, as you will read further in this section, frameworks like Pig, Hive, Spark SQL take advantage of these declarative queries by realizing them as a DAG upon which the compiler can apply transformation if an optimization rule is satisfied. Spark which does provide high level abstraction unlike MapReduce, lacks this very optimization resulting in several human errors as discussed in the Spark’s data-parallel section. -Sawzall {% cite pike2005interpreting --file big-data%} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program only need to specify the filter function, and emit the intermediate pairs to external pre-built aggregators. This largely eliminates the trouble for programmers put into having to write reducers, just the following example shows, programmers can use built-in reducer supports to do the a reducing job. The serialization of the data uses Google's *protocol buffers*, which can produce *meta-data* file for the declared scheme, but the scheme is not used for any optimization purpose per se. Sawzall is good for most of the straightforward processing on large dataset, but it does not support more complex and still common operations like *join*. The pre-built aggregators are limited and it is non-trivial to add more supports. +Sawzall {% cite pike2005interpreting --file big-data %} is a programming language built on top of MapReduce. It consists of a *filter* phase (map) and an *aggregation* phase (reduce). User program only need to specify the filter function, and emit the intermediate pairs to external pre-built aggregators. This largely eliminates the trouble for programmers put into having to write reducers, just the following example shows, programmers can use built-in reducer supports to do the a reducing job. The serialization of the data uses Google's *protocol buffers*, which can produce *meta-data* file for the declared scheme, but the scheme is not used for any optimization purpose per se. Sawzall is good for most of the straightforward processing on large dataset, but it does not support more complex and still common operations like *join*. The pre-built aggregators are limited and it is non-trivial to add more supports. - *Word count implementation in Sawzall* - ``` - result: table sum of int; - total: table sum of float; - x: float = input; - emit count <- 1; - emit total <- x; - ``` +``` +result: table sum of int; +total: table sum of float; +x: float = input; +emit count <- 1; +emit total <- x; +``` -Apart from Sawzall, Pig {%cite olston2008pig --file big-data %} and Hive {%cite thusoo2009hive --file big-data %} are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code. Both support more complex operations than Sawzall: e.g. database join. +Apart from Sawzall, Pig {% cite olston2008pig --file big-data %} and Hive {% cite thusoo2009hive --file big-data %} are the other major components that sit on top of Hadoop framework for processing large data sets without the users having to write Java based MapReduce code. Both support more complex operations than Sawzall: e.g. database join. -Hive is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL {%cite thusoo2010hive --file big-data %} which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach. +Hive is built by Facebook to organize dataset in structured formats and still utilize the benefit of MapReduce framework. It has its own SQL-like language: HiveQL {% cite thusoo2010hive --file big-data %} which is easy for anyone who understands SQL. Hive reduces code complexity and eliminates lots of boiler plate that would otherwise be an overhead with Java based MapReduce approach. - *Word count implementation in Hive* - ``` - CREATE TABLE docs (line STRING); - LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; - CREATE TABLE word_counts AS - SELECT word, count(1) AS count FROM - (SELECT explode(split(line, '\\s')) AS word FROM docs) w - GROUP BY word - ORDER BY word; - ``` +``` +CREATE TABLE docs (line STRING); +LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; +CREATE TABLE word_counts AS +SELECT word, count(1) AS count FROM +(SELECT explode(split(line, '\\s')) AS word FROM docs) w +GROUP BY word +ORDER BY word; +``` Pig Latin by Yahoo aims at a sweet spot between declarative and procedural programming. For advanced programmers, SQL is unnatural to implement program logic and Pig Latin wants to dissemble the set of data transformation into a sequence of steps. This makes Pig more verbose than Hive. Unlike Hive, Pig Latin does not persist metadata, instead it has better interoperability to work with other applications in Yahoo's data ecosystem. - *Word count implementation in PIG* - ``` - lines = LOAD 'input_fule.txt' AS (line:chararray); - words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; - grouped = GROUP words BY word; - wordcount = FOREACH grouped GENERATE group, COUNT(words); - DUMP wordcount; - ``` +``` +lines = LOAD 'input_fule.txt' AS (line:chararray); +words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; +grouped = GROUP words BY word; +wordcount = FOREACH grouped GENERATE group, COUNT(words); +DUMP wordcount; +``` SparkSQL though has the same goals as that of Pig, is better given the Spark exeuction engine, efficient fault tolerance mechanism of Spark and specialized data structure called Dataset. - *Word count example in SparkSQL* - ``` - val ds = sqlContext.read.text("input_file").as[String] - val result = ds - .flatMap(_.split(" ")) - .filter(_ != "") - .toDF() - .groupBy($"value") - .agg(count("*") as "count") - .orderBy($"count" desc) - ``` +```scala +val ds = sqlContext.read.text("input_file").as[String] +val result = ds + .flatMap(_.split(" ")) + .filter(_ != "") + .toDF() + .groupBy($"value") + .agg(count("*") as "count") + .orderBy($"count" desc) +``` The following subsections will discuss Hive, Pig Latin, SparkSQL in details. +#### Hive/HiveQL -### 1.2.1 Hive/HiveQL - -Hive {% cite thusoo2010hive --file big-data%} is a data-warehousing infrastructure built on top of the MapReduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query, and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS {% cite shvachko2010hadoop --file big-data%}. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into MapReduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs. +Hive {% cite thusoo2010hive --file big-data %} is a data-warehousing infrastructure built on top of the MapReduce framework - Hadoop. The primary responsibility of Hive is to provide data summarization, query, and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS {% cite shvachko2010hadoop --file big-data %}. It supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. These SQL queries can be compiled into MapReduce jobs that can be executed be executed on Hadoop. It drastically brings down the development time in writing and maintaining Hadoop jobs. Data in Hive is organized into three different formats: @@ -398,7 +408,8 @@ Data in Hive is organized into three different formats: `Buckets`: Data in each partition can be further divided into buckets on the basis on hash of a column in a table. Each bucket is stored as a file in the partition directory. -***HiveSQL***: Hive query language consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this: +***HiveSQL***: The Hive query language consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this: + ``` FROM ( MAP inputdata USING 'python mapper.py' AS (word, count) @@ -407,18 +418,18 @@ FROM ( ) REDUCE word, count USING 'python reduce.py'; ``` -*Example from {% cite thusoo2010hive --file big-data%}* -This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py. +*Example from {% cite thusoo2010hive --file big-data %}* +This query uses mapper.py for transforming inputdata into (word, count) pair, distributes data to reducers by hashing on word column (given by CLUSTER) and uses reduce.py. ***Serialization/Deserialization*** -Hive implements the LazySerDe as the default SerDe interface. A SerDe is a combination of serialization and deserialization which helps developers instruct Hive on how their records should be processed. The Deserializer interface translates rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. The Serializer, however, converts a Java object into a format that Hive can write to HDFS or another supported system. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. +Hive implements the LazySerDe as the default SerDe interface. A SerDe is a combination of serialization and deserialization which helps developers instruct Hive on how their records should be processed. The Deserializer interface translates rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. The Serializer, however, converts a Java object into a format that Hive can write to HDFS or another supported system. Hive also provides a RegexSerDe which allows the use of regular expressions to parse columns out from a row. +#### Pig Latin -### 1.2.2 Pig Latin -Pig Latin {% cite olston2008pig --file big-data%} is a programming model built on top of MapReduce to provide a declarative description. Different from Hive, who has SQL-like syntax, the goal of Pig Latin is to attract experienced programmers to perform ad-hoc analysis on big data and allow programmers to write execution logic by a sequence of steps. For example, suppose we have a table URLs: `(url, category, pagerank)`. The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank URLs in that category. +Pig Latin {% cite olston2008pig --file big-data %} is a programming model built on top of MapReduce to provide a declarative description. Different from Hive, who has SQL-like syntax, the goal of Pig Latin is to attract experienced programmers to perform ad-hoc analysis on big data and allow programmers to write execution logic by a sequence of steps. For example, suppose we have a table URLs: `(url, category, pagerank)`. The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank URLs in that category. ``` SELECT category, AVG(pagerank) @@ -447,19 +458,17 @@ output = FOREACH big_groups GENERATE *Limitations* The procedural design gives users more control over execution, but at same time the data schema is not enforced explicitly, so it much harder to utilize database-style optimization. Pig Latin has no control structures like loop or conditions, if needed, one has to embed it in Java like JDBC style, but this can easily fail without static syntax checking. It is also not easy to debug. +#### SparkSQL - - -### 1.2.3 SparkSQL - -The major contributions of Spark SQL {% cite armbrust2015spark --file big-data%} are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing. +The major contributions of Spark SQL {% cite armbrust2015spark --file big-data %} are the Dataframe API and the Catalyst. Spark SQL intends to provide relational processing over native RDDs and on several external data sources, through a programmer friendly API, high performance through DBMS techniques, support semi-structured data and external databases, support for advanced analytical processing like machine learning algorithms and graph processing. ***Programming API*** Spark SQL runs on the top of Spark providing SQL interfaces. A user can interact with this interface through JDBC/ODBC, command line or Dataframe API. -A Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike an RDD, Dataframe allows developers to define the structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed. -Spark SQL supports all the major SQL data types including complex data types like arrays, maps, and unions. -Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy). + +The Dataframe API lets users to intermix both relational and procedural code with ease. Dataframe is a collection of schema based rows of data and named columns on which relational operations can be performed with optimized execution. Unlike an RDD, Dataframe allows developers to define the structure for the data and can be related to tables in a relational database or R/Python’s Dataframe. Dataframe can be constructed from tables of external sources or existing native RDD’s. Dataframe is lazy and each object in it represents a logical plan which is not executed until an output operation like save or count is performed. +Spark SQL supports all the major SQL data types including complex data types like arrays, maps, and unions. Some of the Dataframe operations include projection (select), filter(where), join and aggregations(groupBy). + Illustrated below is an example of relational operations on employees data frame to compute the number of female employees in each department. ```scala @@ -468,7 +477,8 @@ employees.join(dept, employees("deptId") === dept("id")) .groupBy(dept("id"), dept("name")) .agg(count("name")) ``` -Several of these operators like $$===$$ for equality test, $$>$$ for greater than, arithmetic ones ($$+$$, $$-$$, etc) and aggregators transforms to an abstract syntax tree of the expression which can be passed to Catalyst for optimization. +Several of these operators like `===` for equality test, `>` for greater than, arithmetic ones (`+`, `-`, etc) and aggregators transforms to an abstract syntax tree of the expression which can be passed to Catalyst for optimization. + A `cache()` operation on the data frame helps Spark SQL store the data in memory so it can be used in iterative algorithms and for interactive queries. In the case of Spark SQL, memory footprint is considerably less as it applies columnar compression schemes like dictionary encoding / run-length encoding. The DataFrame API also supports inline UDF definitions without complicated packaging and registration. Because UDFs and queries are both expressed in the same general purpose language (Python or Scala), users can use standard debugging tools. @@ -479,22 +489,21 @@ Also, Dataframe is both very brittle and very verbose as well, because the user Spark introduced an extension to Dataframe called ***Dataset*** to provide this compile type safety. It embraces object-oriented style for programming and has an additional feature termed Encoders. Encoders translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate bytecode to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object - -Winding up - we can compare SQL vs Dataframe vs Dataset as below : +Winding up - we can compare SQL vs Dataframe vs Dataset as below: <figure class="main-container"> <img src="./sql-vs-dataframes-vs-datasets.png" alt="SQL vs Dataframe vs Dataset" /> </figure> -*Figure from the website :* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html +*Figure from the website :* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html +### Large-scale parallelism on graphs -### 1.3 Large-scale parallelism on graphs MapReduce doesn’t scale easily for iterative / graph algorithms like page rank and machine learning algorithms. Iterative algorithms require a programmer to explicitly handle the intermediate results (writing to disks) resulting in a lot of boilerplate code. Hence, every iteration requires reading the input file and writing the results to the disk resulting in high disk I/O which is a performance bottleneck for any batch processing system. Also, graph algorithms require an exchange of messages between vertices. In a case of PageRank, every vertex requires the contributions from all its adjacent nodes to calculate its score. MapReduce currently lacks this model of message passing which makes it complex to reason about graph algorithms. One model that is commonly employed for implementing distributed graph processing is the graph parallel model. -In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. The bulk-synchronous parallel (BSP) model {% cite valiant1990bridging --file big-data%} is one of the most commonly used graph-parallel model. +In the graph-parallel abstraction, a user-defined vertex program is instantiated concurrently for each vertex and interacts with adjacent vertex programs through messages or shared state. Each vertex program can read and modify its vertex property and in some cases adjacent vertex properties. When all vertex programs vote to halt the program terminates. The bulk-synchronous parallel (BSP) model {% cite valiant1990bridging --file big-data %} is one of the most commonly used graph-parallel model. BSP was introduced in 1980 to represent the hardware design features of parallel computers. It gained popularity as an alternative for MapReduce since it addressed the above-mentioned issues with MapReduce BSP model is a message passing synchronous model where - @@ -506,17 +515,17 @@ BSP model is a message passing synchronous model where - - Barrier synchronization syncs all the processors at the end of every super step. - A notable feature of the model is the complete control of data through communication between every processor at every super step. Though similar to MapReduce model, BSP preserves data in memory across super steps and helps in reasoning iterative graph algorithms. -The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence. GraphX {%cite xin2013graphx --file big-data%} is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity. +The graph-parallel abstractions allow users to succinctly describe graph algorithms, and provide a runtime engine to execute these algorithms in a distributed nature. They simplify the design, implementation, and application of sophisticated graph algorithms to large-scale real-world problems. Each of these frameworks presents a different view of graph computation, tailored to an originating domain or family of graph algorithms. However, these frameworks fail to address the problems of data preprocessing and construction, favor snapshot recovery over fault tolerance and lack support from distributed data flow frameworks. The data-parallel systems are well suited to the task of graph construction and are highly scalable. However, suffer from the very problems mentioned before for which the graph-parallel systems came into existence. GraphX {%cite xin2013graphx --file big-data %} is a new computation system which builds upon the Spark’s Resilient Distributed Dataset (RDD) to form a new abstraction Resilient Distributed Graph (RDG) to represent records and their relations as vertices and edges respectively. RDG’s leverage the RDD’s fault tolerance mechanism and expressivity. ***How does GraphX improve over the existing graph-parallel and data flow models?*** -Similar to the data flow model, GraphX moves away from the vertex-centric view and adopts transformations on graphs yielding a new graph. The RDGs in GraphX provides a set of elegant and expressive computational primitives to support graph transformations as well as enable many graph-parallel systems like Pregel {%cite malewicz2010pregel --file big-data%}, PowerGraph {%cite gonzalez2012powergraph --file big-data%} to be easily expressed with minimal lines of code changes to Spark. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view etc. It minimizes communication and storage overhead across the system by adopting vertex-cuts for effective partitioning. +Similar to the data flow model, GraphX moves away from the vertex-centric view and adopts transformations on graphs yielding a new graph. The RDGs in GraphX provides a set of elegant and expressive computational primitives to support graph transformations as well as enable many graph-parallel systems like Pregel {% cite malewicz2010pregel --file big-data %}, PowerGraph {% cite gonzalez2012powergraph --file big-data %} to be easily expressed with minimal lines of code changes to Spark. GraphX simplifies the process of graph ETL and analysis through new operations like filter, view etc. It minimizes communication and storage overhead across the system by adopting vertex-cuts for effective partitioning. **GraphX** GraphX models graph as property graphs where vertices and edges can have properties. Property graphs are directed multigraph having multiple parallel edges with same source and destination to realize scenarios where multiple relationships could exist between two vertices. For example, in a social graph where every vertex represents a person, there could be a scenario where two people are both co-workers and a friend at the same time. A vertex is keyed by a unique 64-bit long identifier (Vertex ID) while edges contain the corresponding source and destination vertex identifiers. -GraphX API provides the below primitives for graph transformations (From the website : https://spark.apache.org/docs/2.0.0-preview/graphx-programming-guide.html): +The GraphX API provides the below primitives for graph transformations (from: https://spark.apache.org/docs/2.0.0-preview/graphx-programming-guide.html): - `graph` - constructs property graph given a collection of edges and vertices. - `vertices: VertexRDD[VD]`, `edges: EdgeRDD[ED]`- decompose the graph into a collection of vertices or edges by extracting vertex or edge RDDs. @@ -531,18 +540,22 @@ GraphX API provides the below primitives for graph transformations (From the web ***Why partitioning is important in graph computation systems ?*** + Graph-parallel computation requires every vertex or edge to be processed in the context of its neighborhood. Each transformation depends on the result of distributed joins between vertices and edges. This means that graph computation systems rely on graph partitioning (edge-cuts in most of the systems) and efficient storage to minimize communication and storage overhead and ensure balanced computation. <figure class="main-container"> <img src="./edge-cut.png" alt="edge cuts" /> </figure> -*Figure from {%cite xin2013graphx --file big-data%}* +*Figure from {% cite xin2013graphx --file big-data %}* + +***Why are Edge-cuts expensive ?*** -***Why Edge-cuts are expensive ?*** Edge-cuts for partitioning requires random assignment of vertices and edges across all the machines. Thus the communication and storage overhead is proportional to the number of edges cut, and this makes balancing the number of cuts a priority. For most real-world graphs, constructing an optimal edge-cut is cost prohibitive, and most systems use random edge-cuts which achieve appropriate work balance, but nearly worst-case communication overhead. -***Vertex-cuts - GraphX’s solution to effective partitioning*** : An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in a way that minimizes the number of machines spanned by each vertex. +***Vertex-cuts - GraphX’s solution to effective partitioning*** + +An alternative approach which does the opposite of edge-cut — evenly assign edges to machines, but allow vertices to span multiple machines. The communication and storage overhead of a vertex-cut is directly proportional to the sum of the number of machines spanned by each vertex. Therefore, we can reduce communication overhead and ensure balanced computation by evenly assigning edges to machines in a way that minimizes the number of machines spanned by each vertex. ***Implementation of Vertex-cut*** @@ -550,24 +563,24 @@ Edge-cuts for partitioning requires random assignment of vertices and edges acro <img src="./vertex-cut-datastructure.png" alt="vertex-cut-implementation" /> </figure> -*Figure from the website : https://spark.apache.org/docs/2.0.0-preview/graphx-programming-guide.html* +*Figure from: https://spark.apache.org/docs/2.0.0-preview/graphx-programming-guide.html* The GraphX RDG structure implements a vertex-cut representation of a graph using three unordered horizontally partitioned RDD tables. These three tables are as follows: - `EdgeTable(pid, src, dst, data)`: Stores adjacency structure and edge data. -- `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation +- `VertexDataTable(id, data)`: Stores vertex data. Contains states associated with vertices that are changing in the course of graph computation - `VertexMap/Routing Table(id, pid)`: Maps vertex ids to the partitions that contain their adjacent edges. Remains static as long as the graph structure doesn’t change. +## Execution Models +There are many possible implementations for those programming models. In this section, we will discuss a few different execution models, how the above programming interfaces exploit them, the benefits and limitations of each design and so on. At a very high level, MapReduce, its variants, and Spark all adopt the master/workers model, where the master (or driver in Spark) is responsible for managing data and dynamically scheduling tasks to workers. The master monitors workers' status, and when failure happens, the master will reschedule the task to another idle worker. However, data in MapReduce (section 2.1) is distributed over clusters and needs to be moved in and out of the disk, and Spark (section 2.2) takes the in-memory processing approach. This practice saves significant I/O operations and thus is much faster than MapReduce. As for fault tolerance, MapReduce uses data persistence and Spark achieves it by using lineage (recomputation for failed task). -## 2 Execution Models -There are many possible implementations for those programming models. In this section, we will discuss a few different execution models, how the above programming interfaces exploit them, the benefits and limitations of each design and so on. At a very high level, MapReduce, its variants, and Spark all adopt the master/workers model, where the master(or driver in Spark) is responsible for managing data and dynamically scheduling tasks to workers. The master monitors workers' status, and when failure happens, the master will reschedule the task to another idle worker. However, data in MapReduce(section 2.1) is distributed over clusters and needs to be moved in and out of the disk, and Spark(section 2.2) takes the in-memory processing approach. This practice saves significant I/O operations and thus is much faster than MapReduce. As for fault tolerance, MapReduce uses data persistence and Spark achieves it by using lineage(recomputation for failed task). - -As for more declarative querying models, the execution engine needs to take care of query compilation and in the meantime has the opportunity of optimizations. For example, Hive(section 2.3) not only needs a driver as the way MapReduce and Spark do but also has to manage the meta store as well as to take advantage of optimization gain from traditional database like design. SparkSQL(section 2.4) adopts Catalyst framework for SQL optimization: rule-based and cost-based. +As for more declarative querying models, the execution engine needs to take care of query compilation and in the meantime has the opportunity of optimizations. For example, Hive (section 2.3) not only needs a driver the way MapReduce and Spark do, but also has to manage the meta store as well to take advantage of optimization gains from a traditional database like design. SparkSQL (section 2.4) adopts Catalyst framework for SQL optimization which is rule-based and cost-based. +### MapReduce execution model -### 2.1 MapReduce execution model The original MapReduce model is implemented and deployed in Google infrastructure. As described in section 1.1.1, user program defines map and reduce functions and the underlying system manages data partition and schedules jobs across different nodes. Figure 2.1.1 shows the overall flow when the user program calls MapReduce function: + 1. Split data. The input files are split into *M* pieces; 2. Copy processes. The user program creates a master process and the workers. The master picks idle workers to do either map or reduce task; 3. Map. The map worker reads corresponding splits and passes to the map function. The generated intermediate key/value pairs are buffered in memory; @@ -578,32 +591,32 @@ The original MapReduce model is implemented and deployed in Google infrastructur <figure class="fullwidth"> <img src="{{ site.baseurl }}/resources/img/mapreduce-execution.png" alt="MapReduce Execution Overview" /> + <footer>Figure 2.1.1 Execution overview from original MapReduce paper {% cite dean2008mapreduce --file big-data %}</footer> </figure> -<p>Figure 2.1.1 Execution overview<label for="sn-proprietary-monotype-bembo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-proprietary-monotype-bembo" class="margin-toggle"/><span class="sidenote">from original MapReduce paper {%cite dean2008mapreduce --file big-data%}</span></p> At step 4 and 5, the intermediate dataset is written to the disk by map worker and then read from the disk by reducing worker. Transferring big data chunks over the network is expensive, so the data is stored on local disks of the cluster and the master tries to schedule the map task on the machine that contains the dataset or a nearby machine to minimize the network operation. - -### 2.2 Spark execution model +### Spark execution model <figure class="main-container"> <img src="./cluster-overview.png" alt="MapReduce Execution Overview" /> </figure> -*Figure & information (this section) from the website: http://spark.apache.org/docs/latest/cluster-overview.html* -The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs. -The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running on the worker nodes. -A Spark worker executes the business logic submitted by the user by way of the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime. +*Figure & information (this section) from: http://spark.apache.org/docs/latest/cluster-overview.html* -***Job scheduler optimization:*** Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on an RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing. +The Spark driver defines SparkContext which is the entry point for any job that defines the environment/configuration and the dependencies of the submitted job. It connects to the cluster manager and requests resources for further execution of the jobs. The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running on the worker nodes. A Spark worker executes the business logic submitted by the user by way of the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The driver will listen for and accept incoming connections from its executors throughout its lifetime. + +***Job scheduler optimization*** + +Spark’s job scheduler tracks the persistent RDD’s saved in memory. When an action (count or collect) is performed on an RDD, the scheduler first analyzes the lineage graph to build a DAG of stages to execute. These stages only contain the transformations having narrow dependencies. Outside these stages are the wider dependencies for which the scheduler has to fetch the missing partitions from other workers in order to build the target RDD. The job scheduler is highly performant. It assigns tasks to machines based on data locality or to the preferred machines in the contained RDD. If a task fails, the scheduler re-runs it on another node and also recomputes the stage’s parent is missing. ***How are persistent RDD’s memory managed ?*** Persistent RDDs are stored in memory as java objects (for performance) or in memory as serialized data (for less memory usage at cost of performance) or on disk. If the worker runs out of memory upon creation of a new RDD, Least Recently Used(LRU) policy is applied to evict the least recently accessed RDD unless its same as the new RDD. In that case, the old RDD is excluded from eviction given the fact that it may be reused again in future. Long lineage chains involving wide dependencies are checkpointed to reduce the time in recovering an RDD. However, since RDDs are read-only, checkpointing is still ok since consistency is not a concern and there is no overhead to manage the consistency as is seen in distributed shared memory. -### 2.3 Hive execution model +### Hive execution model -The Hive execution model {% cite thusoo2010hive --file big-data%} composes of the below important components (and as shown in the below Hive architecutre diagram below): +The Hive execution model {% cite thusoo2010hive --file big-data %} is composed of the following important components (and as shown in the below Hive architecutre diagram below): - Driver: Similar to the Drivers of Spark/MapReduce application, the driver in Hive handles query submission & its flow across the system. It also manages the session and its statistics. @@ -615,106 +628,96 @@ The Hive execution model {% cite thusoo2010hive --file big-data%} composes of th - Optimization: Optimization forms the core of any declarative interface. In the case of Hive, optimization happens through chains of transformation of DAG. A transformation could include even a user defined optimization and it applies an action on the DAG only if a rule is satisfied. Every node in the DAG implements a special interface called as Node interface which makes it easy for the manipulation of the operator DAG using other interfaces like GraphWalker, Dispatcher, Rule, and Processor. Hence, by transformation, we mean walking through a DAG and for every Node we encounter we perform a Rule satisfiability check. If a Rule is satisfied, a corresponding processor is invoked. A Dispatcher maintains a list of Rule to Processor mappings. - <figure class="main-container" align="center"> - <img src="./Hive-transformation.png" alt="Hive transformation" /> - </figure> - -*Figure to depict the transformation flow during optimization, from:* {%cite thusoo2010hive --file big-data %} +<figure class="main-container" align="center"> + <img src="./Hive-transformation.png" alt="Hive transformation" /> +</figure> -- Execution Engine: Execution Engine finally executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.XML file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say INSERT INTO query). +*Figure to depict the transformation flow during optimization, from:* {% cite thusoo2010hive --file big-data %} +- Execution Engine: Execution Engine finally executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a `plan.xml` file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say an `INSERT INTO` query). ***Summarizing the flow*** *Hive architecture diagram* + <figure class="main-container"> <img src="./Hive-architecture.png" alt="Hive architecture" /> </figure> - The query is first submitted via CLI/the web UI/any other interface. The query undergoes all the compiler phases as explained above to form an optimized DAG of MapReduce and its tasks which the execution engine executes in its correct order using Hadoop. - Some of the important optimization techniques in Hive are: - - Column Pruning - Consider only the required columns needed in the query processing for projection. - - Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates. It is important that unnecessary records are filtered first and transformations are applied to only the needed ones. - - Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. - - Map Side Joins - Smaller tables in the join operation can be replicated in all the mappers and the reducers. - - Join Reordering - Reduce "reducer side" join operation memory by keeping only smaller tables in memory. Larger tables need not be kept in memory. - - Repartitioning data to handle skew in GROUP BY processing can be achieved by performing GROUP BY in two MapReduce stages. In first stage data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. - - Similar to combiners in MapReduce, hash based partial aggregations in the mappers can be performed to reduce the data that is sent by the mappers to the reducers. This helps in reducing the amount of time spent in sorting and merging the resulting data. - - +- Column Pruning - Consider only the required columns needed in the query processing for projection. +- Predicate Pushdown - Filter the rows as early as possible by pushing down the predicates. It is important that unnecessary records are filtered first and transformations are applied to only the needed ones. +- Partition Pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate. +- Map Side Joins - Smaller tables in the join operation can be replicated in all the mappers and the reducers. +- Join Reordering - Reduce "reducer side" join operation memory by keeping only smaller tables in memory. Larger tables need not be kept in memory. +- Repartitioning data to handle skew in GROUP BY processing can be achieved by performing GROUP BY in two MapReduce stages. In first stage data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers. +- Similar to combiners in MapReduce, hash based partial aggregations in the mappers can be performed to reduce the data that is sent by the mappers to the reducers. This helps in reducing the amount of time spent in sorting and merging the resulting data. +### SparkSQL execution model -### 2.4 SparkSQL execution model - -SparkSQL {% cite armbrust2015spark --file big-data%} execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job. -A Catalyst is a query optimizer. Query optimizers for MapReduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. +SparkSQL {% cite armbrust2015spark --file big-data %} execution model leverages Catalyst framework for optimizing the SQL before submitting it to the Spark Core engine for scheduling the job. A Catalyst is a query optimizer. Query optimizers for MapReduce frameworks can greatly improve performance of the queries developers write and also significantly reduce the development time. A good query optimizer should be able to optimize user queries, extensible for user to provide information about the data and even dynamically include developer defined specific rules. Catalyst leverages the Scala’s functional language features like pattern matching and runtime meta programming to allow developers to concisely specify complex relational optimizations. -Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. -Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied. +Catalyst includes both rule-based and cost-based optimization. It is extensible to include new optimization techniques and features to Spark SQL and also let developers provide data source specific rules. Catalyst executes the rules on its data type Tree - a composition of node objects where each node has a node type (subclasses of TreeNode class in Scala) and zero or more children. Node objects are immutable and can be manipulated. The transform method of a Tree applies pattern matching to match a subset of all possible input trees on which the optimization rules needs to be applied. -Hence, in Spark SQL, transformation of user queries happens in four phases : +Hence, in Spark SQL, transformation of user queries happens in four phases: <figure class="main-container"> <img src="./sparksql-data-flow.jpg" alt="SparkSQL optimization plan Overview" /> </figure> -*Figure from : {% cite armbrust2015spark --file big-data%}* -***Analyzing a logical plan to resolve references :*** In the analysis phase a relation either from the abstract syntax tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object(tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions. +*Figure from: {% cite armbrust2015spark --file big-data%}* + +***Analyzing a logical plan to resolve references:*** In the analysis phase a relation either from the abstract syntax tree (AST) returned by the SQL parser or from a DataFrame is analyzed to create a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype). The logical plan is resolved using using the Catalyst’s Catalog object (tracks the table from all data sources) by mapping the named attributes to the input provided, looking up the relations by name from catalog, by propagating and coercing types through expressions. -***Logical plan optimization :*** In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan. +***Logical plan optimization:*** In this phase, several of the rules like constant folding, predicate push down, projection pruning, null propagation, boolean expression simplification are applied on the logical plan. -***Physical planning :*** In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown. +***Physical planning:*** In this phase, Spark generates multiples physical plans out of the input logical plan and chooses the plan based on a cost model. The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. In addition, it can push operations from the logical plan into data sources that support predicate or projection pushdown. -***Code Generation :*** The final phase generates the Java byte code that should run on each machine.Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree(AST). +***Code Generation:*** The final phase generates the Java byte code that should run on each machine. Catalyst transforms the Tree which is an expression in SQL to an AST for Scala code to evaluate, compile and run the generated code. A special scala feature namely quasiquotes aid in the construction of abstract syntax tree (AST). +## Big Data Ecosystem -## 3. Big Data Ecosystem -*3.1 Hadoop ecosystem* +### Hadoop Ecosystem Apache Hadoop is an open-sourced framework that supports distributed processing of large dataset. It involves dozens of projects, all of which are listed [here](https://hadoopecosystemtable.github.io/). In this section, it is also important to understand the key players in the system, namely two parts: the Hadoop Distributed File System (HDFS) and the open-sourced implementation of MapReduce model - Hadoop. <figure class="main-container"> <img src="./hadoop-ecosystem.jpg" alt="Hadoop Ecosystem" /> </figure> -*Figure is from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview* +*Figure from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview* -HDFS forms the data management layer, which is a distributed file system designed to provide reliable, scalable storage across large clusters of unreliable commodity machines. The idea was inspired by GFS{%cite ghemawat2003google --file big-data%}. Unlike closed GFS, HDFS is open-sourced and provides various libraries and interfaces to support different file systems, like S3, KFS etc. +HDFS forms the data management layer, which is a distributed file system designed to provide reliable, scalable storage across large clusters of unreliable commodity machines. The idea was inspired by GFS {% cite ghemawat2003google --file big-data %}. Unlike closed GFS, HDFS is open-sourced and provides various libraries and interfaces to support different file systems, like S3, KFS etc. To satisfy different needs, big companies like Facebook and Yahoo developed additional tools. Facebook's Hive, as a warehouse system, can provide more declarative programming interface and translate to Hadoop jobs. Yahoo's Pig platform is an ad-hoc analysis tool that can structurize HDFS objects and support operations like grouping, joining and filtering. +### Spark ecosystem -*3.2 Spark ecosystem* +Apache Spark's rich ecosystem benefits from other resource management systems like {% cite hindman2011mesos --file big-data %}, Yarn {% cite vavilapalli2013apache --file big-data %}, and several major components that have been already discussed in this article like Spark-core, SparkSQL, GraphX. -Apache Spark's rich-ecosystem constitutes of third party libraries like Mesos{%cite hindman2011mesos --file big-data%}/Yarn{%cite vavilapalli2013apache --file big-data%} and several major components that have been already discussed in this article like Spark-core, SparkSQL, GraphX. -In this section we will discuss the remaining yet very important components/libraries which help Spark deliver high performance. +In this section we will discuss the remaining important libraries and systems which help Spark deliver high performance. <figure class="main-container"> <img src="./spark-ecosystem.png" alt="Spark ecosystem" /> </figure> -*Spark Streaming - A Spark component for streaming workloads* - -Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model{% cite zaharia2012discretized --file big-data%}. Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka {% cite kreps2011kafka --file big-data%}, Akka Actors (http://doc.akka.io/docs/akka/2.4.1/scala/actors.html), IoT Sensors, Apache Flume(https://flume.apache.org/FlumeUserGuide.html), etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports the integration of streaming data with historical data. - +#### Spark Streaming - A Spark component for streaming workloads -*Apache Mesos* +Spark achieves fault tolerant, high throughput data streaming workloads in real-time through a light weight Spark Streaming API. Spark streaming is based on Discretized Streams model. {% cite zaharia2012discretized --file big-data %} Spark Streaming processes streaming workloads as a series of small batch workloads by leveraging the fast scheduling capacity of Apache Spark Core and fault tolerance capabilities of a RDD. A RDD in here represents each batch of streaming data and transformations are applied on the same. Data source in Spark Streaming could be from many a live streams like Twitter, Apache Kafka {% cite kreps2011kafka --file big-data %}, [Akka Actors](http://doc.akka.io/docs/akka/2.4.1/scala/actors.html), IoT Sensors, Apache [Flume](https://flume.apache.org/FlumeUserGuide.html), etc. Spark streaming also enables unification of batch and streaming workloads and hence developers can use the same code for both batch and streaming workloads. It supports the integration of streaming data with historical data. +#### Apache Mesos -Apache Mesos{%cite hindman2011mesos --file big-data%} is an open source heterogenous cluster/resource manager developed at the University of California, Berkley and used by companies such as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper {%cite hunt2010zookeeper --file big-data%} and is a efficient CPU and memory-aware resource scheduler. - -*Alluxio/Tachyon* - -Alluxio/Tachyon{% cite li2014tachyon --file big-data%} is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. - +Apache Mesos {% cite hindman2011mesos --file big-data %} is an open source heterogenous cluster/resource manager developed at the University of California, Berkley and used by companies such as Twitter, Airbnb, Netflix etc. for handling workloads in a distributed environment through dynamic resource sharing and isolation. It aids in the deployment and management of applications in large-scale clustered environments. Mesos abstracts node allocation by combining the existing resources of the machines/nodes in a cluster into a single pool and enabling fault-tolerant elastic distributed systems. Variety of workloads can utilize the nodes from this single pool voiding the need of allocating specific machines for different workloads. Mesos is highly scalable, achieves fault tolerance through Apache Zookeeper {% cite hunt2010zookeeper --file big-data %} and is a efficient CPU and memory-aware resource scheduler. +#### Alluxio/Tachyon +Alluxio/Tachyon {% cite li2014tachyon --file big-data %} is an open source memory-centric distributed storage system that provides high throughput writes and reads enabling reliable data sharing at memory-speed across cluster jobs. Tachyon can integrate with different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon fits between computation frameworks or jobs like spark or mapreducce and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. It caches the frequently read datasets in memory, thereby avoiding going to disk to load every dataset. In Spark RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. ## References + {% bibliography --file big-data %} diff --git a/chapter/9/streaming.md b/chapter/9/streaming.md index 13abea2..7fe0734 100644 --- a/chapter/9/streaming.md +++ b/chapter/9/streaming.md @@ -1,43 +1,47 @@ --- layout: page -title: "Large Scale Streaming Processing" +title: "Large Scale Streaming Processing" by: "Fangfan Li" --- -The previous chapter discusses the large scale batch processing system, where the computation involves the pieces of data stored across the distributed file system. Those systems satisfy the requirements such as scalability and fault-tolerance for applications that deal with 'big data' stored in a distributed way. The batch processing systems are suitable for processing *static* datasets, where the input data do not change overtime during the whole process, thus the system can distribute the computation and perform synchronization assuming the inputs would stay the same during the whole computation. In such *static* model, the processing system can first *pull* data from the disk, and then perform the computation over the pulled data. However, a large number of networking applications are not *static*, instead, the data is constantly in motion, and the inputs would be provided as *stream*, as new data constantly arrives. In the *stream* model, data is *pushed* to the processor. This fundamental difference makes the traditional batch processing system un-suitable for streaming applications, as even the slightest change in the dataset would require the batch processor to *pull* the whole dataset and perform the computation again. Thus in this chapter, we would introduce the history and systems that are created for the streaming processing. +The previous chapter discusses the large scale batch processing system, where the computation involves the pieces of data stored across the distributed file system. Those systems satisfy the requirements such as scalability and fault-tolerance for applications that deal with "big data" stored in a distributed way. Batch processing systems are suitable for processing *static* datasets, where the input data does not change overtime during the whole process, thus the system can distribute the computation and perform synchronization assuming the inputs would stay the same during the whole computation. In such *static* model, the processing system can first *pull* data from the disk, and then perform the computation over the pulled data. However, a large number of networking applications are not *static*, instead, the data is constantly in motion, and the inputs would be provided as *stream*, as new data constantly arrives. In the *stream* model, data is *pushed* to the processor. This fundamental difference makes the traditional batch processing system un-suitable for streaming applications, as even the slightest change in the dataset would require the batch processor to *pull* the whole dataset and perform the computation again. Thus in this chapter, we would introduce the history and systems that are created for the streaming processing. There are many challenges for implementing large scale streaming processing system. Similar to large scale batch processing systems, large scale streaming systems also have to deal with consistency and fault-tolerance due to the distributed nature of those systems. Moreover, latency at the scale of several minutes is at most a nuisance in batch processing while latency is not as tolerable in large streaming processing. -Despite those challenges, there are many active research and productions in the stream processing area, and we want to answer the following questions in this article: 1) what are the earliest ideas of stream processing, why would people want to analyze a stream of data 2) what exactly is a stream, how is it implemented in real system 3) what are the systems that are built for large scale stream processing, and what are the differences between them 4) what are the systems that are being used by companies for their applications, do they build their own system or they would use the existing systems. +Despite those challenges, there are many active research and productions in the stream processing area, and we want to answer the following questions in this article: +1. What are the earliest ideas of stream processing? Why would people want to analyze a stream of data? +2. What exactly is a stream? How is it implemented in real systems? +3. What are the systems that are built for large scale stream processing, and what are the differences between them? +4. What are the systems that are being used by companies for their applications? Do they build their own system or do they use existing systems? ## Data in constant motion -Computing data stream has long been studied in the area of Theory of Computing. Assume we have a sequence of elements, and we want to compute the frequency moments of the data (i.e., count how many of each of the distinct data appear in the sequence). To do that, we could maintain a full histogram on the data, a counter for each data value. However, the memory that we have is not unlimited, thus we can not gather every data, we can then use randomized algorithms for approximating the frequency moments with limited resource{% cite alon1996space --file streaming %}. Thus analyzing the stream using random algorithm was because of the lack of computation resources. +Computing data stream has long been studied in the area of Theory of Computing. Assume we have a sequence of elements, and we want to compute the frequency moments of the data (i.e., count how many of each of the distinct data appear in the sequence). To do that, we could maintain a full histogram on the data, a counter for each data value. However, the memory that we have is not unlimited, thus we can not gather every data, we can then use randomized algorithms for approximating the frequency moments with limited resource {% cite alon1996space --file streaming %}. Thus analyzing the stream using random algorithm was because of the lack of computation resources. -Besides randomized processing on the data sequence, systems were also being developed to deal with the input data that is not static and predictable. Instead of dealing with the lack of resources, those projects were mostly motivated by the fact that in emerging networked environments, the value of the ever increasing amount of data is realized only within the time that it is needed. TelegraphCQ {% cite chandrasekaran2003telegraphcq --file streaming %} is one example among those earliest such systems, which aims at meeting the challenges that arise in handling large streams of continuous queries over high-volume, high-variable data. In contrast to traditional view that data can be found statically in known locations, the authors of TelegraphCQ realized that data becomes fluid and being constantly moving and changing, thus the traditional database can 'pull' data from the storage while data is being 'pushed' into the query processor in case of processing stream. The examples of applications that use this *data in motion* include: event-based processing where the system would react to some special data received or when some event happens (e.g., at a certain time), and query processing over streaming data sources such as network monitoring. TelegraphCQ is one example of the systems that can query processing over data stream. +Besides randomized processing on the data sequence, systems were also being developed to deal with the input data that is not static and predictable. Instead of dealing with the lack of resources, those projects were mostly motivated by the fact that in emerging networked environments, the value of the ever increasing amount of data is realized only within the time that it is needed. TelegraphCQ {% cite chandrasekaran2003telegraphcq --file streaming %} is one example among those earliest such systems, which aims at meeting the challenges that arise in handling large streams of continuous queries over high-volume, high-variable data. In contrast to traditional view that data can be found statically in known locations, the authors of TelegraphCQ realized that data becomes fluid and being constantly moving and changing, thus the traditional database can "pull" data from the storage while data is being "pushed" into the query processor in case of processing stream. The examples of applications that use this *data in motion* include: event-based processing where the system would react to some special data received or when some event happens (e.g., at a certain time), and query processing over streaming data sources such as network monitoring. TelegraphCQ is one example of the systems that can query processing over data stream. -The fundamental difference between TelegraphCQ to other traditional query system is the view of input data, instead of handling a query with detailed static data, TelegraphCQ has to react to the newly arrived data and process the queries *on-the-fly*. In order to always react, the query need to be alway running, so TelegraphCQ runs *continuous queries*, where the queries are constantly running and as new data arrives, the processor would route it to the set of active queries that are listening. TelegraphCQ also uses *shared processing* to avoid the overhead of processing each query individually, in order to avoid blocking and having to interrupt the dataflow, data should be processed simultaneously by all the queries that require this dataflow. In TelegraphCQ, those queries with such commonality can be combined together to improve the performance. +The fundamental difference between TelegraphCQ and other traditional query system is its view of input data. Instead of handling a query with detailed static data, TelegraphCQ has to react to the newly arrived data and process the queries *on-the-fly*. In order to always react, the query needs to be always running, so TelegraphCQ runs *continuous queries*, where the queries are constantly running and as new data arrives, the processor routes it to the set of active queries that are listening. TelegraphCQ also uses *shared processing* to avoid the overhead of processing each query individually, in order to avoid blocking and having to interrupt the dataflow, data should be processed simultaneously by all the queries that require this dataflow. In TelegraphCQ, those queries with such commonality can be combined together to improve the performance. TelegraphCQ shows the importance of modeling data as stream and how can we process such data stream, however it was only implemented in a non-distributed prototype. -Beyond TelegraphCQ, there are systems that were built for continuously querying on large scale streaming data. For example, PipelineDB{% cite pipelinedb --file streaming %} is a system that was designed to run SQL queries continuously on streaming data, where the output of those continuous queries is stored in regular tables which can be queried like any other table. PipelineDB can reduce the cardinality of its input streams by performing different filtering or aggregations on stream once the continuous queries read the raw data, and only the needed information would then be persisted to disk (i.e., the raw data is then discarded). By doing this, PipelineDB can process large volumes of data very efficiently using relatively small number of resources. +Beyond TelegraphCQ, there are systems that were built for continuously querying on large scale streaming data. For example, PipelineDB {% cite pipelinedb --file streaming %} is a system that was designed to run SQL queries continuously on streaming data, where the output of those continuous queries is stored in regular tables which can be queried like any other table. PipelineDB can reduce the cardinality of its input streams by performing different filtering or aggregations on stream once the continuous queries read the raw data, and only the needed information would then be persisted to disk (i.e., the raw data is then discarded). By doing this, PipelineDB can process large volumes of data very efficiently using relatively small number of resources. As we described before, stream processing is not only query processing. Apache Flink {% cite apacheflink --file streaming %} is a system that supports both event-based processing and query processing. Each program in Flink is a streaming dafalow consisting of streams and transformation operators, the stream of data in a streaming dataflow can come from multiple sources (i.e., producers) and travel to one or more sinks (i.e., consumers). The stream of data would get transformed when traveling through the operators, where the computations happen. In order to distribute the work, streams are split into stream partitions and operators are split into operator subtasks in Flink where each subtask executes independently. -What is event-based processing in Flink then? Unlike batch processing, to aggregate a event is more subtle in stream processing, for example we can not count the element in a stream since it is generally unbounded. Instead, Flink enable event-based processing with the notion of time and windows, for example, we can specify something like 'count over 5 minutes window'. Besides time-based window, Flink also supports count windows, and such event would be 'do something when the 100th elements arrive'. Flink has different notions of time such as event time when an event was created and processing time which is when the operator performs a time-based operation. The time are then used internally to keep the order and state for each event and also used by the windowing logic. The flexible streaming windows can then be transformed to flexible triggering condition which makes event-based processing possible in Flink. +What is event-based processing in Flink then? Unlike batch processing, to aggregate a event is more subtle in stream processing, for example we can not count the element in a stream since it is generally unbounded. Instead, Flink enables event-based processing with the notion of time and windows. For example, we can specify something like 'count over 5 minutes window'. Besides time-based windows, Flink also supports count-based windows, and operations like 'do something when the 100th element arrives'. Flink has different notions of time such as event time when an event was created and processing time which is when the operator performs a time-based operation. The time are then used internally to keep the order and state for each event and also used by the windowing logic. The flexible streaming windows can then be transformed to flexible triggering condition which makes event-based processing possible in Flink. We just very briefly introduced PipelineDB and Apache Flink here, and there are many other systems that can perform stream processing in large scale and we would look into few examples in detail in section 3. ## How to represent data stream -Why would we need to process data stream in a large scale? I will use an example to illustrate the idea. For example, assume you are Twitter, and you have a constant feed of user's comments and posts, you want to find out what is the most *trending* topic right now that people are talking about, and your advertisement team want to follow on that. You can store all the posts that happened during the day from 12:01 a.m to 11:59 p.m in a large file system and then run a batch *Spark* {% cite zaharia2012resilient --file streaming %} job to analyze them. The *Spark* job itself may again probably take several hours, but after all these works, the *trending* topic comes out from your analysis might be useless since it might not be hot anymore. Thus we want a stream processing system that can take the constant stream of posts from all different sources as input and output the result with low latency (i.e., before it becomes useless). +Why would we need to process data stream in a large scale? I will use an example to illustrate the idea. For example, assume you are Twitter, and you have a constant feed of user's comments and posts, you want to find out what is the most *trending* topic right now that people are talking about, and your advertisement team want to follow on that. You can store all the posts that happened during the day from 12:01 a.m to 11:59 p.m in a large file system and then run a batch *Spark* {% cite zaharia2012resilient --file streaming %} job to analyze them. The *Spark* job itself may again probably take several hours, but after all these works, the *trending* topic comes out from your analysis might be useless since it might not be hot anymore. Thus we want a stream processing system that can take the constant stream of posts from all different sources as input and output the result with low latency (i.e. before it becomes useless). Before dive into the details of the large scale processing, we would first introduce a few concepts: producer, processor and consumer based on the example. -- The producer is where the data stream comes from, it would be a user who are tweeting. +- The producer is where the data stream comes from, it would be a user who is tweeting. - The consumer is where the results are needed, the advertisement team would be the consumer. -- The processor is then the *magical* component that takes the stream and produce the results. +- The processor is then the *magical* component that takes the stream and produces the results. <figure class="fullwidth"> <img src="{{ site.baseurl }}/chapter/9/DiagramStream.jpg" alt="An example of a stream processing system" /> @@ -45,19 +49,19 @@ Before dive into the details of the large scale processing, we would first intro The producers and consumers are fairly straight forward, it is the processor that are being discussed in this chapter. -In this section, we would first illustrate what is the *stream* (i.e., the tuples between components) that the producers are giving to the processor, which is the component between producers and processors-the data stream. +In this section, we would first illustrate what is the *stream* (i.e. the tuples between components) that the producers are giving to the processor, which is the component between producers and processors-the data stream. We have been talking about the stream of data, but this is a bit under-specified, since the data can be collected from many producers (i.e. different users), how do we combine those data into actual streams and send the them to the processors? What does a data stream really look like? A natural view of a data stream can be an infinite sequence of tuples reading from a queue. However, a traditional queue would not be sufficient in large scale system since the consumed tuple might got lost or the consumer might fail thus it might request the previous tuple after a restart. Furthermore, since the processing power for a single machine is limited, we want several machines to be able to read from the same queue, thus they can work on the stream in parallel. The alternative queue design is then a multi-consumer queue, where a pool of readers may read from a single queue and each record goes to one of them. In a traditional multi-consumer queue, once a consumer reads the data out, it is gone. This would be problematic in a large stream processing system, since the messages are more likely to be lost during transmission, and we want to keep track of what are the data that are successfully being consumed and what are the data that might be lost on their way towards the consumer. Thus we need a little fancier queue to keep track of *what* has been consumed, in order to be resilient in the face of packet loss or network failure. -A naïve approach to attempting to handle lost messages or failures could be to record the message upon sending it, and to wait for the acknowledgement from the receiver. This simple method is a pragmatic choice since the storage in many messaging systems are scarce resources, the system want to free the data immediately once it knows it is consumed successfully thus to keep the queue small. However, getting the two ends to come into agreement about what has been consumed in not a trivial problem. Acknowledgement fixes the problem of losing messages, because if a message is lost, it would not be acknowledged thus the data is still in the queue and can be sent again, this would ensure that each message is processed at least once, however, it also creates new problems. First problem is the receiver might successfully consumed the message *m1* but fail to send the acknowledgment, thus the sender would send *m1* again and the receiver would process the same data twice. Another problem is memory consumption, since the sender has now to keep track of every single messages being sent out with multiple stages, and only free them when acknowledged. +A naive approach to attempting to handle lost messages or failures could be to record the message upon sending it, and to wait for the acknowledgement from the receiver. This simple method is a pragmatic choice since the storage in many messaging systems are scarce resources, the system want to free the data immediately once it knows it is consumed successfully thus to keep the queue small. However, getting the two ends to come into agreement about what has been consumed in not a trivial problem. Acknowledgement fixes the problem of losing messages, because if a message is lost, it would not be acknowledged thus the data is still in the queue and can be sent again, this would ensure that each message is processed at least once, however, it also creates new problems. First problem is the receiver might successfully consumed the message *m1* but fail to send the acknowledgment, thus the sender would send *m1* again and the receiver would process the same data twice. Another problem is memory consumption, since the sender has now to keep track of every single messages being sent out with multiple stages, and only free them when acknowledged. <figure class="fullwidth"> <img src="{{ site.baseurl }}/chapter/9/Kafka.jpg" alt="An example of a stream processing system" /> </figure> -Apache Kafka {% cite apachekafka --file streaming %} handles this differently to achieve better performance. Apache Kafka is a distributed streaming platform, where the producer, processor and consumers can all subscribe to, and create/read the stream they need from, one can think of Kafka as the stream between all components in a stream processing system. Records in Kafka are grouped in topics, where each topic is a category to which this record is published. Each topic is then divided into several partitions, where one topic can always have multi-subscriber and each partition has one reader at a time. Each record is assigned with a offset that uniquely identifies it in that partition. By doing this Kafka can ensure that the only reader of that partition and consumes the data in order. Since there are many partitions of each topic, Kafka balances the load over many consumer instances by assigning different partitions to them. This makes the state about what has been consumed very small, just one number (i.e., the offset) for each partition, and by periodically checkpointing, the equivalent of message acknowledgements becomes very cheap. Kafka retains all published records whether they have been consumed or not during their configurable retention period, this also allows consumers to rewind the stream and replay everything from the point of interest by going back to the specific offset. For example, if the user code has a bug which is discovered later, the user can re-consume those messages from the previous offset once the bug is fixed while ensuring that the processed events are in the order of their origination, or the user can simply start computing with the latest records from "now". +Apache Kafka {% cite apachekafka --file streaming %} handles this differently to achieve better performance. Apache Kafka is a distributed streaming platform, where the producer, processor and consumers can all subscribe to, and create/read the stream they need from, one can think of Kafka as the stream between all components in a stream processing system. Records in Kafka are grouped in topics, where each topic is a category to which this record is published. Each topic is then divided into several partitions, where one topic can always have multi-subscriber and each partition has one reader at a time. Each record is assigned with a offset that uniquely identifies it in that partition. By doing this Kafka can ensure that the only reader of that partition and consumes the data in order. Since there are many partitions of each topic, Kafka balances the load over many consumer instances by assigning different partitions to them. This makes the state about what has been consumed very small, just one number (i.e. the offset) for each partition, and by periodically checkpointing, the equivalent of message acknowledgements becomes very cheap. Kafka retains all published records whether they have been consumed or not during their configurable retention period, this also allows consumers to rewind the stream and replay everything from the point of interest by going back to the specific offset. For example, if the user code has a bug which is discovered later, the user can re-consume those messages from the previous offset once the bug is fixed while ensuring that the processed events are in the order of their origination, or the user can simply start computing with the latest records from "now". With the notions of topics and partitions, Kafka guarantees that the total order over records within a partition, and multiple consumers can subscribe to a single topic which would increase the throughput. If a strong guarantee on the ordering of all records in a topic is needed, the user can simply put all records in this topic into one partition. @@ -65,11 +69,11 @@ Those features of Apache Kafka make it a very popular platform used by many stre ## How to process data stream -Now we know what the stream looks like and how do we ensure that the data in the stream are successfully processed. We would then talk about the processors that consume the data stream. There are two main approaches in processing data stream. The first approach is the continuous queries model, similar to TelegraphCQ, where the queries keep running and the arrival of data initiates the processing. Another approach is micro-batching, where the streaming computation becomes a series of stateless, deterministic batch computations on batch of stream, where certain timer would trigger the processing on the batch in those systems. We would discuss Apache Storm as one example for the fist design and Spark Streaming, Naiad and Google Dataflow are examples of the second approach. These systems not only differ in the way how they process stream, but also how they ensure fault-tolerance which is one of the most important aspects of large scale distributed system. +Now we know what the stream looks like and how do we ensure that the data in the stream are successfully processed. We would then talk about the processors that consume the data stream. There are two main approaches in processing data stream. The first approach is the continuous queries model, similar to TelegraphCQ, where the queries keep running and the arrival of data initiates the processing. Another approach is micro-batching, where the streaming computation becomes a series of stateless, deterministic batch computations on batch of stream, where certain timer would trigger the processing on the batch in those systems. We would discuss Apache Storm as one example for the fist design and Spark Streaming, Naiad, and Google Dataflow as examples of the second approach. These systems not only differ in how they process stream, but also how they ensure fault-tolerance which is one of the most important aspects of large scale distributed system. -### a) Continuous queries (operators) on each tuple +### Continuous queries (operators) on each tuple -- Apache Storm +#### Apache Storm After MapReduce, Hadoop, and the related batch processing system came out, the data can be processed at scales previously unthinkable. However, as we stated before, large scale stream processing becomes more and more important for many businesses. *Apache Storm* {% cite apachestorm --file streaming %} is actually one of the first system that can be described as "Hadoop of stream processing" that feed the needs. Users can process messages in a way that doesn't lose data and also scalable with the primitives provided by *Storm*. @@ -83,7 +87,6 @@ A bolt is where the processing really take place, it can take multiple streams a A topology is then arbitrary combination of the three components, where spouts and bolts are the vertices and streams are the edges in the topology. - ```ruby TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("words", new TestWordSpout(), 10); @@ -99,25 +102,25 @@ builder.setBolt("exclaim2", new ExclamationBolt(), 5) <img src="{{ site.baseurl }}/chapter/9/Topology.jpg" alt="The topology created by the example code" /> </figure> -Here is how we can build a simple topology which contains a spout and two bolts, where the spout emits words and each bolt would append exclamation '!' to its input. The exclaim1 bolt is connected to the spout while the exclaim2 bolt is connected to both the spout and exclaim2 specified by 'Grouping', and we will show what 'shuffle grouping' means in the next paragraph. The nodes are arranged as shown in the graph. For example if the bolt emits the tuple ["Hi"], if it travels from exclaim1 to exclaim2, then exclaim2 would emit the word ["Hi!!"]. +Here is how we can build a simple topology which contains a spout and two bolts, where the spout emits words and each bolt would append exclamation `"!"` to its input. The `exclaim1` bolt is connected to the spout while the `exclaim2` bolt is connected to both the spout and `exclaim2` specified by 'Grouping', and we will show what 'shuffle grouping' means in the next paragraph. The nodes are arranged as shown in the graph. For example if the bolt emits the tuple `["Hi"]`, if it travels from `exclaim1` to `exclaim2`, then `exclaim2` would emit the word `["Hi!!"]`. -Since all the works are distributed, any given vertex is not necessarily running on a single machine, instead they can be spread on different workers in the cluster. The parameter 10, 3 and 5 in the example code actually specify the amount of parallelism the user wants. *Storm* also provides different *stream grouping* schemes for users to determine which vertex should be consuming the output stream from a given vertex. The grouping method can be shuffle grouping as shown in our example, where the tuples from the output stream will be randomly distributed across this bolt's consumers in a way such that each consumer is guaranteed to get an equal number of tuples. Another example would be fields grouping, where the tuples of the stream is partitioned by the fields specified in the grouping, the tuples with the same value in that field would always go to the same bolt. +Since all the works are distributed, any given vertex is not necessarily running on a single machine, instead they can be spread on different workers in the cluster. The parameters `10`, `3` and `5` in the example code actually specify the amount of parallelism the user wants. *Storm* also provides different *stream grouping* schemes for users to determine which vertex should be consuming the output stream from a given vertex. The grouping method can be shuffle grouping as shown in our example, where the tuples from the output stream will be randomly distributed across this bolt's consumers in a way such that each consumer is guaranteed to get an equal number of tuples. Another example would be fields grouping, where the tuples of the stream is partitioned by the fields specified in the grouping, the tuples with the same value in that field would always go to the same bolt. -A natural question to ask here is what if something goes wrong for example a single tuple get lost. One might think that *Storm* maintains a queue similar to what we discussed before to ensure that every tuple is processed at least once. In fact, *Storm* does not keep such queues internally, the reason might be that there would be so many states to maintain if it needs to construct such queue for every edge. In stead, *Storm* maintains a directed acyclic graph (DAG) for every single tuple, where each DAG contains the information of this tuple as how the original tuple is split among different workers. *Storm* uses the DAG to track each tuple, if the tuple fails to be processed, then the system would retry the tuple from the spout again. +A natural question to ask here is what if something goes wrong for example: what if a single tuple gets lost? One might think that *Storm* maintains a queue similar to what we discussed before to ensure that every tuple is processed at least once. In fact, *Storm* does not keep such queues internally, the reason might be that there would be so many states to maintain if it needs to construct such queue for every edge. In stead, *Storm* maintains a directed acyclic graph (DAG) for every single tuple, where each DAG contains the information of this tuple as how the original tuple is split among different workers. *Storm* uses the DAG to track each tuple, if the tuple fails to be processed, then the system would retry the tuple from the spout again. <figure class="fullwidth"> <img src="{{ site.baseurl }}/chapter/9/DAG.jpg" alt="The simple tuple DAG" /> </figure> -There might be two concerns here. The first is how can *Storm* track every DAG efficiently and scalably, would it actually use more resources than just maintain the queues? The second concern is starting all the way from spout again instead of the intermediate queue seems taking a step backwards. For the first concern, *Storm* actually uses a very efficient algorithm to create the DAG of each tuple, it would take at mote 20 bytes for any tuple even the DAG contains trillions of tuples in it. For the second concern, if we look at the guarantees provided by both techniques, tracking DAG and intermediate queues, they are actually the same. They both guarantee that each tuple is processed at least once, so there is no fundamental differences between them. +There might be two concerns here. The first is how can *Storm* track every DAG efficiently and scalably, would it actually use more resources than just maintain the queues? The second concern is starting all the way from spout again instead of the intermediate queue seems taking a step backwards. For the first concern, *Storm* actually uses a very efficient algorithm to create the DAG of each tuple, it would take at most 20 bytes for any tuple even if the DAG contains trillions of tuples in it. For the second concern, if we look at the guarantees provided by both techniques, tracking DAG and intermediate queues, they are actually the same. They both guarantee that each tuple is processed at least once, so there are no fundamental differences between them. Thus as shown before, *Storm* can guarantee the primitives, it can process a stream of data, distribute the work among multiple workers and guarantee each tuple in the stream is processed. -### b) Micro-batch +### Micro-batch We have seen *Apache Storm* as a stream processing system that has the guarantees needed by such system. However, the core of *Storm* is to process stream at a granularity of each tuple. Sometimes such granularity is unnecessary, for the Twitter example that we had before, maybe we are only interested in the *stream* of tuples that came within a 5 minutes interval, with *Storm*, such specification can only be set on top of the system while one really want a convenient way to express such requirement within the system itself. In the next section, we would introduce several other stream processing systems, all of them can act on data stream in real time at large scale as *Storm*, but they provide more ways for the users to express how they want the tuples in the stream to be grouped and then processed. We refer to grouping the tuples before processing them as putting them into small *micro-batches*, and the processor can then provide results by working on those batches instead of single tuple. -- Spark Streaming +#### Spark Streaming The *Spark* streaming {% cite zaharia2012discretized --file streaming %} system is built upon *Apache Spark*, a system for large-scale parallel batch processing, which uses a data-sharing abstraction called 'Resilient Distributed Datasets' or RDDs to ensure fault-tolerance while achieve extremely low latency. The challenges with 'big data' stream processing were long recovery time when failure happens, and the the stragglers might increase the processing time of the whole system. Spark streaming overcomes those challenges by a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerate stragglers. @@ -134,14 +137,15 @@ val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() - ``` Let's look at an example of how we can count the word received from a TCP socket with *Spark streaming*. We first set the processing interval to be 1 second, and we will create a *D-stream* lines that represents the streaming data received from the specific TCP socket. Then we split the lines by space into words, now the stream of words is represented as the words *D-stream*. The words stream is further mapped to a *D-stream* of pairs, which is then reduced to count the number of words in each batch of data. *Spark streaming* handles the slow recovery and straggler issue by dividing stream into small batches on small time intervals and using RDDs to keep track of how the result of certain batched stream is computed. This model makes handling recovery and straggler easier because the computation can be ran in parallel by re-computing the result while RDDs make the process fast. -**Structured Streaming** Besides *Spark streaming*, Apache Spark recently added a new higher-level API, *Structured Streaming*{% cite structuredstreaming --file streaming %}, which is also built on top of the notion of RDDs while makes a strong guarantee that the output of the application is equivalent to executing a batch job on a prefix of data at any time, which is also known as *prefix integrity*. *Structured Streaming* makes sure that the output tables are always consistent with all the records in a prefix of the data, thus the out-of-order data is easy to identify and can simply be used to update its respective row in the table. *Structured Streaming* provides a simple API where the users can just specify the query as if it were a static table, and the systems would automatically convert this query to a stream processing job. +##### Structured Streaming + +Besides *Spark streaming*, Apache Spark recently added a new higher-level API, *Structured Streaming*{% cite structuredstreaming --file streaming %}, which is also built on top of the notion of RDDs while makes a strong guarantee that the output of the application is equivalent to executing a batch job on a prefix of data at any time, which is also known as *prefix integrity*. *Structured Streaming* makes sure that the output tables are always consistent with all the records in a prefix of the data, thus the out-of-order data is easy to identify and can simply be used to update its respective row in the table. *Structured Streaming* provides a simple API where the users can just specify the query as if it were a static table, and the systems would automatically convert this query to a stream processing job. ```ruby // Read data continuously from an S3 location @@ -151,13 +155,13 @@ val inputDF = spark.readStream.json("s3://logs") inputDF.groupBy($"action", window($"time", "1 hour")).count() .writeStream.format("jdbc") .start("jdbc:mysql//...") - ``` + The programming model of *Structured Streaming* views the latest data as newly appended rows in an unbounded table, every trigger interval, new rows would be added to the existing table which would eventually update the output table. The event-time then becomes nature in this view, since each event from producers is a row where the even-time is just a column value in this row, which then makes window-based aggregations become simply grouping on the event-time column. Unlike other systems where users have to specify how to aggregate the records when outputing, *Structured Streaming* would take care of updating the result table when there is new data, users can then just specify different modes to decide what gets written to the external storage. For example in Complete Mode, the entire updated result table would be written to external storage while in Update Mode, only the rows that were updated in the result table will be written out. -- Naiad +#### Naiad *Naiad* {% cite murray2013naiad --file streaming %} is another distributed system for executing data stream which is developed by *Microsoft*. *Naiad* combines the benefits of high throughput of batch processors and the low latency of stream processors by its computation model called *timely dataflow* that enables dataflow computations with timestamps. @@ -167,7 +171,6 @@ The *timely dataflow*, like topology described in *Storm*, contains stateful ver <img src="{{ site.baseurl }}/chapter/9/Naiad.jpg" alt="A simple Timely Dataflow" /> </figure> - ```ruby void OnNotify(T time) { @@ -175,29 +178,27 @@ foreach (var pair in counts[time]) this.SendBy(output, pair, time); counts.Remove(time); } - ``` -In this example, A, B are different processing vertices and each of them has one message being processed, and the OnNotify function is running on node B. For A, the number 2 in its message (e2,2) indicates that this messages is generated in epoch 2, thus a on B counter would increase by 1 if it is counting the number of messages in epoch 1. In the example code, *counts* would be the counter that counts the number of distinct messages received (i.e., in other functions). Once B gets notified that one epoch has ended, the OnNotify function would be triggered, and a count for each distinct input record would then be sent to output. - +In this example, A, B are different processing vertices and each of them has one message being processed, and the OnNotify function is running on node B. For A, the number 2 in its message (e2, 2) indicates that this messages is generated in epoch 2, thus a on B counter would increase by 1 if it is counting the number of messages in epoch 1. In the example code, *counts* would be the counter that counts the number of distinct messages received (i.e. in other functions). Once B gets notified that one epoch has ended, the OnNotify function would be triggered, and a count for each distinct input record would then be sent to output. -*Naiad* can also execute cyclic dataflow program. If there is a loop in the data flow graph, for example where the message need to be processed with the processed result of previous message, then each message circulating in the group has another counter associated with it along with the epoch. This loop counter would increase by one whenever it complete a loop once. Thus the epoch and counter can work together for the system to track progress of the whole computation. +*Naiad* can also execute cyclic dataflow programs. If there is a loop in the data flow graph, for example where the message need to be processed with the processed result of previous message, then each message circulating in the group has another counter associated with it along with the epoch. This loop counter would increase by one whenever it complete a loop once. Thus the epoch and counter can work together for the system to track progress of the whole computation. -Tracking process is not a trivial task since there are many messages with different timestamps being sent between nodes. For example, a node *n* is in charge of notifying the end of each epoch and performing a task 'count the number of the event in each epoch'. Then the next question is when can *n* say for sure that a certain epoch has already ended thus the counting job can start. The problematic issue here is that even the node has been receiving messages with epoch *e*, there might still be messages with epoch *e-1* that are still *circulating* (i.e., haven't been consumed) in the dataflow thus if *n* fires the counting right now, it would end up with wrong results since those *circulating* messages are not counted. *Naiad* accomplishes this task by tracking all the messages that being sent and haven't being successfully consumed yet, the system can then compute a *could-result-in* map with those messages. In a *could-result-in* map, a message could lead to a notification of the end of epoch *e* if and only if the messages has timestamp *t* <= *e*, and there is a path from the message to the notification location *n* and all the ingress, egress, feedback vertex on that path satisfies *t* <= *e*. This is guaranteed by that messages are not sent "back in time". Thus the *could-result-in* map can keep track of the epochs, and functions rely on epochs can work correctly. +Tracking process is not a trivial task since there are many messages with different timestamps being sent between nodes. For example, a node *n* is in charge of notifying the end of each epoch and performing a task 'count the number of the event in each epoch'. Then the next question is when can *n* say for sure that a certain epoch has already ended thus the counting job can start. The problematic issue here is that even the node has been receiving messages with epoch *e*, there might still be messages with epoch *e-1* that are still *circulating* (i.e. haven't been consumed) in the dataflow thus if *n* fires the counting right now, it would end up with wrong results since those *circulating* messages are not counted. *Naiad* accomplishes this task by tracking all the messages that being sent and haven't being successfully consumed yet, the system can then compute a *could-result-in* map with those messages. In a *could-result-in* map, a message could lead to a notification of the end of epoch *e* if and only if the messages has timestamp *t* <= *e*, and there is a path from the message to the notification location *n* and all the ingress, egress, feedback vertex on that path satisfies *t* <= *e*. This is guaranteed by that messages are not sent "back in time". Thus the *could-result-in* map can keep track of the epochs, and functions rely on epochs can work correctly. *Naiad* is the implementation of *timely dataflow* in a cluster, where the tracker on each machine would broadcast both the messages that has not been consumed and recently been consumed in order for every tracker to maintain a single view of the global *could-result-in* map, thus the process of the whole computation is guaranteed. *Naiad* also optimizes its performance by dealing with micro-stragglers such as making changes on TCP layer to reduce network latency and customizing garbage collection methods. Another interesting point about *Naiad* is how it deals with failures. As described before, there are systems that achieve fault-tolerance by replication and systems such as *Storm* that would replay the tuple from beginning. Then we have *Spark streaming*, which would keep the *lineage* of all operations and is able to rebuilt the RDDs in parallel. *Naiad* more or less can be seen as an example that takes the replay approach, it would checkpoint the computation and can perform potentially more compact checkpointing when requested. When the system periodically checkpoints, all processes would pause and finish ongoing works. Then the system would perform checkpointing on each vertex and then resume. To recover from a failure, all live processes would revert to the last durable checkpoint, and the work from the failed vertex would be reassigned to other processes. This method might have higher latency for recovery due to both checkpointing and resuming than other approaches. -In short, *Naiad* allows processing of messages from different epochs and aggregating result from the same epoch by using timestamps on messages. Moreover, by allowing producers to set epoch on messages arbitrarily (i.e., set logical time), *Naiad* provides a powerful way to create batches of streams. However, the computation model of *Naiad* introduce high latency when dealing with failures. +In short, *Naiad* allows processing of messages from different epochs and aggregating result from the same epoch by using timestamps on messages. Moreover, by allowing producers to set epoch on messages arbitrarily (i.e. set logical time), *Naiad* provides a powerful way to create batches of streams. However, the computation model of *Naiad* introduce high latency when dealing with failures. -- Google Dataflow +#### Google Dataflow We now have seen three different systems that can process data stream in large scale, however, each of them are constraint in the way of viewing the dataset. *Storm* can perform stream processing on each tuple, where *Spark streaming* and *Naiad* have their own way of grouping tuples together into small batches before processing. The authors of *Google Dataflow* {% cite akidau2015dataflow --file streaming %} believe that the fundamental problem of those views is they are limited by the processing engine, for example, if you were to use *Spark streaming* to process the stream, you can only group the tuples into small time intervals. The motivation of *Google Dataflow* is then a general underlying system with which the users can express what processing model they want. *Google Dataflow* is a system that allows batch, micro-batch and stream processing where users can choose based on the tradeoffs provided by each processing model: latency or resouce constraint. *Google Dataflow* implements many features in order to achieve its goal, and we will briefly talk about them. -*Google Dataflow* provides a windowing model that supports unaligned event-time windows, which helped the users to express how to batch the tuples together in a stream. Windowing slices a dataset into finite chunks for processing as a group, one can think of it as batching as we discussed before. Unaligned windows are the windows that would only be applied to certain tuples during the period, for example, if we have an unaligned window *w[1:00,2:00)(k)*, and only the events with key *k* during the time period [1:00, 2:00) would be grouped by this window. This is powerful since it provides an alternative way of batching tuples other than just time before processing. +*Google Dataflow* provides a windowing model that supports unaligned event-time windows, which helped the users to express how to batch the tuples together in a stream. Windowing slices a dataset into finite chunks for processing as a group, one can think of it as batching as we discussed before. Unaligned windows are the windows that would only be applied to certain tuples during the period, for example, if we have an unaligned window `w[1:00, 2:00)(k)`, and only the events with key `k` during the time period `[1:00, 2:00)` would be grouped by this window. This is powerful since it provides an alternative way of batching tuples other than just time before processing. The next question is then how does *Google Dataflow* knows when to emit the results of a certain window, this requires some other signal to show when the window is done. *Google Dataflow* handles this by providing different choices of triggering methods. One example would be completion estimation, this is useful when combined with percentile watermarks, one might only care about processing a minimum percentage of the input data quickly than finishing every last piece of it. Another interesting triggering method is responding to data arrival, this is useful for application that are grouping data based on the number of them, for example, the processor can be fired once 100 data points are received. These real triggering semantics help *Google Dataflow* to become a general purposed processing system, the first method allows the users to deal with stragglers while the second one provides a way to support tuple-based windows. @@ -208,48 +209,47 @@ PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) .accumulating()) .apply(Sum.integersPerKey()); - ``` -The above example code shows how to apply a trigger that repeatedly fires on one-minute period, where PCollection can be viewed as the data stream abstraction in *Google Dataflow*. The *accumulating* mode is also specified so that the *Sum* can be refined overtime. +The above example code shows how to apply a trigger that repeatedly fires on one-minute period, where `PCollection` can be viewed as the data stream abstraction in *Google Dataflow*. The *accumulating* mode is also specified so that the *Sum* can be refined overtime. -*Google Dataflow* also relies on MillWheel{% cite akidau2013millwheel --file streaming %} as the underlying execution engine to achieve exactly-once-delivery of the tuples. MillWheel is a framework for building low-latency data-processing applications used at Google. It achieves exactly-once-delivery by first checking the incoming record and discard duplicated ones, then pending the productions (i.e., produce records to any stream) until the senders are acknowledges, only then the pending productions are sent. +*Google Dataflow* also relies on MillWheel {% cite akidau2013millwheel --file streaming %} as the underlying execution engine to achieve exactly-once-delivery of the tuples. MillWheel is a framework for building low-latency data-processing applications used at Google. It achieves exactly-once-delivery by first checking the incoming record and discard duplicated ones, then pending the productions (i.e., produce records to any stream) until the senders are acknowledges, only then the pending productions are sent. In conclusion, one of the most important core principles that drives *Google Dataflow* is to accommodate the diversity of known use cases, it did so by providing a rich set of abstractions such as windowing, triggering and controlling. Compared to the 'specialized' system that we discussed above, *Google Dataflow* is a more general system that can fulfill batch, micro-batch, and stream processing requirements. - ## The systems being used nowadays -Till now we have talked about what is stream processing and what are the different model/system built for this purpose. As shown before, the systems vary on how they view stream, for example *Storm* can perform operation on the level of each tuple while *Spark streaming* could group tuples into micro-batches and then process on the level of batch. They also differ on how to deal with failures, *Storm* can replay the tuple from spout while *Naiad* would keep checkpointing. Then we introduced *Google Dataflow*, which is seems the most powerful tool so far that allows the users to express how to group and control the tuples in the stream. +Until now we have talked about what is stream processing and what are the different model/system built for this purpose. As shown before, the systems vary on how they view stream, for example *Storm* can perform operation on the level of each tuple while *Spark streaming* could group tuples into micro-batches and then process on the level of batch. They also differ on how to deal with failures, *Storm* can replay the tuple from spout while *Naiad* would keep checkpointing. Then we introduced *Google Dataflow*, which seems like the most powerful tool so far that allows the users to express how to group and control the tuples in the stream. Despite all the differences among them, they all started with more or less the same goal: to be *the* stream processing system that would be used by companies, and we showed several examples of why companies might need such system. In this section, we would discuss three companies that use the stream processing system as the core of their business: Alibaba, Twitter and Spotify. ## Alibaba -Alibaba is the largest e-commerce retailer in the world with an annual sales more than eBay and Amazon combined in 2015. Alibaba search is the its personalized search and recommendation platform which uses Apache Flink to power the critical aspects of it{% cite alibabaflink --file streaming %}. -The processing engine of Alibaba runs on 2 different pipelines: a batch pipeline and a streaming pipeline, where the first one would process all data sources while the latter process updates that occur after the batch job is finished. As we can see the second pipeline is one example of stream processing. One of the example applications for the streaming pipeline is the online machine learning recommendation system. There are special days of the year (i.e., Singles Day in China, which is very similar to Black Friday in the U.S.) where transaction volume is huge and the previously-trained model would not correctly reflect the current trends, thus Alibaba needs a streaming job to take the real-time data into account. There are many reasons that Alibaba chose Flink, for example, Flink is general enough to express both the batch pipeline and the streaming pipeline. Another reason is that the changes to the products must be reflected in the final search result thus at-least-once semantics is needed, while other products in Alibaba might need exactly-once semantics, and Flink provides both semantics. +Alibaba is the largest e-commerce retailer in the world with an annual sales more than eBay and Amazon combined in 2015. Alibaba search is its personalized search and recommendation platform which uses Apache Flink to power critical aspects of it {% cite alibabaflink --file streaming %}. + +The processing engine of Alibaba runs on 2 different pipelines: a batch pipeline and a streaming pipeline, where the first one would process all data sources while the latter process updates that occur after the batch job is finished. As we can see the second pipeline is one example of stream processing. One of the example applications for the streaming pipeline is the online machine learning recommendation system. There are special days of the year (i.e. Singles Day in China, which is very similar to Black Friday in the U.S.) where transaction volume is huge and the previously-trained model would not correctly reflect the current trends, thus Alibaba needs a streaming job to take the real-time data into account. There are many reasons that Alibaba chose Flink, for example, Flink is general enough to express both the batch pipeline and the streaming pipeline. Another reason is that the changes to the products must be reflected in the final search result thus at-least-once semantics is needed, while other products in Alibaba might need exactly-once semantics, and Flink provides both semantics. -Alibaba developed a forked version of Flink called Blink to fit some of the unique requirements at Alibaba. One important improvement here is a more robust integration with YARN{% cite hadoopyarn --file streaming %}, where YARN is used as the global resource manager for Flink. YARN requires a job in Flink to grab all required resources up front and can not require or release resources dynamically. As Alibaba search engine is currently running on over 1000 machines, a better resources utilization is critical. Blink improves on this by letting each job has its own JobMaster to request and release resources as the job requires, which optimizes the resources usage. +Alibaba developed a forked version of Flink called Blink to fit some of the unique requirements at Alibaba. One important improvement here is a more robust integration with YARN {% cite hadoopyarn --file streaming %}, where YARN is used as the global resource manager for Flink. YARN requires a job in Flink to grab all required resources up front and can not require or release resources dynamically. As Alibaba's search engine is currently running on over 1000 machines, a better utilization of resources is critical. Blink improves on this by letting each job have its own JobMaster to request and release resources as the job requires, which optimizes the resources usage. ## Twitter -Twitter is one of the 'go-to' examples that people would think of when considering large scale stream processing system, since it has a huge amount of data that needed to be processed in real-time. Twitter bought the company that created *Storm* and used *Storm* as its real-time analysis tool for several years {% cite toshniwal2014storm --file streaming %}. However, as the data volume along with the more complex use cases increased, Twitter needed to build a new real-time stream data processing system as *Storm* can no longer satisfies the new requirements. We would talk about how *Storm* was used at Twitter and then the system that they built to replace *Storm*-*Heron*. +Twitter is one of the 'go-to' examples that people would think of when considering large scale stream processing system, since it has a huge amount of data that needed to be processed in real-time. Twitter bought the company that created *Storm* and used *Storm* as its real-time analysis tool for several years. {% cite toshniwal2014storm --file streaming %} However, as the data volume along with the more complex use cases increased, Twitter needed to build a new real-time stream data processing system as *Storm* could no longer satisfy the new requirements. We would talk about how *Storm* was used at Twitter and then the system that they built to replace *Storm*-*Heron*. -- Storm@Twitter +### Storm@Twitter Twitter requires processing complex computation on streaming data in real-time since each interaction with a user requires making a number of complex decisions, often based on data that has just been created, and they use *Storm* as the real-time distributed stream data processing engine. As we described before, *Storm* represents one of the early open-source and popular stream processing systems that is in use today, and was developed by Nathan Marz at BackType which was acquired by Twitter in 2011. After the acquisition, *Storm* has been improved and open-sourced by Twitter and then picked up by various other organizations. -We will first briefly introduce the structure of *Storm* at Twitter. *Storm* runs on a distributed cluster, and clients submit topologies to a master node, which is in charge of distributing and coordinating the execution of the topologies. The actual bolts and spouts are tasks, and multiple tasks are grouped into executor, multiple executors are in turn grouped into a worker. The worker process would then be distributed to an actual worker node (i.e., machine), where there can be multiple worker processes be running on. Each worker node runs a supervisor that communicates with the master node thus the state of the computation can be tracked. +We will first briefly introduce the structure of *Storm* at Twitter. *Storm* runs on a distributed cluster, and clients submit topologies to a master node, which is in charge of distributing and coordinating the execution of the topologies. The actual bolts and spouts are tasks, and multiple tasks are grouped into executor, multiple executors are in turn grouped into a worker. The worker process would then be distributed to an actual worker node (i.e. machine), where there can be multiple worker processes be running on. Each worker node runs a supervisor that communicates with the master node thus the state of the computation can be tracked. -As shown before, *Storm* can guarantee each tuple is processed 'at least once', however, at Twitter, *Storm* can provide two types of semantic guarantees-'at least once' and 'at most once'. 'At least once' semantic is guaranteed by the directed acyclic graph as we showed before, and 'at most once' semantic is guaranteed by dropping the tuple in case of a failure (e.g., by disabling the acknowledgements of each tuple). Note that for 'at least once' semantic, the coordinators (i.e., Zookeeper) would checkpoint each processed tuple in the topology, and the system can start processing tuples from the last 'checkpoint' that is recorded once recovered from a failure. +As shown before, *Storm* can guarantee each tuple is processed 'at least once', however, at Twitter, *Storm* can provide two types of semantic guarantees-"at least once" and "at most once". "At least once" semantic is guaranteed by the directed acyclic graph as we showed before, and "at most once" semantic is guaranteed by dropping the tuple in case of a failure (e.g. by disabling the acknowledgements of each tuple). Note that for "at least once" semantic, the coordinators (i.e. Zookeeper) would checkpoint each processed tuple in the topology, and the system can start processing tuples from the last "checkpoint" that is recorded once recovered from a failure. -*Storm* fulfilled many requirements at Twitter with satisfactory performance. *Storm* was running on hundreds of servers and several hundreds of topologies ran on these clusters some of which run on more than a few hundred nodes, terabytes of data flows through the cluster everyday and generated several billions of output tuples. These topologies were used to do both simple tasks such as filtering and aggregating the content of various streams and complex tasks such as machine learning on stream data. *Storm* was resilient to failures and achieved relatively low latency, a machine can be taken down for maintenance without interrupting the topology and the 99% response time for processing a tuple is close to 1ms. +*Storm* fulfilled many requirements at Twitter with satisfactory performance. *Storm* was running on hundreds of servers and several hundreds of topologies ran on these clusters some of which run on more than a few hundred nodes, terabytes of data flowed through the cluster everyday and generated billions of output tuples. These topologies were used to do both simple tasks such as filtering and aggregating the content of various streams and complex tasks such as machine learning on stream data. *Storm* was resilient to failures and achieved relatively low latency, a machine can be taken down for maintenance without interrupting the topology and the 99% response time for processing a tuple is close to 1ms. In conclusion, *Storm* was a critical infrastructure at Twitter that powered many of the real-time data-driven decisions that were made at Twitter. -- Twitter Heron +### Twitter Heron -*Storm* has long served as the core of Twitter for real-time analysis, however, as the scale of data being processed has increased, along with the increase in the diversity and the number of use cases, many limitations of *Storm* became apparent {% cite kulkarni2015twitter --file streaming %}. +*Storm* has long served as the core of Twitter for real-time analysis, however, as the scale of data being processed has increased, along with the increase in the diversity and the number of use cases, many limitations of *Storm* became apparent. {% cite kulkarni2015twitter --file streaming %} There are several issues with *Storm* that make using is at Twitter become challenging. The first challenge is debug-bility, there is no clean mapping from the logical units of computation in the topology to each physical process, this makes finding the root cause of misbehavior extremely hard. Another challenge is as the cluster resources becomes precious, the need for dedicated cluster resources in *Storm* leads to inefficiency and it is better to share resources across different types of systems. In addition, Twitter needs a more efficient system, simply with the increase scale, any improvement in performance can translate to huge benefit. @@ -257,23 +257,24 @@ Twitter realized in order to meet all the needs, they needed a new real-time str A key design goal for Heron is compatibility with the *Storm* API, thus Heron runs topologies, graphs with spouts and bolts like Storm. Unlike *Storm* though, the Heron topology is translated into a physical plan before actual execution, and there are multiple components in the physical plan. -Each topology is run as an Aurora{% cite apacheaurora --file streaming %} job, instead of using Nimbus{% cite nimbusproject --file streaming %} as scheduler. Nimbus used to be the master node of *Storm* that schedules and manages all running topologies, it delopys topology on *Storm*, and assigns workers to execute the topology where Aurora is also a service scheduler that can manage long-running services. Twitter chose Aurora since it is developed and used by other Twitter projects. Each Aurora job is then consisted of several containers, the first container runs Topology Master, which provides a single point of contact for discovering the status of the topology and also serves as the gateway for the topology metrics through an endpoint. The other containers each run a Stream Manager, a Metrics Manager and a number of Heron Instances. The key functionality for each Stream Manager is to manage the routing of tuples efficiently, all Stream Managers are connected to each other and the tuples from Heron Instances in different containers would be transmitted through their Stream Managers, thus the Stream Managers can be viewed as Super Node for communication. Stream Manager also provides a backpressure mechanism, if the receiver component is unable to handle incoming data/tuples, then the sender can dynamically adjust the rate of the data flows through the network. For example, if the Stream Managers of the bolts are overwhelmed, they would then notice the Stream Managers of the spouts to slow down thus ensure all the data are properly processed. Heron Instance carries out the real work for a spout or a bolt, unlike worker in *Storm*, each Heron Instance runs only a single task as a process, in addition to performing the work, Heron Instance is also responsible for collecting multiple metrics. The metrics collected by Heron Instances would then be sent to the Metrics Manager in the same container and to the central monitoring system. +Each topology is run as an Aurora {% cite apacheaurora --file streaming %} job, instead of using Nimbus {% cite nimbusproject --file streaming %} as scheduler. Nimbus used to be the master node of *Storm* that schedules and manages all running topologies, it delopys topology on *Storm*, and assigns workers to execute the topology where Aurora is also a service scheduler that can manage long-running services. Twitter chose Aurora since it is developed and used by other Twitter projects. Each Aurora job is then consisted of several containers, the first container runs Topology Master, which provides a single point of contact for discovering the status of the topology and also serves as the gateway for the topology metrics through an endpoint. The other containers each run a Stream Manager, a Metrics Manager and a number of Heron Instances. The key functionality for each Stream Manager is to manage the routing of tuples efficiently, all Stream Managers are connected to each other and the tuples from Heron Instances in different containers would be transmitted through their Stream Managers, thus the Stream Managers can be viewed as Super Node for communication. Stream Manager also provides a backpressure mechanism, if the receiver component is unable to handle incoming data/tuples, then the sender can dynamically adjust the rate of the data flows through the network. For example, if the Stream Managers of the bolts are overwhelmed, they would then notice the Stream Managers of the spouts to slow down thus ensure all the data are properly processed. Heron Instance carries out the real work for a spout or a bolt, unlike worker in *Storm*, each Heron Instance runs only a single task as a process, in addition to performing the work, Heron Instance is also responsible for collecting multiple metrics. The metrics collected by Heron Instances would then be sent to the Metrics Manager in the same container and to the central monitoring system. The components in the Heron topology are clearly separated, so the failure in various level would be handled differently. For example, if the Topology Master dies, the container would restart the process, and the stand-by Topology Master would take over the master while the restarted would become the stand-by. When a Stream Manager dies, it gets started in the same container, and after rediscovers the Topology Master, it would fetch and check whether there are any changes need to be made in its state. Similarly, all the other failures can be handled gracefully by Heron. -Heron addresses the challenges of *Storm*. First, each task is performed by a single Heron Instance, and the different functionalities are abstracted into different level, which makes debug clear. Second, the provisioning of resources is abstracted out thus made sharing infrastructure with other systems easier. Third, Heron provides multiple metrics along with the backpressure mechanism, which can be used to precisely reason about and achieve a consistent rate of delivering results. +Heron addresses the challenges of *Storm*. First, each task is performed by a single Heron Instance, and the different functionalities are abstracted into different level, which makes debugging clearer. Second, the provisioning of resources is abstracted out thus made sharing infrastructure with other systems easier. Third, Heron provides multiple metrics along with the backpressure mechanism, which can be used to precisely reason about and achieve a consistent rate of delivering results. -*Storm* has been decommissioned and Heron is now the de-facto streaming system at Twitter and an interesting note is that after migrating all the topologies to Heron, there was an overall 3X reduction in hardware. Not only Heron reduces the infrastructure needed, it also outperform *Storm* by delivering 6-14X improvements in throughput, and 5-10X reductions in tuple latencies. +*Storm* has been decommissioned and Heron is now the de-facto streaming system at Twitter and an interesting note is that after migrating all the topologies to Heron, there was an overall 3X reduction in hardware. Not only did Heron reduce the infrastructure needed, it also outperformed *Storm* by delivering 6-14X improvements in throughput, and 5-10X reductions in tuple latencies. ## Spotify + Another company that deploys large scale distributed system is Spotify {% cite spotifylabs --file streaming %}. Every small piece of information, such as listening to a song or searching an artist, is sent to Spotify servers and processed. There are many features of Spotify that need such stream processing system, such as music/playlist recommendations. Originally, Spotify would collect all the data generated from client softwares and store them in their HDFS, and those data would then be processed on hourly basis by a batch job (i.e., the data collected each hour would be stored and processed together). -In the original Spotify structure, each job must determine, with high probability, that all data from the hourly bucket has successfully written to a persistent storage before firing the job. Each job were running as a batch job by reading the files from the storage, so late-arriving data for already completed bucket can not be appended since jobs generally only read data once from a hourly bucket, thus each job has to treat late data differently. All late data is written to a currently open hourly bucket then. +In the original Spotify structure, each job must determine, with high probability, that all data from the hourly bucket was successfully written to persistent storage before firing the job. Each job was running as a batch job by reading the files from the storage, so late-arriving data for already completed bucket could not be appended since jobs generally only read data once from a hourly bucket, thus each job has to treat late data differently. All late data is written to a currently open hourly bucket then. -Spotify then decided to use *Google Dataflow*, since the features provided by it is exactly what Spotify wants. The previous batch jobs can be written as streaming jobs with one hour window size, and all the data stream can be grouped based on both window and key, while the late arriving data can be gracefully handled if the controlling is set to *accumulating & retracting*. Also, *Google Dataflow* also reduces the export latency of the hourly analysis results, since when assigning windows, Spotify would have an early trigger that is set to emit pane (i.e., result) every N tuples until the window is closed. +Spotify then decided to use *Google Dataflow*, since the features provided by it is exactly what Spotify wants. The previous batch jobs can be written as streaming jobs with one hour window size, and all the data stream can be grouped based on both window and key, while the late arriving data can be gracefully handled if the controlling is set to *accumulating & retracting*. Also, *Google Dataflow* reduces the export latency of the hourly analysis results, since when assigning windows, Spotify would have an early trigger that is set to emit a pane (i.e. result) every N tuples until the window is closed. The worst end-to-end latency observed with new Spotify system based on *Google Dataflow* is four times lower than the previous system and also with much lower operational overhead. ## References -{% bibliography --file streaming %}
\ No newline at end of file +{% bibliography --file streaming %} |
