From 96cad2ba22898805e10f3e23b0223c27a09bf033 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Mon, 5 Dec 2016 19:06:37 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 9f04232..199ab6a 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -3,6 +3,9 @@ layout: page title: "Distributed Programming Languages" by: "A Systems Person" --- +## Outline + +* ### Two major major, orthogonal approaches to distributed languages: @@ -49,6 +52,8 @@ Unlike Erlang, MapReduce and DSL's that implement the paradigm are "all the rage Unlike Erlang, MapReduce has experienced adoption because it offers true abstraction of the problems of distributed computing. Erlang only provided a way of detecting a process failure; it did not consider machine or network failures. +* MapReduce is a GPL for the domain of distribution + ## References {% bibliography --file dist-langs %} -- cgit v1.2.3 From 5adc3c9e9bf6998658240b6849ffac281e39199f Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Tue, 6 Dec 2016 13:00:58 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 199ab6a..5bfea62 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -3,19 +3,33 @@ layout: page title: "Distributed Programming Languages" by: "A Systems Person" --- -## Outline -* +### Three major approaches to distributed languages: -### Two major major, orthogonal approaches to distributed languages: +#### Shared Memory + +What is it? + +Some examples: + +* Linda +* Orca + +Tries to make many machines look like a single machine. +This is hard because of consistency and partitioning. +The logic of the program is simple, but requiring that the system handle shared memory opens up many opportunities for performance bugs. #### Actor / Object model -The actor model has its roots in procedural programming. -This model maps in a straighforward way to a distributed environment. +The actor model has its roots in procedural and object oriented programming. +Communication through RPC or message-passing. +Actors/Objects are location agnostic, because state is not shared. +The system can decide how to most efficiently place actors. * Erlang * Cloud Haskell (I know, right? Why?) +* Emerald +* Argus #### Dataflow model (static and stream) @@ -24,6 +38,7 @@ Some languages that use this model are: * Multilisp * MapReduce (Spark, Hadoop, etc.) +* Orleans (Wait, what??) ### Why GPL's not DSL's? -- cgit v1.2.3 From f34fe47ae54d8790961d12d571df719bb1535b2d Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Tue, 6 Dec 2016 14:51:02 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 5bfea62..7b1412b 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -4,6 +4,13 @@ title: "Distributed Programming Languages" by: "A Systems Person" --- +### Problems of Distributed Programming + +* Partial failure +* Consistency +* Efficiency +* Scallability + ### Three major approaches to distributed languages: #### Shared Memory @@ -40,6 +47,15 @@ Some languages that use this model are: * MapReduce (Spark, Hadoop, etc.) * Orleans (Wait, what??) +#### Which is best? Why? + +Why is MR all-the-rage? + +* Dataflow / MapReduce fundamentally changed the programming style for distributed systems +* Other models (Actor, DSM) tried to mask distribution +* By changing the style, programs need necessarily consider communication patterns (disk, network) +* Although, system may still handle fault tolerance + ### Why GPL's not DSL's? * problem of domain-composition -- cgit v1.2.3 From f4ed5284892178b53150d1a9bf936f56327cce0a Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Tue, 6 Dec 2016 20:53:26 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 7b1412b..a2ea4d3 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -11,6 +11,9 @@ by: "A Systems Person" * Efficiency * Scallability +Languages and systems designed for distribution aim to abstract these problems from the application developer. + + ### Three major approaches to distributed languages: #### Shared Memory @@ -51,11 +54,15 @@ Some languages that use this model are: Why is MR all-the-rage? +* MR is DSL for distribution? (wouldn't use it to develop single-machine app (probably)) + * Dataflow / MapReduce fundamentally changed the programming style for distributed systems * Other models (Actor, DSM) tried to mask distribution * By changing the style, programs need necessarily consider communication patterns (disk, network) * Although, system may still handle fault tolerance +## Maybe use below topics + ### Why GPL's not DSL's? * problem of domain-composition -- cgit v1.2.3 From c05b3f575b10fcce870ddcac6c983b53901ac5a9 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 11:40:17 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index a2ea4d3..7eab0c4 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -7,10 +7,12 @@ by: "A Systems Person" ### Problems of Distributed Programming * Partial failure -* Consistency -* Efficiency +* Consistency (Concurrency) +* Efficiency (Latency) * Scallability +For the above points cite "A Note on Distributed Computing" + Languages and systems designed for distribution aim to abstract these problems from the application developer. -- cgit v1.2.3 From 1e6ce06c7c57eba04315e90353f1cd7ab3fff1a8 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 12:00:26 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 7eab0c4..ac7122d 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -42,6 +42,7 @@ The system can decide how to most efficiently place actors. * Cloud Haskell (I know, right? Why?) * Emerald * Argus +* Orleans #### Dataflow model (static and stream) @@ -50,7 +51,7 @@ Some languages that use this model are: * Multilisp * MapReduce (Spark, Hadoop, etc.) -* Orleans (Wait, what??) + #### Which is best? Why? -- cgit v1.2.3 From a61720c405141700b5d533ca27af2921c775864e Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 14:52:21 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 52 ++++++++++++++++++++++++------------------------- 1 file changed, 25 insertions(+), 27 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index ac7122d..d2f425d 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -11,7 +11,7 @@ by: "A Systems Person" * Efficiency (Latency) * Scallability -For the above points cite "A Note on Distributed Computing" +For the above points cite "A Note on Distributed Computing," "Fallacies of Distributed Computing Explained" Languages and systems designed for distribution aim to abstract these problems from the application developer. @@ -26,6 +26,7 @@ Some examples: * Linda * Orca +* RPC ( and why RPC is shared-memory ) Tries to make many machines look like a single machine. This is hard because of consistency and partitioning. @@ -39,7 +40,7 @@ Actors/Objects are location agnostic, because state is not shared. The system can decide how to most efficiently place actors. * Erlang -* Cloud Haskell (I know, right? Why?) +* Cloud Haskell * Emerald * Argus * Orleans @@ -51,49 +52,46 @@ Some languages that use this model are: * Multilisp * MapReduce (Spark, Hadoop, etc.) - +* RDD +* Dryad, DryadLinq #### Which is best? Why? -Why is MR all-the-rage? +MR vs Actors: depends on problem, solution -* MR is DSL for distribution? (wouldn't use it to develop single-machine app (probably)) +How fine grain is your data and logic? +Does your algorithm map to a batch processing job? +MR: + +* MR is DSL for distribution? (wouldn't use it to develop single-machine app (probably)) * Dataflow / MapReduce fundamentally changed the programming style for distributed systems * Other models (Actor, DSM) tried to mask distribution * By changing the style, programs need necessarily consider communication patterns (disk, network) * Although, system may still handle fault tolerance -## Maybe use below topics - -### Why GPL's not DSL's? - -* problem of domain-composition -* problem of abstraction -* problem of ecosystem -* problem of tumultuous architecture -* "any gpl + library can act as a dsl" - mernik" +Actors: -#### Erlang vs C: A Tar and Feathering +* -{% cite Armstrong2010 --file dist-langs %} +### Support for Distribution -Erlang offers nothing that is unavailable in C. +#### Intro -For example, dynamic code swapping is one of Erlang's major selling points. -However, code swapping can easily be achieved in C with dynamic linking. -This approach is analogous to the example offered in the Erlang paper. +* What is a DSL? +> Domain-specific languages are languages tailored to a specific application domain. -Other selling points, such as isolation, concurrency, and message passing can all be accomplished with unix-style system calls. -Why is this language not considered redundant? +> A domain-specific language is a programming language or executable specification language that offer, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. -#### MapReduce: A New Hope +#### Where is it in the stack? -Unlike Erlang, MapReduce and DSL's that implement the paradigm are "all the rage." -Unlike Erlang, MapReduce has experienced adoption because it offers true abstraction of the problems of distributed computing. -Erlang only provided a way of detecting a process failure; it did not consider machine or network failures. +#### Why GPL's not DSL's? -* MapReduce is a GPL for the domain of distribution +* problem of domain-composition +* problem of abstraction +* problem of ecosystem +* problem of tumultuous architecture +* "any gpl + library can act as a dsl" - mernik" ## References -- cgit v1.2.3 From 7bad353a24563a05c96ad50bc48acdaa28705d58 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 14:52:48 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 1 + 1 file changed, 1 insertion(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index d2f425d..711d9fd 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -79,6 +79,7 @@ Actors: #### Intro * What is a DSL? + > Domain-specific languages are languages tailored to a specific application domain. > A domain-specific language is a programming language or executable specification language that offer, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. -- cgit v1.2.3 From 33c900bbf51fcabea6b69287497f52a71c18248d Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 14:56:22 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 2 ++ 1 file changed, 2 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 711d9fd..b49162e 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -82,6 +82,8 @@ Actors: > Domain-specific languages are languages tailored to a specific application domain. +Another definition: + > A domain-specific language is a programming language or executable specification language that offer, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. #### Where is it in the stack? -- cgit v1.2.3 From 4753f31ac4a9271e88e5112253fb2e3af2fcc650 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 15:33:40 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index b49162e..c74e05d 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -72,7 +72,7 @@ MR: Actors: -* +* Message-passing chapter ### Support for Distribution @@ -84,12 +84,19 @@ Actors: Another definition: -> A domain-specific language is a programming language or executable specification language that offer, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. +> A domain-specific language is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. #### Where is it in the stack? +* Libraries: +* Compiler Extension +* Compiler / Runtime: +* Hardware + #### Why GPL's not DSL's? +Reasons for moving to GPL's as base for DSL's + * problem of domain-composition * problem of abstraction * problem of ecosystem -- cgit v1.2.3 From adaca5d05aac476323c696a424abca2465b60bb0 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 16:56:32 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index c74e05d..1734814 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -4,7 +4,7 @@ title: "Distributed Programming Languages" by: "A Systems Person" --- -### Problems of Distributed Programming +## Problems of Distributed Programming * Partial failure * Consistency (Concurrency) @@ -16,9 +16,9 @@ For the above points cite "A Note on Distributed Computing," "Fallacies of Distr Languages and systems designed for distribution aim to abstract these problems from the application developer. -### Three major approaches to distributed languages: +## Three major approaches to distributed languages: -#### Shared Memory +### Shared Memory What is it? @@ -32,7 +32,7 @@ Tries to make many machines look like a single machine. This is hard because of consistency and partitioning. The logic of the program is simple, but requiring that the system handle shared memory opens up many opportunities for performance bugs. -#### Actor / Object model +### Actor / Object model The actor model has its roots in procedural and object oriented programming. Communication through RPC or message-passing. @@ -45,7 +45,7 @@ The system can decide how to most efficiently place actors. * Argus * Orleans -#### Dataflow model (static and stream) +### Dataflow model (static and stream) The dataflow model has its roots in functional programming. Some languages that use this model are: @@ -55,7 +55,7 @@ Some languages that use this model are: * RDD * Dryad, DryadLinq -#### Which is best? Why? +### Which is best? Why? MR vs Actors: depends on problem, solution @@ -74,9 +74,9 @@ Actors: * Message-passing chapter -### Support for Distribution +## Support for Distribution -#### Intro +### Intro * What is a DSL? @@ -86,14 +86,14 @@ Another definition: > A domain-specific language is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. -#### Where is it in the stack? +### Where is it in the stack? * Libraries: * Compiler Extension * Compiler / Runtime: * Hardware -#### Why GPL's not DSL's? +### Why GPL's not DSL's? Reasons for moving to GPL's as base for DSL's -- cgit v1.2.3 From d0029e16fa4e4e8b7576f2e6ac2ded39afbf24bd Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 16:57:52 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 1734814..8bf98f6 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -59,7 +59,7 @@ Some languages that use this model are: MR vs Actors: depends on problem, solution -How fine grain is your data and logic? +How fine grain is your data and logic? (MR job can be built from actor model) Does your algorithm map to a batch processing job? MR: -- cgit v1.2.3 From d9299098200fa9271a286806b3ec4d8c0e11d699 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Wed, 7 Dec 2016 20:16:03 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 8bf98f6..0477199 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -6,10 +6,27 @@ by: "A Systems Person" ## Problems of Distributed Programming -* Partial failure -* Consistency (Concurrency) -* Efficiency (Latency) -* Scallability +There are problems that exist in distributed system environments that do not exist in single-machine environments. +For example, programs running on distributed systems must be resilient to partial failure. +In a single-machine environment, a program is either running or crashed. +When instructions are distributed accross multiple machines, the program can be running, crashed, or partially crashed. +Programs, systems, and languages designed for distribution must be able to tolerate problems such as partial failure. +Furthermore, they must abstract some or all of these problems from the application developer and application logic. + +### Partial Failure + +On a single-machine environment, a crash means that either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) +If an application consists of multiple communicating processes, it is possible for some components to remain running when others have crashed. + +### Consistency (Concurrency) + + + +### Efficiency (Latency) + + + +### Scallability For the above points cite "A Note on Distributed Computing," "Fallacies of Distributed Computing Explained" @@ -93,7 +110,7 @@ Another definition: * Compiler / Runtime: * Hardware -### Why GPL's not DSL's? +### Why DSL's as Libraries? Reasons for moving to GPL's as base for DSL's -- cgit v1.2.3 From 6605f95f575b009d645c28448dcf374621057bfd Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Thu, 8 Dec 2016 11:06:31 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 0477199..202bcfa 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -7,30 +7,25 @@ by: "A Systems Person" ## Problems of Distributed Programming There are problems that exist in distributed system environments that do not exist in single-machine environments. -For example, programs running on distributed systems must be resilient to partial failure. -In a single-machine environment, a program is either running or crashed. -When instructions are distributed accross multiple machines, the program can be running, crashed, or partially crashed. -Programs, systems, and languages designed for distribution must be able to tolerate problems such as partial failure. -Furthermore, they must abstract some or all of these problems from the application developer and application logic. +Partial failure, concurrency, and latency are three problems that make distributed computing fundamentally different from local computing. +In order to understand the design decisions behind programming languages and systems for distributed computing, it is necessary to discuss these three problems that make distributed computing unique. +In this section, we present an overview of these three problems and their impact on distributed programming models. ### Partial Failure On a single-machine environment, a crash means that either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) -If an application consists of multiple communicating processes, it is possible for some components to remain running when others have crashed. - -### Consistency (Concurrency) - - - -### Efficiency (Latency) +If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. +For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about another's state, etc. +Because of the presence of a network, in a distributed computing environment it is not possible to know the source of failure. +Failure in a distributed settings means either the network or the host has failed (or both). +Further, if the failure is network related, it is possible for the network to "come back up" at some future time. +### Consistency (Concurrency) -### Scallability -For the above points cite "A Note on Distributed Computing," "Fallacies of Distributed Computing Explained" -Languages and systems designed for distribution aim to abstract these problems from the application developer. +### Latency ## Three major approaches to distributed languages: -- cgit v1.2.3 From fe791939bf976b8af985833d7df67baba5647d96 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Thu, 8 Dec 2016 13:01:18 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 202bcfa..20ae333 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -17,16 +17,49 @@ On a single-machine environment, a crash means that either the machine has faile If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about another's state, etc. -Because of the presence of a network, in a distributed computing environment it is not possible to know the source of failure. -Failure in a distributed settings means either the network or the host has failed (or both). -Further, if the failure is network related, it is possible for the network to "come back up" at some future time. +* Failure in a single-machine setting +* Failure in a distributed setting + * 2 sources, network and host + * no central manager (no knowledge) + * non-determinism + * consistency (leave until next section) + * control is not returned to the caller, message or response may "vanish" + +* Impact, methods of dealing with partial failure + * recompute, duplicate computation (MR, RDD, Hadoop) + * 2-phase commit (Argus) + * redundancy (MR (spark's duplicate master), Orleans, Argus) + * checkpoint-restore (Naiad, Hadoop) ### Consistency (Concurrency) +* Local + * enforce consistency with locks + * state located under one resource manager (partial local crash) + +* Distributed + * preserve state in instance of failure + +* Methods + * sequencer + * message queues + * read only vs. write ops ### Latency +* process locality +* minimize communication +* + +### The CAP Theorem + +Indeed, these three issues of distributed computing are not disjoint. +A solution designed to solve one problem may emphasize another. + +* Consistency +* Availability +* Partitioning ## Three major approaches to distributed languages: -- cgit v1.2.3 From e136271f9041c8a565383f616a331875b7617dc0 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Thu, 8 Dec 2016 13:44:25 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 20ae333..098c87b 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -17,7 +17,6 @@ On a single-machine environment, a crash means that either the machine has faile If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about another's state, etc. -* Failure in a single-machine setting * Failure in a distributed setting * 2 sources, network and host * no central manager (no knowledge) @@ -36,10 +35,11 @@ For example, a process can be restored based on a checkpoint, another process in * Local * enforce consistency with locks * state located under one resource manager (partial local crash) - * Distributed * preserve state in instance of failure + * concurrent method invocations / messages + * operations on replicated objects (for scalability) * Methods * sequencer @@ -48,10 +48,22 @@ For example, a process can be restored based on a checkpoint, another process in ### Latency -* process locality -* minimize communication -* +* Local + * operations are relatively fast + * topology doesn't change + +* Distributed + * Network failures and recovery + * changing topology + * efficiency +* Methods + * process locality (static, dynamic analysis) + * minimize communication + * data replication (Orca) + * pipelining (HTTP) + * asynchronous callbacks + ### The CAP Theorem Indeed, these three issues of distributed computing are not disjoint. -- cgit v1.2.3 From f9b1bc6f8747e1e73f516cce278cc3a4708d75f0 Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Thu, 8 Dec 2016 13:46:32 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 098c87b..785d2b7 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -13,7 +13,7 @@ In this section, we present an overview of these three problems and their impact ### Partial Failure -On a single-machine environment, a crash means that either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) +In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about another's state, etc. -- cgit v1.2.3 From 7697198aaf2afb40e668f5c86e9c86598e8bffff Mon Sep 17 00:00:00 2001 From: cnnrznn Date: Thu, 8 Dec 2016 14:45:35 -0500 Subject: Update dist-langs.md --- chapter/4/dist-langs.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 785d2b7..d4feaec 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -67,7 +67,7 @@ For example, a process can be restored based on a checkpoint, another process in ### The CAP Theorem Indeed, these three issues of distributed computing are not disjoint. -A solution designed to solve one problem may emphasize another. +A solution designed to solve one problem may exacerbate another. * Consistency * Availability @@ -77,10 +77,9 @@ A solution designed to solve one problem may emphasize another. ### Shared Memory -What is it? - -Some examples: +* Definition +* Mirage * Linda * Orca * RPC ( and why RPC is shared-memory ) -- cgit v1.2.3 From c3094764dd24479be2e801c796c966a646edc234 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Thu, 8 Dec 2016 21:41:24 -0500 Subject: updating the doc --- chapter/4/dist-langs.md | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index d4feaec..68f3873 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -75,14 +75,26 @@ A solution designed to solve one problem may exacerbate another. ## Three major approaches to distributed languages: -### Shared Memory +### Distributed Shared Memory -* Definition +Virtual memory provides a powerful abstraction for processes. +It allows each program running on a machine to believe it is the sole user of the machine, as well as provide each process with more (or less) memory addresses than may be physically present. +The operating system is responsible for mapping virtual memory addresses to physical ones and swapping addresses to and from disk. -* Mirage -* Linda -* Orca -* RPC ( and why RPC is shared-memory ) +Distributed takes the virtual memory abstraction one step further by allowing virtual addresses to be mapped to physical memory regions on remote machines. +Given such an abstraction, programs can communicate simply by reading from and writing to shared memory addresses. +Distributed shared memory is appealing because the programming model is the same for local and distributed systems. +However, it requires an underlying system to function properly. + +#### Mirage + +#### Linda + +#### Orca + + + +#### RPC ( and why RPC is shared-memory ) Tries to make many machines look like a single machine. This is hard because of consistency and partitioning. -- cgit v1.2.3 From 97669daa79447c8747724061f2ce2f0092b29e71 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Fri, 9 Dec 2016 17:53:43 -0500 Subject: edits --- chapter/4/dist-langs.md | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 68f3873..fd029d1 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -78,13 +78,14 @@ A solution designed to solve one problem may exacerbate another. ### Distributed Shared Memory Virtual memory provides a powerful abstraction for processes. -It allows each program running on a machine to believe it is the sole user of the machine, as well as provide each process with more (or less) memory addresses than may be physically present. +It allows each process running on a machine to believe it is the sole user of the machine, as well as provide each process with more (or less) memory addresses than may be physically present. The operating system is responsible for mapping virtual memory addresses to physical ones and swapping addresses to and from disk. -Distributed takes the virtual memory abstraction one step further by allowing virtual addresses to be mapped to physical memory regions on remote machines. +Distributed shared memory (DSM) takes the virtual memory abstraction one step further by allowing virtual addresses to be mapped to physical memory regions on remote machines. Given such an abstraction, programs can communicate simply by reading from and writing to shared memory addresses. -Distributed shared memory is appealing because the programming model is the same for local and distributed systems. +DSM is appealing because the programming model is the same for local and distributed systems. However, it requires an underlying system to function properly. +Mirage, Linda, and Orca are three systems that use distributed shared memory to provide a distributed programming model. #### Mirage @@ -92,7 +93,30 @@ However, it requires an underlying system to function properly. #### Orca - +Orca is a programming language built for distribution and is based on the DSM model. +Orca expresses parallelism explicitly through processes. +Processes in Orca are similar to procedures, but are concurrent instead of serial. +When a process is forked, it can take parameters that are either passed as a copy of the original data, or passed as a *shared data-object*. +Processes can then communicate through these shared objects. + +Shared data objects in Orca are similar to objects in OOP. +An object is defined abstractly by a name and a set of interfaces. +An implementation of the object defines any private data fields as well as the interfaces (methods). +Importantly, these interfaces are gauranteed to be indivisible, meaning that simultaneous calls to the same interface are serializable. +Although serializability alone does not eliminate indeterminism from Orca programs, it keeps the model simple while it allows programmers to construct richer, multi-operation locks for arbitrary semantics and logic. + +Another key feature of Orca is the ability to express symbolic data structures as shared data objects. +Because shared data is expressed through data-objects, it is easy to serialize, for instance, operations on a binary tree. + +* processes - for distribution, sharing data + * concurrency is explicit + * control over where processes are located + * invocation ( fork( parameters ) [ on CPU # ]; ) +* abstract data types - shared data objects + * similar to objects in OOP + * interfaces + * operations on objects (methods) are indivisible (serializable) + * talk about their implementation #### RPC ( and why RPC is shared-memory ) -- cgit v1.2.3 From 71bed9a13a03803a5e6145fa0026e38daed30c17 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Fri, 9 Dec 2016 18:19:10 -0500 Subject: . --- chapter/4/dist-langs.md | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index fd029d1..7f9d079 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -97,7 +97,7 @@ Orca is a programming language built for distribution and is based on the DSM mo Orca expresses parallelism explicitly through processes. Processes in Orca are similar to procedures, but are concurrent instead of serial. When a process is forked, it can take parameters that are either passed as a copy of the original data, or passed as a *shared data-object*. -Processes can then communicate through these shared objects. +Processes communicate through these shared objects. Shared data objects in Orca are similar to objects in OOP. An object is defined abstractly by a name and a set of interfaces. @@ -108,22 +108,6 @@ Although serializability alone does not eliminate indeterminism from Orca progra Another key feature of Orca is the ability to express symbolic data structures as shared data objects. Because shared data is expressed through data-objects, it is easy to serialize, for instance, operations on a binary tree. -* processes - for distribution, sharing data - * concurrency is explicit - * control over where processes are located - * invocation ( fork( parameters ) [ on CPU # ]; ) -* abstract data types - shared data objects - * similar to objects in OOP - * interfaces - * operations on objects (methods) are indivisible (serializable) - * talk about their implementation - -#### RPC ( and why RPC is shared-memory ) - -Tries to make many machines look like a single machine. -This is hard because of consistency and partitioning. -The logic of the program is simple, but requiring that the system handle shared memory opens up many opportunities for performance bugs. - ### Actor / Object model The actor model has its roots in procedural and object oriented programming. -- cgit v1.2.3 From 477562da1962045141ce34bafa9722099349901c Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 21:03:27 -0500 Subject: progress? --- chapter/4/dist-langs.md | 62 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 52 insertions(+), 10 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 7f9d079..453b36a 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -89,6 +89,33 @@ Mirage, Linda, and Orca are three systems that use distributed shared memory to #### Mirage +Mirage is an OS-level implementation of DSM. +In Mirage, regions of memory known as *segments* are created and indicated as shared. +A segment consists of one or more fixed-size pages. +Other local or remote processes can *attach* segments to arbitrary regions of their own virtual address space. +When a process is finished with a segment, the region can be *detached*. +Requests to shared memory are transparent to user processes. +Faults occur on the page level. + +Operations on shared pages are gauranteed to be coherent; after a process writes to a page, all subsequent reads will observe the results of the write. +To accomplish this, Mirage uses a protocol for requesting read or write access to a page. +Depending on the permissions of the current "owner" of a page, the page may be invalidated on other nodes. +The behavior of the protocol is outlined by the table below. + + + +Crucially, the semantics of this protocol are that at any time there may only be either (1) a single writer or (2) one or more readers of a page. +When a single writer exists, no other copies of the page are present. +When a read request arrives for a page that is being written to, the writer is demoted to a reader. +These two properties ensure coherence. +Many read copies of a page may be present, both to minimize network traffic and provide locality. +When a write request arrives for a page, all read instances of the page are invalidated. + +To ensure fairness, the system associates each page with a timer. +When a request is honored for a page, the timer is reset. +The timer gaurantees that the page will not be invalidated for a minimum period of time. +Future request that result in invalidation or demotion (writer to reader) are only honored if the timer is satisfied. + #### Linda #### Orca @@ -106,20 +133,35 @@ Importantly, these interfaces are gauranteed to be indivisible, meaning that sim Although serializability alone does not eliminate indeterminism from Orca programs, it keeps the model simple while it allows programmers to construct richer, multi-operation locks for arbitrary semantics and logic. Another key feature of Orca is the ability to express symbolic data structures as shared data objects. -Because shared data is expressed through data-objects, it is easy to serialize, for instance, operations on a binary tree. +In Orca, a generic graph type is offered as a first-class data-object. +Because operations are serializable at the method level, the graph data-object can offer methods that span multiple nodes while still retaining serializability. ### Actor / Object model -The actor model has its roots in procedural and object oriented programming. -Communication through RPC or message-passing. -Actors/Objects are location agnostic, because state is not shared. -The system can decide how to most efficiently place actors. +Unlike DSM, communication in the actor model is explicit and exposed through message passing. +Messages can be synchronous or asynchronous, point-to-point or broadcast style. + +In the actor model, concurrent entities do not share state as they do in DSM. +Each process, object, actor, etc., has its own address space. +The model maps well to single multicore machines as well as clusters of machines. +Although an underlying system is required to differentiate between local and remote messages, the location of processes, objects, or actors can be transparent to the application programmer. + +#### Erlang + +* functional programming language +* process is the unit of concurrency +* messages are sent to mailboxes + * asynchronous +* functional programming +* variables, atoms, lists + +#### Cloud Haskell + +#### Emerald + +#### Argus -* Erlang -* Cloud Haskell -* Emerald -* Argus -* Orleans +#### Orleans ### Dataflow model (static and stream) -- cgit v1.2.3 From 1bf1926b38feb85aa7c11c9b9d92f9b7134d906c Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 22:16:14 -0500 Subject: ... --- chapter/4/dist-langs.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 453b36a..ec9de35 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -148,12 +148,26 @@ Although an underlying system is required to differentiate between local and rem #### Erlang -* functional programming language -* process is the unit of concurrency -* messages are sent to mailboxes - * asynchronous -* functional programming -* variables, atoms, lists +Erlang is a distributed language which combines functional programming with message passing. +Units of distribution in Erlang are processes. +These processes may be colocated on the same node or distributed amongst a cluster. + +Processes in Erlang communicate by message passing. +Specifically, a process can send an asynchronous message to another process' *mailbox*. +At some future time, the receiving process may enter a *receive* clause, which searches the mailbox for the first message that matches a set of patterns. +The branch of code that is executed in response to a message is dependent on the pattern that is matched. + +In general, an application written in Erlang is separated into two broad components: *workers* and *monitors*. +Workers are responsible for application logic. +Erlang offers a special function `link(Pid)` which allows one process to monitor another. +If the process indicated by `Pid` fails, the monitor process will be notified and is expected to handle the error. +Worker processes are "linked" by monitor processes which implement the fault-tolerance logic of the application. + +Erlang, first implemented in Prolog, has the features and styles of a functional programming language. +Variables in Erlang are immutable; once assigned a value, they cannot be changed. +Because of this, loops in Erlang are written as tail recursive function calls. +Although this at first seems like a flawed practice (to traditional procedural programmers), recursive calls do not grow the current stack frame, but instead replace it. +It is worth noting that stack frame growth is still possible, but if the recursive call is "alone" (the result of the inner function is the result of the outer function), the stack will not grow. #### Cloud Haskell -- cgit v1.2.3 From 5d64d3aa62c8e08457c71e1d60cd60433d1234c3 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 22:20:42 -0500 Subject: . --- chapter/4/dist-langs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index ec9de35..d4fbed6 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -143,7 +143,7 @@ Messages can be synchronous or asynchronous, point-to-point or broadcast style. In the actor model, concurrent entities do not share state as they do in DSM. Each process, object, actor, etc., has its own address space. -The model maps well to single multicore machines as well as clusters of machines. +The model maps well to single multicore machines as well as to clusters of machines. Although an underlying system is required to differentiate between local and remote messages, the location of processes, objects, or actors can be transparent to the application programmer. #### Erlang @@ -154,7 +154,7 @@ These processes may be colocated on the same node or distributed amongst a clust Processes in Erlang communicate by message passing. Specifically, a process can send an asynchronous message to another process' *mailbox*. -At some future time, the receiving process may enter a *receive* clause, which searches the mailbox for the first message that matches a set of patterns. +At some future time, the receiving process may enter a *receive* clause, which searches the mailbox for the first message that matches one of a set of patterns. The branch of code that is executed in response to a message is dependent on the pattern that is matched. In general, an application written in Erlang is separated into two broad components: *workers* and *monitors*. @@ -166,7 +166,7 @@ Worker processes are "linked" by monitor processes which implement the fault-tol Erlang, first implemented in Prolog, has the features and styles of a functional programming language. Variables in Erlang are immutable; once assigned a value, they cannot be changed. Because of this, loops in Erlang are written as tail recursive function calls. -Although this at first seems like a flawed practice (to traditional procedural programmers), recursive calls do not grow the current stack frame, but instead replace it. +Although this at first seems like a flawed practice (to traditional procedural programmers), tail recursive calls do not grow the current stack frame, but instead replace it. It is worth noting that stack frame growth is still possible, but if the recursive call is "alone" (the result of the inner function is the result of the outer function), the stack will not grow. #### Cloud Haskell -- cgit v1.2.3 From 44c0b2611d4ab0dc6b8ab8d93854ff0b982f3e85 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 22:32:24 -0500 Subject: . --- chapter/4/dist-langs.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index d4fbed6..481a12c 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -102,7 +102,12 @@ To accomplish this, Mirage uses a protocol for requesting read or write access t Depending on the permissions of the current "owner" of a page, the page may be invalidated on other nodes. The behavior of the protocol is outlined by the table below. - +| State of Owner | State of Requester | Clock Check? | Invalidation? | +|----------------|--------------------|--------------|-------------------------------------------------------------| +| Reader | Reader | No | No | +| Reader | Writer | Yes | Yes, Requester is possibly sent current version of the page | +| Writer | Reader | Yes | No, Owner is demoted to Reader | +| Writer | Writer | Yes | Yes | Crucially, the semantics of this protocol are that at any time there may only be either (1) a single writer or (2) one or more readers of a page. When a single writer exists, no other copies of the page are present. -- cgit v1.2.3 From 8f2e54fa75e7cc07d6593ef5ebff68e33caaabae Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 22:41:35 -0500 Subject: . --- chapter/4/dist-langs.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 481a12c..f7a91a8 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -73,7 +73,12 @@ A solution designed to solve one problem may exacerbate another. * Availability * Partitioning -## Three major approaches to distributed languages: +## Three Major Approaches to Distributed Languages + +Clearly, there are problems present in distributed programming that prevent traditional local programming models from applying directly to distributed environments. +Languages and systems built for writing distributed applications can be classified into three categories: distributed shared memory, actors, and dataflow. +Each model has strengths and weaknesses. +Here, we describe each model and provide examples of languages and systems that implement them. ### Distributed Shared Memory @@ -151,6 +156,8 @@ Each process, object, actor, etc., has its own address space. The model maps well to single multicore machines as well as to clusters of machines. Although an underlying system is required to differentiate between local and remote messages, the location of processes, objects, or actors can be transparent to the application programmer. +Erlang, Emerald, Argus, and Orleans are just a few of many implementations of the actor model. + #### Erlang Erlang is a distributed language which combines functional programming with message passing. -- cgit v1.2.3 From d7dd751aa356f4f97abde84774ae9be7b11d55e4 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Sun, 11 Dec 2016 22:43:01 -0500 Subject: . --- chapter/4/dist-langs.md | 2 -- 1 file changed, 2 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index f7a91a8..0059171 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -150,12 +150,10 @@ Because operations are serializable at the method level, the graph data-object c Unlike DSM, communication in the actor model is explicit and exposed through message passing. Messages can be synchronous or asynchronous, point-to-point or broadcast style. - In the actor model, concurrent entities do not share state as they do in DSM. Each process, object, actor, etc., has its own address space. The model maps well to single multicore machines as well as to clusters of machines. Although an underlying system is required to differentiate between local and remote messages, the location of processes, objects, or actors can be transparent to the application programmer. - Erlang, Emerald, Argus, and Orleans are just a few of many implementations of the actor model. #### Erlang -- cgit v1.2.3 From ef6a156c910ef50a40f68a16d11abfa4c8fe48c1 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 00:51:19 -0500 Subject: . --- chapter/4/dist-langs.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 0059171..89eb496 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -183,6 +183,33 @@ It is worth noting that stack frame growth is still possible, but if the recursi #### Emerald +Emerald is a distributed programming language based around a unified object model. +Programs in Emerald consist of collections of Objects. +Critically, Emerald provides the programmer with a unified object model so as to abstract object location from the invocation of methods. +With that in mind, Emerald also provides the developer with the tools to designate explicitly the location of objects. + +Objects in Emerald resemble objects in other OOP languages such as Java. +Emerald objects can only be manipulated by methods that are exposed, and may contain internal state. +However, their are a few key differences between Emerald and Java Objects. +First, objects in Emerald may have an associated process which starts after initialization. +In Emerald, object processes are the basic unit of concurrency. +Additionally, an object may *not* have an associated process. +These objects more closely resemble traditional Java objects, and their code is executed when called by processes. +Second, processes may not touch internal state (members) of other objects. +Unlike Java, all internal state of Emerald objects must be accessed through method calls. +Third, objects in Emerald may contain a special *monitor* section which can contain methods and variables that are accessed atomically. +If multiple processes make simultaneous calls to a "monitored" method, the calls are effectively serialized. + +* single object model + * objects only manipulated by exposed methods + * can be moved + * identity, operations, process (optional) + * process, monitor, initially sections +* abstract types + * interfaces + * allow system upgrades, new implementations + * implementations can have different semantics for an interface + #### Argus #### Orleans -- cgit v1.2.3 From f0703c1f9ae645085c769290dbb8c19c96e4834e Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 00:53:55 -0500 Subject: . --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 89eb496..70f20ac 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -189,7 +189,7 @@ Critically, Emerald provides the programmer with a unified object model so as to With that in mind, Emerald also provides the developer with the tools to designate explicitly the location of objects. Objects in Emerald resemble objects in other OOP languages such as Java. -Emerald objects can only be manipulated by methods that are exposed, and may contain internal state. +Emerald objects expose methods to execute logic, and may contain internal state. However, their are a few key differences between Emerald and Java Objects. First, objects in Emerald may have an associated process which starts after initialization. In Emerald, object processes are the basic unit of concurrency. -- cgit v1.2.3 From c45c364009dd2b4a6e04e572d77fd312fc579b97 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 00:55:09 -0500 Subject: . --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 70f20ac..456199f 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -189,7 +189,7 @@ Critically, Emerald provides the programmer with a unified object model so as to With that in mind, Emerald also provides the developer with the tools to designate explicitly the location of objects. Objects in Emerald resemble objects in other OOP languages such as Java. -Emerald objects expose methods to execute logic, and may contain internal state. +Emerald objects expose methods to implement logic and provide functionality, and may contain internal state. However, their are a few key differences between Emerald and Java Objects. First, objects in Emerald may have an associated process which starts after initialization. In Emerald, object processes are the basic unit of concurrency. -- cgit v1.2.3 From afa14d68ab973294f09479579f662b9ba622d11e Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 01:22:13 -0500 Subject: . --- chapter/4/dist-langs.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 456199f..b1e44cf 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -194,12 +194,18 @@ However, their are a few key differences between Emerald and Java Objects. First, objects in Emerald may have an associated process which starts after initialization. In Emerald, object processes are the basic unit of concurrency. Additionally, an object may *not* have an associated process. -These objects more closely resemble traditional Java objects, and their code is executed when called by processes. +Objects that do not have a process are known as passive and more closely resemble traditional Java objects; their code is executed when called by processes belonging to other objects. Second, processes may not touch internal state (members) of other objects. Unlike Java, all internal state of Emerald objects must be accessed through method calls. Third, objects in Emerald may contain a special *monitor* section which can contain methods and variables that are accessed atomically. If multiple processes make simultaneous calls to a "monitored" method, the calls are effectively serialized. +Emerald also takes an OOP approach to system upgrades. +With a large system, it may not be desirable to disable the system, recompile, and relaunch. +Emerald uses abstract types to define sets of interfaces. +Objects that implement such interfaces can be "plugged in" where needed. +Therefore, code may be dynamically upgraded, and different implementations may be provided for semantically similar operations. + * single object model * objects only manipulated by exposed methods * can be moved -- cgit v1.2.3 From 46ae5c101e665662dae7acb52a9d037b68383ac8 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 11:13:19 -0500 Subject: Argus --- chapter/4/dist-langs.md | 47 ++++++++++++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 17 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index b1e44cf..9a75153 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -15,7 +15,7 @@ In this section, we present an overview of these three problems and their impact In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. -For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about another's state, etc. +For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about the failed process' state state, etc. * Failure in a distributed setting * 2 sources, network and host @@ -61,7 +61,7 @@ For example, a process can be restored based on a checkpoint, another process in * process locality (static, dynamic analysis) * minimize communication * data replication (Orca) - * pipelining (HTTP) + * pipe-lining (HTTP) * asynchronous callbacks ### The CAP Theorem @@ -102,7 +102,7 @@ When a process is finished with a segment, the region can be *detached*. Requests to shared memory are transparent to user processes. Faults occur on the page level. -Operations on shared pages are gauranteed to be coherent; after a process writes to a page, all subsequent reads will observe the results of the write. +Operations on shared pages are guaranteed to be coherent; after a process writes to a page, all subsequent reads will observe the results of the write. To accomplish this, Mirage uses a protocol for requesting read or write access to a page. Depending on the permissions of the current "owner" of a page, the page may be invalidated on other nodes. The behavior of the protocol is outlined by the table below. @@ -123,7 +123,7 @@ When a write request arrives for a page, all read instances of the page are inva To ensure fairness, the system associates each page with a timer. When a request is honored for a page, the timer is reset. -The timer gaurantees that the page will not be invalidated for a minimum period of time. +The timer guarantees that the page will not be invalidated for a minimum period of time. Future request that result in invalidation or demotion (writer to reader) are only honored if the timer is satisfied. #### Linda @@ -139,7 +139,7 @@ Processes communicate through these shared objects. Shared data objects in Orca are similar to objects in OOP. An object is defined abstractly by a name and a set of interfaces. An implementation of the object defines any private data fields as well as the interfaces (methods). -Importantly, these interfaces are gauranteed to be indivisible, meaning that simultaneous calls to the same interface are serializable. +Importantly, these interfaces are guaranteed to be indivisible, meaning that simultaneous calls to the same interface are serializable. Although serializability alone does not eliminate indeterminism from Orca programs, it keeps the model simple while it allows programmers to construct richer, multi-operation locks for arbitrary semantics and logic. Another key feature of Orca is the ability to express symbolic data structures as shared data objects. @@ -160,7 +160,7 @@ Erlang, Emerald, Argus, and Orleans are just a few of many implementations of th Erlang is a distributed language which combines functional programming with message passing. Units of distribution in Erlang are processes. -These processes may be colocated on the same node or distributed amongst a cluster. +These processes may be co-located on the same node or distributed amongst a cluster. Processes in Erlang communicate by message passing. Specifically, a process can send an asynchronous message to another process' *mailbox*. @@ -194,7 +194,7 @@ However, their are a few key differences between Emerald and Java Objects. First, objects in Emerald may have an associated process which starts after initialization. In Emerald, object processes are the basic unit of concurrency. Additionally, an object may *not* have an associated process. -Objects that do not have a process are known as passive and more closely resemble traditional Java objects; their code is executed when called by processes belonging to other objects. +Objects that do not have a process are known as *passive* and more closely resemble traditional Java objects; their code is executed when called by processes belonging to other objects. Second, processes may not touch internal state (members) of other objects. Unlike Java, all internal state of Emerald objects must be accessed through method calls. Third, objects in Emerald may contain a special *monitor* section which can contain methods and variables that are accessed atomically. @@ -206,18 +206,31 @@ Emerald uses abstract types to define sets of interfaces. Objects that implement such interfaces can be "plugged in" where needed. Therefore, code may be dynamically upgraded, and different implementations may be provided for semantically similar operations. -* single object model - * objects only manipulated by exposed methods - * can be moved - * identity, operations, process (optional) - * process, monitor, initially sections -* abstract types - * interfaces - * allow system upgrades, new implementations - * implementations can have different semantics for an interface - #### Argus +Argus is a distributed programming language and system. +It uses a special kind of object, called a *guardian*, to create units of distribution and group highly coupled data and logic. +Argus procedures are encapsulated in atomic transactions, or *actions*, which allow operations that encompass multiple guardians to exhibit serializability. +The presence of guardians and actions was motivated by Argus' use as a platform for building distributed, consistent applications. + +An object in Argus is known as a guardian. +Like traditional objects, guardians contain internal data members for maintaining state. +Unique to Argus is the distinction between volatile and stable data members. +To cope with crashes, data members that are stable are periodically serialized to disk. +When a guardian crashes and is restored, this serialized state is used to reconstruct guardians. +Like in Emerald, internal data members may not be accessed directly, but rather through handlers. +Guardians are interacted with through methods known as *handlers*. +When a handler is called, a new process is created to handle the operation. +Additionally, guardians may contain a background process for performing continual work. + +Argus encapsulates handler calls in what it calls *actions*. +Actions are designed to solve the problems of consistency, synchronization, and fault tolerance. +To accomplish this, actions are serializable as well as total. +Being serializable means that no actions interfere with one another. +For example, a read operation that spans multiple guardians either "sees" the complete effect of a simultaneous write operation, or it sees nothing. +Being total means that write operations that span multiple guardians either fully complete or fully fail. +This is accomplished by a two-phase commit protocol that serializes the state of *all* guardians involved in an action, or discards partial state changes. + #### Orleans ### Dataflow model (static and stream) -- cgit v1.2.3 From f41103d38b6e91ec95b298073d34c1021aef938f Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 11:15:58 -0500 Subject: spell checker --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 9a75153..1001f8c 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -201,7 +201,7 @@ Third, objects in Emerald may contain a special *monitor* section which can cont If multiple processes make simultaneous calls to a "monitored" method, the calls are effectively serialized. Emerald also takes an OOP approach to system upgrades. -With a large system, it may not be desirable to disable the system, recompile, and relaunch. +With a large system, it may not be desirable to disable the system, recompile, and re-launch. Emerald uses abstract types to define sets of interfaces. Objects that implement such interfaces can be "plugged in" where needed. Therefore, code may be dynamically upgraded, and different implementations may be provided for semantically similar operations. -- cgit v1.2.3 From c1d49465c5da46b03a8d6f69927d843d7f26e352 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 13:04:01 -0500 Subject: Linda --- chapter/4/dist-langs.md | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 1001f8c..7ba5db2 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -126,7 +126,25 @@ When a request is honored for a page, the timer is reset. The timer guarantees that the page will not be invalidated for a minimum period of time. Future request that result in invalidation or demotion (writer to reader) are only honored if the timer is satisfied. -#### Linda +#### Linda (1993) + +Linda is a programming model based on DSM. +In Linda, shared memory is known as the *tuple space*; the basic unit of shared data is the tuple. +Instead of processes reading and writing to shared memory addresses, processes can insert, extract, or copy entries from the tuple space. +Under the Linda model, processes communicate and distribute work through tuples. + +A Linda tuple is not a fixed size; it can contain any combination of primitive data types and values. +To insert a tuple into the space, the fields of the tuple are fully evaluated before insertion. +A process can decide to evaluate a tuple serially, or spin up a background task to first evaluate the fields, then insert the tuple. +To retrieve a tuple from the space, a *template* tuple is provided that contains a number of fixed fields to match against, as well as *formals* that are "filled in" by the tuple that matches the search. +If many tuples match the template, one is selected arbitrarily. +When retrieving a tuple, the tuple may be left in the tuple space or removed. + +In practice, the tuple space is disjointly distributed among the nodes in the cluster. +The number and type of elements in a tuple defines the tuple's *class*. +All requests made for a particular class of tuple are sent through a *rendezvous point*, which provides a logically central way of performing book keeping about tuples. +The rendezvous services requests for insertion and deletions of all tuples of a class. +In the most basic implementation of Linda, each rendezvous point is located on a single participating node in the cluster. #### Orca -- cgit v1.2.3 From dc7359d589f8d24632d32eb15f47346eb8da7306 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 13:12:56 -0500 Subject: reorganize --- chapter/4/dist-langs.md | 94 ++++++++++++++++++++++++------------------------- 1 file changed, 46 insertions(+), 48 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 7ba5db2..24c819e 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -90,9 +90,9 @@ Distributed shared memory (DSM) takes the virtual memory abstraction one step fu Given such an abstraction, programs can communicate simply by reading from and writing to shared memory addresses. DSM is appealing because the programming model is the same for local and distributed systems. However, it requires an underlying system to function properly. -Mirage, Linda, and Orca are three systems that use distributed shared memory to provide a distributed programming model. +Mirage, Orca, and Linda are three systems that use distributed shared memory to provide a distributed programming model. -#### Mirage +#### Mirage (1989) Mirage is an OS-level implementation of DSM. In Mirage, regions of memory known as *segments* are created and indicated as shared. @@ -126,6 +126,24 @@ When a request is honored for a page, the timer is reset. The timer guarantees that the page will not be invalidated for a minimum period of time. Future request that result in invalidation or demotion (writer to reader) are only honored if the timer is satisfied. +#### Orca (1992) + +Orca is a programming language built for distribution and is based on the DSM model. +Orca expresses parallelism explicitly through processes. +Processes in Orca are similar to procedures, but are concurrent instead of serial. +When a process is forked, it can take parameters that are either passed as a copy of the original data, or passed as a *shared data-object*. +Processes communicate through these shared objects. + +Shared data objects in Orca are similar to objects in OOP. +An object is defined abstractly by a name and a set of interfaces. +An implementation of the object defines any private data fields as well as the interfaces (methods). +Importantly, these interfaces are guaranteed to be indivisible, meaning that simultaneous calls to the same interface are serializable. +Although serializability alone does not eliminate indeterminism from Orca programs, it keeps the model simple while it allows programmers to construct richer, multi-operation locks for arbitrary semantics and logic. + +Another key feature of Orca is the ability to express symbolic data structures as shared data objects. +In Orca, a generic graph type is offered as a first-class data-object. +Because operations are serializable at the method level, the graph data-object can offer methods that span multiple nodes while still retaining serializability. + #### Linda (1993) Linda is a programming model based on DSM. @@ -146,24 +164,6 @@ All requests made for a particular class of tuple are sent through a *rendezvous The rendezvous services requests for insertion and deletions of all tuples of a class. In the most basic implementation of Linda, each rendezvous point is located on a single participating node in the cluster. -#### Orca - -Orca is a programming language built for distribution and is based on the DSM model. -Orca expresses parallelism explicitly through processes. -Processes in Orca are similar to procedures, but are concurrent instead of serial. -When a process is forked, it can take parameters that are either passed as a copy of the original data, or passed as a *shared data-object*. -Processes communicate through these shared objects. - -Shared data objects in Orca are similar to objects in OOP. -An object is defined abstractly by a name and a set of interfaces. -An implementation of the object defines any private data fields as well as the interfaces (methods). -Importantly, these interfaces are guaranteed to be indivisible, meaning that simultaneous calls to the same interface are serializable. -Although serializability alone does not eliminate indeterminism from Orca programs, it keeps the model simple while it allows programmers to construct richer, multi-operation locks for arbitrary semantics and logic. - -Another key feature of Orca is the ability to express symbolic data structures as shared data objects. -In Orca, a generic graph type is offered as a first-class data-object. -Because operations are serializable at the method level, the graph data-object can offer methods that span multiple nodes while still retaining serializability. - ### Actor / Object model Unlike DSM, communication in the actor model is explicit and exposed through message passing. @@ -174,32 +174,7 @@ The model maps well to single multicore machines as well as to clusters of machi Although an underlying system is required to differentiate between local and remote messages, the location of processes, objects, or actors can be transparent to the application programmer. Erlang, Emerald, Argus, and Orleans are just a few of many implementations of the actor model. -#### Erlang - -Erlang is a distributed language which combines functional programming with message passing. -Units of distribution in Erlang are processes. -These processes may be co-located on the same node or distributed amongst a cluster. - -Processes in Erlang communicate by message passing. -Specifically, a process can send an asynchronous message to another process' *mailbox*. -At some future time, the receiving process may enter a *receive* clause, which searches the mailbox for the first message that matches one of a set of patterns. -The branch of code that is executed in response to a message is dependent on the pattern that is matched. - -In general, an application written in Erlang is separated into two broad components: *workers* and *monitors*. -Workers are responsible for application logic. -Erlang offers a special function `link(Pid)` which allows one process to monitor another. -If the process indicated by `Pid` fails, the monitor process will be notified and is expected to handle the error. -Worker processes are "linked" by monitor processes which implement the fault-tolerance logic of the application. - -Erlang, first implemented in Prolog, has the features and styles of a functional programming language. -Variables in Erlang are immutable; once assigned a value, they cannot be changed. -Because of this, loops in Erlang are written as tail recursive function calls. -Although this at first seems like a flawed practice (to traditional procedural programmers), tail recursive calls do not grow the current stack frame, but instead replace it. -It is worth noting that stack frame growth is still possible, but if the recursive call is "alone" (the result of the inner function is the result of the outer function), the stack will not grow. - -#### Cloud Haskell - -#### Emerald +#### Emerald (1987) Emerald is a distributed programming language based around a unified object model. Programs in Emerald consist of collections of Objects. @@ -224,7 +199,7 @@ Emerald uses abstract types to define sets of interfaces. Objects that implement such interfaces can be "plugged in" where needed. Therefore, code may be dynamically upgraded, and different implementations may be provided for semantically similar operations. -#### Argus +#### Argus (1988) Argus is a distributed programming language and system. It uses a special kind of object, called a *guardian*, to create units of distribution and group highly coupled data and logic. @@ -249,7 +224,30 @@ For example, a read operation that spans multiple guardians either "sees" the co Being total means that write operations that span multiple guardians either fully complete or fully fail. This is accomplished by a two-phase commit protocol that serializes the state of *all* guardians involved in an action, or discards partial state changes. -#### Orleans +#### Erlang (2000) + +Erlang is a distributed language which combines functional programming with message passing. +Units of distribution in Erlang are processes. +These processes may be co-located on the same node or distributed amongst a cluster. + +Processes in Erlang communicate by message passing. +Specifically, a process can send an asynchronous message to another process' *mailbox*. +At some future time, the receiving process may enter a *receive* clause, which searches the mailbox for the first message that matches one of a set of patterns. +The branch of code that is executed in response to a message is dependent on the pattern that is matched. + +In general, an application written in Erlang is separated into two broad components: *workers* and *monitors*. +Workers are responsible for application logic. +Erlang offers a special function `link(Pid)` which allows one process to monitor another. +If the process indicated by `Pid` fails, the monitor process will be notified and is expected to handle the error. +Worker processes are "linked" by monitor processes which implement the fault-tolerance logic of the application. + +Erlang, first implemented in Prolog, has the features and styles of a functional programming language. +Variables in Erlang are immutable; once assigned a value, they cannot be changed. +Because of this, loops in Erlang are written as tail recursive function calls. +Although this at first seems like a flawed practice (to traditional procedural programmers), tail recursive calls do not grow the current stack frame, but instead replace it. +It is worth noting that stack frame growth is still possible, but if the recursive call is "alone" (the result of the inner function is the result of the outer function), the stack will not grow. + +#### Orleans (2011) ### Dataflow model (static and stream) -- cgit v1.2.3 From 5bfcd392ebf514e6dbfd1256c3c10811372407e2 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 14:02:28 -0500 Subject: orleans --- chapter/4/dist-langs.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 24c819e..eb7b7a4 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -249,6 +249,32 @@ It is worth noting that stack frame growth is still possible, but if the recursi #### Orleans (2011) +Orleans is a programming model for distributed computing based on actors. +An Orleans program can be conceptualized as a collection of actors. +Because Orleans is intended as a model for building cloud applications, actors do not spawn independent processes as they do in Emerald or Argus. +Rather, Orleans actors are designed to execute only when responding to requests. + +As in Argus, Orleans encapsulates requests (root function calls) within *transactions*. +When processing a transaction, function calls may span many actors. +To ensure consistency in the end result of a transaction, Orleans offers another abstraction, *activations*, to allow each transaction to operate on a consistent state of actors. +An activation is an instance of an actor. +Activations allow (1) consistent access to actors during concurrent transactions, and (2) high throughput when an actor becomes "hot." +Consistency is achieved by only allowing transactions to "touch" activations of actors that are not being used by another transaction. +High throughput is achieved by spawning many activations of an actor for handling concurrent requests. + +For example, suppose there is an actor that represents a specific YouTube video. +This actor will have data fields like `title`, `content`, `num_views`, etc. +Suppose their are concurrent requests (in turn, transactions) for viewing the video. +In the relevent transaction, the `num_views` field is incremented. +Therefore, in order to run the view requests concurrently, two activations (or copies) of the actor are created. + +Because there is concurrency within individual actors, Orleans also supports means of state reconciliation. +When concurrent transactions modify different activations of an actor, state must eventually be reconciled. +In the case of the above example, it may not be necessary to know immediately the exact view count of a video, but we would like to be able to know this value eventually. +To accomplish reconciliation, Orleans provides data structures that can be automatically merged. +As well, the developer can implement arbitrary logic for merging state. +In the case of the YouTube video, we would want logic to determine the delta of views since the start of the activation, and add that to the actors' sum. + ### Dataflow model (static and stream) The dataflow model has its roots in functional programming. -- cgit v1.2.3 From 96cf859dd50e8ea67cda6e10989a99e1a1fac741 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 14:12:46 -0500 Subject: intro --- chapter/4/dist-langs.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index eb7b7a4..21adfa6 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -275,15 +275,17 @@ To accomplish reconciliation, Orleans provides data structures that can be autom As well, the developer can implement arbitrary logic for merging state. In the case of the YouTube video, we would want logic to determine the delta of views since the start of the activation, and add that to the actors' sum. -### Dataflow model (static and stream) +### Dataflow model -The dataflow model has its roots in functional programming. -Some languages that use this model are: +In the dataflow model, programs are expressed as transformations on data. +Given a set of input data, programs are constructed as a series of transformations and reductions. +Computation is data-centric, and expressed easily as a directed acyclic graph (DAG). +Unlike the DSM and actor models, processes are not exposed to the programmer. +Rather, the programmer designs the data transformations, and a system is responsible for initializing processes and distributing work accross a system. -* Multilisp -* MapReduce (Spark, Hadoop, etc.) -* RDD -* Dryad, DryadLinq +#### Multilisp () +#### MapReduce () +#### DryadLINQ () ### Which is best? Why? -- cgit v1.2.3 From 3ceac593715dec5c1dc42357d3a6d28e96eeb722 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 14:54:07 -0500 Subject: image --- chapter/4/MR.png | Bin 0 -> 83219 bytes chapter/4/dist-langs.md | 16 ++++++++++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) create mode 100644 chapter/4/MR.png (limited to 'chapter/4') diff --git a/chapter/4/MR.png b/chapter/4/MR.png new file mode 100644 index 0000000..54db004 Binary files /dev/null and b/chapter/4/MR.png differ diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 21adfa6..64eccc0 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -283,9 +283,21 @@ Computation is data-centric, and expressed easily as a directed acyclic graph (D Unlike the DSM and actor models, processes are not exposed to the programmer. Rather, the programmer designs the data transformations, and a system is responsible for initializing processes and distributing work accross a system. -#### Multilisp () -#### MapReduce () +#### MapReduce (2004) + +* input key-value pairs -> output key-value pairs +* Map and Reduced chained to create programs +* Map + * input key-value pairs transformed into intermediate key-value pairs +* Reduce + * intermediate keys are aggregated by key + * function performs some action based on all values associated with an intermediate key +* Map and Reduce may emit zero, one, or many key-value pairs per input + +![Alt text] (/MR.png "MapReduce Wordcount Workflow") + #### DryadLINQ () +#### Discretized Streams (2012) ### Which is best? Why? -- cgit v1.2.3 From 8d16b80b97c0c798ad50df52034e94bb583789fc Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 14:56:23 -0500 Subject: .. --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 64eccc0..0816e6a 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -294,7 +294,7 @@ Rather, the programmer designs the data transformations, and a system is respons * function performs some action based on all values associated with an intermediate key * Map and Reduce may emit zero, one, or many key-value pairs per input -![Alt text] (/MR.png "MapReduce Wordcount Workflow") +![Alt text] (./MR.png "MapReduce Wordcount Workflow") #### DryadLINQ () #### Discretized Streams (2012) -- cgit v1.2.3 From ae9c88ec42abba9bdc8f7d49971038344268766f Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 14:57:32 -0500 Subject: . --- chapter/4/dist-langs.md | 1 + 1 file changed, 1 insertion(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 0816e6a..9d3adc8 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -295,6 +295,7 @@ Rather, the programmer designs the data transformations, and a system is respons * Map and Reduce may emit zero, one, or many key-value pairs per input ![Alt text] (./MR.png "MapReduce Wordcount Workflow") +(http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/) #### DryadLINQ () #### Discretized Streams (2012) -- cgit v1.2.3 From 891dd757e5d48ba96c9d81fd8f54451726c763bc Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Mon, 12 Dec 2016 17:13:23 -0500 Subject: mapreduce --- chapter/4/dist-langs.md | 34 ++++++++++++++++++++++++---------- 1 file changed, 24 insertions(+), 10 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 9d3adc8..e489d03 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -285,16 +285,30 @@ Rather, the programmer designs the data transformations, and a system is respons #### MapReduce (2004) -* input key-value pairs -> output key-value pairs -* Map and Reduced chained to create programs -* Map - * input key-value pairs transformed into intermediate key-value pairs -* Reduce - * intermediate keys are aggregated by key - * function performs some action based on all values associated with an intermediate key -* Map and Reduce may emit zero, one, or many key-value pairs per input - -![Alt text] (./MR.png "MapReduce Wordcount Workflow") +MapReduce is a model and system for writing distributed programs that is data-centric. +Distributed programs are structed as series of *Map* and *Reduce* data transformations. +These two primitives are borrowed from traditional functional languages, and can be used to express a wide range of logic. +The key strength of this approach is that computations can be reasoned about and expressed easily while an underlying system takes care of the "dirty" aspects of distributed computing such as communication, fault-tolerance, and efficiency. + +A MapReduce program consists of a few key stages. +First, the data is read from a filesystem or other data source as a list of key-value pairs. +These pairs are distributed amongst a set of workers called *Mappers*. +Each mapper processes each element in its partition, and may output zero, one, or many *intermediate* key-value pairs. +Then, intermediate key-value pairs are grouped by key. +Finally, *Reducers* take all values pertaining to an intermediate key and output zero, one, or many output key-value pairs. +A MapReduce job may consist of one or many iterations of map and reduce. + +Crucially, for each stage the programmer is only responsible for programming the Map and Reduce logic. +The underlying system (in the case of Google, a C++ library), handles distributing input data and *shuffling* intermediate entries. +Optionally, the user can implement custom logic for formatting input and output data. + +An example program in MapReduce is illustrated below. +First, the input file is partitioned and distributed to a set of worker nodes. +Then, the map function transforms lines of the text file into key-value pairs in the format (\< word \>, 1). +These intermediate pairs are aggregated by key: the word. +In the reduce phase, the list of 1's is summed to compute a wordcount for each word. + +![Alt text] (./MR.png "MapReduce Workflow") (http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/) #### DryadLINQ () -- cgit v1.2.3 From c019fd9d7f49168f0bc855de717d710946c032e1 Mon Sep 17 00:00:00 2001 From: Connor Date: Mon, 12 Dec 2016 21:49:01 -0500 Subject: .. --- chapter/4/dist-langs.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index e489d03..32d1175 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -311,9 +311,18 @@ In the reduce phase, the list of 1's is summed to compute a wordcount for each w ![Alt text] (./MR.png "MapReduce Workflow") (http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/) -#### DryadLINQ () #### Discretized Streams (2012) +#### GraphX (2013) + +Many real world problems are expressed using graphs. +GraphX is a system built on top of the Spark MapReduce framework { // TODO cite RDD } that exposes traditional graph operations while internally representing a graph as a collection of RDD's. +GraphX exposes these operations through what it calls a Resilient Distributed Graph (RDG). +Internally, an RDG is a collection of RDD's that define a vertex split of a graph { // TODO CITE powergraph }. +Because they are built on top of RDD's, RDG's inherit immutability. +When a tranformation is performed, a new graph is created. +In this way, fault tolerance in GraphX can be executed the same way as it is in vanilla Spark; when a fault happens, the series of computations is remembered and re-executed. + ### Which is best? Why? MR vs Actors: depends on problem, solution -- cgit v1.2.3 From bb7b13c11fcfc1f106ad4535a068de7d771627f9 Mon Sep 17 00:00:00 2001 From: Connor Date: Mon, 12 Dec 2016 22:16:06 -0500 Subject: graphx --- chapter/4/dist-langs.md | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 32d1175..4c5b946 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -323,6 +323,13 @@ Because they are built on top of RDD's, RDG's inherit immutability. When a tranformation is performed, a new graph is created. In this way, fault tolerance in GraphX can be executed the same way as it is in vanilla Spark; when a fault happens, the series of computations is remembered and re-executed. +A key feature of GraphX is that it is a DSL library built on top of a GPL library. +Because it uses the general purpose computing framework of Spark, arbitrary MapReduce jobs may be performed in the same program as more specific graph operations. +In other graph-processing frameworks, results from a graph query would have to be written to disk to be used as input to a general purpose MapReduce job. + +With GraphX, if you can structure your application logic as a series of graph operations, an implementation may be created on top of RDD's. +Because many real-world applications, like social media "connections," are naturally expressed as graphs, GraphX can be used to create a highly scalable, fault-tolerant implementation. + ### Which is best? Why? MR vs Actors: depends on problem, solution -- cgit v1.2.3 From 5a033e43971aff5f3c9a21e2ec60f4b38e5c3953 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 12:15:28 -0500 Subject: d-streams --- chapter/4/dist-langs.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 4c5b946..84532c7 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -313,6 +313,23 @@ In the reduce phase, the list of 1's is summed to compute a wordcount for each w #### Discretized Streams (2012) +Discretized Streams is a model for processing streams data in realtime based on the traditional dataflow paradigm. +Streams of data are "chunked" discretely based on a time interval. +These chunks are then operated on as normal inputs to DAG-style computations. +Because this model is implemented on top of the MapReduce framework Spark, streaming computations can be flexibly combined with static MapReduce computations as well as live queries. + +Discretized Streams (D-Streams) are represented as a series of RDD's, each spanning a certain time interval. +Like traditional RDD's, D-Streams offer stateless operations such as *map*, *reduce*, *groupByKey*, etc., which can be performed regardless of previous inputs and outputs. +Unlike traditional RDD's, D-Streams offer *statefull* operations. +These stateful operations, such as *runningReduce*, are necessary for producing aggregate results for a *possibly never-ending* stream of input data. + +Because the inputs are not known *a priori*, fault tolerance in streaming systems must behave slightly differently. +For efficiency, the system periodically creates a checkpoint of intermediate data. +When a node fails, the computations performed since the last checkpoint are remembered, and a new node is assigned to recompute the lost partitions from the previous checkpoint. +Two other approaches to fault tolerance in streaming systems are replication and upstream backup. +Replication is not cost effective as every process must be duplicated, and does not cover the case of all replicas failing. +Upstream backup is slow as the system must wait for a backup node to recompute everything in order to recover state. + #### GraphX (2013) Many real world problems are expressed using graphs. -- cgit v1.2.3 From 01ef4e7b3e8f5bbb1bfefdb574ffcd2fe3225ed3 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 12:18:16 -0500 Subject: spell checks --- chapter/4/dist-langs.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 84532c7..2d8f4af 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -265,7 +265,7 @@ High throughput is achieved by spawning many activations of an actor for handlin For example, suppose there is an actor that represents a specific YouTube video. This actor will have data fields like `title`, `content`, `num_views`, etc. Suppose their are concurrent requests (in turn, transactions) for viewing the video. -In the relevent transaction, the `num_views` field is incremented. +In the relevant transaction, the `num_views` field is incremented. Therefore, in order to run the view requests concurrently, two activations (or copies) of the actor are created. Because there is concurrency within individual actors, Orleans also supports means of state reconciliation. @@ -281,12 +281,12 @@ In the dataflow model, programs are expressed as transformations on data. Given a set of input data, programs are constructed as a series of transformations and reductions. Computation is data-centric, and expressed easily as a directed acyclic graph (DAG). Unlike the DSM and actor models, processes are not exposed to the programmer. -Rather, the programmer designs the data transformations, and a system is responsible for initializing processes and distributing work accross a system. +Rather, the programmer designs the data transformations, and a system is responsible for initializing processes and distributing work across a system. #### MapReduce (2004) MapReduce is a model and system for writing distributed programs that is data-centric. -Distributed programs are structed as series of *Map* and *Reduce* data transformations. +Distributed programs are structured as series of *Map* and *Reduce* data transformations. These two primitives are borrowed from traditional functional languages, and can be used to express a wide range of logic. The key strength of this approach is that computations can be reasoned about and expressed easily while an underlying system takes care of the "dirty" aspects of distributed computing such as communication, fault-tolerance, and efficiency. @@ -313,14 +313,14 @@ In the reduce phase, the list of 1's is summed to compute a wordcount for each w #### Discretized Streams (2012) -Discretized Streams is a model for processing streams data in realtime based on the traditional dataflow paradigm. +Discretized Streams is a model for processing streams data in real-time based on the traditional dataflow paradigm. Streams of data are "chunked" discretely based on a time interval. These chunks are then operated on as normal inputs to DAG-style computations. Because this model is implemented on top of the MapReduce framework Spark, streaming computations can be flexibly combined with static MapReduce computations as well as live queries. Discretized Streams (D-Streams) are represented as a series of RDD's, each spanning a certain time interval. Like traditional RDD's, D-Streams offer stateless operations such as *map*, *reduce*, *groupByKey*, etc., which can be performed regardless of previous inputs and outputs. -Unlike traditional RDD's, D-Streams offer *statefull* operations. +Unlike traditional RDD's, D-Streams offer *stateful* operations. These stateful operations, such as *runningReduce*, are necessary for producing aggregate results for a *possibly never-ending* stream of input data. Because the inputs are not known *a priori*, fault tolerance in streaming systems must behave slightly differently. @@ -337,7 +337,7 @@ GraphX is a system built on top of the Spark MapReduce framework { // TODO cite GraphX exposes these operations through what it calls a Resilient Distributed Graph (RDG). Internally, an RDG is a collection of RDD's that define a vertex split of a graph { // TODO CITE powergraph }. Because they are built on top of RDD's, RDG's inherit immutability. -When a tranformation is performed, a new graph is created. +When a transformation is performed, a new graph is created. In this way, fault tolerance in GraphX can be executed the same way as it is in vanilla Spark; when a fault happens, the series of computations is remembered and re-executed. A key feature of GraphX is that it is a DSL library built on top of a GPL library. -- cgit v1.2.3 From 7ea765654bc4716639cfc9600b2139a99f44ee29 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 13:27:58 -0500 Subject: partial failure --- chapter/4/dist-langs.md | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 2d8f4af..e8174e1 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -14,21 +14,31 @@ In this section, we present an overview of these three problems and their impact ### Partial Failure In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) -If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge about the failure. -For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about the failed process' state state, etc. - -* Failure in a distributed setting - * 2 sources, network and host - * no central manager (no knowledge) - * non-determinism - * consistency (leave until next section) - * control is not returned to the caller, message or response may "vanish" - -* Impact, methods of dealing with partial failure - * recompute, duplicate computation (MR, RDD, Hadoop) - * 2-phase commit (Argus) - * redundancy (MR (spark's duplicate master), Orleans, Argus) - * checkpoint-restore (Naiad, Hadoop) +If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge. +For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about the failed process' state, etc. + +Because failure in a distributed setting involves another player, the network, it is impossible in most cases to determine the cause of failure. +In a distributed environment, there is no (reliable) central manager that can report on the state of all components. +Further, due to the inherent concurrency in a distributed system, nondeterminism is a problem that must be considered when designing distributed models, languages, and systems. +Communication is perhaps the most obvious example of this; messages may be lost or arrive out-of-order. +Finally, unlike in a local environment where failure returns control to the caller, failure may not be reported or the response may simply vanish. +Because of this, distributed communication must be designed expecting partial failure, and be able to "fail gracefully." + +Several methods have been developed to deal with the problem of partial failure. +One method, made popular with batch processing and MapReduce style frameworks, is to remember the series of computations needed to obtain a result and recompute the result in the case of failure. +Systems such as MapReduce, Spark, GraphX, and Spark Streaming use this model, as well as implement optimizations to make it fast. +Another method of dealing with partial failure is the two phase commit. +To perform a change of state across many components, first a logically central "leader" checks to see if all components are ready to perform and action. +If all reply "yes," the action is *committed*. +Otherwise, as in the case of partial failure, no changes are committed. +Two phase commit ensures that state is not changed in a partial manner. +Another solution to partial failure is redundancy, or replication. +If one replica of a computation failes, the others may survive and continue. +Replication can also be used to improve performance, as in MapReduce and Spark Streaming. +Checkpoint and restore has also been implemented as a way to recover from partial failure. +By serializing a recent "snapshot" of state to stable storage, recomputing current state is made cheap. +This is the primary method of partial failure in RDD-based systems. +In other systems, like Argus, objects are reconstructed from state that is automatically or manually serialized to disk. ### Consistency (Concurrency) -- cgit v1.2.3 From 3ad557858a0b6d1164648292f767bc9c7d78b126 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 13:28:58 -0500 Subject: spell check --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index e8174e1..263a356 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -33,7 +33,7 @@ If all reply "yes," the action is *committed*. Otherwise, as in the case of partial failure, no changes are committed. Two phase commit ensures that state is not changed in a partial manner. Another solution to partial failure is redundancy, or replication. -If one replica of a computation failes, the others may survive and continue. +If one replica of a computation fails, the others may survive and continue. Replication can also be used to improve performance, as in MapReduce and Spark Streaming. Checkpoint and restore has also been implemented as a way to recover from partial failure. By serializing a recent "snapshot" of state to stable storage, recomputing current state is made cheap. -- cgit v1.2.3 From 71de8df5edf1d39ba75a400e46d38fb953012fe9 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 14:17:55 -0500 Subject: consistency --- chapter/4/dist-langs.md | 45 ++++++++++++++++++++++++++++++++------------- 1 file changed, 32 insertions(+), 13 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 263a356..f3cca62 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -42,19 +42,38 @@ In other systems, like Argus, objects are reconstructed from state that is autom ### Consistency (Concurrency) -* Local - * enforce consistency with locks - * state located under one resource manager (partial local crash) - -* Distributed - * preserve state in instance of failure - * concurrent method invocations / messages - * operations on replicated objects (for scalability) - -* Methods - * sequencer - * message queues - * read only vs. write ops +If computing on shared data can be avoided, parallel computations would not be bottlenecked by serialized accesses. +Unfortunately, there are many instances where operating on shared data is necessary. +While problems with shared data can be dealt with fairly simply in the local case, distribution introduces problems that make consistency more complex. + +In local computing, enforcing consistency is fast and straightforward. +Traditionally, a piece of data is protected by another piece of data called a *lock*. +To operate on the data, a concurrent process *acquires* the lock, makes its changes, then *releases* the lock. +Because the data is either located in on-board memory or an on-chip cache, passing the shared data around is relatively fast. +As in the case of partial failure, a central resource manager (the OS) is present and can respond to a failed process that has obtained a lock. + +In a distributed environment, coordination and locking is more difficult. +First, because of the lack of a central resource manager, there needs to be a way of preserving or recovering the state of shared data in the case of failure. +The problem of acquiring locks also becomes harder due to partial failure and higher latency. +Synchronization protocols must expect and be able to handle failure. + +To deal with operations on shared data, there are a few standard techniques. +A *sequencer* can be used to serialize requests to a shared piece of data. +When a process on a machine wants to write to shared data, it sends a request to a logically central process called a sequencer. +The sequencer takes all incoming requests and serializes them, and sends the serialized operations (in order) to all machines with a copy of the data. +The shared data will then undergo the same sequence of transformations on each machine, and therefore be consistent. +A similar method for dealing with consistency is the message queue. +In the actor model, pieces of an application are represented as actors which respond to requests. +Actors may use *message queues* which behave similarly to sequencers. +Incoming method calls or requests are serialized in a queue which the actor can use to process requests one at a time. +Finally, some systems take advantage of the semantics of operations on shared data, distinguishing between read-only operations and write operations. +If an operation is determined to be read-only, the shared data can be distributed and accessed locally. +If an operation writes to shared data, further synchronization is required. + +Unfortunately, none of these techniques can survive a network partition. +Consistency requires communication, and a partitioned network will prevent updates to state on one machine from propagating. +Distributed systems therefore may be forced by other requirements to loosen their requirement of consistency. +Below, the CAP theorem formalizes this idea. ### Latency -- cgit v1.2.3 From 871a97cc40da006c70ff9e13c8cfb4f368243b3a Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 16:21:37 -0500 Subject: progress --- chapter/4/dist-langs.md | 61 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 40 insertions(+), 21 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index f3cca62..39da2d0 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -77,30 +77,49 @@ Below, the CAP theorem formalizes this idea. ### Latency -* Local - * operations are relatively fast - * topology doesn't change - -* Distributed - * Network failures and recovery - * changing topology - * efficiency - -* Methods - * process locality (static, dynamic analysis) - * minimize communication - * data replication (Orca) - * pipe-lining (HTTP) - * asynchronous callbacks - +Latency is another major problem that is unique to distributed computing. +Unlike the other problems discussed in this section, latency does not necessarily affect program correctness. +Rather, it is a problem that impacts application performance, and can be a source of nondeterminism. + +In the case of local computing, latency is minimal and fairly constant. +Although their may be subtle timing differences that arise from contention from concurrent processes, these fluctuations are relatively small. +As well, machine hardware is constant. +There are no changes to the latency of communication channels on a single machine. + +Distribution introduces network topology. +This topology significantly (orders of magnitude) increases the latency of communication, as well as introduces a source of nondeterminism. +At any time, routing protocols or hardware changes (or both) may cause the latency between two machines to change. +Therefore, distributed applications may not rely on specific timings of communication in order to function. +Distributed processes may also be more restricted. +Because communication across the network is costly, applications may necessarily be designed to minimize communication. + +A more subtle (and sinister) problem with increased latency and the network is the inability of a program to distinguish between a slow message and a failed message. +This situation is analogous to the halting problem, and forces distributed applications to make decisions about when a message, link, or node has "failed." + +Several methods have been developed to cope with the latency of communication. +Static and dynamic analysis may be performed on communication patterns so that entities that communicate frequently are more proximate than those that communicate infrequently. +Another approach that has been used is data replication. +If physically separate entities all need to perform reads on a piece of data, that data can be replicated and read from local hardware. +Another approach is pipelining; a common example of this is in some flavors of the HTTP protocol. +Pipelining requests allows a process to continue with other work, or issue more requests without blocking for the response of each request. +Pipelining lends itself to an asynchronous style of programming, where a callback can be assigned to handle the results of a request. +*Futures* and *promises* have built on this programming style, allowing computations to be queued, and performed when the value of a future or promise is resolved. + ### The CAP Theorem -Indeed, these three issues of distributed computing are not disjoint. -A solution designed to solve one problem may exacerbate another. +Indeed, the three problems outlined above are not independent, and a solution for one may come at the cost of *amplifying* the effects of another. +For example, let's suppose when a request to our system arrives, a response should be issued as soon as possible. +Here, we want to minimize latency. +Unfortunately, this may come at the cost of consistency. +We are forced to either (1) honor latency and send a possibly inconsistent result, or (2) honor consistency and wait for the distributed system to synchronize before replying. + +The CAP theorem { // TODO cite CAP } formalizes this notion. +CAP stands for Consistency, Availability, and tolerance to Partitioning. +The theorem states that a distributed system may only have two of these three properties. -* Consistency -* Availability -* Partitioning +Since its introduction, experience suggests this theorem is not as rigid as was originally proposed { // TODO cite 12 years later }. +In practice, for example, rareness of network partitioning makes satisfaction of all three easier. +As well, advancements in consistency models, such as CRDT's { // TODO cite CRDT's paper }, make balancing consistency and availability flexible to the requirements of the system. ## Three Major Approaches to Distributed Languages -- cgit v1.2.3 From d81f59ce9e3dd25e871c67c29e06f1614b977f25 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 16:59:39 -0500 Subject: prg --- chapter/4/dist-langs.md | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 39da2d0..fd9a7d3 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -414,18 +414,35 @@ Actors: * Message-passing chapter -## Support for Distribution +## Thoughts on System Design ### Intro -* What is a DSL? +The definition of a domain-specific language is a hot topic and there have been several attempts to concretely define what exactly *it* is. + +Here is the definition as given by { // TODO cite when and how }: > Domain-specific languages are languages tailored to a specific application domain. -Another definition: +Another definition is offered (and commonly cited) by { // TODO cite annotated bib }: > A domain-specific language is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. +Generally, I would refer to a domain-specific language (DSL) as a *system*, be it a standalone language, compiler extension, library, set of macros, etc., that is designed for common operations in a problem domain to be easily expressed. + +For example, the python twitter library is designed for easily expressing operations that manage a twitter account. + +The problem in defining this term (I believe) is the the vagueness of the components *domain* and *language*. +Depending on the classification, a set of problems designated in a certain domain may span a "wide" or "narrow" scope. +For example, does "tweeting" qualify as a domain (within the twitter library)? +Would "social media sharing" qualify as a domain (containing the twitter library)? +For my purposes I will accept the definition of a domain as a "well-defined, highly cohesive set of operations." + +It is also difficult to come up with a definition for a language. +A language may be qualified if it has its own compiler. +An orthogonal definition qualifies a language by its style, as in the case of sets of macros. +This confusion is why I adopt the even more vague term of *system* in my own definition. + ### Where is it in the stack? * Libraries: -- cgit v1.2.3 From d6994f34cd72bb8f5fc871a444d393860a878da6 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 17:02:08 -0500 Subject: . --- chapter/4/dist-langs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index fd9a7d3..e7463c7 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -428,7 +428,7 @@ Another definition is offered (and commonly cited) by { // TODO cite annotated b > A domain-specific language is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. -Generally, I would refer to a domain-specific language (DSL) as a *system*, be it a standalone language, compiler extension, library, set of macros, etc., that is designed for common operations in a problem domain to be easily expressed. +Generally, I would refer to a domain-specific language (DSL) as a *system*, be it a standalone language, compiler extension, library, set of macros, etc., that is designed for a set of cohesive operations to be easily expressed. For example, the python twitter library is designed for easily expressing operations that manage a twitter account. @@ -436,7 +436,7 @@ The problem in defining this term (I believe) is the the vagueness of the compon Depending on the classification, a set of problems designated in a certain domain may span a "wide" or "narrow" scope. For example, does "tweeting" qualify as a domain (within the twitter library)? Would "social media sharing" qualify as a domain (containing the twitter library)? -For my purposes I will accept the definition of a domain as a "well-defined, highly cohesive set of operations." +For my purposes I will accept the definition of a domain as a "well-defined, cohesive set of operations." It is also difficult to come up with a definition for a language. A language may be qualified if it has its own compiler. -- cgit v1.2.3 From 8cd17ec52832d1182f98764e8dacf03f03829236 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Tue, 13 Dec 2016 18:15:35 -0500 Subject: . --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index e7463c7..c12ea82 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -441,7 +441,7 @@ For my purposes I will accept the definition of a domain as a "well-defined, coh It is also difficult to come up with a definition for a language. A language may be qualified if it has its own compiler. An orthogonal definition qualifies a language by its style, as in the case of sets of macros. -This confusion is why I adopt the even more vague term of *system* in my own definition. +This confusion is why I adopt the even more vague term *system* in my own definition. ### Where is it in the stack? -- cgit v1.2.3 From 6196544e8d66971cc8e87c123bd12066e5cdec12 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 15:32:02 -0500 Subject: . --- chapter/4/dist-langs.md | 62 ++++++++++++++++++++++++------------------------- 1 file changed, 31 insertions(+), 31 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index c12ea82..a26c42f 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -395,28 +395,24 @@ In other graph-processing frameworks, results from a graph query would have to b With GraphX, if you can structure your application logic as a series of graph operations, an implementation may be created on top of RDD's. Because many real-world applications, like social media "connections," are naturally expressed as graphs, GraphX can be used to create a highly scalable, fault-tolerant implementation. -### Which is best? Why? - -MR vs Actors: depends on problem, solution - -How fine grain is your data and logic? (MR job can be built from actor model) -Does your algorithm map to a batch processing job? - -MR: - -* MR is DSL for distribution? (wouldn't use it to develop single-machine app (probably)) -* Dataflow / MapReduce fundamentally changed the programming style for distributed systems -* Other models (Actor, DSM) tried to mask distribution -* By changing the style, programs need necessarily consider communication patterns (disk, network) -* Although, system may still handle fault tolerance - -Actors: - -* Message-passing chapter +## Comparing Design + +* level of implementation + * OS + * compiler (language) + * library +* granularity + * logic, state + * course (batch) + * fine (actors) +* abstractions + * location + * scalability + * fault tolerance ## Thoughts on System Design -### Intro +### Domain-Specific Languages The definition of a domain-specific language is a hot topic and there have been several attempts to concretely define what exactly *it* is. @@ -443,22 +439,26 @@ A language may be qualified if it has its own compiler. An orthogonal definition qualifies a language by its style, as in the case of sets of macros. This confusion is why I adopt the even more vague term *system* in my own definition. -### Where is it in the stack? +### Distribution as a Domain + +Given the examples of models and systems, I think it is reasonable to qualify distribution as a domain. +Distributed computing has a unique set of problems, such as fault-tolerance, nondeterminism, network partitioning, and consistency that make it unique from parallel computing on a single machine. +The languages, systems, and models presented here all are built to assist the developer in dealing with these problems. -* Libraries: -* Compiler Extension -* Compiler / Runtime: -* Hardware +### DSL's as Libraries -### Why DSL's as Libraries? +The examples given above demonstrate a trend. +At first, systems designed to tackle the distribution domain were implemented as stand-alone languages. +Later, these systems appear as libraries built on top of existing general-purpose languages. +For many reasons, this style of system development is superior. -Reasons for moving to GPL's as base for DSL's -* problem of domain-composition -* problem of abstraction -* problem of ecosystem -* problem of tumultuous architecture -* "any gpl + library can act as a dsl" - mernik" +#### problem of domain-composition + * domains may be orthogonal -> can be composed +#### problem of abstraction +#### problem of ecosystem +#### problem of tumultuous architecture +#### "any gpl + library can act as a dsl" - mernik" ## References -- cgit v1.2.3 From cf130c134ba025b25ec5af136b114ac6dab4621f Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 16:32:37 -0500 Subject: . --- chapter/4/dist-langs.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index a26c42f..c445d2e 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -397,15 +397,29 @@ Because many real-world applications, like social media "connections," are natur ## Comparing Design -* level of implementation - * OS - * compiler (language) - * library -* granularity +Here, we present a taxonomy which can be used to classify each of the examples. +Among these distributed systems appear to be three major defining characteristics: level of implementation, granularity, and level of abstraction. +Importantly, these characteristics are orthogonal and each present an opportunity for a design decision. + +### Level of Implementation + +The programming model exposed by each of the examples is implemented at some level in the computer system. +In Mirage, the memory management needed to enable DSM is implemented within the operating system. +Mirage's model of DSM lends itself to an OS implementation because data is shared by address. +In other systems, such as Orca, Argus, and Erlang, the implementation is at the compiler level. +These are languages that support distribution through syntax and programming style. +Finally, some systems are implemented as libraries (e.g. Linda, MapReduce). +In such cases, the underlying language is powerfull enough to support desired operations. +The library is used to ease programmer burden and supply domain-specific syntax. + +### granularity + * logic, state * course (batch) * fine (actors) -* abstractions + +### Level of Abstraction + * location * scalability * fault tolerance -- cgit v1.2.3 From b2c0b9f8b7f0fb612e6da3fa60e00c3d81b8b837 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 19:21:26 -0500 Subject: . --- chapter/4/dist-langs.md | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index c445d2e..53320fc 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -412,11 +412,27 @@ Finally, some systems are implemented as libraries (e.g. Linda, MapReduce). In such cases, the underlying language is powerfull enough to support desired operations. The library is used to ease programmer burden and supply domain-specific syntax. -### granularity +The pros and cons of different implementation stategies is discussed further under *DSL's as Libraries*. - * logic, state - * course (batch) - * fine (actors) +### Granularity + +The granularity of logic and state is another major characteristic of these systems. +Generally, the actor and DSM models can be considered fine grain while the dataflow model is course grain. + +The actor and DSM models can be considered fine grain because they can be used to define the logic and states of individual workers. +Under the DSM model, an application may be composed of many separately compiled programs. +Each of these programs captures a portion of the logic, communication being done through shared memory regions. +Under the actor model, actors or objects may be used to wrap separate logic. +These units of logic communicate through RPC or message-passing. +The benefit to using the actor or DSM model is the ability to wrap unique, cohesive logic in modules. +Unfortunately, this means that the application or library developer is responsible for handling problems such as scaling and process location. + +The dataflow model can be considered course grain because the logic of every worker is defined by a single module. +A program is written as a single set of logic that transforms a dataset. +Workers proceed by receiving a partition of the data and the program, and executing the transformation. +Crucially, each worker operates under the same logic. + +To borrow an idea from traditional parallel programming, the actor/DSM model implements a *multiple instruction multiple data* (MIMD) architecture, whereas the dataflow model implements a *single instruction multiple data* (SIMD) architecture. ### Level of Abstraction -- cgit v1.2.3 From 663da1818525d97e294fb75711bc5bd15704b5cb Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 19:23:53 -0500 Subject: . --- chapter/4/dist-langs.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 53320fc..9e78e24 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -427,8 +427,7 @@ These units of logic communicate through RPC or message-passing. The benefit to using the actor or DSM model is the ability to wrap unique, cohesive logic in modules. Unfortunately, this means that the application or library developer is responsible for handling problems such as scaling and process location. -The dataflow model can be considered course grain because the logic of every worker is defined by a single module. -A program is written as a single set of logic that transforms a dataset. +The dataflow model can be considered course grain because the logic of every worker is defined by a single set of instructions. Workers proceed by receiving a partition of the data and the program, and executing the transformation. Crucially, each worker operates under the same logic. -- cgit v1.2.3 From 10e7b81dad9ec6e10d5417b5d0977b65583f1d7a Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 20:23:40 -0500 Subject: ... --- chapter/4/dist-langs.md | 41 ++++++++++++++++++++++++++++++++--------- 1 file changed, 32 insertions(+), 9 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 9e78e24..edc98aa 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -435,9 +435,24 @@ To borrow an idea from traditional parallel programming, the actor/DSM model imp ### Level of Abstraction - * location - * scalability - * fault tolerance +Each of the examples of systems for distributed computing offer different levels of abstraction from problems like partial-failure. +Depending on the requirements of the application, it may be sufficient to let the system handle these problems. +In other cases, it may be necessary to be able to define custom logic. + +In some systems, like Emerald, it is possible to specify the location of processes. +When the communication patterns of the application are known, this can allow for optimizations on a per-application basis. +In other systems, the system handles the resource allocation of processes. +While this may ease development, the developer must trust the system to make the decision. + +The actor and DSM models require the application developer to create and destroy individual processes. +This means that either the application must contain logic for scaling, or a library must be developed to handle scaling. +Systems that implement the dataflow model handle scaling automatically. +An exception to this rule is Orleans, which follows the actor model but handles scaling automatically. + +Final, fault tolerance is abstracted to varying degrees in these systems. +Argus exposes fault tolerance to the programmer; object data members are labeled as volatile or stable. +Periodically, these stable members are serialized to disk and can be used for object reconstruction. +Other systems, especially those based on dataflow, fully abstract the problem of partial failure. ## Thoughts on System Design @@ -481,13 +496,21 @@ At first, systems designed to tackle the distribution domain were implemented as Later, these systems appear as libraries built on top of existing general-purpose languages. For many reasons, this style of system development is superior. +#### Domain Composition + +Systems like GraphX and Spark Streaming demonstrate a key benefit of developing DSL's as libraries: composition. +When DSL's are implemented on top of a common language, they may be composed. +For example, a C++ math library may be used along with MapReduce to perform complex transformations on individual records. +If the math library and MapReduce were individual languages with separate compilers, composition would be difficult or impossible. +Further, the GraphX system demonstrates that domains exist at varying degrees of generality, and that building the library for one domain on top of another may result in unique and efficient solutions. +DSL's that are implemented as full languages with unique compilers are unattractive because existing libraries that handle common tasks must be re-written for the new language. + +#### Ecosystem + +Another problem that drives DSL development towards libraries is ecosystem. +In order for a DSL to be adopted, there must be a body of developers that can incorporate the DSL into existing systems. +If either (1) the DSL does not incorporate well with existing code bases or (2) the DSL requires significant investment to learn, adoption will be less likely. -#### problem of domain-composition - * domains may be orthogonal -> can be composed -#### problem of abstraction -#### problem of ecosystem -#### problem of tumultuous architecture -#### "any gpl + library can act as a dsl" - mernik" ## References -- cgit v1.2.3 From c55c6399265a376f726efcd138e79431bc069575 Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Wed, 14 Dec 2016 20:25:36 -0500 Subject: . --- chapter/4/dist-langs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index edc98aa..4951526 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -414,7 +414,7 @@ The library is used to ease programmer burden and supply domain-specific syntax. The pros and cons of different implementation stategies is discussed further under *DSL's as Libraries*. -### Granularity +### Granularity of Logic and State The granularity of logic and state is another major characteristic of these systems. Generally, the actor and DSM models can be considered fine grain while the dataflow model is course grain. -- cgit v1.2.3 From de49a518d27f60bf560044aa051d3067b211000c Mon Sep 17 00:00:00 2001 From: Connor Zanin Date: Fri, 16 Dec 2016 16:45:31 -0500 Subject: added references, checked compilation --- chapter/4/dist-langs.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) (limited to 'chapter/4') diff --git a/chapter/4/dist-langs.md b/chapter/4/dist-langs.md index 4951526..9a86a0b 100644 --- a/chapter/4/dist-langs.md +++ b/chapter/4/dist-langs.md @@ -13,7 +13,7 @@ In this section, we present an overview of these three problems and their impact ### Partial Failure -In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system. (// TODO cite "a note on dist. comp.) +In the case of a crash on a local environment, either the machine has failed (total failure), or the source of the crash can be learned from a central resource manager such as the operating system {% cite waldo1997 --file dist-langs.bib %} If an application consists of multiple communicating processes partial failure is possible, however because the cause of the partial failure can be determined, this kind of partial failure can be repaired given the operating system's knowledge. For example, a process can be restored based on a checkpoint, another process in the application can query the operating system about the failed process' state, etc. @@ -113,13 +113,13 @@ Here, we want to minimize latency. Unfortunately, this may come at the cost of consistency. We are forced to either (1) honor latency and send a possibly inconsistent result, or (2) honor consistency and wait for the distributed system to synchronize before replying. -The CAP theorem { // TODO cite CAP } formalizes this notion. +The CAP theorem {% cite gilbert2002brewer --file dist-langs.bib %} formalizes this notion. CAP stands for Consistency, Availability, and tolerance to Partitioning. The theorem states that a distributed system may only have two of these three properties. -Since its introduction, experience suggests this theorem is not as rigid as was originally proposed { // TODO cite 12 years later }. +Since its introduction, experience suggests this theorem is not as rigid as was originally proposed {% cite brewer2012cap --file dist-langs.bib %}. In practice, for example, rareness of network partitioning makes satisfaction of all three easier. -As well, advancements in consistency models, such as CRDT's { // TODO cite CRDT's paper }, make balancing consistency and availability flexible to the requirements of the system. +As well, advancements in consistency models, such as CRDT's {% cite shapiro2011conflict --file dist-langs.bib %}, make balancing consistency and availability flexible to the requirements of the system. ## Three Major Approaches to Distributed Languages @@ -356,8 +356,9 @@ Then, the map function transforms lines of the text file into key-value pairs in These intermediate pairs are aggregated by key: the word. In the reduce phase, the list of 1's is summed to compute a wordcount for each word. -![Alt text] (./MR.png "MapReduce Workflow") -(http://www.milanor.net/blog/an-example-of-mapreduce-with-rmr2/) +
+ A Sample MapReduce Program +
#### Discretized Streams (2012) @@ -381,9 +382,9 @@ Upstream backup is slow as the system must wait for a backup node to recompute e #### GraphX (2013) Many real world problems are expressed using graphs. -GraphX is a system built on top of the Spark MapReduce framework { // TODO cite RDD } that exposes traditional graph operations while internally representing a graph as a collection of RDD's. +GraphX is a system built on top of the Spark MapReduce framework {% cite zaharia2012resilient --file dist-langs.bib %} that exposes traditional graph operations while internally representing a graph as a collection of RDD's. GraphX exposes these operations through what it calls a Resilient Distributed Graph (RDG). -Internally, an RDG is a collection of RDD's that define a vertex split of a graph { // TODO CITE powergraph }. +Internally, an RDG is a collection of RDD's that define a vertex split of a graph {% cite gonzalez2012powergraph --file dist-langs.bib %}. Because they are built on top of RDD's, RDG's inherit immutability. When a transformation is performed, a new graph is created. In this way, fault tolerance in GraphX can be executed the same way as it is in vanilla Spark; when a fault happens, the series of computations is remembered and re-executed. @@ -460,11 +461,11 @@ Other systems, especially those based on dataflow, fully abstract the problem of The definition of a domain-specific language is a hot topic and there have been several attempts to concretely define what exactly *it* is. -Here is the definition as given by { // TODO cite when and how }: +Here is the definition as given by {% cite Mernik2005 --file dist-langs.bib %}: > Domain-specific languages are languages tailored to a specific application domain. -Another definition is offered (and commonly cited) by { // TODO cite annotated bib }: +Another definition is offered (and commonly cited) by {% cite Deursen2000 --file dist-langs.bib %}: > A domain-specific language is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain. -- cgit v1.2.3