From 78e594ef879f70f7e41b232c7720376f7c5eb0bf Mon Sep 17 00:00:00 2001 From: Nat Dempkowski Date: Sun, 15 Oct 2017 18:32:11 -0400 Subject: Move from inlined HTML to Markdown code blocks --- chapter/6/counters.md | 267 +++++++++++++++++++++++++------------------------- 1 file changed, 133 insertions(+), 134 deletions(-) (limited to 'chapter/6') diff --git a/chapter/6/counters.md b/chapter/6/counters.md index 67e822a..23dccde 100644 --- a/chapter/6/counters.md +++ b/chapter/6/counters.md @@ -6,29 +6,28 @@ by: "Aviral Goel" Counters are replicated integers. They are the most basic distributed object. This chapter describes the different variations of counter CRDT in both state based and operation based form. -## 1. G-counter - Increment only counter +## G-counter - Increment only counter As the name suggests, these counters only support increment operation. They can be used to implement the `like` button functionality of social media websites. -### 1.1. CmRDT: Operation based design +### CmRDT: Operation based design In the operation based implementation, the increment operation is transmitted to all other replicas. This is straightforward to implement as there is only one update operation. -
class Counter(CmRDT):
+```python
+class Counter(CmRDT):
+    def __init__(self):
+        self._count = 0
 
-    def __init__(self):         # constructor function
-        self._count = 0
+    def value(self): 
+        return self._count
 
-    def value(self):            # query function
-        return self._count
-
-    def increment(self):        # update function
-        self._count += 1
-        for replica in self.replicas():
-            self.transmit("increment", replica)
-
-
+ def increment(self): + self._count += 1 + for replica in self.replicas(): + self.transmit("increment", replica) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. In accordance with the specification, each increment request increments the counter locally by one unit and the operation is transmitted by the replica to all other replicas of the system. @@ -44,7 +43,7 @@ We can make the following observations: * **Concurrent operations** - A replica may have to handle concurrent operations. For example, at *t19*, c1 encounters two increment operations. Since its the same operation, there is only one way to handle the situation. There is no conflict. -### 1.2. CvRDT: State based design +### CvRDT: State based design In the state based implementation, the counter state is transmitted to all other replicas. But how do we model the state? Of course, the counter's count is its state. @@ -52,24 +51,23 @@ Since the count always increases, modeling the state as count automatically make The code below provides the specification of this counter. -
class Counter(CvRDT):
-
-    def __init__(self, count = 0):  # constructor function
-        self._count = count
+```python
+class Counter(CvRDT):
+    def __init__(self, count = 0):  
+        self._count = count
 
-    def value(self):                # query function
-        return self._count
+    def value(self):                
+        return self._count
 
-    def increment(self):            # update function
-        self._count += 1
+    def increment(self):            
+        self._count += 1
 
-    def compare(self, other):       # comparison function
-        return self.value() <= other.value()
+    def compare(self, other):       
+        return self.value() <= other.value()
 
-    def merge(self, other):         # merge function
-        return Counter(max(self.value(), other.value()))
-
-
+ def merge(self, other): + return Counter(max(self.value(), other.value())) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. The replicas keep transmitting their state at random times to other randomly chosen replicas. @@ -92,33 +90,33 @@ Let's observe the problem again. Our merge method only returns the state of that The specification below shows how this can be implemented. -
class Counter(CvRDT):
-
-    def __init__(self, counts = None):  # constructor function
-        if counts is None:
-            self._counts = [0] * length(self.replicas())
-        else:
-            self._counts = counts
+```python
+class Counter(CvRDT):
+    def __init__(self, counts = None):
+        if counts is None:
+            self._counts = [0] * length(self.replicas())
+        else:
+            self._counts = counts
 
-    def value(self):                    # query function
-        return sum(self._counts)
+    def value(self):
+        return sum(self._counts)
 
-    def counts(self):                   # query function
-        return list(self._counts)       # return a clone
+    def counts(self):
+		# return a copy of the counts
+        return list(self._counts)
 
-    def increment(self):                # update function
-        self._counts[self.replicaId()] += 1
+    def increment(self):
+        self._counts[self.replicaId()] += 1
 
-    def compare(self, other):           # comparison function
-        return all(v1 <= v2 for (v1, v2) in
-                   zip(self.counts(),
+    def compare(self, other):
+        return all(v1 <= v2 for (v1, v2) in
+                   zip(self.counts(),
                        other.counts()))
 
-    def merge(self, other):             # merge function
-        return Counter(map(max, zip(self.counts(),
+    def merge(self, other):
+        return Counter(map(max, zip(self.counts(),
                                     other.counts())))
-
-
+``` The figure below shows an execution trace of three replicas confirming to this specification. @@ -128,34 +126,32 @@ The figure below shows an execution trace of three replicas confirming to this s This design converges to the correct value. This provides us an eventually consistent state based increment only counter. -## 2. PN-counter - Increment and Decrement counter +## PN-counter - Increment and Decrement counter A PN-counter can be incremented and decremented. These can serve as a general purpose counters, as they also provide a decrement operation. For example - counting the number of users active one a social media website at any point. Note that the users can go offline and the counter will have to decremented. This can't be done with an increment only counter. -### 2.1. CmRDT: Operation based design +### CmRDT: Operation based design The code below provides the specification of an operation based increment and decrement counter. -
-
-class Counter(CmRDT):
+```python
+class Counter(CmRDT):
+    def __init__(self):
+        self._count = 0
 
-    def __init__(self):         # constructor function
-        self._count = 0
+    def value(self):
+        return self._count
 
-    def value(self):            # query function
-        return self._count
+    def increment(self):
+        self._count += 1
+        for replica in self.replicas():
+            self.transmit("increment", replica)
 
-    def increment(self):        # update function
-        self._count += 1
-        for replica in self.replicas():
-            self.transmit("increment", replica)
-
-    def decrement(self):        # update function
-        self._count -= 1
-        for replica in self.replicas():
-            self.transmit("decrement", replica)
-
+ def decrement(self): + self._count -= 1 + for replica in self.replicas(): + self.transmit("decrement", replica) +``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. In accordance with the specification, each increment request increments the counter locally by one unit and each decrement request decrements the counter locally by one unit. The corresponding operation is transmitted by the replica to all other replicas of the system. @@ -177,40 +173,41 @@ We can make the the following observations: In both cases the result is same because the two operations commute, i.e. - the order in which they are executed by the replicas does not matter. Both replicas perform them in different orders and still achieve consistency, eventually. -### 2.2. CvRDT: State based design +### CvRDT: State based design The code below provides the specification of a state based increment and decrement counter. We take inspiration from the design of state based increment only counter and model the state of this counter in the same way. -
class Counter(CvRDT):
+```python
+class Counter(CvRDT):
 
-    def __init__(self, counts = None):  # constructor function
-        if counts is None:
-            self._counts = [0] * length(self.replicas())
-        else:
-            self._counts = counts
+    def __init__(self, counts = None):
+        if counts is None:
+            self._counts = [0] * length(self.replicas())
+        else:
+            self._counts = counts
 
-    def value(self):                    # query function
-        return sum(self._counts)
+    def value(self):
+        return sum(self._counts)
 
-    def counts(self):                   # query function
-        return list(self._counts)       # return a clone
+    def counts(self):
+		# return a copy of the counts
+        return list(self._counts)
 
-    def increment(self):                # update function
-        self._counts[self.replicaId()] += 1
+    def increment(self):
+        self._counts[self.replicaId()] += 1
 
-    def decrement(self):                # update function
-        self._counts[self.replicaId()] -= 1
+    def decrement(self):
+        self._counts[self.replicaId()] -= 1
 
-    def compare(self, other):           # comparison function
-        return all(v1 <= v2 for (v1, v2) in
-                   zip(self.counts(),
+    def compare(self, other):
+        return all(v1 <= v2 for (v1, v2) in
+                   zip(self.counts(),
                        other.counts()))
 
-    def merge(self, other):             # merge function
-        return Counter(map(max, zip(self.counts(),
+    def merge(self, other):
+        return Counter(map(max, zip(self.counts(),
                                     other.counts())))
-
-
+``` Let's try to understand how it works through an example. The figure below shows an execution trace of three replicas confirming to this specification. The replicas keep transmitting their state at random times to other randomly chosen replicas. @@ -246,52 +243,54 @@ But we do gain two valuable insights from this design- Let's try to correct this problem. We need a way to count the decrement operations without losing monotonicity. One solution is to model this counter using two increment only counters. The first counter counts the increment operations and the second one counts the decrement operations. The value of the actual counter is the difference between the two corresponding counters. The specification below shows how this can be implemented. -
class Counter(CvRDT):
-
-    def __init__(self,
-                 increments = None,
-                 decrements = None):   # constructor function
-        if increments is None:
-            self._increments = [0] * length(replicas())
-        else:
-            self._increments = increments
-        if decrements is None:
-            self._decrements = [0] * length(replicas())
-        else:
-            self._decrements = decrements
-
-    def increments(self):               # query function
-        return list(self._increments)   # return a clone
-
-    def decrements(self):               # query function
-        return list(self._decrements)   # return a clone
-
-    def value(self):                    # query function
-        return (sum(self.increments()) -
-                sum(self.decrements()))
-
-    def increment(self):                # update function
-        self._increments[self.replicaId()] += 1
-
-    def decrement(self):                # update function
-        self._decrements[self.replicaId()] += 1
-
-    def compare(self, other):           # comparison function
-        return (all(v1 <= v2 for (v1, v2) in
-                    zip(self.increments(),
+```python
+class Counter(CvRDT):
+    def __init__(self,
+                 increments = None,
+                 decrements = None):
+        if increments is None:
+            self._increments = [0] * length(replicas())
+        else:
+            self._increments = increments
+        if decrements is None:
+            self._decrements = [0] * length(replicas())
+        else:
+            self._decrements = decrements
+
+    def increments(self):
+		# return a copy of the increments
+        return list(self._increments)
+
+    def decrements(self):
+		# return a copy of the decrements
+        return list(self._decrements)
+
+    def value(self):
+        return (sum(self.increments()) -
+                sum(self.decrements()))
+
+    def increment(self):
+        self._increments[self.replicaId()] += 1
+
+    def decrement(self):
+        self._decrements[self.replicaId()] += 1
+
+    def compare(self, other):
+        return (all(v1 <= v2 for (v1, v2) in
+                    zip(self.increments(),
                         other.increments()))
-                and
-                all(v1 <= v2 for (v1, v2) in
-                    zip(self.decrements(),
+                and
+                all(v1 <= v2 for (v1, v2) in
+                    zip(self.decrements(),
                         other.decrements())))
 
-    def merge(self, other):             # merge function
-        return Counter(increments = map(max, zip(self.increments(),
-                                                 other.increments())),
-                       decrements = map(max, zip(self.decrements(),
-                                                 other.decrements())))
-
-
+ def merge(self, other): + return Counter( + increments = map(max, zip(self.increments(), + other.increments())), + decrements = map(max, zip(self.decrements(), + other.decrements()))) +``` The figure below shows an execution trace of three replicas confirming to this specification. @@ -305,7 +304,7 @@ This design converges to the correct value. This provides us an eventually consi State based increment and decrement counter(correct) lattice -## 3. References +## References {% bibliography --file counters %} -- cgit v1.2.3 From 08d952d8f5680ff384838f4eef20fa17c561f4c3 Mon Sep 17 00:00:00 2001 From: Nat Dempkowski Date: Sun, 15 Oct 2017 18:55:46 -0400 Subject: Make minor fixes to the CAP, Consistency, & CRDTs chapter --- ...dic-to-basic-how-the-database-ph-has-changed.md | 96 +++++++++++++--------- chapter/6/being-consistent.md | 34 +++++--- 2 files changed, 77 insertions(+), 53 deletions(-) (limited to 'chapter/6') diff --git a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md index c140ee0..9515384 100644 --- a/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md +++ b/chapter/6/acidic-to-basic-how-the-database-ph-has-changed.md @@ -4,7 +4,7 @@ title: "ACIDic to BASEic: How the database pH has changed" by: "Aviral Goel" --- -## 1. The **ACID**ic Database Systems +## The **ACID**ic Database Systems Relational Database Management Systems are the most ubiquitous database systems for persisting state. Their properties are defined in terms of transactions on their data. A database transaction can be either a single operation or a sequence of operations, but is treated as a single logical operation on the data by the database. The properties of these transactions provide certain guarantees to the application developer. The acronym **ACID** was coined by Andreas Reuter and Theo Härder in 1983 to describe them. @@ -16,19 +16,21 @@ Relational Database Management Systems are the most ubiquitous database systems * **Durability** guarantees that upon the completion of a transaction, the effects are applied permanently on the database and cannot be undone. They remain visible even in the event of power failures or crashes. This is done by ensuring that the changes are committed to disk (non-volatile memory). -

ACIDity implies that if a transaction is complete, the database state is structurally consistent (adhering to the rules of the schema) and stored on disk to prevent any loss.

+
+

ACIDity implies that if a transaction is complete, the database state is structurally consistent (adhering to the rules of the schema) and stored on disk to prevent any loss.

+
Because of the strong guarantees this model simplifies the life of the developer and has been traditionally the go to approach in application development. It is instructive to examine how these properties are enforced. -Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency, i.e., all transactions are performed in a particular sequence and invariants are always maintained by them. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations. +Single node databases can simply rely upon locking to ensure *ACID*ity. Each transaction marks the data it operates upon, thus enabling the database to block other concurrent transactions from modifying the same data. The lock has to be acquired both while reading and writing data. The locking mechanism enforces a strict linearizable consistency, i.e. all transactions are performed in a particular sequence and invariants are always maintained by them. An alternative, *multiversioning* allows a read and write operation to execute in parallel. Each transaction which reads data from the database is provided the earlier unmodified version of the data that is being modified by a write operation. This means that read operations don't have to acquire locks on the database. This enables read operations to execute without blocking write operations and write operations to execute without blocking read operations. This model works well on a single node. But it exposes a serious limitation when too many concurrent transactions are performed. A single node database server will only be able to process so many concurrent read operations. The situation worsens when many concurrent write operations are performed. To guarantee *ACID*ity, the write operations will be performed in sequence. The last write request will have to wait for an arbitrary amount of time, a totally unacceptable situation for many real time systems. This requires the application developer to decide on a **Scaling** strategy. -### 1.2. Scaling transaction volume +### Scaling transaction volume -To increase the volume of transactions against a database, two scaling strategies can be considered +To increase the volume of transactions against a database, two scaling strategies can be considered: -* **Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. Since its specialized hardware, vendor lock-in will add to further costs. +* **Vertical Scaling** is the easiest approach to scale a relational database. The database is simply moved to a larger computer which provides more transactional capacity. Unfortunately, its far too easy to outgrow the capacity of the largest system available and it is costly to purchase a bigger system each time that happens. * **Horizontal Scaling** is a more viable option and can be implemented in two ways. Data can be segregated into functional groups spread across databases. This is called *Functional Scaling*. Data within a functional group can be further split across multiple databases, enabling functional areas to be scaled independently of one another for even more transactional capacity. This is called *sharding*. @@ -40,59 +42,58 @@ Horizontal Scaling through functional partitioning enables high degree of scalab 2PC is a blocking protocol and updates can take from a few milliseconds up to a few minutes to commit. This means that while a transaction is being processed, other transactions will be blocked. So the application that initiated the transaction will be blocked. Another option is to handle the consistency across databases at the application level. This only complicates the situation for the application developer who is likely to implement a similar strategy if *ACID*ity is to be maintained. -## 2. The Distributed Concoction +## The Distributed Concoction A distributed application is expected to have the following three desirable properties: 1. **Consistency** - This is the guarantee of total ordering of all operations on a data object such that each operation appears indivisible. This means that any read operation must return the most recently written value. This provides a very convenient invariant to the client application. This definition of consistency is the same as the **Atomic**ity guarantee provided by relational database transactions. -2. **Availability** - Every request to a distributed system must result in a response. However, this is too vague a definition. Whether a node failed in the process of responding or it ran a really long computation to generate a response or whether the request or the response got lost due to network issues is generally impossible to determine by the client and willHence, for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, the amount of delay an application can bear depends on the application domain. +2. **Availability** - Every request to a distributed system must result in a response. However, this is too vague a definition. Determining whether a node failed in the process of responding, ran a really long computation to generate a response, or whether the request or response got lost due to network issues is generally impossible from the client's perspective. This problem means that for all practical purposes, availability can be defined as the service responding to a request in a timely fashion, where the amount of delay an application can bear depends on the application domain. 3. **Partition Tolerance** - Partitioning is the loss of messages between the nodes of a distributed system. During a network partition, the system can lose arbitrary number of messages between nodes. A partition tolerant system will always respond correctly unless a total network failure happens. -Consistency requirement implies that every request will be treated atomically by the system even if the nodes lose messages due to network partitions. -Availability requirement implies that every request should receive a response even if a partition causes messages to be lost arbitrarily. +The consistency requirement implies that every request will be treated atomically by the system even if the nodes lose messages due to network partitions. The availability requirement implies that every request should receive a response even if a partition causes messages to be lost arbitrarily. -## 3. The CAP Theorem +## The CAP Theorem
A partitioned network
A partitioned network
-In the network above, all messages between the node set M and N are lost due to a network issue. The system as a whole detects this situation. There are two options - +In the network above, all messages between the node set M and N are lost due to a network issue. The system as a whole detects this situation. There are two options: 1. **Availability first** - The system allows any application to read and write to data objects on these nodes independently even though they are not able to communicate. The application writes to a data object on node M. Due to **network partition**, this change is not propagated to replicas of the data object in N. Subsequently, the application tries to read the value of that data object and the read operation executes in one of the nodes of N. The read operation returns the older value of the data object, thus making the application state not **consistent**. -2. **Consistency first** - The system does not allow any application to write to data objects as it cannot ensure **consistency** of replica states. This means that the system is perceived to be **unavailable** by the applications. +2. **Consistency first** - The system does not allow any application to write to data objects as it cannot ensure **consistency** of replica states. This means that the system is perceived to be **unavailable** by the applications. -If there are no partitions, clearly both consistency and availability can be guaranteed by the system. This observation led Eric Brewer to conjecture in an invited talk at PODC 2000- +If there are no partitions, clearly both consistency and availability can be guaranteed by the system. This observation led Eric Brewer to conjecture the CAP Theorem in an invited talk at PODC 2000. The CAP Theorem states: -
It is impossible for a web service to provide the following three guarantees: -Consistency -Availability -Partition Tolerance
+
+It is impossible for a web service to provide the following three guarantees: Consistency, Availability, and Partition Tolerance +
-This is called the CAP theorem. +It is clear that the prime culprit here is network partitions. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system. -It is clear that the prime culprit here is network partition. If there are no network partitions, any distributed service will be both highly available and provide strong consistency of shared data objects. Unfortunately, network partitions cannot be remedied in a distributed system. - -## 4. Two of Three - Exploring the CAP Theorem +## Two of Three - Exploring the CAP Theorem The CAP theorem dictates that the three desirable properties, consistency, availability and partition tolerance cannot be offered simultaneously. Let's study if its possible to achieve two of these three properties. ### Consistency and Availability -If there are no network partitions, then there is no loss of messages and all requests receive a response within the stipulated time. It is clearly possible to achieve both consistency and availability. Distributed systems over intranet are an example of such systems. + +If there are no network partitions, then there is no loss of messages and all requests receive a response within the stipulated time. It is clearly possible to achieve both consistency and availability. Distributed systems over an intranet are an example of such systems. ### Consistency and Partition Tolerance + Without availability, both of these properties can be achieved easily. A centralized system can provide these guarantees. The state of the application is maintained on a single designated node. All updates from the client are forwarded by the nodes to this designated node. It updates the state and sends the response. When a failure happens, then the system does not respond and is perceived as unavailable by the client. Distributed locking algorithms in databases also provide these guarantees. ### Availability and Partition Tolerance + Without atomic consistency, it is very easy to achieve availability even in the face of partitions. Even if nodes fail to communicate with each other, they can individually handle query and update requests issued by the client. The same data object will have different states on different nodes as the nodes progress independently. This weak consistency model is exhibited by web caches. -Its clear that two of these three properties are easy to achieve in any distributed system. Since large scale distributed systems have to take partitions into account, will they have to sacrifice availability for consistency or consistency for availability? Clearly giving up either consistency or availability is too big a sacrifice. +Its clear that two of these three properties are easy to achieve in any distributed system. Since large scale distributed systems have to take partitions into account, will they have to sacrifice availability for consistency or consistency for availability? Clearly totally giving up either consistency or availability is too big a sacrifice. -## 5. The **BASE**ic distributed state +## The **BASE**ic distributed state When viewed through the lens of CAP theorem and its consequences on distributed application design, we realize that we cannot commit to perfect availability and strong consistency. But surely we can explore the middle ground. We can guarantee availability most of the time with occasional inconsistent view of the data. The consistency is eventually achieved when the communication between the nodes resumes. This leads to the following properties of the current distributed applications, referred to by the acronym BASE. @@ -102,41 +103,53 @@ When viewed through the lens of CAP theorem and its consequences on distributed * **Eventually Consistent** services try to make application state consistent whenever possible. -## 6. Partitions and latency -Any large scale distributed system has to deal with latency issue. In fact, network partitions and latency are fundamentally related. Once a request is made and no response is received within some duration, the sender node has to assume that a partition has happened. The sender node can take one of the following steps: +## Partitions and latency + +Any large scale distributed system has to deal with latency issues. In fact, network partitions and latency are fundamentally related. Once a request is made and no response is received within some duration, the sender node has to assume that a partition has happened. The sender node can take one of the following steps: -1) Cancel the operation as a whole. In doing so, the system is choosing consistency over availability. -2) Proceed with the rest of the operation. This can lead to inconsistency but makes the system highly available. -3) Retry the operation until it succeeds. This means that the system is trying to ensure consistency and reducing availability. +* Cancel the operation as a whole. In doing so, the system is choosing consistency over availability. +* Proceed with the rest of the operation. This can lead to inconsistency but makes the system highly available. +* Retry the operation until it succeeds. This means that the system is trying to ensure consistency and reducing availability. Essentially, a partition is an upper bound on the time spent waiting for a response. Whenever this upper bound is exceeded, the system chooses C over A or A over C. Also, the partition may be perceived only by two nodes of a system as opposed to all of them. This means that partitions are a local occurrence. -## 7. Handling Partitions +## Handling Partitions + Once a partition has happened, it has to be handled explicitly. The designer has to decide which operations will be functional during partitions. The partitioned nodes will continue their attempts at communication. When the nodes are able to establish communication, the system has to take steps to recover from the partitions. -### 7.1. Partition mode functionality +### Partition mode functionality + When at least one side of the system has entered into partition mode, the system has to decide which functionality to support. Deciding this depends on the invariants that the system must maintain. Depending on the nature of problem, the designer may choose to compromise on certain invariants by allowing partitioned system to provide functionality which might violate them. This means the designer is choosing availability over consistency. Certain invariants may have to be maintained and operations that will violate them will either have to be modified or prohibited. This means the designer is choosing consistency over availability. + Deciding which operations to prohibit, modify or delay also depends on other factors such as the node. If the data is stored on the same node, then operations on that data can typically proceed on that node but not on other node. -In any event, the bottomline is that if the designer wishes for the system to be available, certain operations have to be allowed. The node has to maintain a history of these operations so that it can be merged with the rest of the system when it is able to reconnect. -Since the operations can happen simultaneously on multiple disconnected nodes, all sides will maintain this history. One way to maintain this information is through version vectors. + +In any event, the bottomline is that if the designer wishes for the system to be available, certain operations have to be allowed. The node has to maintain a history of these operations so that it can be merged with the rest of the system when it is able to reconnect. Since the operations can happen simultaneously on multiple disconnected nodes, all sides will maintain this history. One way to maintain this information is through version vectors. + Another interesting problem is to communicate the progress of these operations to the user. Until the system gets out of partition mode, the operations cannot be committed completely. Till then, the user interface has to faithfully represent their incomplete or in-progress status to the user. -### 7.2. Partition Recovery -When the partitioned nodes are able to communicate, they have to exchange information to maintain consistency. Both sides continued in their independent direction but now the delayed operations on either side have to be performed and violated invariants have to be fixed. Given the state and history of both sides, the system has to accomplish the following tasks. +### Partition Recovery + +When the partitioned nodes are able to communicate, they have to exchange information to maintain consistency. During the partition, both sides continued processing independently, but now the delayed operations on either side have to be performed and violated invariants have to be fixed. Given the state and history of both sides, the system has to accomplish the following tasks. + +#### Consistency + +During recovery, the system has to reconcile the inconsistency in state of both nodes. This is relatively straightforward to accomplish. One approach is to start from the state at the time of partition and apply operations of both sides in an appropriate manner, ensuring that the invariants are maintained. Depending on operations allowed during the partition phase, this process may or may not be possible. The general problem of conflict resolution is not solvable but a restricted set of operations may ensure that the system can always always merge conflicts. For example, Google Docs limits operations to style and text editing. But source-code control systems such as Concurrent Versioning System (CVS) may encounter conflict which require manual resolution. -#### 7.2.1. Consistency -During recovery, the system has to reconcile the inconsistency in state of both nodes. This is relatively straightforward to accomplish. One approach is to start from the state at the time of partition and apply operations of both sides in an appropriate manner, ensuring that the invariants are maintained. Depending on operations allowed during the partition phase, this process may or may not be possible. The general problem of conflict resolution is not solvable but a restricted set of operations may ensure that the system can always always merge conflicts. For example, Google Docs limits operations to style and text editing. But source-code control systems such as Concurrent Versioning System (CVS) may encounter conflict which require manual resolution. Research has been done on techniques for automatic state convergence. Using commutative operations allows the system to sort the operations in a consistent global order and execute them. Though all operations can't be commutative, for example - addition with bounds checking is not commutative. Mark Shapiro and his colleagues at INRIA have developed *commutative replicated data types (CRDTs)* that provably converge as operations are performed. By implementing state through CRDTs, we can ensure Availability and automatic state convergence after partitions. +Research has been done on techniques for automatic state convergence. Using commutative operations allows the system to sort the operations in a consistent global order and execute them. Though all operations can't be commutative, for example - addition with bounds checking is not commutative. Mark Shapiro and his colleagues at INRIA have developed *commutative replicated data types (CRDTs)* that provably converge as operations are performed. By implementing state through CRDTs, we can ensure Availability and automatic state convergence after partitions. + +#### Compensation -#### 7.2.2. Compensation During partition, its possible for both sides to perform a series of actions which are externalized, i.e. their effects are visible outside the system. To compensate for these actions, the partitioned nodes have to maintain a history. + For example, consider a system in which both sides have executed the same order during a partition. During the recovery phase, the system has to detect this and distinguish it from two intentional orders. Once detected, the duplicate order has to be rolled back. If the order has been committed successfully then the problem has been externalized. The user will see twice the amount deducted from his account for a single purchase. Now, the system has to credit the appropriate amount to the user's account and possibly send an email explaining the entire debacle. All this depends on the system maintaining the history during partition. If the history is not present, then duplicate orders cannot be detected and the user will have to catch the mistake and ask for compensation. + It would have been great if the duplicate order was not issued by the system in the first place. But the requirement to maintain system availability trumps consistency. Mistakes in such cases cannot always be corrected internally. But by admitting them and compensating for them, the system arguably exhibits equivalent behavior. -### 8. What's the right pH for my distributed solution? +## What's the right pH for my distributed solution? Whether an application chooses to be an *ACID*ic or *BASE*ic service depends on the domain. An application developer has to consider the consistency-availability tradeoff on a case by case basis. *ACID*ic databases provide a very simple and strong consistency model making application development easy for domains where data inconsistency cannot be tolerated. *BASE*ic systems provide a very loose consistency model, placing more burden on the application developer to understand the invariants and manage them carefully during partitions by appropriately limiting or modifying the operations. -## 9. References +## References https://neo4j.com/blog/acid-vs-base-consistency-models-explained/ https://en.wikipedia.org/wiki/Eventual_consistency/ @@ -182,4 +195,5 @@ https://en.wikipedia.org/wiki/Distributed_database https://en.wikipedia.org/wiki/ACID http://searchstorage.techtarget.com/definition/data-availability https://datatechnologytoday.wordpress.com/2013/06/24/defining-database-availability/ + {% bibliography --file rpc %} diff --git a/chapter/6/being-consistent.md b/chapter/6/being-consistent.md index 233d987..bc7a55b 100644 --- a/chapter/6/being-consistent.md +++ b/chapter/6/being-consistent.md @@ -5,6 +5,7 @@ by: "Aviral Goel" --- ## Replication and Consistency + Availability and Consistency are the defining characteristics of any distributed system. As dictated by the CAP theorem, accommodating network partitions requires a trade off between the two properties. Modern day large scale internet based distributed systems have to be highly available. To manage huge volumes of data (big data) and to reduce access latency for geographically diverse user base, their data centers also have to be geographically spread out. Network partitions which would otherwise happen with a low probability on a local network become certain events in such systems. To ensure availability in the event of partitions, these systems have to replicate data objects. This begs the question, how to ensure consistency of these replicas? It turns out there are different notions of consistency which the system can adhere to. * **Strong Consistency** implies linearizability of updates, i.e., all updates applied to a replicated data type are serialized in a global total order. This means that any update will have to be simultaneously applied to all other replicas. Its obvious that this notion of consistency is too restrictive. A single unavailable node will violate this condition. Forcing all updates to happen synchronously will impact system availability negatively. This notion clearly does not fit the requirements of highly available fault tolerant systems. @@ -16,26 +17,27 @@ Most large scale distributed systems try to be **Eventually Consistent** to ensu ## A Distributed Setting ### TODO need to write pseudocode. Will finish this part with the detailed explanation of CRDTs in the next chapter. -Consider a replicated counter. Each node can increment the value of its local copy. The figure below shows three nodes which increment their local copies at arbitrary time points and each replica sends its value asynchronously to the other two replicas. Whenever it recieves the value of its replica, it adds it to its current value. If two values are received concurrently, both will be added together to its current value. So merging replicas in this example becomes trivial. -Let's take a look at another interesting generalization of this. Integer Vector +Consider a replicated counter. Each node can increment the value of its local copy. The figure below shows three nodes which increment their local copies at arbitrary time points and each replica sends its value asynchronously to the other two replicas. Whenever it receives the value of its replica, it adds it to its current value. If two values are received concurrently, both will be added together to its current value. So merging replicas in this example becomes trivial. +Let's take a look at another interesting generalization of this. Integer Vector We can make an interesting observation from the previous examples: - + __*All distributed data structures don't need conflict resolution*__ This raises the following question: - + __*How can we design a distributed structure such that we don't need conflict resolution?*__ The answer to this question lies in an algebraic structure called the **join semilattice**. ## Join Semilattice + A join-semilattice or upper semilattice is a *partial order* `≤` with a *least upper bound* (LUB) `⊔` for all pairs. -`m = x ⊔ y` is a Least Upper Bound of `{` `x` `,` `y` `}` under `≤` iff `∀m′, x ≤ m′ ∧ y ≤ m′ ⇒ x ≤ m ∧ y ≤ m ∧ m ≤ m′`. +`m = x ⊔ y` is a Least Upper Bound of `{ x , y }` under `≤` iff `∀m′, x ≤ m′ ∧ y ≤ m′ ⇒ x ≤ m ∧ y ≤ m ∧ m ≤ m′`. -`⊔` is: +Here the least upper bound `⊔` has the follow properties: **Associative** @@ -50,31 +52,39 @@ A join-semilattice or upper semilattice is a *partial order* `≤` with a *least `x ⊔ x = x` The examples we saw earlier were of structures that could be modeled as join semilattices. The merge operation for the increment only counter is the summation function and for the integer vector it is the per-index maximum of the vectors being merged. + So, if we can model the state of the data structure as a partially ordered set and design the merge operation to always compute the "larger" of the two states, its replicas will never need consensus. They will always converge as execution proceeds. Such data structures are called CRDTs (Conflict-free Replicated Data Type). But what about consistency of these replicas? ## Strong Eventual Consistency (SEC) + We discussed a notion of consistency, *Eventual Consistency*, in which replicas eventually become consistent if there are no more updates to be merged. But the update operation is left unspecified. Its possible for an update to render the replica in a state that causes it to conflict with a later update. In this case the replica may have to roll back and use consensus to ensure that all replicas do the same to ensure consistency. This is complicated and wasteful. But if replicas are modeled as CRDTs, the updates never conflict. Regardless of the order in which the updates are applied, all replicas will eventually have equivalent state. Note that no conflict arbitration is necessary. This kind of Eventual Consistency is a stronger notion of consistency than the one that requires conflict arbitration and hence is called *Strong Eventual Consistency*. ### Strong Eventual Consistency and CAP Theorem Let's study SEC data objects from the perspective of CAP theorem. -#### 1. Consistency and Network Partition +#### Consistency and Network Partition + Each distributed replica will communicate asynchronously with other reachable replicas. These replicas will eventually converge to the same value. There is no consistency guarantee on the value of replicas not reachable due to network conditions and hence this condition is strictly weaker than strong consistency. But as soon as those replicas can be reached, they will also converge in a self-stabilizing manner. -#### 2. Availability and Network Partition -Each distributed replica will always be available for local reads and writes regardless of network partitions. In fact, if there are n replicas, a single replica will function even if the remaining n - 1 replicas crash simultaneously. This **provides an extreme form of availability**. +#### Availability and Network Partition + +Each distributed replica will always be available for local reads and writes regardless of network partitions. In fact, if there are n replicas, a single replica will function even if the remaining n - 1 replicas crash simultaneously. This **provides an extreme form of availability**. SEC facilitates maximum consistency and availability in the event of network partitions by relaxing the requirement of global consistency. Note that this is achieved by virtue of modeling the data objects as join semilattices. #### Strong Eventual Consistency and Linearizability -In a distributed setting, a replica has to handle concurrent updates. In addition to its sequential behavior, a CRDT also has to ensure that its concurrent behavior also ensures strong eventual consistency. This makes it possible for CRDTs to exhibit behavior that is simply not possible for sequentially consistent objects. + +In a distributed setting, a replica has to handle concurrent updates. In addition to its sequential behavior, a CRDT also has to ensure that its concurrent behavior also ensures strong eventual consistency. This makes it possible for CRDTs to exhibit behavior that is simply not possible for sequentially consistent objects. + Consider a set CRDT used in a distributed setting. One of the replicas pi executes the sequence `add(a); remove(b)`. Another replica pj executes the sequence `add(b); remove(a)`. Now both send their states asynchronously to another replica pk which has to merge them concurrently. Same element exists in one of the sets and does not exist in the other set. There are multiple choices that the CRDT designer can make. Let's assume that the implementation always prefers inclusion over exclusion. So in this case, pk will include both `a` and `b`. -Now consider a sequential execution of the two sequences on set data structure. The order of execution will be either `add(a); remove(b); add(b); remove(a)` or `add(b); remove(a); add(a); remove(b)`. In both cases one of the elements is excluded. This is different from the state of the CRDT set implementation. -Thus, strong eventually consistent data structures can be sequentially inconsistent. + +Now consider a sequential execution of the two sequences on set data structure. The order of execution will be either `add(a); remove(b); add(b); remove(a)` or `add(b); remove(a); add(a); remove(b)`. In both cases one of the elements is excluded. This is different from the state of the CRDT set implementation. Thus, strong eventually consistent data structures can be sequentially inconsistent. + Similarly, if there are `n` sequentially consistent replicas, then they would need consensus to ensure a single order of execution of operations across all replicas. But if `n - 1` replicas crash, then consensus cannot happen. This makes the idea of sequential consistency incomparable to that of strong eventual consistency. ## What Next? + This chapter introduced Strong Eventual Consistency and the formalism behind CRDTs, join semilattices, which enables CRDTs to exhibit strong eventual consistency. The discussion however does not answer an important question: __*Can all standard data structures be designed as CRDTs?*__ -- cgit v1.2.3