Distributed Systems

Distributed Systems

Single node are easy to debug and manage. However, there are certain problems with using a single node, such as lack of separation of concerns, single point of failure and limited computational power. With that in mind, we would look at the following concepts in distributed systems.

List of References

  1. Designing Data Intensive Application by Martin Kleppman
  2. Understanding Distributed Systems by Roberto Vitillo
  3. CAP VS BASE

Important Acronyms

CAP

TermDefinition
ConsistencyEvery read is guranteed to have an error or latest response.
AvailabilityEvery request is guaranteed a response, but not guaranteed to be the latest information
Partition ToleranceNodes should continue to operate even if there is arbitrary partition due to network outages

BASE

TermDefinition
Basically AvailableGuarantees availability
Soft StateThe state of the system may change over time, even without input, because of eventual consistency.
Eventual ConsistentThe system will be consistent over time, provided that it does not receive any updates

Replication

Replication means that the same data can be found on different nodes.

We would usually try to replicate the data so that there would not be a single point of failure, able to withstand high load and is geographically closer to the client.

Replication ensures high availability because we do not have a single point of failure. This means that if one slave node goes down, we can still read from other slaves. However, there is no consistency because in order to replicate, we should be mainly using asynchronous where the master does not need to wait for a response.

Issues with replication

Troubles with distributed systems

Unreliable Networks

Assuming a shared nothing architecture, the different nodes do not have access to each other’s memory except through network. On an asynchronous network, we do not have guarantees when a message it will arrive at the other node.

On the other hand, in a synchronous network like telephone, we setup a circuit and allocate fixed bandwidth to the nodes. There is no queueing and the delays are bounded.

Timeout

It is difficult to identify a suitable timeout time. If the timeout is too long, the response time will be too long. If the timeout is too short, a node might be falsely declared dead. This is problematic because the responsibility of the node will be transferred to other nodes, increasing stress on the network and other nodes. Furthermore, a node might be asked to take over the process, and it might end up executing twice.

Network Congestion

Most frequent cause of delay is caused by network congestion.

  1. If several nodes try to send packets simultaneously to the same node, the network switch needs to queue them up and feed them one-by-one. This is known as network congestion. If the switch queue is full, the packet will be dropped,

  2. If the packet reaches the destination machine and queue is full, the OS might queue them up.

  3. TCP does flow control to limit the sending rate at the sender side.

We do not establish a circuit for the networks because we want to optimise them for bursty traffic. If we do so and under-estimate the bandwidth, the transfer is too slow and network might be unused. If the guess is too high, the circuit cannot be setup. TCP adapts to the rate of data transfer to available network capacity. With use of Quality of Service and admission control, it is possible to emulate packet switching and provide statistically bounded delay.

Unreliable Clocks

Knowledge, Truth and Lies