Distributed Systems
Single node are easy to debug and manage. However, there are certain problems with using a single node, such as lack of separation of concerns, single point of failure and limited computational power. With that in mind, we would look at the following concepts in distributed systems.
List of References
- Designing Data Intensive Application by Martin Kleppman
- Understanding Distributed Systems by Roberto Vitillo
- CAP VS BASE
Important Acronyms
CAP
Term | Definition |
---|---|
Consistency | Every read is guranteed to have an error or latest response. |
Availability | Every request is guaranteed a response, but not guaranteed to be the latest information |
Partition Tolerance | Nodes should continue to operate even if there is arbitrary partition due to network outages |
BASE
Term | Definition |
---|---|
Basically Available | Guarantees availability |
Soft State | The state of the system may change over time, even without input, because of eventual consistency. |
Eventual Consistent | The system will be consistent over time, provided that it does not receive any updates |
Replication
Replication means that the same data can be found on different nodes.
We would usually try to replicate the data so that there would not be a single point of failure, able to withstand high load and is geographically closer to the client.
Replication ensures high availability because we do not have a single point of failure. This means that if one slave node goes down, we can still read from other slaves. However, there is no consistency because in order to replicate, we should be mainly using asynchronous where the master does not need to wait for a response.
Issues with replication
Troubles with distributed systems
Unreliable Networks
Assuming a shared nothing architecture, the different nodes do not have access to each other’s memory except through network. On an asynchronous network, we do not have guarantees when a message it will arrive at the other node.
On the other hand, in a synchronous network like telephone, we setup a circuit and allocate fixed bandwidth to the nodes. There is no queueing and the delays are bounded.
Timeout
It is difficult to identify a suitable timeout time. If the timeout is too long, the response time will be too long. If the timeout is too short, a node might be falsely declared dead. This is problematic because the responsibility of the node will be transferred to other nodes, increasing stress on the network and other nodes. Furthermore, a node might be asked to take over the process, and it might end up executing twice.
Network Congestion
Most frequent cause of delay is caused by network congestion.
-
If several nodes try to send packets simultaneously to the same node, the network switch needs to queue them up and feed them one-by-one. This is known as network congestion. If the switch queue is full, the packet will be dropped,
-
If the packet reaches the destination machine and queue is full, the OS might queue them up.
-
TCP does flow control to limit the sending rate at the sender side.
We do not establish a circuit for the networks because we want to optimise them for bursty traffic. If we do so and under-estimate the bandwidth, the transfer is too slow and network might be unused. If the guess is too high, the circuit cannot be setup. TCP adapts to the rate of data transfer to available network capacity. With use of Quality of Service and admission control, it is possible to emulate packet switching and provide statistically bounded delay.