Database
The most basic form of database is appending to the bottom of the logs. This allows an but reads will be . In order to speed things up, we can add indexes which might potentially slow down reads but speed up writes.
In order to prevent running out of space, we can separate the data into segments. When the segment reaches a certain threshold, we will start another segment. From time to time, we will perform compaction to combine segments together.
Indexes
Index 1: Hash Index
We have a in-memory hashmap that maps the keys to the bit offset of the key. It is faster than a B-Tree or SSTable because writing is , and merging old segments prevents fragmented data.
However, we cannot perform range query as the keys are random, and the in memory hashmap needs to fit into memory space. If there are too many keys, we might run out of space.
Index 2: Sorted String Table
Similar to the hash indexes, except that the keys are sorted. Compare the segments side by side and take the smaller key. If duplicate keys exist, take the one from the later segment. The good thing is that we do not need to store all the keys anymore, and we can have sparse keys.
Index 3: Log-Structured Merge Tree (LSM Tree)
- Create an instance of LSM tree and write to it.
- When the tree reaches a certain threshold, start writing to the disk and create a new instance.
To serve a read request, we search the LSM tree first then disk then older segment. However, this is inefficient and we can use a Bloom Filter to optimise the query.
Size Level compaction — Newer and smaller SSTables are merged into older and larger SSTables.
Levelled-tier compaction — Key range is split up into smaller SSTables and older data is moved into separate levels, which allows compaction to proceed more incrementally and use less space.
Index 4: B-Tree Index
B-trees break the database down into fixed-size blocks or pages and read or write one page at a time. A page can refer to another page on the disk like a linked list.
We start with the root of the B-Tree which is a page and each child is in charge of a range of keys. We keep traversing until we reach a page containing individual key (leaf page) that contains the value or reference to the value.
Branching Factor: Number of references per page
Update value for existing key: Search for leaf page containing key, change the value in the page and write back to it.
Create a new key: Find the range that encompasses the new key and add it to that page. If there is not enough free space in the page, split the page into two halves and update parent page.
Basic underlying operation is to write data on the whole disk. Some pages require multiple operations to be overwritten, and if the database crashes after several index have been written, we might get corrupted data ⇒ Write Ahead Log (WAL)
Why LSM | Why BTree |
---|---|
In B-Tree, some of the memory might not be used. Therefore, space might be wasted. | Very inefficient to search for data in LSM |
In B-Tree, you have to write data twice — one to the page and one to WAL |
OLTP vs OLAP
Online Transaction Processing | Online Analytical Processing |
---|---|
Mainly used for business transactions | Mainly use for analytical purpose |
Written by users | Stream processing or bulk import |
Read a few rows at a time | Aggregated over large amount of data |
ACID Properties
Acronym | Description |
---|---|
Atomicity | Guarantees to see uncommitted data or committed data, not partial |
Consistency | Application notion of consistent state |
Isolation | Describe what happens when transactions happen in isolation |
Durability | When there is a hardware fault, data that is committed is not affected |
Isolation Levels vs Read Phenomenas
Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
---|---|---|---|
Read uncommitted | Yes | Yes | Yes |
Read committed | No | Yes | Yes |
Repeatable Read | No | No | Yes |
Serializable | No | No | No |
The higher the serialisation level, the higher the accuracy of data. However, this incurs a performance cost on the database due to reduced concurrency.
Dirty Reads
Transaction A reading uncommited data by Transaction B. The problem with this is that if the Transaction B aborts, the data by transaction A will be inaccurate.
On the other hand, dirty write happens when both transactions write to the same data, and transaction A overwrites uncommitted data by transaction B. Read Committed isolation guarantees that there is no Dirty Reads.
Non-Repeatable Read
In the following scenario where we decide to transfer 500 from account 1 to account 2 in transaction 2 and reading both accounts in transaction 1, transaction 1 will return a total of $900 for both accounts even though the total will eventually be $1000.
This does not violate the read committed principle because in Transaction 2, we read from account 1 the uncommitted value and we read from account 2 the committed value. We are not reading the partially updated values.
This phenomena is known as non-repeatable read. It is unacceptable in 2 conditions: (1) backup of data because it will mix old and new data together and (2) analysing data because it will lead to garbage data.
Snapshot Isolation
Take a snapshot of the uncommitted values before transaction. Ignore any writes by other transactions, whether they are ongoing, aborted or committed. This can be done by implementing write locks.
Multi-Version Concurrency Control
For each row, you have 2 additional fields called created_by
and
deleted_by
. When a transaction inserts a new row, it will fill
up the created_by
field and when it deletes a row, it will fill
up the deleted_by
field. When it updates a row, it actually deletes
and create a new row.
When we are guaranteed that there are no other transactions accessing the row, we will delete it.
Since transaction 1 takes a snapshot of the uncommitted values prior to beginning, it will see both accounts as $500 as it ignores the writes written by transaction 2.
Lost Update
Lost Update usually occurs in a read-modify write cycle. When we read a value, the value might have change prior to modifying it, causing a data race.
Solution 1: Atomic Write operations
Put a lock when the object is read and only unlock when it is updated (cursor stability) or force all atomic operations to execute on single thread.
UPDATE counters SET value = value + 1 WHERE key = 'foo';
Solution 2: Explicit Locking of rows
BEGIN;
SELECT * FROM figures WHERE name = 'robot' AND game_id = 222 FOR UPDATE;
COMMIT;
Solution 3: Automatic Detection
Once detected, we will abort the transaction and retry.
Solution 4: Compare and Set
Allowing an update to occur if the value has not changed since we last read it. If the current value does not match, retry the read-modify-write cycle.
Solution 5: Replication Configuration
- Last Write Wins (LRW)
- Atomic operations
- Allow several conflicting versions and use application code and separate data structure to resolve and merge the versions.
Write Skew
2 doctors on the sick shift want to take sick leave but they can only do so if the other doctor is available. If they query the database concurrently, the select criteria will return true for both doctors and since condition are both true, they will update their status to False.
However, this will violate the first condition. This is known as
a write skew. The only good way to solve this is implementing
serialisable isolation level because we cannot guarantee to lock
the rows using FOR UPDATE
.
Serialisation
Two Phase Locking (2PL)
-
If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the locks in shared mode simultaneously, but if a transaction already has exclusive lock, they must wait.
-
If a transaction wants to write to an object, it must first acuire the lock in exclusive mode. Only 1 transaction may hold the lock in exclusive mode at a time.
-
If a transaction reads then write to an object, it can upgrade the shared lock to exclusive lock.
-
2 phases: Acquire lock during transaction and release lock during the end of the transaction.