Cloud-Native Architecture & Design: Data Partition & Replication

13 min readFeb 28, 2021

Introduction

This article provides details of sharding for cloud-native applications.

What is Partition?

Partition allows a table, index, or index-organized table to be subdivided into smaller chunks, where each chunk of such a database object is called a partition. Each partition has its own name.

Data Partitioning divides the data set and distributes the data over multiple servers or shards. Each shard is an independent database, and collectively, the shard makes up a single database. The portioning helps manageability, performance, High Availability, security, operational flexibility, and scalability. This makes technologies an ideal fit for microservices data storage.

The Data Partitioning addresses the issues of scale-like:

High query rates exhausting the CPU capacity of the server
Larger data sets exceeding the storage capacity of a single machine
Working set sizes larger than the system’s RAM, thus stressing the I/O capacity of disk drives

You can use the following strategies for database partitioning

Horizontal Partitioning (Sharding) — Each partition is a separate data store, but all partitions have the same schema. Each partition known as shards and holds a subset of data.
Vertical Partitioning — Each partition holds a subset of the fields for items in the data store, the fields are divided according to how you access the data.
Functional Partitioning — Data is aggregated according to how it is used by each bounded context in the system.

You can combine multiple strategies in your application, for example, you apply horizontal partitioning for High Availability and use a vertical partitioning strategy to store based on data access.

The database either RDBMS or NoSQL provides different criteria to shard the database. These criteria are:

Range or interval Partitioning
List Partitioning
Round-Robin Partitioning
Hash Partitioning

The round-robin partitioning distributes the rows of a table among the nodes in a round-robin fashion. The range, list, and hash partitioning, an attribute “Partitioning Key”, must be chosen among the table attributes. The partition of the table rows is based on the value of the partitioning key.

In range partitioning, a given range of values is assigned to a partition, the data distributed among the nodes in such a way that each partition contains rows for which the partitioning key value lies within its range. The list strategy similar to the range but a list of values assigned one by one. The hash partitioning is based on the partition key and the hash values.

Horizontal Partitioning or Sharding

Applications in an enterprise require a database to store business data. When a business grows, the data size grows exponentially, at some point in time database performs very badly with limited CPU, single storage capacity, performance, query throughput. There should be a limit to increase CPU, memory, etc., therefore you can’t go beyond certain limitations.

Sharding is a common idea in database architectures. By sharding tables, you can store new chunks of data across multiple physical nodes to achieve horizontal scalability. By horizontal scaling out, you can enable a flexible database design that increases the performance and High Availability of data.

The figure shows horizontal partitioning or sharding, in this example, user employee details data is divided into two shards HS1 and HS2 based on ID/Key. Each shard holds the data for a contiguous range of shard keys. Sharding spreads the load over more nodes, which reduces contentions and improves performance.

The shards don’t have to be the same size. It’s more important to balance the number of requests. Some shards might be large and other shards might be smaller, you can choose the key based on the access operation. Smaller the size more frequent and fast, the larger the size less frequent and slow.

Besides achieving the scalability and throughput of Service Level Agreements (SLAs), sharding can potentially improve unplanned outages and each node collaborate with each other to make sure always available. Some database vendors use the master-slave architecture style for sharding.

Range Based or interval Partitioning/Sharding

Range-based sharding separates the date based on ranges of the data value. Shard keys with range values are separated into a separate chunk. Each shard in an architecture preserves the same schema of the master database. Interval partitioning is an extension to range partitioning in which, beyond a point in time, the partition is defined by the interval.

Range-based shards support more efficient range queries. Given a range query on the shard key, the query router can easily determine which chunks overlap that range and route the query to only to those shards that contain these values in a chunk.

Each partitioning creates a dedicated partition for certain values or value ranges in a table, in the above example, the partition is based on the income. The income <$35000 shard into one and income > $35000 are into another shard.

Partitions may be created or dropped as needed and applications may choose to use range partitioning to manage data at a fine level of details.

The range partitioning specification usually takes ranges of values to determine one partition but it also possible to define a partition for a single value. When one row is inserted or modified, the target partition is determined by the defined ranges. If a value does not fit one of these ranges, an error is raised. To prevent this kind of error, create another partition to accommodate these kinds of data which not part of the range.

The range-based partitioning can result in the uneven distribution of data which may negate some of the benefits of sharding.

Consider Range or interval partition when:

Very large tables are frequently scanned by a range predicate on a good partitioning column.
You want to maintain a rolling window of data
You cannot complete housekeeping activity on large tables in a required time, but you can divide them into smaller logical chunks based on the partition range column

Hash Partitioning/Sharding

Hash Partitioning is a partitioning technique where a hash key is used to distribute rows evenly across the different partitions.

Hashing is the process of converting a given key into another value and refers to the conversion of a column’s primary key value to a database page number on which the rows will be stored.

Hash sharding takes a shard key’s value and generates a hash value from it. The hash value is then used to determine in which shard the data should reside. With a uniform hashing algorithm such as Ketama (it is an implementation of a consistent hashing algorithm, meaning you can add or remove servers from the pool without causing a complete remap of all keys), with this approach, data with close shard keys are unlikely to be placed on the same shard. In the below diagram, the table is partitioned by using the hash function on ID/Key column

Partitioning by Hash is used primarily to ensure an even distribution of data among a predetermined number of partitions and focused on data distribution instead of data grouping.

As a rule of thumb, the has partitioning can be used:

To enable partial or full parallel partition-wise joins with likely equi sized partitions
To distribute data evenly among the nodes
To randomly distribute data to avoid I/O bottlenecks.

List Partition

The List Partitioning concept similar to range partitioning. As detailed above, the range partitioning is done by assigning a range of values to each partition. In the List Partition, we assign a set of values to each partition.

You should use List partitioning when you want to specifically map rows to partitions based on discrete values. For example, all users in Asia and Europe are stored in one partition, and users in America and Africa stored in different partitions.

List Partition useful when we have a column that can only contain a limited set of values, even range partition can be used but List partition allows you to equally distribute the rows by assigning a proper set of values to each partition.

Round-Robin Partitioning

The Round-Robin portioning is used to achieve an equal distribution of rows to partitions. With this technique, the new rows are assigned to partitions on a rotation basis. There is no partition key, rows are distributed randomly across all partitions, therefore load balancing is achieved.

Vertical Partitioning

The Vertical Partitioning splits the data vertically to reduce I/O and performance associated with fetching items that are frequently accessed.

In this example, different attributes of employees are stored in different partition. The VS1 holds data that is accessed more frequently and in another partition, VS2 holds employee type and income which is accessed intermittently.

The benefits of vertical partitioning are:

Slow-access data can be separated from more dynamic data.
Sensitive data can be stored in a separate partition with additional security controls
This strategy can reduce the amount of concurrent access

Leader Based or Leader Followers Replication

Replication is the continuous copying of data changes from primary database to secondary database. The two databases are generally located in different servers, resulting in a load balancing framework by distributing various database queries and provided failover capability. This kind of distribution satisfies the failover and fault tolerance characteristics.

Replication can serve many non-functional requirements such as:

Scalability — Can handle higher query throughput than a single machine can handle
High Availability — Keeping the system running even when one or more nodes go down
Disconnected Operations — Allowing an application to continue working when there is a network problem
Latency — Placing data geographically closer to users, so that users can interact with the data faster

In some cases, replication can provide increased read capacity as clients can send read operations to different servers. Maintaining copies of data in different nodes and different data centers can increase data locality and availability of the distributed applications. You can also maintain additional copies of dedicated purposes, such as disaster recovery, reporting, or backup.

In Leader based replication, one replica is designed as a leader while another replica is a follower. Clients always send their write queries to the leader. Leaders write the data to its local storage first and then sends the data change to its followers. When the client wants to read from the database, it can query either the leader or the follower. The leader is responsible for taking decisions on behalf of the entire cluster and propagating the decisions to all the nodes in a cluster.

In the above example, the single leader with asynchronous and synchronous replication. The user sends an update request to update the first name to the leader, the leader update first and these send an asynchronous request to Follower 1 and Follower 2 after the leader receives an OK response from Follower 1 and Follower 2, the leader sends ok status to the user for a successful update. The leader replicates asynchronously to Follower 3 and the leader doesn’t wait to receive any ok from follower 3.

In a multi-leader example, there are two DCs or Clusters across geographies to provide High Availability or latency to various users. In this model, you need to have two separate sets of leaders and followers in each cluster or DCs and replicates each as mentioned above diagram, but both need to synchronize and resolve any conflicts or inconsistencies. In this case, both leaders talks to each other over conflict resolution object to synch each other.

Every server in a node or cluster or DCs at startup looks for an existing leader. If no leader is found, it triggers leader selection. A leader in each cluster is a must, without the leader, there is no acceptance of any request from the user. Only the leader handles the client request, not followers, if a request sent to a follower, then the follower sends a request to the leader to take action.

How the leaders are selected?

The election will be conducted to select a leader if the existing leader is not available, then the database cluster uses the Raft Consensus Algorithm to choose the leader.

The Raft is a consensus algorithm and designed to select a leader by ensuring each node in the cluster agrees upon the same series of state transitions.

The Raft protocol was developed by Diego Ongaro and John Ousterhout (Stanford University) in the year 2014. The Raft was designed for better understandability of how consensus can be achieved. The consensus is a method to involve multiple servers agreeing on one value, once they reach a decision on a value, that decision is final.

According to Raft, each node in a replicated server cluster can stay in either leader, follower, or candidate. At the time of election to choose the leader, the servers can ask other servers to vote, hence they are called candidates when they have requested votes.

In the above figure, the step-by-step process of how servers apply Raft Consensus to choose a leader. A leader election is started by a candidate server, it starts the election by increasing the term counter, voting itself as a new leader, and sends a message to all other nodes, here Follower 3 is a candidate and sends messages to Follower 2 and Follower 1. A server will vote only once per term, on a first-come-first-serve basis. If a candidate receives a majority vote, then it becomes a new leader. Here Follower 3 receives a maximum vote, then selected as a new leader. Raft uses a randomized election timeout to ensure that split vote problems are resolved quickly.

High Availability of leaders is achieved using a Failover pattern. A timeout with heartbeats is used to detect whether the replica is dead or alive. When one or more followers fall behind a leader by a certain configurable unit, it is called a replication lag and can cause strange side effects. Various consistency models can be used for deciding how an application should behave under replication lag.

Quorum Based Replication

A cluster quorum disk is the storage medium on which the configuration database is stored for a cluster computing network. the cluster configuration database also called a quorum, informs the cluster which physical server(s) should be active at any given time. The quorum disk comprises a shared block device that allows concurrent read/write access by all nodes in a cluster.

In this replication, the client is responsible for copying the data to multiple replicas. The nodes do not actively copy data among each other. The size of the replica group doesn’t change even when some replicas are down. The client sends both read and write to multiple replicas. A cluster agrees that it received an update when a majority of the nodes in the cluster have acknowledged the update. This number is called a quorum. The number of quorum will be decided by the formula.

No of quorum = n/2+1, if you have 5 nodes in a cluster, then n=5 nodes, then 5/2+1= 3 (round off), so if you have a cluster of 5 nodes, you need a quorum of 3.

In the quorum, how to decides how many failures can be tolerated = size of the cluster minus quorum. Node=5 and quorum=3, therefore node-quorum= failure, 5–3=2. A cluster of 5 nodes can tolerate two of them failing. You can use this formula to calculate nodes in a cluster= 2f+1, f=failure (2*2+1=5).

The above figure depicts a Quorum-based replication pattern that shows quorum write, quorum read, and read repair after a node (Replica 3) outage. In that case, it is sufficient to acknowledge the write. Thus, when the user receives a two “ok” response from the cluster. This satisfies the n/2+1 = 3/2+1=2.

If there are ’n’ replicas, every write must be confirmed by ‘w’ nodes to be considered successful and we must query at least ‘r’ nodes for each read. The quorum allows the system to tolerate unavailable nodes as follows.

If w < n, we can still process writes if a node is unavailable
If r < n, we can still process reads if a node is unavailable
With n=3, w=2, r=2, you can tolerate one available node.
With n=5, w=3, r=3 you can tolerate two unavailable nodes

The cluster can function only if the majority of servers are up and running. You need to consider

The throughput of a write operation: Every time data are written to the cluster, it needs to be copied to multiple servers. Every node in a cluster adds overhead to complete all write. The latency of data is a directly proportionate number of servers forming the quorum, therefore if you increase the number of nodes, then it impacts the throughput.
The number of failures that need to be tolerated — the number of failures tolerated depends on the number of nodes in a cluster, by adding one more node doesn’t give more fault tolerance, like 100 developers cannot complete the entire project in 1 day instead of 5 developers in 20 days.

Even if a client always performs quorum reads and writes, conflicts are likely to occur:

Two clients may write to the same key at the same time (use concurrency control to manage)
If an error occurs during writing, or if a node is failed and needs to recreate, a write may be present on fewer than ‘w’ replicas.

The result is that replicas disagree about what a particular value in the database should be. In such a case application must handle by using a concurrency algorithm.

Martin Fowler wrote in his blog, how to choose the optimal servers in a cluster, he mentioned it is based on the number of tolerated failures and approximate impact on the throughput. The throughput column shows approximate relative throughput to highlight how throughput degrades with the number of servers. The number will vary from system to system. For further reading, refer to Raft Thesis and Zookeeper’s paper.

In the Quorum write and read is not sufficient, as some failure scenarios can cause clients to see data inconsistency. Each individual server does not have any visibility of data on another server. The inconsistency can be resolved only when data is read from multiple nodes in a cluster.