![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Probability is the branch of mathematics that deals with uncertainty. It helps us understand the likelihood of different outcomes occurring. Below, we consider two alternative architecture options for scaling a database horizontally and employ probability theory to show that one architecture is more reliable than the other by a factor of 60,000.
Application-level sharding uses domain-specific knowledge to partition data into multiple database instances running on multiple servers. Each database instance is isolated, enabling workloads to be scaled. This architecture requires custom logic for routing, rebalancing and handling cross-shard operations.
Distributed SQL provides a single logical database that horizontally scales across multiple servers with built-in replication and quorum-based logic to implement global ACID transactions. Additional servers can be added and integrated into the system, enabling workloads to be scaled. Automatic routing, rebalancing and handling of cross-shard operations simplifies development and speeds up time to market.
For this comparison, we assume both architectures run on Google Cloud Platform using VMs hosted as part of the Compute Engine service. Google Cloud Platform provides a monthly uptime service-level objective of 99.9% for a single VM/instance. We use this SLO in our system availability calculations.
An application-sharded system partitions data across multiple servers that then operate semi-independently.
In probability theory, independent events are events whose outcomes do not affect each other. For example, when throwing four dice, the number displayed on each dice is independent of the other three dice.
Similarly, the availability of each server in a six-node application-sharded cluster is independent of the others. This means that each server has an individual probability of being available or unavailable, and the failure of one server is not affected by the failure or otherwise of other servers in the cluster.
In reality, there may be shared resources or shared infrastructure that links the availability of one server to another. In mathematical terms, this means that the events are dependent. However, we consider the probability of these types of failures to be low, and therefore, we do not take them into account in this analysis.
Mathematically, if two events A and B are independent, then the probability of both A and B happening together is the product of their individual probabilities:
👁 Image
For a six-node database cluster, this would mean:
👁 Image
The six-node sharded architecture, therefore, supports an SLO of 99.4%, which is notably lower than the SLO of the underlying VMs.
A distributed SQL database automatically shards the data of a single logical database across multiple servers. Additionally, for resilience, it maintains replicas for each shard and typically uses a quorum-based algorithm to coordinate updates, ensuring strong consistency for reads and writes.
Each node manages one or more shards of data. Each shard is in a quorum group, with its data replicated on two other nodes. To protect against availability zone (AZ) outages as well as individual node failures, the cluster is typically distributed across three availability zones, and the data distribution algorithm ensures that replicas of a shard are always placed in different availability zones.
In probability theory, the binomial distribution models the number of expected outcomes during a series of trials or tests.
We can use the binomial distribution to calculate the probability of k servers being available in a cluster of n servers:
So, the six-node system is available if:
This means:
The six-node replication factor (RF) three-quorum-based architecture supports an SLO of 99.998%, notably higher than the SLO of the underlying VMs.
To further increase resilience and protect against two simultaneous failures, distributed SQL can be configured to operate with a replication factor RF of five. With this architecture, each node manages one or more shards of data.
Each shard is in a quorum group, with its data replicated on four other nodes. To protect against availability zone (AZ) outages as well as individual node failures, the cluster is typically distributed across five availability zones, and the data distribution algorithm ensures that replicas of a shard are always placed in different availability zones.
The 10-node RF5 quorum-based architecture supports a service-level objective of 99.99999%, which is significantly higher than the SLO of the RF3 cluster.
Traditional architectures are limited by single-node failure risk. Application-level sharding compounds this problem because if any node goes down, its shard and therefore the total system becomes unavailable.
In contrast, distributed databases with quorum-based consensus (like YugabyteDB) provide fault tolerance and scalability, enabling higher resilience and improved availability.
| Architecture | Service Level Objective | |
| Single Node | 99.9% | (Three 9s) |
| 6 Node Application-Level Sharding | 99.4% | (Two 9s) |
| 6 Node RF3 Distributed SQL Cluster | 99.99% | (Four 9s) |
| 10 Node RF5 Distributed SQL Cluster | 99.99999% | (Seven 9s) |
The summary table above shows a far greater likelihood of failure using a six-node application-level sharding architecture than a 10-node RF 5 distributed SQL cluster. Specifically:
Likelihood of failure of six-node application sharded compared with 10-node RF5 =
Enterprises that deliver high-throughput, real-time transaction services, such as payment processors and anti-money laundering solutions, are critically dependent on the resilience of their infrastructure.
Every minute of downtime is lost revenue. It erodes trust and potentially causes churn. A platform handling 10,000 transactions per second at $50 each with a 2% fee would lose $600,000 of revenue per minute, just in fees.
Resilience matters. Well-documented processes and runbooks activated during a failure scenario are not enough. Operational resilience for critical services requires resilient, self-healing architectures like distributed SQL.
Traditional architectures, particularly those using single-node or application-level sharding, are prone to failure and offer limited availability. Distributed SQL databases with quorum-based replication provide significantly higher availability, fault tolerance and resilience.
The difference is not just technical but business-critical. Downtime can result in substantial revenue loss, reputational damage and regulatory risk. As operational demands and regulatory expectations increase, adopting resilient, self-healing architectures is crucial for any enterprise that relies on high-throughput, real-time services.