VOOZH about

URL: https://thenewstack.io/scyllladb-is-moving-to-a-new-replication-algorithm-tablets/

⇱ ScyllaDB Is Moving to a New Replication Algorithm: Tablets - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-08-02 06:31:31
ScyllaDB Is Moving to a New Replication Algorithm: Tablets
sponsor-scylladb,sponsored-post-contributed,
Data / Software Development / Storage

ScyllaDB Is Moving to a New Replication Algorithm: Tablets

Support for tablets is in experimental mode using Raft. This ultimately will allow ScyllaDB to scale faster and in parallel.
Aug 2nd, 2023 6:31am by Tomasz Grabiec
👁 Featued image for: ScyllaDB Is Moving to a New Replication Algorithm: Tablets
Image from TippaPatt on Shutterstock
ScyllaDB sponsored this post.

Like Apache Cassandra, ScyllaDB has historically decided on replica sets for each partition using Vnodes. The Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. It’s very simplistic. Vnodes (token space split points) are chosen randomly, which may cause an imbalance in the actual load on each node.

Also, the allocation happens only when adding nodes, and it involves moving large amounts of data, which limits its flexibility. Another problem is that the distribution is shared by all tables in a keyspace, which is not efficient for relatively small tables, whose token space is fragmented into many small chunks.

In response to these challenges, ScyllaDB is moving to a new replication algorithm: tablets. Initial support for tablets is now in experimental mode.

Tablets allow each table to be laid out differently across the cluster. With tablets, we start from a different side. We divide the resources of the replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets).

This will allow us to balance the load in a more flexible manner by moving individual tablets around. Also, unlike with Vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind Raft groups to tablets.

ScyllaDB is engineered to deliver predictable performance at scale. It’s adopted by organizations that need ultra-low latency, even over millions of ops/sec & PBs of data. Our unique architecture leverages the power of modern infrastructure – translating to fewer nodes, less admin & lower costs.
Learn More
The latest from ScyllaDB
Hear more from our sponsor

This new replication algorithm allows each table to make different choices about how it is replicated and for those choices to change dynamically as the table grows and shrinks. It separates the token ownership from servers, ultimately allowing ScyllaDB to scale faster and in parallel.

Tablets require strong consistency from the control plane; this is provided by Raft. We talked about this detail in the ScyllaDB Summit talk below (starting at 17:26).

Raft vs. Lightweight Transactions for Strongly Consistent Tables

ScyllaDB is in the process of bringing the technology of Raft to user tables and allowing users to create strongly consistent tables that are based on Raft. We already provide strong consistency in the form of lightweight transactions, which are Paxos-based, but they have several drawbacks.

Generally, lightweight transactions are slow. They require three rounds to replicas for every request and they have poor scaling if there are conflicts between transactions. If there are concurrent conflicting requests, the protocol will retry due to conflict. As a result, they may not be able to make progress. This will not scale well.

Raft doesn’t suffer from this issue. First and foremost, it requires only one round to replicas per request when you’re on the leader — or even less than one per request because it can batch multiple commands in a single request. It also supports pipelining, meaning that it can keep sending commands without waiting for previous commands to be acknowledged. The pipelining goes down to a single CPU on which every following state machine runs. This leads to high throughput.

👁 Image

But Raft also has drawbacks in this context. Because there is a single leader, Raft tables may experience latency when the leader dies because the leader has to undergo a failover. Most of the delay is actually due to detection latency because Raft doesn’t switch the leader back and forth so easily. It waits for 1 second until it decides to elect a new leader. Lightweight transactions don’t have this, so they are theoretically more highly available.

Another problem with Raft is that you have to have an extra hop to the leader when the request starts executing not on the leader. This can be remedied by improving drivers to make them leader-aware and route requests to the leader directly.

Also, Raft tables require a lot of Raft groups to distribute load among shards evenly. That’s because every request has to go through a single CPU, the leader, and you have to have many such leaders to have even load. Lightweight transactions are much easier to distribute.

👁 Image

Balancing the Data across the Cluster

So let’s take a closer look at this problem. This is how the load is distributed currently using our standard partitioning (the Vnode partitioning), which also applies to tables that use lightweight transactions.

👁 Image

Replication metadata, which is per keyspace, determines the set of replicas for a given key. The request then is routed to every replica. On that replica, there is a sharding function that picks the CPU in which the request is served, which owns the data for a given key. The sharding function makes sure that the keys are evenly distributed among CPUs, and this provides good load distribution.

👁 Image

The story with Raft is a bit different because there is no sharding function applied on the replica. Every request that goes to a given Raft group will go to a fixed set of Raft state machines and Raft leader, and their location of CPUs is fixed.

👁 Image

They have a fixed shard, so the load distribution is not as good as with the sharding function with standard tables.

We could remedy the situation by creating more tokens inside the replication metadata so that we have more ranges and more narrow ranges. However, this creates a lot of Raft groups, which may lead to an explosion of metadata and management overhead because of Raft groups.

👁 Image

The solution to this problem depends on another technology: tablet partitioning.

Tablet Partitioning

In tablet partitioning, replication metadata is not per keyspace. Every table has a separate replication metadata. For every table, the range of keys (as with Vnode partitioning) is divided into ranges, and those ranges are called tablets. Every tablet is replicated according to the replication strategy, and the replicas live on a particular shard on the owning node. Unlike with Vnodes, requests will not be routed to nodes that then independently decide on the assignment of the key to the shard, but will rather be routed to specific shards.

👁 Image

This will give us more control over where data lives, which is managed centrally. This gives us finer control over the distribution of data.

The system will aim to keep the tablets at a manageable size. With too many small tablets, there’s a lot of metadata overhead associated with having tablets. But with too few large tablets, it’s more difficult to balance the load by moving tablets around. A table will start with just a few tablets. For small tables, it may end there. This is a good thing, because unlike with the Vnode partitioning, the data will not be fragmented into many tiny fragments, which adds management overhead and also negatively affects performance. Data will be localized in large chunks that are easy to process efficiently

As tables grow, as they accumulate data, eventually they will hit a threshold, and will have to be split. Or the tablet becomes popular with requests hitting it, and it’s beneficial to split it and redistribute the two parts so the load is more evenly distributed.

👁 Image

The tablet load balancer decides where to move the tablets, either within the same node to balance the shards or across the nodes to balance the global load in the cluster. This will help to relieve overloaded shards and balance utilization in the cluster, something which the current Vvnode partitioner cannot do.

This depends on fast, reliable, fault-tolerant topology changes because this process will be automatic. It works in small increments and can be happening more frequently than current node operations.

👁 Image

Tablets also help us implement Raft tables. Every Raft group will be associated with exactly one tablet, and the Raft servers will be associated with tablet replicas. Moving a tablet replica also moves the associated Raft server.

👁 Image

Additional Benefits: Resharding and Cleanup

Turns out the tablets will also help with other things. For example, resharding will be very cheap. SSTables are split at the tablet boundary, so resharding is only a logical operation that reassigns tablets to shards.

👁 Image

Cleanup is also cheap because cleaning up all data after a topology change is just about deleting the SSTable — there’s no need to rewrite them.

ScyllaDB is engineered to deliver predictable performance at scale. It’s adopted by organizations that need ultra-low latency, even over millions of ops/sec & PBs of data. Our unique architecture leverages the power of modern infrastructure – translating to fewer nodes, less admin & lower costs.
Learn More
The latest from ScyllaDB
Hear more from our sponsor
TRENDING STORIES
Tomasz Grabiec is a software engineer. Prior to joining ScyllaDB he worked for UBS IB and Sabre Holdings on systems built with Java technology. He's been a contributor to the Jato VM project, an open source implementation of the JVM.
Read more from Tomasz Grabiec
ScyllaDB sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.