cat /dev/random meh

The Road To Serverless: Storage Engine

Author: Ed Huang (i@huangdx.net), CTO@PingCAP

In the previous article, we introduced the “Why” behind TiDB Serverless and the goals we hoped to achieve. Looking back now, in the process of building TiDB Serverless, I believe the biggest architectural change has been in the storage engine. The storage engine has been the key contributor to reducing overall usage costs.

Another reason why emphasize the storage engine,is because if the compute layer is well abstracted, it can achieve a stateless state (as seen in TiDB’s SQL layer), and the elastic scalability of stateless systems is relatively easy to achieve. The real challenge lies in the storage layer.

In the traditional database context, when we talk about storage engines, we typically think of LSM-Tree, B-Tree, Bitcask, and so on, these storage engines are local, so, for a distributed database, a layer of sharding logic is needed on top of the local storage engine to determine the data distribution across different physical nodes. However, in the cloud, especially when building a serverless storage service, the physical distribution of data is no longer critical because S3 (object store, which includes, but is not limited to, AWS S3) has already addressed scalability and distributed availability remarkably well (and there are also high-quality open-source object storage implementations in the community, such as MinIO). If we consider a new ‘storage engine’ built on top of S3 as a whole, we can see that the entire design logic of the distributed database is significantly simplified.

But S3 has a limitation in terms of the I/O latency for small I/O, so it cannot be directly used in the primary read-write path of OLTP workloads. Therefore, for writes, data still needs to be written to local disks. However, as long as we ensure successful log writes, implementing a distributed, low-latency, and highly available log storage is much simpler compared to building a complete storage engine (because it’s just append-only). Therefore, log replication still needs to be handled separately, and options like Raft or Multi-Paxos can be chosen as RSM (Replicated State Machine) algorithms, or one can even rely on EBS for a more simplified approach. As for read requests, if we want extremely low latency (<10ms), we primarily rely on in-memory caching. So the optimization direction is focused on implementing several levels of caching and deciding whether to use local disks, and whether to cache at the page or row level. Furthermore, in a distributed environment, one advantage is that since data can be logically partitioned, the cache can also be distributed across multiple machines.

Here are the design goals for the new storage engine for TiDB Serverless:

So, this is the main design goals and technical choices for the cloud storage engine in TiDB Serverless. In the design of TiDB Serverless for multi-tenancy, we have also utilized many features of the storage layer. Therefore, in the next article, I will introduce the implementation of multi-tenancy.