Troubleshooting Guide: Write Stalls in TiKV
There are many possibilities for degrading write performance. In this troubleshooting guide we will discuss write performance degradation and how Rocks DB’s built in Write Stall feature can be the culprit for poor write performance and how to resolve the issue.
When RocksDB is unable to perform flushes and compaction promptly, it uses a feature called “Stalls” to try and slow down the amount of data coming into the engine. Write Stalls include pausing all writes and limiting the number of writes.
RocksDB has an extensive system to slow down writes when flush or compaction can't keep up with the incoming write rate. Without such a system, if users keep writing more than the hardware can handle, the database will:
- Increase space amplification, which could lead to running out of disk space;
- Increase read amplification, significantly degrading read performance.
The idea is to slow down incoming writes to the speed that the database can handle. However, sometimes the database can be too sensitive to a temporary write burst, or underestimate what the hardware can handle, so that you may get unexpected slowness or query timeouts.
TiKV has 2 instances of RocksDB (RaftDB, KVDB). RaftDB has only 1 column family. KVDB has 3 column families (default, write, lock).
User Experiences & Symptoms
When the application or process does a large number of writes, TiDB response time (latency) degrades significantly. This seems to affect all writes and not just writes to a specific table.
There are a couple of Grafana graphs that can help identify if RocksDB’s Write Stall is the cause and if additional investigation should be continued into RocksDB Write Stalls.
We can use the following Grafana charts located in the section [Cluster name] / TiKV-Details / [RocksDB - raft | RocksDB -kv]
In the below Write stall duration chart we see that Write Stalls spiked significantly at 11:30~. Prior to 11:30 and after 12:15 Write Stalls did not occur. A healthy system should have no Write Stalls.
With this knowledge, we can dig deeper into Write Stalls.
The following are 3 primary reasons why RocksDB will trigger Write Stalls.
- Too Many Memtables
- Too Many Level 0 SST Files
- Too Many Pending Compaction Bytes
These reasons above can be directly mapped to the Write Stall Reason diagram below.
Each of the metrics below is associated with slowdown or stop. Slowdown identifies that Rocks DB has limited the number of writes specific to the metric. Stop identifies that Rocks DB has completely stopped writes related to the metric.
Below we will discuss each of these reasons why RocksDB triggers Stalls
Types of RocksDB Write Stalls
Below we will discuss each of the 3 reasons why RocksDB will trigger Write Stalls.
Too Many Memtables
If too many large MemTables are created then there is a greater possibility of Out-Of-Memory (OOM) exception, therefore RocksDB will limit the number of Memtables created and their size.
For each Column Family (default, write, lock, raft), RocksDB first writes the record in the WAL log that’s on disk, and then the data is inserted into a MemTable (write buffer). When the MemTable reaches the write-buffer-size limit (default for CF’s default and write 128MB, default for CF lock 32 MB), the memtable becomes read-only and a new memtable is generated to receive new write operations. There is a maximum number of MemTables(default 5) that can be created. This is set by max-write-buffer-number. Once the limit is reached, RocksDB will not create more MemTables and will stall all new write operations until the MemTables count is reduced below the max-write-buffer-number limit.
There are background tasks (jobs) that are specifically responsible for flushing MemTables to disk and compacting SST files. By default max-background-jobs is set to 8 or CPU Cores -1, whichever is smaller.
Many MemTables can exist. Remember we have a minimum of one MemTable for each Column Family (default, write, lock, raft) As mentioned before, when the MemTables reaches the write-buffer-size limit it becomes read-only and is flagged to be flushed to disk as an SST file. There are a limited number of tasks (jobs) that are responsible for flushing MemTables to disk, which can be identified with max-background-flushes limit. The default is 2 or max-background-jobs / 4, whichever is bigger.
NOTE: In early versions prior to TiKV v4.0.9, max-background-flushes is not available and the number of Background Flush tasks is set to max-background-jobs / 4.
If the number of MemTables reaches the max-write-buffer-number for a Column Family, then RocksDB will stall all writes for that Column Family on that instance.
[rocksdb|raftdb].[defaultcf|writecf|lockcf].write-buffer-size (default 128MB)
If increased, the possibility OOM is higher
[rocksdb|raftdb].[defaultcf|writecf|lockcf].max-write-buffer-number (default 5)
If increased, the possibility OOM is higher
[rocksdb|raftdb].max-background-jobs (TiDB 5: default 8)
[rocksdb|raftdb].max-background-flushes (default The default value is set to 2 or max_background_jobs / 4, whichever is bigger)
- Immutable mem-table Number
- Memtable Size
Too Many Level 0 SST Files
The size and count of Level 0 SST files can directly affect query read performance. Therefore, RocksDB will attempt to limit the number of Level 0 SST files.
SST files in general exist for each Column Family (default, write, lock, raft). The Level 0 SST Files of RocksDB are different from other levels. MemTables are flushed to disk as a Level 0 SST file. The data in Level 0 SST File are arranged in the order of generation as the MemTable. The key ranges between the SSTs can overlap. Another way to say the previous sentence is that data from the same regions can be spread across different Level 0 SST files. Also data with the same key can be in multiple SST files. Therefore, when a read occurs, each Level 0 SST file must be read in order. The more Level 0 SST files the more likely query read performance will be impacted negatively. Write Stalls are triggered when there are too many files in Level 0. It’s normal for Level 0 & Level 1 SST to not be compressed, which in contrast to other levels that do compress SST Files.
By default, when the count of Level 0 SST files reach 20, RocksDB will slow writes for the entire database. If the count of Level 0 SST files reach 36, RocksDB will completely stall writes for the entire database.
Compaction is triggered when a column family (default, write, lock, raft) Level 0 SST files count reaches 4. But compaction may not occur. Parallel compactions on different column families of the database can occur simultaneously.
- [rocksdb|raftdb].[defaultcf|writecf|lockcf].level0-file-num-compaction-trigger (default 4)
- [rocksdb|raftdb].[defaultcf|writecf|lockcf].level0-slowdown-writes-trigger (default 20) should be set to 32. When there are too many level0 files, this will cause frequent write stalls. To reduce these write stalls, we suggest increase this setting.
- [rocksdb|raftdb].[defaultcf|writecf|lockcf].level0-stop-writes-trigger (default 36) should be set to 64, for the same reason as 'level0-slowdown-writes-trigger'.
- [rocksdb|raftdb].[defaultcf|writecf|lockcf].rocksdb.max-background-jobs (default 8) should be set to 12, to solve the issue of too many pending bytes. Increasing this setting should speed up the consumption of pending bytes.
Grafana Graphs Level0
- Level0 SST File Number
Too Many Pending Compaction Bytes
Log-Structured Merge-Tree (LSM) consists of multiple levels. Each level can include zero or more Sorted String Table (SST) files. Level 0 is identified as a “Higher Level” and each incrementing level (Level 1, Level 2, …) is identified as a lower level (see diagram below).
Since Level 0 is treated differently than the lower levels, Level 0 does not apply to Pending Compaction Bytes, but it’s included in the diagram below to provide a more complete picture of the levels.
RocksDB uses Level Compaction for Levels 1 and lower. Level 1 by default has a Target Size compaction at 512 MB for column families (CFs) default and write and 128MB for CF lock. Each lower level has a Target Size 10 times greater than the previous higher level. For example, if Level 1 Target size is 512 MB, Level 2 Target Size is 5 GB, Level 3 Target Size is 50 GB, and so on.
When a level exceeds its Target Size, the amount that is exceeded is known as Pending Compaction Bytes. Pending Compaction Bytes for all Levels 1 and lower are aggregated together to calculate the Total Pending Compaction Bytes. Write Stalls are triggered if Total Pending Compaction Bytes exceeds the soft-pending-compaction-bytes-limit or hard-pending-compaction-bytes-limit.
For example in the diagram below, the Total Pending Compaction is 72 GB, if soft-pending-compaction-bytes-limit is set to 64 GB (default), Write Stalls will be triggered.
(Diagram is not at scale)
- [rocksdb|raftdb].soft-pending-compaction-bytes-limit (default 64GB)
- [rocksdb|raftdb].hard-pending-compaction-bytes-limit (default 256GB)
Grafana Graphs Compaction
- Compaction Pending Bytes
Possible Root Causes
- Application workload has changes and there are many more writes
- Infrastructure storage IOPs is not adequate
- Cascading effect of Mem Tables being full, too many Level 0 SST files, pending compaction bytes to high
- Number Of Compaction Threads Too Small
Titan is a high-performance RocksDB plugin for key-value separation. Titan can reduce write amplification in RocksDB when large values are used.
For systems with very high write performance requirements, write stall is a potential issue, so we can only balance the write, read and space configurations under specific scenarios. This makes it critical to understand the parameters of RocksDB and the related meanings.
General Documentation and Articles
GitHub Facebook - RocksDB - Write Stalls
TiDB Write Flow and Metrics - Write Stalls
RocksDB Overview - WriteStall
GitHub Facebook - RocksDB - Tuning Guide
GitHub Facebook - RocksDB - Rate Limiter
GitHub Facebook - RocksDB - Leveled Compaction
3rd party Article - Tuning RocksDB - Write Stalls
TiKV Sample Config Template .toml
RocksDB.org Blog - Level-based Compaction Changes (2017)
Deck - RocksDB Compaction (2016)
Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB
GitHub Facebook - RocksDB - RocksDB Overview (good info on compaction)
Jira #2717 - PayPay - v3.0.12 - TiKV write stall
Jira #1149 - Industrial and Commercial Bank-v2.1.11-tikv is busy, cluster write performance plummets
Jira #1615 - Xiaohongshu-4.0.0-There was a problem with writing, and the delay gradually increased