TiKV Performance Tuning
Overall Resource Usage
Tuning for Writes
Tuning for Reads
TiKV Whole Picture
- Transactions use the Scheduler pool
- Raft uses the Raftstore pool
- RocksDB uses the Apply thread pool
Overall Resource Usage
- TiKV is CPU bound for most workloads.
- Since TiKV is not using async IO (for now), 70%-80% CPU util per core usually means CPU is fully utilized.
- Grafana shows CPU usage as cpu_util * #cpu_cores.
~700% CPU usage per 8 vCPU TiKV nodes means TiKV is CPU bound
CPU util of individual thread pools
- Even if the overall CPU is not fully utilized, individual thread pools may be fully utilized.
- Increase #threads in the specific thread pool to match your usage.
Check if disk latency/IOPS/throughput are high or have spikes.
For cloud disk (AWS EBS, GCP storage disks, etc.), check if IOPS or throughput is hitting the limit.
For some cloud vendors, there’s also IOPS/throughput limit per VM instance.
TiKV Disk IO tasks
The disk performance dashboard only shows average IO latency.
Major TiKV IO tasks
append log: raft dB WAL write / fsync.
apply log: kvdb WAL write.
storage read: kvdb get/scan.
coprocessor read: kvdb scan.
backup, data import, etc.
RocksDB WAL duration
raft dB and kvdb WAL write duration and fsync duration have a big impact on performance, especially for write-heavy workloads.
Disk performance directly affects WAL write and fsync duration.
raft dB fsync after each write.
kvdb fsync only on Raftstore admin command (e.g., raft log GC command)
TiKV Memory Usage
TiKV memory allocations are all through Jemalloc.
resident: approximation of memory used by Jemalloc.
allocated: memory requested by TiKV.
Difference between “resident” and “allocated” represents memory fragmentation.
RocksDB block cache
Shared by raft dB and kvdb.
Check `storage.block-cache.capacity` config if block cache size is not as expected.
Cache needs to warm up after TiKV starts.
warmup is slow for write-only workload.
Block cache sizes are the same under “RocksDB – kv” and “RocksDB – raft” panel, since raftdb and kvdb share block cache.
Tuning for Writes
TiKV Write Flow
- Process time spent in TiKV (excluding network latency)
TiKV Scheduler Pool
scheduler worker pool
Check if the scheduler worker pool is almost full?
Storage Async Write Duration
- Is it slow in Raftstore?
TiKV Raftstore Pool
Raftstore Apply Pool and Append Pool
Are they balanced?
Are raftstore CPUs balanced between TiKV instances?
Are apply CPUs balanced between TiKV instances?
Is the raftstore pool or apply pool almost full? Increase if needed.
raftstore.store-pool-size (default: 2)
raftstore.apply-pool-size (default: 2)
Increasing their size sometimes increases performance, even if they are not full.
hard to reason about. Need some experimentation.
Enable Hibernate Regions
Raftstore heartbeat can consume a lot of CPU
Especially region number per node is huge (e.g. high data volume)
Reduce raftstore heartbeat if thereâ€™s no traffic for a region
Will be enabled by default from 5.0.2
Consider enabling it if
heartbeat message is more than append message in raftstore
raftstore per thread CPU util > 80%
Check “Messages” in the “Raft message” panel to see if the heartbeat message is more than the appended message.
Raftstore Append and Apply Duration
Raftstore flow control
Are log commits slow?
Make flow control more aggressive.
raftstore.raft-max-inflight-msgs and raftstore.raft-max-size-per-msg
TiKV Write Flow to RocksDB
RocksDB Write Duration
- Are RocksDB writes slow?
RocksDB write steps
Enter and wait in write queue
Preprocessing (switch memtable, etc.)
Write WAL and fsync (disk IO)
Post Processing (update seqno)
High WAL/sync duration means IO is slow
Check disk metrics
Check if RocksDB compaction flow is too high or too spiky
Dealing with Write Stall
Write stall means compaction is not able catch up
Identify stall reason from metrics
Identify resource limit
reduce CPU consumption
change compression level
reduce disk bandwidth consumption
consider enabling Titan
For temporary write stall, increase stall trigger
Throttling Background IO
Reduce impact of background IO (mainly RocksDB compaction)
RocksDB rate limiter
Can only limit compaction IO
rocksdb.rate-bytes-per-sec = “100MB” (throttle compaction to 100MB/s)
auto-tuned rate limit to smooth compaction flow (since 5.0.0)
TiKV rate limiter (will be available in 5.1.0)
Can limit total IO (including Raftstore snapshot, data import/export IO, etc.)
storage.io-rate-limit.max-bytes-per-sec = “100MB”
Separate large value from RocksDB
The large value stored in Titan does not participate in RocksDB compaction.
Reduce disk throughput consumption by compaction.
Stale values are cleaned up by Titan GC.
Consider enabling Titan if …
Workload writes a lot of large values (> 1K), and
Disk throughput is bottleneck, and
No scan or do not care about scan performance.
rocksdb.titan.enabled = true
Has some more tunables.
Tuning For Reads
TiKV Read Flow
Read Thread Pools
Unified thread pool
Implements multilevel feedback queue
Adaptive thread number
Fairness between small and large tasks
Storage read pool / coprocessor read pool
Simple thread pool with priority
Read thread pools (2)
Unified Thread Pool
5.0, if no config specified
4.0 / 5.0, if no config specified
Copr Read Pool/ Storage Read Pool
<= 4.0, or have config specified
<= 3.0, or have config specified
The unified thread pool is enabled by default only when no other configuration is specified for “readpool.storage” and “readpool.coprocessor” respectively
readpool.stoarge.use-unified-pool = true
readpool.coprocessor.use-unified-pool = true
Adjust Thread Pool Size
Is the thread pool almost full?
Coprocessor Wait Duration
Is the wait duration too high?
Too many large scans
Thread pool is too small.
Wrong query plan
Block cache hit rate
low filter / index hit rate (<95%) indicates block cache too small
data hit rate depend on block cache size and data locality
Block cache flow
total_written: approximation of disk read IO size (plus page cache hits)
filter_insert / index_insert / data_insert: breakdowns
Block cache size
Increase block cache if
filter/index hit rate is low.
Too much CPU time spent on data decompression.
Decrease block cache if…
IO read latency is high (counter intuitive)
As long as filter/index hit rate keeps high, and more CPU usage is fine
Hard to reason about. Need some experimentation.
curl the endpoint to obtain the file, then open in browser
Cluster deployed with TiUP
update <deploy_dir>/tikv-/scripts/run_tikv.sh to include
jeprof --pdf <tikv_binary_file> <profile_file>
Hardware: r5b.2xlarge (8 vCPU), EBS gp3 625MB/s, 6K IOPS
Where is the bottleneck?