TiKV Performance Tuning
Outline
TiKV Architecture
Overall Resource Usage
Tuning for Writes
Tuning for Reads
Profiling Tools
Practice Exercises
TiKV Architecture
TiKV Whole Picture
Layered Design
TiKV Module
- Transactions use the Scheduler pool
- Raft uses the Raftstore pool
- RocksDB uses the Apply thread pool
Overall Resource Usage
CPU util
- TiKV is CPU bound for most workloads.
- Since TiKV is not using async IO (for now), 70%-80% CPU util per core usually means CPU is fully utilized.
- Grafana shows CPU usage as cpu_util * #cpu_cores.
~700% CPU usage per 8 vCPU TiKV nodes means TiKV is CPU bound
CPU util of individual thread pools
- Even if the overall CPU is not fully utilized, individual thread pools may be fully utilized.
- Increase #threads in the specific thread pool to match your usage.
Disk IO
Check if disk latency/IOPS/throughput are high or have spikes.
For cloud disk (AWS EBS, GCP storage disks, etc.), check if IOPS or throughput is hitting the limit.
For some cloud vendors, there’s also IOPS/throughput limit per VM instance.
TiKV Disk IO tasks
The disk performance dashboard only shows average IO latency.
Major TiKV IO tasks
Foreground
append log: raft dB WAL write / fsync.
apply log: kvdb WAL write.
storage read: kvdb get/scan.
coprocessor read: kvdb scan.
Background
RocksDB flush/compaction
Raftstore snapshots
backup, data import, etc.
RocksDB WAL duration
raft dB and kvdb WAL write duration and fsync duration have a big impact on performance, especially for write-heavy workloads.
Disk performance directly affects WAL write and fsync duration.
fsync
raft dB fsync after each write.
kvdb fsync only on Raftstore admin command (e.g., raft log GC command)
TiKV Memory Usage
Cluster Memory
TiKV memory allocations are all through Jemalloc.
Jemalloc stats
resident: approximation of memory used by Jemalloc.
allocated: memory requested by TiKV.
Difference between “resident” and “allocated” represents memory fragmentation.
RocksDB block cache
Shared by raft dB and kvdb.
Check `storage.block-cache.capacity` config if block cache size is not as expected.
Cache needs to warm up after TiKV starts.
warmup is slow for write-only workload.
Block cache sizes are the same under “RocksDB – kv” and “RocksDB – raft” panel, since raftdb and kvdb share block cache.
Tuning for Writes
TiKV Write Flow
gRPC duration
- Process time spent in TiKV (excluding network latency)
TiKV Scheduler Pool
scheduler worker pool
Check if the scheduler worker pool is almost full?
Increase scheduler-worker-pool-size.
Storage Async Write Duration
- Is it slow in Raftstore?
TiKV Raftstore Pool
Raftstore Apply Pool and Append Pool
Are they balanced?
Are raftstore CPUs balanced between TiKV instances?
Are apply CPUs balanced between TiKV instances?
Is the raftstore pool or apply pool almost full? Increase if needed.
raftstore.store-pool-size (default: 2)
raftstore.apply-pool-size (default: 2)
Increasing their size sometimes increases performance, even if they are not full.
hard to reason about. Need some experimentation.
Enable Hibernate Regions
Raftstore heartbeat can consume a lot of CPU
Especially region number per node is huge (e.g. high data volume)
Hibernate region
Reduce raftstore heartbeat if there’s no traffic for a region
Will be enabled by default from 5.0.2
Consider enabling it if
heartbeat message is more than append message in raftstore
raftstore per thread CPU util > 80%
Check “Messages” in the “Raft message” panel to see if the heartbeat message is more than the appended message.
Raftstore Append and Apply Duration
Raftstore flow control
Are log commits slow?
Make flow control more aggressive.
raftstore.raft-max-inflight-msgs and raftstore.raft-max-size-per-msg
TiKV Write Flow to RocksDB
RocksDB Write Duration
- Are RocksDB writes slow?
RocksDB write steps
Enter and wait in write queue
Preprocessing (switch memtable, etc.)
Write WAL and fsync (disk IO)
Write memtable
Post Processing (update seqno)
High WAL/sync duration means IO is slow
Check disk metrics
Check if RocksDB compaction flow is too high or too spiky
Dealing with Write Stall
Write stall means compaction is not able catch up
Identify stall reason from metrics
Identify resource limit
reduce CPU consumption
change compression level
rocksdb.[cf].compression-per-level
reduce disk bandwidth consumption
consider enabling Titan
For temporary write stall, increase stall trigger
rocksdb.level0-slowdown-writes-trigger
rocksdb.level0-stop-writes-trigger
rocksdb.soft-pending-compaction-bytes-limit
rocksdb.hard-pending-compaction-bytes-limit
Throttling Background IO
Reduce impact of background IO (mainly RocksDB compaction)
RocksDB rate limiter
Can only limit compaction IO
rocksdb.rate-bytes-per-sec = “100MB” (throttle compaction to 100MB/s)
auto-tuned rate limit to smooth compaction flow (since 5.0.0)
TiKV rate limiter (will be available in 5.1.0)
Can limit total IO (including Raftstore snapshot, data import/export IO, etc.)
storage.io-rate-limit.max-bytes-per-sec = “100MB”
Titan
Separate large value from RocksDB
The large value stored in Titan does not participate in RocksDB compaction.
Reduce disk throughput consumption by compaction.
Stale values are cleaned up by Titan GC.
Consider enabling Titan if …
Workload writes a lot of large values (> 1K), and
Disk throughput is bottleneck, and
No scan or do not care about scan performance.
Enabling Titan
rocksdb.titan.enabled = true
Has some more tunables.
Tuning For Reads
TiKV Read Flow
Read Thread Pools
Unified thread pool
Implements multilevel feedback queue
Adaptive thread number
Fairness between small and large tasks
Storage read pool / coprocessor read pool
Simple thread pool with priority
Read thread pools (2)
Default value | Storage Read | Coprocessor Read |
Unified Thread Pool | 5.0, if no config specified | 4.0 / 5.0, if no config specified |
Copr Read Pool/ Storage Read Pool | <= 4.0, or have config specified | <= 3.0, or have config specified |
The unified thread pool is enabled by default only when no other configuration is specified for “readpool.storage” and “readpool.coprocessor” respectively
Config
readpool.stoarge.use-unified-pool = true
readpool.coprocessor.use-unified-pool = true
Adjust Thread Pool Size
Is the thread pool almost full?
Increase readpool.unified.max-thread-count
Coprocessor Wait Duration
Is the wait duration too high?
Too many large scans
Imbalance
Thread pool is too small.
Wrong query plan
Block Cache
Block cache hit rate
low filter / index hit rate (<95%) indicates block cache too small
data hit rate depend on block cache size and data locality
Block cache flow
total_written: approximation of disk read IO size (plus page cache hits)
filter_insert / index_insert / data_insert: breakdowns
Block cache size
storage.block-cache.capacity
Increase block cache if
filter/index hit rate is low.
Too much CPU time spent on data decompression.
Decrease block cache if…
IO read latency is high (counter intuitive)
As long as filter/index hit rate keeps high, and more CPU usage is fine
Hard to reason about. Need some experimentation.
Profiling Tools
CPU Profiling
Profiling endpoint
http://<tikv\_server>:<status\_port>/debug/pprof/profile?seconds=60
curl the endpoint to obtain the file, then open in browser
Memory profiling
Profiling endpoint
http://<tikv\_server>:<status\_port>/debug/pprof/heap?seconds=60
Cluster deployed with TiUP
update <deploy_dir>/tikv-/scripts/run_tikv.sh to include
export MALLOC_CONF="prof:true,prof_active:true,prof.active:true"
Generating call-graph
jeprof --pdf <tikv_binary_file> <profile_file>
Practice Exercises
Scenario 1
TPCC workload
Hardware: r5b.2xlarge (8 vCPU), EBS gp3 625MB/s, 6K IOPS
Where is the bottleneck?
Comments
0 comments
Please sign in to leave a comment.