Restarting a Large Cluster that Has Dropped Multiple TiKV Nodes
On occasions, such as when an AWS availability zone - or even a whole region - goes down, or when the RocksDB component encounters a memory region that is an inconsistent state, this may cause the entire cluster to shut down with one or more errors posted.
When an availability zone goes down, all the cluster nodes from that zone will also go down and will have to be failed over to another availability zone. If, on the other hand, the issue is with data consistency at the storage, then the TiKV instance will go down with an error like the following:
[_timestamp_] [FATAL] [server.rs:708] ["failed to start node: Engine(Other("[components/raftstore/src/store/fsm/store.rs:932\]: Engine Engine("Corruption: block checksum mismatch: expected 417475072, got 2324967102 in /mnt/data/ti/data/tikv-20160/raft/1537737.sst offset 6145532 size 17374")"))"]
In that case, the host may be available, but it will refuse to
restart with the same error because of the data corruption. In
addition, upon restarting the TiDB service, you may see a failure with
low-space-ratio. This means that the
service does not think there is sufficient empty space in a working
volume to accommodate future needs.
When you are faced with a situation where one or more TiKV nodes have dropped off, be sure to replace the failing nodes before attempting to restart the cluster. This will avoid repeated failed restart attempts due to under-provisioned TiKV nodes.
Sometimes, in situations where your hosts have a lot of disk storage,
TiKV may fail to start with a message about
in the logs. When this condition occurs, you can correct it by checking
low-space-ratio setting in the PD
The following is a sample command for obtaining the PD Configuration for a PD host named ‘pd1.pingcap.com’:
> ctl:v5.2.1 pd config show all -u http://pd1.pingcap.com:2379 | jq '.schedule."low-space-ratio"'
You can see that the
low-space-ratio is set to the
default of 0.8. This Means that, If your PD host has a 3TB partition,
you may have as much as 600 MB of free space and still not meet your
You can correct this by changing the ‘low-space-partition’ setting to ‘0.9’, which should allow the PD to start with a lot less free storage space (<= 300 MB, in this case) with the following command
> ctl:v5.2.1 pd config set low-space-ratio 0.9 -u http://pd1.pingcap.com:2379
> Starting component `ctl`: /home/ubuntu/.tiup/components/ctl/v5.2.1/ctl pd config set low-space-ratio 0.9 -u http://pd1.pingcap.com:2379 Success!
OnCall-2633: Streak - TiKV corruption/checksum crashes
PD-3552: PD should consider the available storage size to detect low space store
All TiDB releases
#timeout, #low-space-ratio, #stability, #administration