How to Mute Unwanted Alerts in Prometheus
Summary
After wiring up Prometheus alerts to send text messages through PagerDuty, you will discover that our default alert manager configuration sends messages for alerts at all severity levels. The levels are described in TiDB Cluster Alert Rules:
Severity level |
Description |
Emergency-level |
The highest severity level at which the service is unavailable. Emergency-level alerts are often caused by a service or node failure. Manual intervention is required immediately. |
Critical-level |
Decreased service availability. For the critical-level alerts, a close watch on the abnormal metrics is required. |
Warning-level |
Warning-level alerts are a reminder for an issue or error. |
The problem is that most users do not want to receive pager alerts for Warning-level messages - especially in the middle of the night!
Diagnosis
A typical alert manager configuration will be wired as follows:
The YAML file will be found under [deployment_directory]/deploy/alertmanager-9093/conf/alertmanager.yml. The key fields are highlighted:
route:
# A default receiver
receiver: "tidb-pagerduty"
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group\_by: \["env", "instance", "alertname", "type", "group", "job"\]
# When a new group of alerts is created by an incoming alert, wait at
# least 'group\_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group\_wait: 30s
# When the first notification was sent, wait for 'group\_interval' to send a batch
# of new alerts that started firing for that group.
group\_interval: 3m
# If an alert has successfully been sent, wait for 'repeat\_interval' to
# resend them.
repeat\_interval: 3m
receivers:
\- name: "tidb-pagerduty"
pagerduty\_configs:
\- routing\_key: <PagerDuty API token>
Note that every single alert - regardless of priority - will be routed to the PagerDuty endpoint regardless of severity level. This results in a lot of warning-level messages sending beeper alerts to your IT team, which is probably not what you want. Here are some of the most common warning-level alerts:
- tikvclient_backoff_seconds
- tikv_coprocessor_request_error
- tidb_memory_abnormal
- tikv_pending_task
Solution #1 - Mute All “Warning” Level Alerts
Our recommendation is to stop sending the "warning" alerts to the on-call PagerDuty endpoint. To accomplish this, you will need to create a route in the alert manager configuration that matches the level of the alert and routes it to some other receiver. Our alertmanager.yml that we ship includes a "black hole" receiver that simply discards the alert. You could also set up a separate, non-emergency PagerDuty or email receiver if you want to be able to review these alerts on a more leisurely basis. Here's an example of the "route" section of an alertmanager.yml that sends all "warning" level alerts to blackhole:
route:
# A default receiver
receiver: "tidb-pagerduty"
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group\_by: \["env", "instance", "alertname", "type", "group", "job"\]
# When a new group of alerts is created by an incoming alert, wait at
# least 'group\_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group\_wait: 30s
# When the first notification was sent, wait for 'group\_interval' to send a batch
# of new alerts that started firing for that group.
group\_interval: 3m
# If an alert has successfully been sent, wait for 'repeat\_interval' to
# resend them.
repeat\_interval: 3m
**routes:**
**- match:**
**level: warning**
**receiver: blackhole**
Once you have made these modifications, you can make it effective by following the instructions in our Update Configuration documentation. In summary, you will be doing the following:
- Make a copy of the alertmanager.yml file from the deployment
directory to somewhere in your home directory:
cp [tidb-installation-home]/tidb-deploy/alertmanager-9093/conf/alertmanager.yml /home/myusername/. - Edit the local alertmanager.yml as described above.
- Edit the cluster configuration to let the server know where the new
alertmanager.yml resides:
tiup cluster edit-config
Then add the config_file field under the alertmanager_servers heading:
alertmanager\_servers:
\- host: 127.0.0.1
ssh\_port: 22
web\_port: 9093
cluster\_port: 9094
deploy\_dir: /home/tidb/tidb-deploy/alertmanager-9093
data\_dir: /home/tidb/tidb-data/alertmanager-9093
log\_dir: /home/tidb/tidb-deploy/alertmanager-9093/log
arch: amd64
os: linux
config\_file: /home/jose/alertmanager.yml
Finish up by reloading the alert manager service:
And that’s it. You should stop seeing “warning” level alerts once the alert manager server is reloaded.
Solution #2 - Route All Warnings Elsewhere
If you do not want to throw away all “warning” level alerts, you could route them instead to another service, such as email, that is used to retain these alerts for later review. Your configuration should look something like this:
global:
# The smarthost and SMTP sender are used for mail notifications.
# smtp\_smarthost: "mail-server-host"
smtp\_from: "alertmanager@example.org"
smtp\_auth\_username: "username"
smtp\_auth\_password: "password"
# smtp\_require\_tls: true
receivers:
- name: "db-alert-email"
email\_configs:
- send\_resolved: true
to: "email@your-company.com"
Similarly, you can also opt to send all “warning” level alerts to a Slack channel, which will naturally serve as a storage mechanism and may be viewed any time you wish. This page shows sample configurations for Slack, email, and many other channels you may use.
Solution #3 - Change the Rules for a Single Alert
If you just want to mute a single alert, you can do it by changing the rules file for TiDB. We recommend against this because this would override any configuration updates included in a new version of TiDB. If you do decide to go this route, just remember that your upgrade runbook will have to include an extra step for loading the desired rules file after the upgrade is installed.
For instance, let’s say you have installed as user “tidb” and now want to mute the alert tidb_memory_abnormal. Here are the steps you would follow:
- Copy all the rules files from /home/tidb/tidb-deploy/prometheus-9090/conf/ to a folder on your home directory.
# Create a private rules directory and Copy all the rules files from
# the deployed config to this new rules folder.
> cd
> mkdir myrules
> cp /home/tidb/tidb-deploy/prometheus-9090/conf/\* /home/myuser/myrules
# Now in the myrules folder, comment out the tidb\_memory\_abnormal alert:
# - alert: TiDB\_memory\_abnormal
# expr: go\_memstats\_heap\_inuse\_bytes{job="tidb"} > 1e+10
# for: 1m
# labels:
# env: tidb-test
# level: warning
# expr: go\_memstats\_heap\_inuse\_bytes{job="tidb"} > 1e+10
# annotations:
# description: 'cluster: tidb-test, instance: {{ $labels.instance }}, values:{{ $value }}'
# value: '{{ $value }}'
# summary: TiDB heap memory usage is over 10 GB
# save these edits and exit your editor
# next, run edit-config to add a rules\_dir to the alertmanager as below:
> tiup cluster edit-config <cluster-name>
monitoring\_servers:
- host: 127.0.0.1
ssh\_port: 22
port: 9090
deploy\_dir: /home/tidb/tidb-deploy/prometheus-9090
data\_dir: /home/tidb/tidb-data/prometheus-9090
log\_dir: /home/tidb/tidb-deploy/prometheus-9090/log
external\_alertmanagers: \[\]
arch: amd64
os: linux
rule\_dir: /home/myuser/myrules
# Once you save these changes, you make them effective
# by reloading the Prometheus server as follows:
> tiup cluster reload -R prometheus <cluster-name>
You can also silence an alert by using the Grafana Web UI. Silences are configured in the web interface of the Alert Manager and are documented here. As before, you should know that your upgrade runbook will have to include this extra step to be performed post-installation, as an upgrade will overwrite your changes to the Alert Manager.
Related Articles
Prometheus alerts triggering (possibly mistuned alert thresholds).
Zendesk Ticket #226, OnCall-3222: Prometheus alerts triggering (possibly mistuned alert thresholds)
Zendesk Ticket #271: TiKV sporadic latency alerts
Applies To
All TiDB releases
[alertmanager, prometheus]
#alertmanager #Prometheus #Alerts
Comments
0 comments
Article is closed for comments.