[toc] 利用Apache Flink的内置指标系统以及如何使用Prometheus来高效地监控流式应用程序.

基于PushGateway + prometheus的方式。Flink任务先将数据推到pushgateway。然后pushgateway将值推送到prometheus,最后grafana展示prometheus中的值,如下图

graph LR
Flink-->PushGateway
PushGateway-->Prometheus
Prometheus-->Grafana

环境准备

  • Flink
  • PushGateway
  • Prometheus
  • Grafana

因为环境都已经准备好,就不在此篇记录各环境安装。只记录基于现有环境新增的修改。

需要修改配置文件flink-conf.yaml

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
# 这里写PushGateway的主机名与端口号
metrics.reporter.promgateway.host: xxx.com
metrics.reporter.promgateway.port: 9091
# Flink metric在前端展示的标签(前缀)与随机后缀
metrics.reporter.promgateway.jobName: flink-metrics-
metrics.reporter.promgateway.randomJobNameSuffix: true
# 关闭集群删除pushgateway数据
metrics.reporter.promgateway.deleteOnShutdown: true
metrics.reporter.promgateway.interval: 10 SECONDS

deleteOnShutdown有BUG,只会删除Jobmanager的信息数据,解决办法是写一个脚本,通过crontab定期执行过期的Taskmanager信息。 每分钟执行 * * * * * sh /pushgateway-1.4.1.linux-amd64/cleanup.sh >> /pushgateway-1.4.1.linux-amd64/cleanup.out 定期清理脚本:

#!/bin/bash

export MILLS=60
export ADD='xxx.com'

export PUSH_TIME_SECONDS=$(curl -X GET http://$ADD:9091/api/v1/metrics  | jq --raw-output '.data[].push_time_seconds')

export PUSH_TIME=($(echo $PUSH_TIME_SECONDS | jq --raw-output '.time_stamp'| sed 's/\"//g'))
export JOB=($(echo $PUSH_TIME_SECONDS | jq --raw-output '.metrics[0].labels.job'| sed 's/\"//g'))

export NUM=${#PUSH_TIME[@]}

#echo $PUSH_TIME_SECONDS | jq

CURRENT_TIME=$(date +%s)

#echo $PUSH_TIME

#echo $JOB

for ((i=0;i<NUM;i++))
do
 # echo -e "\033[1;32m[ `date -d ${PUSH_TIME[i]} +%s` ]\033[0m"
 # echo -e "\033[1;32m[${PUSH_TIME[i]}]\033[0m"
 # echo -e "\033[1;32m[${JOB[i]}]\033[0m"
  push_mill=`date -d ${PUSH_TIME[i]} +%s`
  let gaps=$CURRENT_TIME-$push_mill

  if [[ $gaps -ge $MILLS ]];then
    echo `date --date today +%Y%m%d_%H:%M:%S` gaps: $gaps '删除' ${JOB[i]}
    curl -X DELETE http://$ADD:9091/metrics/job/${JOB[i]}
  fi
done

PushGateway

安装跳过,贴上下载地址 解压,创建启动pushgateway脚本:

ps -ef | grep pushgateway |grep -v simple | grep -v grep | awk '{print $2}' | xargs kill -9
nohup ./pushgateway  --web.enable-admin-api > pushgateway.out 2>&1 &

Prometheus

安装跳过,启动脚本./prometheus --config.file=./prometheus.yml --storage.tsdb.retention.time=15d --web.enable-lifecycle > prometheus.out 2>&1 &

修改配置文件,新增

scrape_configs:
  - job_name: 'pushgateway'
    static_configs:
      - targets: ['xxx.com:9091']

Grafana

跳过安装部署阶段,直接配置图表