需求:
pod中的容器重启一次则报警通知
pod非Runing 状态则报警
pod中的容器非true状态则报警
三个需求其实是有点重叠的
pod重启期间pod肯定会有非Running状态,只要有重启报警那么pod非Runing也会报警,pod非Runing容器状态肯定非true也会报警
所有报警设置为:
pod重启一次就报警
pod非Runing and 容器非true (#3) and pod非删除 =报警
zabbix server中建一个模板
3.2 2017-11-23T07:48:53Z OpenShift OC Pods OC Pods OpenShift restartCount RunningStatus OC Pods Discover 0 oc.pod.discover 300 1 0 0 0 0 0 7 Pod {#POD_NAME} Restarts 0 0 oc.pod.status[{#POD_NAME},restarts] 30 30 0 0 4 0 0 0 0 1 0 0 0 restartCount Pod {#POD_NAME} Running 0 0 oc.pod.status[{#POD_NAME},running] 30 30 0 0 4 0 0 0 0 1 0 0 0 RunningStatus Pod {#POD_NAME} Running True 0 1 oc.pod.status[{#POD_NAME},running_true] 30 30 365 0 3 0 0 0 0 1 0 0 0 RunningStatus {OC Pods:oc.pod.status[{#POD_NAME},running].str(Running_true)}=0 and {OC Pods:oc.pod.status[{#POD_NAME},running].str(Pod deleted)}=0 and {OC Pods:oc.pod.status[{#POD_NAME},running_true].last(#5)}=0 0 Pod {#POD_NAME} No Running 0 0 1 0 1 {OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning)}=1 1 {OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning,#3)}=0 Pod {#POD_NAME} restarted Warning 0 0 1 0 1
新建一个自动发现规则,有三个监控项对于上面说的三个需求
zabbix agent
在配置文件末尾中加入
# vim zabbix_agentd.conf
UserParameter=oc.pod.discover,/data/app/zabbix/etc/oc_pod_discover.sh
UserParameter=oc.pod.status[*],/data/app/zabbix/etc/oc_pod_monitor.sh $1 $2
自动发现脚本
# vim oc_pod_discover.sh
#!/bin/bashTOKEN="123456"ENDPOINT="www.oc.domain.cn:8443"WORKSPACE="/data/tmp/oc_monitor"mkdir -p $WORKSPACE#获取所有pod只保留pod namecurl -k \ -H "Authorization: Bearer $TOKEN" \ -H 'Accept: application/json' \ https://$ENDPOINT/api/v1/pods 2>/dev/null > $WORKSPACE/all_pods.jsonPod_Name=(`cat $WORKSPACE/all_pods.json |jq -r '.items | .[] | .metadata | .name' |grep -v build |grep -v deploy`)#转换为json格式printf "{\n"printf '\t"data":[\n'for ((i=0;i<${#Pod_Name[@]};i++))do printf '\t\t{\n' num=$(echo $((${#Pod_Name[@]}-1))) if [ "$i" == ${num} ]; then printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name[$i]}\"}\n" else printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name[$i]}\"},\n" fidoneprintf "\t]\n"printf "}\n"
监控脚本
# vim oc_pod_monitor.sh
#!/bin/bashTOKEN="123456"ENDPOINT="www.oc.domain.cn:8443"POD_NAME="$1"Monitoring_type="$2"WORKSPACE="/data/tmp/oc_monitor"mkdir -p $WORKSPACE#通过pod name获得pod所在的namespace5分钟更新一次NAMESPACE="`cat $WORKSPACE/all_pods.json |jq -r '.items |.[] |.metadata |.name,.namespace' |grep -A1 $POD_NAME |grep -v $POD_NAME`"#验证pod是否存在if [ ! -n "$NAMESPACE" ]; then if [ "$Monitoring_type" = "running_true" ]; then echo "1" exit 0 fi echo "Pod deleted" exit 0fi#获取pod状态数据if [ ! -f "$WORKSPACE/${POD_NAME}.status" ]; then if [ "$Monitoring_type" = "running_true" ]; then echo "1" exit 0 fi echo "New Pod" exit 0fiPod_Status="`cat $WORKSPACE/${POD_NAME}.status`"#验证容器是否在Pending状态Pending="`echo "$Pod_Status" |jq -r '.status |.phase'`"if [ "$Pending" = "Pending" ]; then if [ "$Monitoring_type" = "running_true" ]; then echo "0" exit 0 fi echo "Pending" exit 0fi#选择要获取的数据case $Monitoring_type in restarts)#监控pod是否重启过 #获取pod状态数据写到文件里面可供所有项目调用 curl -k \ -H "Authorization: Bearer $TOKEN" \ -H 'Accept: application/json' \ https://${ENDPOINT}/api/v1/namespaces/$NAMESPACE/pods/$POD_NAME/status 2>/dev/null > $WORKSPACE/${POD_NAME}.status find /data/tmp/oc_monitor/ -type f -mtime +3 -name "*" -exec rm -f {} \; #获取pod的状态只保留restartCount的值 ##获取上次的值 A_line=`sed -n 1p $WORKSPACE/${POD_NAME}.restartCount` B_line_null="`sed -n 2p $WORKSPACE/${POD_NAME}.restartCount`" if [ ! -n "$B_line_null" ]; then #处理有两个restartCount值的pod B_line="0" else B_line=`sed -n 2p $WORKSPACE/${POD_NAME}.restartCount` fi Last_state=`expr $A_line + $B_line` ## ##获取本次的值 echo "$Pod_Status" |jq -r '.status |.containerStatuses |.[] |.restartCount' > $WORKSPACE/${POD_NAME}.restartCount A_line=`sed -n 1p $WORKSPACE/${POD_NAME}.restartCount` B_line_null="`sed -n 2p $WORKSPACE/${POD_NAME}.restartCount`" if [ ! -n "$B_line_null" ]; then #处理有两个restartCount值的pod B_line="0" else B_line=`sed -n 2p $WORKSPACE/${POD_NAME}.restartCount` fi Current_state=`expr $A_line + $B_line` ## #对比本次拿到的restartCount值与上此的restartCount值 if [ "$Current_state" -gt "$Last_state" ]; then Restart_status="Warning restart_count=$Current_state" else Restart_status="Normal restart_count=$Current_state" fi echo "$Restart_status" ;; running)#监控pod的运行状态和容器的状态返回字符串 if [ ! -n "$Pod_Status" ]; then echo "New Pod" exit 0 fi running_status=`echo "$Pod_Status" |jq -r '.status |.phase'` Container_status="`echo "$Pod_Status" |jq -r '.status |.containerStatuses |.[] |.ready' |grep false`" if [ ! -n "$Container_status" ]; then Container_status="_true" else Container_status="_false" fi echo "${running_status}${Container_status}" ;; running_true)#监控pod中的容器运行状态返回数字 if [ ! -n "$Pod_Status" ]; then echo "New Pod" exit 0 fi Container_status="`echo "$Pod_Status" |jq -r '.status |.containerStatuses |.[] |.ready' |grep false`" if [ ! -n "$Container_status" ]; then Container_status="true" else Container_status="false" fi if [ "$Container_status" = "true" ]; then echo "1" else echo "0" fi ;; *) echo "Error parameters" exit 0 ;;esac