
监控的监控用Prometheus监控ELK日志吞吐的性能指标一、 ELK集群需要监控什么1.1 监控维度维度关键指标监控手段节点健康JVM Heap、GC、CPUPrometheus JMX Exporter写入性能索引速率、bulk延迟Elasticsearch Exporter查询性能查询延迟、队列深度Elasticsearch Exporter存储状态磁盘使用率、段数量Node Exporter ES API采集链路Filebeat吞吐、Kafka LagFilebeat Exporter KMI1.2 Prometheus Elasticsearch Exporter# docker-compose.es-monitor.yml version: 3 services: elasticsearch_exporter: image: quay.io/prometheuscommunity/elasticsearch-exporter:v1.6.0 command: - --es.urihttp://elastic:changemees-data-01:9200 - --es.all - --es.cluster_settings - --es.indices - --es.indices_settings - --es.shards - --es.snapshots - --es.timeout30s ports: - 9114:9114 restart: always二、 关键监控指标与PromQL2.1 写入性能监控# 1. 索引速率每秒写入文档数 sum(rate(elasticsearch_indices_indexing_index_total[1m])) by (cluster) # 2. 写入延迟P99 elasticsearch_indices_indexing_index_time_seconds_total / elasticsearch_indices_indexing_index_total * 1000 # 3. Bulk拒绝率 rate(elasticsearch_indices_indexing_is_throttled_bool[5m]) * 100 # 4. 写入线程池队列深度 elasticsearch_thread_pool_queue{typewrite}2.2 存储监控# 1. 磁盘使用率 (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes * 100 # 2. 段数量多说明需要force merge elasticsearch_indices_segments_count # 3. 分片状态 elasticsearch_cluster_health_number_of_nodes elasticsearch_cluster_health_active_shards_percent_as_number # 4. 每个索引的大小 elasticsearch_indices_store_size_bytes2.3 JVM监控# 1. Heap使用率 elasticsearch_jvm_memory_used_bytes{areaheap} / elasticsearch_jvm_memory_max_bytes{areaheap} * 100 # 2. GC次数和耗时 rate(elasticsearch_jvm_gc_collection_seconds_sum[5m]) rate(elasticsearch_jvm_gc_collection_seconds_count[5m]) # 3. 线程数 elasticsearch_jvm_threads_count2.4 Filebeat采集链路监控Filebeat自带的监控模块可以暴露内部指标# filebeat.yml — 开启监控 monitoring: enabled: true elasticsearch: hosts: [http://localhost:9200] username: filebeat password: ${FILEBEAT_PASSWORD}# 1. 事件采集速率 rate(filebeat_input_events_total[1m]) # 2. 输出延迟数据从采集到发送的时间差 filebeat_output_events_acked_total - filebeat_output_events_total # 3. 失败的发送次数 rate(filebeat_output_failed_total[5m])2.5 Kafka Lag监控关键Kafka的消费者Lag是ELK链路中最敏感的指标之一——Lag持续增长说明Logstash消费速度跟不上生产速度# 1. Consumer Lag通过Kafka Exporter kafka_consumergroup_current_offset - kafka_consumergroup_current_offset offset{consumergrouplogstash-prod} # 2. 生产速率 vs 消费速率 rate(kafka_topic_partition_current_offset[5m])三、 告警规则配置# prometheus-rules/elk-alerts.yaml groups: - name: elasticsearch rules: # 节点宕机 - alert: ESNodeDown expr: elasticsearch_cluster_health_number_of_nodes 3 for: 1m labels: severity: critical team: sre annotations: summary: ES节点宕机 # 集群状态红色有分片未分配 - alert: ESClusterRed expr: elasticsearch_cluster_health_status{colorred} 1 for: 1m labels: severity: critical annotations: summary: ES集群状态为红色 # Heap使用率过高 - alert: ESHeapHighUsage expr: | (elasticsearch_jvm_memory_used_bytes{areaheap} / elasticsearch_jvm_memory_max_bytes{areaheap}) * 100 85 for: 5m labels: severity: warning annotations: summary: ES节点{{ $labels.node }} Heap使用率超过85% # 写入拒绝 - alert: ESBulkRejected expr: rate(elasticsearch_indices_indexing_is_throttled_bool[5m]) 0 for: 2m labels: severity: warning annotations: summary: ES bulk写入被限流 # 磁盘使用率 - alert: ESDiskHighUsage expr: | (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes * 100 80 for: 5m labels: severity: warning annotations: summary: ES数据盘使用率超过80% - name: logstash rules: # Logstash进程挂掉 - alert: LogstashDown expr: up{joblogstash} 0 for: 1m labels: severity: critical annotations: summary: Logstash实例{{ $labels.instance }}不可用 - name: filebeat rules: # Filebeat采集延迟 - alert: FilebeatOutputDelay expr: | (filebeat_output_events_acked_total - filebeat_output_events_total) 10000 for: 5m labels: severity: warning annotations: summary: Filebeat事件积压超过10000条 - name: kafka rules: # Consumer Lag超过阈值 - alert: KafkaConsumerLagHigh expr: | kafka_consumergroup_lag{consumergrouplogstash-prod} 50000 for: 5m labels: severity: warning annotations: summary: Kafka Consumer Lag超过50000四、 Grafana看板ELK集群健康全景我们建了一个ELK集群健康看板包含以下行4.1 Row 1: 集群概览ES节点数: gauge 集群状态: stat (green/yellow/red) 活跃分片百分比: gauge 待处理任务数: gauge4.2 Row 2: 写入性能索引速率 (docs/s): timeseries × 3节点 Bulk延迟P50/P99: timeseries 写入线程池队列: timeseries4.3 Row 3: 存储磁盘使用率: gauge × 每个节点 索引大小Top 10: bar gauge 段数量趋势: timeseries4.4 Row 4: JVMHeap使用率: timeseries × 每个节点 GC次数/分钟: timeseries GC暂停总时间: timeseries4.5 Row 5: 采集链路Filebeat采集速率: timeseries Kafka Topic Lag: timeseries Logstash吞吐: timeseries五、 实际效果这套监控体系上线后我们成功提前发现了几个关键问题凌晨4点的GC风暴JVM Heap使用率从60%缓慢爬升到85%触发了预警。排查发现是因为前一天上线的索引模板没有设置refresh_interval导致段合并压力过大。Kafka Lag异常增长某次大版本发布后Logstash消费速率跟不上Lag从5000涨到80000。我们在Grafana上看到后及时扩容了Logstash consumer。磁盘增长趋势预警通过磁盘使用率的趋势预测我们提前3天发现了容量瓶颈及时做了索引生命周期策略调整。总结ELK是运维的眼睛。如果眼睛出了问题整个身体都会出问题。用Prometheus构建ELK的自监控体系核心思路就是监控链路中的每一个环节——从Filebeat采集开始经过Kafka缓冲、Logstash处理、ES存储每个环节的性能指标都要可见、可告警。记住可观测性系统的可观测性才是运维的底裤。