Этот документ описывает, как подключить iobackup-agent к
Prometheus, какие метрики доступны и как построить базовые
алерты/дашборды.
Агент экспортирует метрики на:
GET /metricsПример проверки:
curl -sS http://127.0.0.1:8735/metricsВ ответе есть:
go_*);process_*);iobackup_*.Пример prometheus.yml:
scrape_configs:
- job_name: iobackup-agent
metrics_path: /metrics
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets:
- 127.0.0.1:8735
labels:
service: iobackup
env: devЕсли агент доступен не локально, замените target на
адрес узла с агентом.
Allowed labels (expected low-cardinality / safe):
agent_idjob_idtask_idprovider_typeprovider_kindstatuserror_codestageForbidden / high-risk labels (do not add; may explode cardinality or leak sensitive info):
run_idtask_run_idartifact_idbackup_idrequest_idcorrelation_iderror_messageobject_keypathwebhook_urlvault_pathNote: current metrics include job_id,
task_id and webhook (webhook name) labels. Это
допустимо только если имена webhook и job/task конфигов
ограничены конфигом (не генерируются динамически на
каждый run). Если потребуется более строгая политика — пересмотрим label
set в 0.17-fix.60+.
iobackup_api_requests_total{method,route,status}
status — HTTP код ответа.iobackup_api_request_duration_seconds{method,route,status}
_bucket, _sum,
_count).iobackup_api_inflight_requests{method,route}
iobackup_jobs_submitted_total{status}
status=success|error.iobackup_jobs_stored_total
iobackup_runs_started_total{job_id}
iobackup_runs_finished_total{job_id,status}
iobackup_run_duration_seconds{job_id,status}
iobackup_runs_in_progress
iobackup_run_task_count{job_id}
iobackup_last_run_finished_timestamp_seconds{job_id}iobackup_last_run_success_timestamp_seconds{job_id}iobackup_last_run_failed_timestamp_seconds{job_id}iobackup_run_status_transitions_total{status}Task-уровень:
iobackup_tasks_started_total{job_id,task_id,source_type,destination_type}iobackup_tasks_finished_total{job_id,task_id,source_type,destination_type,status}iobackup_task_duration_seconds{job_id,task_id,source_type,destination_type,status}iobackup_tasks_in_progress{source_type,destination_type}iobackup_provider_operations_total{kind,provider,operation,status}iobackup_provider_operation_duration_seconds{kind,provider,operation,status}Где:
kind:
source|destination|policy|notification;provider: например filesystem,
s3, ssh, postgres,
mysql, webhook;operation: например backup,
put, get, put_manifest,
apply, send.iobackup_backup_bytes_total{source_type,destination_type}
iobackup_artifacts_created_total{destination_type}iobackup_artifact_downloads_total{destination_type,status}iobackup_artifact_deletes_total{status}iobackup_retention_cleanup_total{status,destination_type}
deleted,
artifact_delete_failed,
manifest_delete_failed).iobackup_verifications_total{destination_type,status}
success|failed|error) по
destination.iobackup_verification_duration_seconds{destination_type,status}
iobackup_verify_after_run_total{status}
started|error|skipped_disabled|skipped_no_success_tasks).iobackup_notifications_total{scope,event,webhook,status}
iobackup_notification_attempts{scope,event,webhook,status}
iobackup_build_info{version,agent_id,hostname}=1go_*, process_* — стандартные технические
метрики процесса.Run failed за 15 минут:
sum(increase(iobackup_runs_finished_total{status="failed"}[15m]))
Task failed по destination за 15 минут:
sum by (destination_type) (
increase(iobackup_tasks_finished_total{status="failed"}[15m])
)
Run success ratio за 1 час:
sum(increase(iobackup_runs_finished_total{status="success"}[1h]))
/
sum(increase(iobackup_runs_finished_total[1h]))
Webhook success ratio за 30 минут:
sum(increase(iobackup_notifications_total{status="success"}[30m]))
/
sum(increase(iobackup_notifications_total[30m]))
P95 API latency (в секундах):
histogram_quantile(
0.95,
sum by (le, route, method) (
rate(iobackup_api_request_duration_seconds_bucket[5m])
)
)
P95 task duration по source/destination:
histogram_quantile(
0.95,
sum by (le, source_type, destination_type) (
rate(iobackup_task_duration_seconds_bucket[15m])
)
)
Скорость записи backup-данных (bytes/sec) по destination:
sum by (destination_type) (
rate(iobackup_backup_bytes_total[5m])
)
Файл iobackup-alerts.yml:
groups:
- name: iobackup.rules
rules:
- alert: IobackupRunFailures
expr: sum(increase(iobackup_runs_finished_total{status="failed"}[10m])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Есть failed run в iobackup"
description: "За последние 10m обнаружены неуспешные run."
- alert: IobackupTaskFailuresHigh
expr: sum(increase(iobackup_tasks_finished_total{status="failed"}[15m])) > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Высокий уровень task failures"
description: "Более 3 failed task за 15m."
- alert: IobackupWebhookFailures
expr: sum(increase(iobackup_notifications_total{status="failed"}[10m])) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Webhook delivery errors"
description: "Есть failed webhook notifications."
- alert: IobackupAPILatencyP95High
expr: |
histogram_quantile(
0.95,
sum by (le) (rate(iobackup_api_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Высокая latency API"
description: "P95 API latency выше 1s в течение 10m."env,
cluster, service) в Prometheus.job_id/task_id naming-pattern;curl http://<agent>/metrics.scrape_config в Prometheus./targets Prometheus UI.iobackup_runs_finished_total
и iobackup_tasks_finished_total.