prometheus 66 Q&As

Prometheus FAQ & Answers

66 expert Prometheus answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

66 questions
A

Prometheus is an open-source CNCF graduated project for monitoring and alerting with custom TSDB optimized for time series data. Core features: (1) Pull-based model scraping HTTP /metrics endpoints at configured intervals. (2) Multi-dimensional data model using labels: metric_name{label1="value1"}. (3) PromQL query language for powerful time series analysis. (4) Local TSDB with 2-hour blocks and WAL for durability. (5) Alertmanager integration for routing, grouping, silencing. (6) Service discovery (Kubernetes, Consul, EC2, DNS, file). (7) 200+ exporters for third-party systems. (8) Grafana integration for visualization. (9) Recording rules for pre-computation. (10) Remote storage support (Thanos, Mimir, VictoriaMetrics). Written in Go, Prometheus is the de facto standard for cloud-native observability, supporting infrastructure monitoring, application metrics, and business KPIs.

99% confidence
A

Prometheus pull model actively scrapes HTTP /metrics endpoints from targets at configured intervals (scrape_interval: 15s to 1m). Targets discovered via static_configs or service discovery (kubernetes_sd_configs, consul_sd_configs, ec2_sd_configs, dns_sd_configs, file_sd_configs). Configuration: scrape_configs: - job_name: 'api' scrape_interval: 30s static_configs: - targets: ['api:8080']. Benefits: (1) Centralized configuration - targets don't need Prometheus knowledge. (2) Target health detection via up metric (1=healthy, 0=down). (3) Easier debugging - manual scrapes with curl http://target:8080/metrics. (4) No firewall changes for targets. (5) Pull scheduling controlled by Prometheus. For short-lived jobs (batch, cron), use Pushgateway as bridge. Pull model simplifies operations and enables automatic service discovery in dynamic environments.

99% confidence
A

Prometheus metrics are time series identified by metric_name{label1="value1", label2="value2"}. Four types: (1) Counter - monotonically increasing (use rate()): http_requests_total, errors_total. Reset on restart. (2) Gauge - current value (can go up/down): memory_usage_bytes, cpu_temp_celsius, active_connections. (3) Histogram - observations in buckets (server-side quantiles): request_duration_seconds_bucket{le="0.1"}, request_duration_seconds_sum, request_duration_seconds_count. Use histogram_quantile(0.95, ...) for percentiles. (4) Summary - client-side quantiles: request_duration_seconds{quantile="0.95"}. Use Counter for cumulative events, Gauge for current state, Histogram for latency distributions (preferred for aggregation), Summary when exact quantiles needed. Histograms enable aggregation across instances, Summaries provide exact quantiles but cannot aggregate.

99% confidence
A

Metric names must match [a-zA-Z_:][a-zA-Z0-9_:]* with single-word application prefix using snake_case. Include units and type suffixes: _total (counters), _count/_sum/_bucket (histograms), _seconds, bytes, ratio (0-1). Examples: http_requests_total, http_request_duration_seconds, process_cpu_seconds_total, node_memory_MemAvailable_bytes. Use base units: seconds not milliseconds, bytes not kilobytes. Colons are reserved for recording rules only. Label names match [a-zA-Z][a-zA-Z0-9]* with finite values: {method='GET', status='200', service='api'}. Avoid high-cardinality labels (user_id, ip_address). Prometheus 3.0+ supports UTF-8 characters but stick to recommended charset for compatibility. Consistent naming ensures query clarity and cross-team reusability.

99% confidence
A

PromQL queries time series with powerful syntax. Instant vector (single value per series): http_requests_total. Label matchers: http_requests_total{method="GET", status="200"} (exact), http_requests_total{status!="200"} (not equal), http_requests_total{method="GET|POST"} (regex), http_requests_total{path!"/admin.*"} (negative regex). Range vector (time window): http_requests_total[5m]. Operators: arithmetic (+, -, *, /, %), comparison (==, !=, >, <, >=, <=), logical (and, or, unless). Examples: rate(http_requests_total[5m]) (per-second rate), sum(rate(http_requests_total[5m])) by (service) (aggregate by service), rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) (error rate). Always use label matchers to filter specific time series. Best practices: use shorter time ranges for faster queries, test queries incrementally, leverage recording rules for complex queries.

99% confidence
A

Aggregation operators combine time series. Operators: sum (total), avg (average), min/max (extremes), count (series count), stddev (standard deviation), stdvar (variance), count_values (histogram), topk(N, ...) (top N), bottomk(N, ...) (bottom N), quantile(φ, ...) (quantile). Grouping: by (label1, label2) preserves specified labels, without (label1) removes specified labels. Examples: sum(rate(http_requests_total[5m])) by (method, status) (requests per method/status), sum(rate(http_requests_total[5m])) without (instance) (aggregate across instances), topk(5, rate(http_requests_total[5m])) (top 5 highest rate series), avg(rate(container_cpu_usage_seconds_total[5m])) by (namespace) (CPU per namespace). Best practice: use without() to preserve job/cluster labels. Aggregations reduce cardinality and enable cross-instance dashboards/alerts.

99% confidence
A

Essential PromQL functions: (1) rate(counter[5m]) - per-second rate for counters (handles resets). (2) increase(counter[1h]) - total increase over period. (3) irate(counter[5m]) - instant rate (last 2 samples, volatile). (4) delta(gauge[1h]) - change in gauge. (5) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) - 95th percentile from histogram. (6) avg_over_time(gauge[1h]) - time-based average. (7) max_over_time(gauge[5m]) - maximum over period. (8) predict_linear(gauge[4h], 3600) - linear prediction 1 hour ahead. (9) absent(metric) - returns 1 if metric missing (for alerting). (10) clamp_max/clamp_min - limit values. Example: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 (error rate >5%). Use rate() for counters, delta() for gauges.

99% confidence
A

Time range selectors define time windows using bracket notation. Syntax: metric[duration] where duration uses s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years). Examples: http_requests_total[5m] (last 5 minutes), node_cpu_seconds_total[1h] (last hour), up[30s] (last 30 seconds). Offset modifier: metric[5m] offset 1h (5-minute window from 1 hour ago), metric offset 1d (instant value from 1 day ago). Subquery syntax: rate(http_requests_total[5m])[1h:30s] (evaluate rate() every 30s over 1 hour). Range vectors required for: rate(), increase(), delta(), avg_over_time(), max_over_time(). Instant vectors for: arithmetic, aggregation. Best practices: match range to scrape_interval (typically 4× scrape_interval for rate()), use longer ranges for low-frequency events. Example: rate(http_requests_total[5m]) offset 1w compares to last week.

99% confidence
A

Exporters are standalone programs exposing /metrics endpoint in Prometheus format for systems lacking native instrumentation. Workflow: Prometheus scrapes exporter → exporter queries target system → converts to Prometheus metrics → returns via HTTP. Official exporters: node_exporter (Linux/Windows system metrics: CPU, memory, disk, network), mysqld_exporter (MySQL metrics), redis_exporter (Redis), postgres_exporter (PostgreSQL), blackbox_exporter (HTTP/ICMP/TCP/DNS probing), elasticsearch_exporter (Elasticsearch). Deploy as sidecar or standalone service. Example scrape config: scrape_configs: - job_name: 'node' static_configs: - targets: ['node-exporter:9100']. Custom exporters use client libraries (prometheus_client for Python, client_golang for Go). Exporters transform proprietary APIs to Prometheus format, essential for monitoring databases, message queues, cloud services, and legacy systems. Over 200+ community exporters available.

99% confidence
A

Static service discovery manually defines scrape targets in prometheus.yml. Configuration: scrape_configs: - job_name: 'api' scrape_interval: 30s static_configs: - targets: ['api-1.example.com:8080', 'api-2.example.com:8080'] labels: env: 'production' region: 'us-east-1' - targets: ['api-3.example.com:8080'] labels: env: 'staging'. Labels attached to all metrics from targets. Use cases: small deployments, static infrastructure, development environments, explicit control over targets. Advantages: simple, no external dependencies, predictable. Disadvantages: manual updates required, not scalable for dynamic environments (containers, autoscaling). Best for: <50 targets, stable infrastructure, or when service discovery unavailable. Reload config without restart: curl -X POST http://localhost:9090/-/reload (requires --web.enable-lifecycle flag).

99% confidence
A

File-based service discovery reads targets from JSON/YAML files updated without Prometheus restart. Configuration: file_sd_configs: - files: ['/etc/prometheus/targets/.json', '/etc/prometheus/targets/.yml'] refresh_interval: 30s. JSON format: [{"targets": ["host1:9100", "host2:9100"], "labels": {"env": "prod", "team": "platform"}}, {"targets": ["host3:9100"], "labels": {"env": "staging"}}]. YAML format: - targets: ["host1:9100"] labels: env: prod. Prometheus watches files, reloads on changes every refresh_interval (default 5m). Use cases: integration with config management (Ansible, Terraform), custom discovery scripts, cloud metadata APIs. Generate files dynamically: ansible template, terraform local_file, cron script querying cloud APIs. Advantages: simple, no external service dependencies, works with any system that can write files. Ideal bridge between static and dynamic discovery.

99% confidence
A

Kubernetes service discovery automatically finds targets using pod/service/endpoint annotations. Configuration: kubernetes_sd_configs: - role: pod namespaces: names: [monitoring] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] target_label: metrics_path - source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address. Standard annotations: prometheus.io/scrape: "true" (enable), prometheus.io/port: "8080" (metrics port), prometheus.io/path: "/metrics" (custom path), prometheus.io/scheme: "https" (TLS). Roles: pod, service, endpoints, node, ingress. Automatic target updates on pod creation/deletion enable seamless monitoring in dynamic Kubernetes environments.

99% confidence
A

Alert rules evaluate PromQL expressions and trigger alerts via Alertmanager. Load in prometheus.yml: rule_files: - 'alert_rules.yml'. Structure: groups: - name: api.rules interval: 30s rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 10m labels: severity: warning service: "{{ $labels.service }}" annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ printf "%.2f" $value | humanizePercentage }} (threshold: 5%)" runbook_url: "https://runbooks.example.com/high-error-rate". Components: alert (CamelCase name), expr (PromQL), for (duration before firing), labels (routing/grouping), annotations (human-readable details). Test: promtool check rules alert_rules.yml. Reload: curl -X POST http://localhost:9090/-/reload. Best practices: alert on symptoms not causes, use 'for' clause to prevent flapping, include runbook_url.

99% confidence
A

The 'for' clause defines how long condition must be true before alert fires, preventing flapping from transient spikes. Syntax: for: (e.g., for: 5m, for: 30s, for: 1h). Example: - alert: HighCPU expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.80 for: 10m (CPU >80% for 10 consecutive minutes). Without 'for': alert fires immediately on first evaluation. With 'for': alert enters 'pending' state, then 'firing' after duration. Best practices: (1) Critical alerts: 1-5m to minimize response time. (2) Warning alerts: 5-15m to avoid noise. (3) Non-urgent: 30m-1h. (4) Match 'for' to scrape_interval (minimum 2-3× scrape interval). Benefits: eliminates false positives, reduces alert fatigue, focuses on sustained issues. Monitor pending alerts: ALERTS{alertstate="pending"}. Essential for stable production alerting.

99% confidence
A

Alert labels control Alertmanager routing, grouping, and inhibition. Standard labels: severity (critical, warning, info), service (affected component), environment (prod, staging, dev), team (owning team), cluster (Kubernetes cluster). Example: labels: severity: critical service: "{{ $labels.service }}" environment: production team: platform cluster: "{{ $labels.cluster }}" page: "true". Severity routing in Alertmanager: routes: - match: severity: critical receiver: pagerduty - match: severity: warning receiver: slack. Dynamic labels: use "{{ $labels.label_name }}" to copy from metrics. Reserved labels: alertname (auto-generated). Common patterns: page: "true" (requires paging), component: database (system component), priority: P1 (SLA priority). Labels must be consistent across alerts for proper grouping. Avoid high-cardinality labels (instance IDs). Use annotations for variable details (values, runbook links).

99% confidence
A

Annotations provide human-readable alert context without affecting routing (unlike labels). Template syntax: annotations: summary: "High latency on {{ $labels.service }}" description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 500ms)" runbook_url: "https://runbooks.example.com/latency" dashboard: "https://grafana.example.com/d/service?var-service={{ $labels.service }}". Template variables: $labels.label_name (label values), $value (alert expression result), $externalLabels (global labels). Template functions: humanize (format numbers), humanizeDuration (convert seconds to duration), humanizePercentage (format as %), printf (format strings). Example: description: "Error rate is {{ printf "%.2f" $value | humanizePercentage }} on {{ $labels.instance }}". Annotations appear in Alertmanager notifications (email, Slack, PagerDuty). Best practices: include actionable information, link to runbooks/dashboards, explain threshold/impact, use consistent formatting. Annotations don't affect alert identity or routing.

99% confidence
A

Alertmanager handles Prometheus alerts: deduplication, grouping, routing to receivers (email, Slack, PagerDuty), silencing, inhibition. Configuration in alertmanager.yml defines routing trees. Example: route: receiver: default group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 12h routes: - match: severity: critical receiver: pagerduty group_wait: 10s continue: true - match_re: service: ^(api|web)$ receiver: slack-dev. Root route: default receiver for unmatched alerts. Nested routes: match specific label patterns. continue: true allows multiple receivers. Grouping combines alerts with same group_by labels into single notification. Deduplication prevents duplicate firing alerts. Repeat interval controls notification frequency. Alertmanager essential for production: prevents alert storms, routes to correct teams, provides UI for silence management.

99% confidence
A

Grouping combines related alerts into single notifications to reduce noise. Configuration: route: group_by: [alertname, cluster, service] group_wait: 30s group_interval: 5m repeat_interval: 4h. group_by: labels for grouping (empty [] groups all). group_wait: wait for more alerts before sending first notification. group_interval: time between notifications for same group. repeat_interval: time before re-sending resolved/firing alert. Inhibition suppresses alerts when other alerts active: inhibit_rules: - source_matchers: [severity="critical", alertname="NodeDown"] target_matchers: [severity="warning"] equal: [instance]. Example: NodeDown critical alert inhibits NodeHighCPU warning on same instance. Use cases: cluster-wide outage inhibits node-level alerts, critical database alert inhibits slow query warnings. Proper grouping/inhibition reduce alert fatigue while maintaining visibility of critical issues.

99% confidence
A

Receivers define notification integrations. Email: receivers: - name: email email_configs: - to: [email protected] from: [email protected] smarthost: smtp.gmail.com:587 auth_username: [email protected] auth_password: app-password headers: Subject: '[{{ .Status }}] {{ .GroupLabels.alertname }}'. Slack: slack_configs: - api_url: https://hooks.slack.com/services/XXX channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'. PagerDuty: pagerduty_configs: - routing_key: severity: '{{ .GroupLabels.severity }}'. Webhook: webhook_configs: - url: http://webhook.example.com/alerts http_config: bearer_token: secret-token. Multiple receivers: receivers: - name: multi email_configs: [...] slack_configs: [...]. Test: amtool alert add alertname=Test severity=warning --alertmanager.url=http://localhost:9093.

99% confidence
A

Silences mute alerts temporarily without modifying rules. Create via UI (http://localhost:9093 → Silences → Create), amtool, or API. amtool: amtool silence add alertname=HighCPU instance=web-1 --duration=2h --comment="Planned maintenance" --author=ops. API: curl -XPOST http://localhost:9093/api/v2/silences -d '{"matchers":[{"name":"alertname","value":"HighCPU","isRegex":false},{"name":"instance","value":"web-1","isRegex":false}],"startsAt":"2025-01-15T10:00:00Z","endsAt":"2025-01-15T12:00:00Z","createdBy":"ops","comment":"Maintenance window"}'. List silences: amtool silence query or GET /api/v2/silences. Delete: amtool silence expire or DELETE /api/v2/silence/. Use cases: planned maintenance, known issues, testing, deployments. Silences match alerts by labels (supports regex). Best practice: always add comments explaining why, include end time, use specific matchers to avoid over-silencing.

99% confidence
A

Prometheus TSDB uses block-based architecture combining in-memory Head block and immutable on-disk blocks. Head block stores recent samples (2 hours default) with Write-Ahead Log (WAL) for crash recovery. Completed blocks written to disk contain: meta.json (metadata), index (inverted index for labels), chunks/ (compressed time series data). Blocks compacted into larger blocks (up to 10% retention or 31 days) using Log-Structured Merge-Tree (LSMT). Each time series gets unique ID from label sets. Storage flags: --storage.tsdb.path (default ./data), --storage.tsdb.retention.time (default 15d). Monitor with prometheus_tsdb_size_retained_bytes, prometheus_tsdb_head_series. Calculate storage: sample_size_bytes × series_count × scrape_interval × retention. Example: 1M series × 2 bytes × 15s interval × 15d = ~130GB. Understanding TSDB architecture optimizes capacity planning and performance.

99% confidence
A

Recording rules pre-compute expensive queries and store results as new time series. Format: level:metric:operations where level lists preserved labels, metric is unchanged (strip _total from counters), operations lists applied functions. Example: groups: - name: api.rules interval: 30s rules: - record: job:http_requests:rate5m expr: sum without (instance) (rate(http_requests_total[5m])). Load in prometheus.yml: rule_files: - 'recording_rules.yml'. Evaluation interval: global.evaluation_interval (default 1m) or per-group interval. Best for: histogram_quantile() pre-computation, expensive aggregations (sum, avg by), frequently used dashboard queries. Monitor evaluation time: prometheus_rule_evaluation_duration_seconds. Use without() clause to preserve labels like job. Recording rules should have zero or two colons. Dramatically improves dashboard performance without storing unnecessary data.

99% confidence
A

Recording rules follow level:metric:operations naming where level lists preserved labels. Examples: job:http_requests:rate5m, instance:node_cpu:usage_avg1m, cluster:container_memory:bytes:sum. Use for: expensive histogram_quantile() operations, multi-service aggregations (sum without(instance)), frequently queried dashboard metrics, alert pre-computation. Avoid: over-recording rarely used queries, duplicating simple metrics, recording without without() clause (loses labels). Best practices: (1) Always use without() to preserve job/cluster labels. (2) Use ratio separator for divisions: job:http_requests:errors_per_requests:ratio. (3) Document with comments. (4) Test with promtool check rules. (5) Monitor evaluation time: prometheus_rule_evaluation_duration_seconds. (6) Group related rules with appropriate interval. (7) Only record queries used 3+ times. Recording rules should have zero or two colons. Proper rules improve dashboard performance without storage bloat.

99% confidence
A

Instrument apps with official client libraries for Go, Python, Java, Node.js, Ruby, .NET. Python: from prometheus_client import Counter, Histogram, Gauge, start_http_server. requests_total = Counter('http_requests_total', 'Total requests', ['method', 'status']). request_duration = Histogram('http_request_duration_seconds', 'Request duration', buckets=[0.1, 0.5, 1.0, 2.0, 5.0]). @app.route('/api'): requests_total.labels(method='GET', status='200').inc(); with request_duration.time(): return response. start_http_server(8000) exposes /metrics. Go: import "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp". requestsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{Name: "http_requests_total", Help: "Total requests"}, []string{"method", "status"}). prometheus.MustRegister(requestsTotal). http.Handle("/metrics", promhttp.Handler()). Libraries handle: thread-safety, metric registration, text/protobuf exposition, automatic process metrics (CPU, memory).

99% confidence
A

Instrumentation best practices: (1) Metric names: use base units (seconds not ms, bytes not KB), suffix with unit (_seconds, _bytes, _total). (2) Metric types: Counter for cumulative (requests_total), Gauge for current state (memory_usage_bytes), Histogram for distributions (request_duration_seconds with buckets=[0.1, 0.5, 1, 2.5, 5]). (3) Labels: use finite cardinality (method, status, service), avoid user_id, ip_address, timestamps. (4) Initialize metrics at startup: counter.labels(method='GET', status='200').inc(0). (5) Instrument critical paths: request handlers, database queries, external API calls. (6) Export business metrics: orders_total, revenue_dollars, active_users_gauge. (7) Document with Help text. (8) Use consistent label names across services (status not http_status). Good: http_requests_total{method="GET", status="200"}. Bad: requests{user="alice", path="/api/v1/users/123"}. Follow conventions for cross-team querying.

99% confidence
A

High cardinality (>10K unique series) causes memory issues, slow queries, and potential TSDB crashes. Common causes: user_id, ip_address, request_id, email as labels. Detection: topk(10, count by (name, job)({name=~".+"})) or use mimirtool for unused metrics analysis. Solutions: (1) Drop labels via relabel_configs: - source_labels: [user_id] action: drop. (2) Pre-aggregate with recording rules: record: service:requests:rate5m expr: sum(rate(requests_total[5m])) by (service). (3) Use metric_relabel_configs to hash high-card labels. (4) Limit scrape scope with label selectors. (5) Consider VictoriaMetrics for better high-cardinality handling or managed solutions like Levitate. Monitor: prometheus_tsdb_head_series, prometheus_tsdb_symbol_table_size_bytes. Set limits: --storage.tsdb.max-series-per-metric. Keep <100K series per metric for optimal Prometheus performance.

99% confidence
A

Optimize scrape_interval based on metric volatility and query needs. Recommended: 15s-30s for dynamic metrics (CPU, memory, request rates), 1m-5m for stable metrics (disk usage). Configuration: global: scrape_interval: 15s. Per-job override: scrape_configs: - job_name: 'slow-metrics' scrape_interval: 5m. Retention: --storage.tsdb.retention.time=15d (default) or --storage.tsdb.retention.size=50GB (size-based). Calculate storage: 2 bytes/sample × series_count × (seconds_in_retention / scrape_interval). Example: 1M series × 2 bytes × (15d × 86400s / 15s) = ~173GB. Optimization: (1) Use recording rules for dashboards instead of reducing scrape interval. (2) Remote storage (Thanos, Mimir, VictoriaMetrics) for long-term retention (>30d). (3) Monitor: prometheus_tsdb_size_retained_bytes, prometheus_tsdb_head_series. (4) Set --storage.tsdb.max-block-duration for compaction tuning.

99% confidence
A

Federation allows hierarchical Prometheus topologies where one server scrapes selected metrics from another via /federate endpoint. Configuration: scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: /federate params: 'match[]': - '{job="kubernetes-.*"}' - '{name="job:."}' static_configs: - targets: ['source-prometheus:9090']. honor_labels: true preserves original labels from source. Use match[] to select specific metrics/jobs for federation. Patterns: (1) Hierarchical: regional Prometheus → global aggregation server. (2) Cross-datacenter: isolate prod/staging while centralizing dashboards. Best practices: federate recording rules (job:) not raw metrics to reduce cardinality. Alternatives: Thanos (object storage), Mimir (multi-tenant), VictoriaMetrics (better compression) offer better scalability than federation for large-scale deployments.

99% confidence
A

Pushgateway acts as intermediary for short-lived jobs that can't be scraped directly. Workflow: job → push metrics → Pushgateway → Prometheus scrapes Pushgateway. Push example: echo "backup_duration_seconds 123.4" | curl --data-binary @- http://pushgateway:9091/metrics/job/backup/instance/db-backup. Python: from prometheus_client import CollectorRegistry, Gauge, push_to_gateway. registry = CollectorRegistry(). g = Gauge('job_duration_seconds', 'Job duration', registry=registry). g.set(123.4). push_to_gateway('pushgateway:9091', job='backup', registry=registry). Scrape config: scrape_configs: - job_name: pushgateway honor_labels: true static_configs: - targets: [pushgateway:9091]. Use cases: batch jobs, cron jobs, CI/CD pipelines, serverless functions. Limitations: stale metrics if job crashes, no automatic expiry, single point of failure. Best practices: delete metrics after job (DELETE /metrics/job/), use unique instance labels, prefer pull model for long-running services.

99% confidence
A

Prometheus lacks built-in auth; use external tools or native TLS (v2.24+). TLS config: tls_server_config: cert_file: /etc/prometheus/prometheus.crt key_file: /etc/prometheus/prometheus.key client_auth_type: RequireAndVerifyClientCert client_ca_file: /etc/prometheus/client_ca.crt. Basic auth (v2.24+): basic_auth_users: admin: $2y$10$hashed_password_here. Reverse proxy auth (nginx): location / { auth_basic "Prometheus"; auth_basic_user_file /etc/nginx/.htpasswd; proxy_pass http://localhost:9090; }. OAuth2 proxy: oauth2-proxy --upstream=http://localhost:9090 --provider=google. Scrape target auth: basic_auth: username: scrape_user password_file: /etc/prometheus/scrape_password. bearer_token_file: /var/run/secrets/token. tls_config: ca_file: /etc/prometheus/ca.crt. Network: firewall rules, VPC peering, mTLS between components. Monitor failed auth: prometheus_http_requests_total{code="401"}.

99% confidence
A

Anti-patterns to avoid: (1) High-cardinality labels: using user_id, ip_address, request_id as labels causes memory exhaustion. (2) Missing 'for' clause: alerts fire on brief spikes causing alert fatigue. (3) Pushgateway for services: use pull model for long-running apps. (4) Scraping every second: creates unnecessary load; use 15-30s minimum. (5) Ignoring metric naming: inconsistent names (request_count vs requests_total) break queries. (6) Not monitoring Prometheus: monitor prometheus_tsdb_head_series, up{job="prometheus"}. (7) Single Prometheus instance: use federation/remote storage for scale. (8) No retention planning: disk fills unexpectedly. (9) Using gauge for counters: breaks rate() calculations. (10) Complex dashboards without recording rules: slow queries at scale. Best practices: follow naming conventions, use recording rules, implement proper cardinality controls, test alerts, plan capacity.

99% confidence
A

Prometheus Operator manages Prometheus/Alertmanager on Kubernetes via CRDs. CRDs: Prometheus (server config), ServiceMonitor (pod/service scraping), PodMonitor (pod scraping), PrometheusRule (alert/recording rules), Alertmanager (alerting config), ThanosRuler (Thanos ruler), PrometheusAgent (agent mode). ServiceMonitor: apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api namespace: monitoring spec: selector: matchLabels: {app: api} endpoints: - port: metrics interval: 30s path: /metrics. PrometheusRule: kind: PrometheusRule spec: groups: - name: api rules: - alert: HighErrorRate expr: rate(errors[5m]) > 0.05. Benefits: declarative GitOps, automatic config reload, native Kubernetes integration, multi-tenancy support. Install: helm install prometheus prometheus-community/kube-prometheus-stack. Part of kube-prometheus providing full monitoring stack (Prometheus, Alertmanager, Grafana, node-exporter).

99% confidence
A

Grafana is standard for Prometheus visualization. Setup: Add datasource → Prometheus → URL: http://prometheus:9090 → Save & Test. Create dashboard: Panel → Query: rate(http_requests_total[5m]) → Visualization: Time series. Variables for dynamic dashboards: $namespace: label_values(kube_pod_info, namespace), $service: label_values(http_requests_total{namespace="$namespace"}, service). Query examples: (1) Request rate: sum(rate(http_requests_total[5m])) by (service). (2) Error rate: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])). (3) P95 latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)). Visualization types: Time series (trends), Stat (current value), Gauge (percentage), Heatmap (latency distribution). Best practices: USE method (Utilization, Saturation, Errors), RED method (Rate, Errors, Duration), consistent colors, meaningful thresholds, dashboard folders, annotation queries.

99% confidence
A

Prometheus is an open-source CNCF graduated project for monitoring and alerting with custom TSDB optimized for time series data. Core features: (1) Pull-based model scraping HTTP /metrics endpoints at configured intervals. (2) Multi-dimensional data model using labels: metric_name{label1="value1"}. (3) PromQL query language for powerful time series analysis. (4) Local TSDB with 2-hour blocks and WAL for durability. (5) Alertmanager integration for routing, grouping, silencing. (6) Service discovery (Kubernetes, Consul, EC2, DNS, file). (7) 200+ exporters for third-party systems. (8) Grafana integration for visualization. (9) Recording rules for pre-computation. (10) Remote storage support (Thanos, Mimir, VictoriaMetrics). Written in Go, Prometheus is the de facto standard for cloud-native observability, supporting infrastructure monitoring, application metrics, and business KPIs.

99% confidence
A

Prometheus pull model actively scrapes HTTP /metrics endpoints from targets at configured intervals (scrape_interval: 15s to 1m). Targets discovered via static_configs or service discovery (kubernetes_sd_configs, consul_sd_configs, ec2_sd_configs, dns_sd_configs, file_sd_configs). Configuration: scrape_configs: - job_name: 'api' scrape_interval: 30s static_configs: - targets: ['api:8080']. Benefits: (1) Centralized configuration - targets don't need Prometheus knowledge. (2) Target health detection via up metric (1=healthy, 0=down). (3) Easier debugging - manual scrapes with curl http://target:8080/metrics. (4) No firewall changes for targets. (5) Pull scheduling controlled by Prometheus. For short-lived jobs (batch, cron), use Pushgateway as bridge. Pull model simplifies operations and enables automatic service discovery in dynamic environments.

99% confidence
A

Prometheus metrics are time series identified by metric_name{label1="value1", label2="value2"}. Four types: (1) Counter - monotonically increasing (use rate()): http_requests_total, errors_total. Reset on restart. (2) Gauge - current value (can go up/down): memory_usage_bytes, cpu_temp_celsius, active_connections. (3) Histogram - observations in buckets (server-side quantiles): request_duration_seconds_bucket{le="0.1"}, request_duration_seconds_sum, request_duration_seconds_count. Use histogram_quantile(0.95, ...) for percentiles. (4) Summary - client-side quantiles: request_duration_seconds{quantile="0.95"}. Use Counter for cumulative events, Gauge for current state, Histogram for latency distributions (preferred for aggregation), Summary when exact quantiles needed. Histograms enable aggregation across instances, Summaries provide exact quantiles but cannot aggregate.

99% confidence
A

Metric names must match [a-zA-Z_:][a-zA-Z0-9_:]* with single-word application prefix using snake_case. Include units and type suffixes: _total (counters), _count/_sum/_bucket (histograms), _seconds, bytes, ratio (0-1). Examples: http_requests_total, http_request_duration_seconds, process_cpu_seconds_total, node_memory_MemAvailable_bytes. Use base units: seconds not milliseconds, bytes not kilobytes. Colons are reserved for recording rules only. Label names match [a-zA-Z][a-zA-Z0-9]* with finite values: {method='GET', status='200', service='api'}. Avoid high-cardinality labels (user_id, ip_address). Prometheus 3.0+ supports UTF-8 characters but stick to recommended charset for compatibility. Consistent naming ensures query clarity and cross-team reusability.

99% confidence
A

PromQL queries time series with powerful syntax. Instant vector (single value per series): http_requests_total. Label matchers: http_requests_total{method="GET", status="200"} (exact), http_requests_total{status!="200"} (not equal), http_requests_total{method="GET|POST"} (regex), http_requests_total{path!"/admin.*"} (negative regex). Range vector (time window): http_requests_total[5m]. Operators: arithmetic (+, -, *, /, %), comparison (==, !=, >, <, >=, <=), logical (and, or, unless). Examples: rate(http_requests_total[5m]) (per-second rate), sum(rate(http_requests_total[5m])) by (service) (aggregate by service), rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) (error rate). Always use label matchers to filter specific time series. Best practices: use shorter time ranges for faster queries, test queries incrementally, leverage recording rules for complex queries.

99% confidence
A

Aggregation operators combine time series. Operators: sum (total), avg (average), min/max (extremes), count (series count), stddev (standard deviation), stdvar (variance), count_values (histogram), topk(N, ...) (top N), bottomk(N, ...) (bottom N), quantile(φ, ...) (quantile). Grouping: by (label1, label2) preserves specified labels, without (label1) removes specified labels. Examples: sum(rate(http_requests_total[5m])) by (method, status) (requests per method/status), sum(rate(http_requests_total[5m])) without (instance) (aggregate across instances), topk(5, rate(http_requests_total[5m])) (top 5 highest rate series), avg(rate(container_cpu_usage_seconds_total[5m])) by (namespace) (CPU per namespace). Best practice: use without() to preserve job/cluster labels. Aggregations reduce cardinality and enable cross-instance dashboards/alerts.

99% confidence
A

Essential PromQL functions: (1) rate(counter[5m]) - per-second rate for counters (handles resets). (2) increase(counter[1h]) - total increase over period. (3) irate(counter[5m]) - instant rate (last 2 samples, volatile). (4) delta(gauge[1h]) - change in gauge. (5) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) - 95th percentile from histogram. (6) avg_over_time(gauge[1h]) - time-based average. (7) max_over_time(gauge[5m]) - maximum over period. (8) predict_linear(gauge[4h], 3600) - linear prediction 1 hour ahead. (9) absent(metric) - returns 1 if metric missing (for alerting). (10) clamp_max/clamp_min - limit values. Example: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 (error rate >5%). Use rate() for counters, delta() for gauges.

99% confidence
A

Time range selectors define time windows using bracket notation. Syntax: metric[duration] where duration uses s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years). Examples: http_requests_total[5m] (last 5 minutes), node_cpu_seconds_total[1h] (last hour), up[30s] (last 30 seconds). Offset modifier: metric[5m] offset 1h (5-minute window from 1 hour ago), metric offset 1d (instant value from 1 day ago). Subquery syntax: rate(http_requests_total[5m])[1h:30s] (evaluate rate() every 30s over 1 hour). Range vectors required for: rate(), increase(), delta(), avg_over_time(), max_over_time(). Instant vectors for: arithmetic, aggregation. Best practices: match range to scrape_interval (typically 4× scrape_interval for rate()), use longer ranges for low-frequency events. Example: rate(http_requests_total[5m]) offset 1w compares to last week.

99% confidence
A

Exporters are standalone programs exposing /metrics endpoint in Prometheus format for systems lacking native instrumentation. Workflow: Prometheus scrapes exporter → exporter queries target system → converts to Prometheus metrics → returns via HTTP. Official exporters: node_exporter (Linux/Windows system metrics: CPU, memory, disk, network), mysqld_exporter (MySQL metrics), redis_exporter (Redis), postgres_exporter (PostgreSQL), blackbox_exporter (HTTP/ICMP/TCP/DNS probing), elasticsearch_exporter (Elasticsearch). Deploy as sidecar or standalone service. Example scrape config: scrape_configs: - job_name: 'node' static_configs: - targets: ['node-exporter:9100']. Custom exporters use client libraries (prometheus_client for Python, client_golang for Go). Exporters transform proprietary APIs to Prometheus format, essential for monitoring databases, message queues, cloud services, and legacy systems. Over 200+ community exporters available.

99% confidence
A

Static service discovery manually defines scrape targets in prometheus.yml. Configuration: scrape_configs: - job_name: 'api' scrape_interval: 30s static_configs: - targets: ['api-1.example.com:8080', 'api-2.example.com:8080'] labels: env: 'production' region: 'us-east-1' - targets: ['api-3.example.com:8080'] labels: env: 'staging'. Labels attached to all metrics from targets. Use cases: small deployments, static infrastructure, development environments, explicit control over targets. Advantages: simple, no external dependencies, predictable. Disadvantages: manual updates required, not scalable for dynamic environments (containers, autoscaling). Best for: <50 targets, stable infrastructure, or when service discovery unavailable. Reload config without restart: curl -X POST http://localhost:9090/-/reload (requires --web.enable-lifecycle flag).

99% confidence
A

File-based service discovery reads targets from JSON/YAML files updated without Prometheus restart. Configuration: file_sd_configs: - files: ['/etc/prometheus/targets/.json', '/etc/prometheus/targets/.yml'] refresh_interval: 30s. JSON format: [{"targets": ["host1:9100", "host2:9100"], "labels": {"env": "prod", "team": "platform"}}, {"targets": ["host3:9100"], "labels": {"env": "staging"}}]. YAML format: - targets: ["host1:9100"] labels: env: prod. Prometheus watches files, reloads on changes every refresh_interval (default 5m). Use cases: integration with config management (Ansible, Terraform), custom discovery scripts, cloud metadata APIs. Generate files dynamically: ansible template, terraform local_file, cron script querying cloud APIs. Advantages: simple, no external service dependencies, works with any system that can write files. Ideal bridge between static and dynamic discovery.

99% confidence
A

Kubernetes service discovery automatically finds targets using pod/service/endpoint annotations. Configuration: kubernetes_sd_configs: - role: pod namespaces: names: [monitoring] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] target_label: metrics_path - source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address. Standard annotations: prometheus.io/scrape: "true" (enable), prometheus.io/port: "8080" (metrics port), prometheus.io/path: "/metrics" (custom path), prometheus.io/scheme: "https" (TLS). Roles: pod, service, endpoints, node, ingress. Automatic target updates on pod creation/deletion enable seamless monitoring in dynamic Kubernetes environments.

99% confidence
A

Alert rules evaluate PromQL expressions and trigger alerts via Alertmanager. Load in prometheus.yml: rule_files: - 'alert_rules.yml'. Structure: groups: - name: api.rules interval: 30s rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 10m labels: severity: warning service: "{{ $labels.service }}" annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ printf "%.2f" $value | humanizePercentage }} (threshold: 5%)" runbook_url: "https://runbooks.example.com/high-error-rate". Components: alert (CamelCase name), expr (PromQL), for (duration before firing), labels (routing/grouping), annotations (human-readable details). Test: promtool check rules alert_rules.yml. Reload: curl -X POST http://localhost:9090/-/reload. Best practices: alert on symptoms not causes, use 'for' clause to prevent flapping, include runbook_url.

99% confidence
A

The 'for' clause defines how long condition must be true before alert fires, preventing flapping from transient spikes. Syntax: for: (e.g., for: 5m, for: 30s, for: 1h). Example: - alert: HighCPU expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.80 for: 10m (CPU >80% for 10 consecutive minutes). Without 'for': alert fires immediately on first evaluation. With 'for': alert enters 'pending' state, then 'firing' after duration. Best practices: (1) Critical alerts: 1-5m to minimize response time. (2) Warning alerts: 5-15m to avoid noise. (3) Non-urgent: 30m-1h. (4) Match 'for' to scrape_interval (minimum 2-3× scrape interval). Benefits: eliminates false positives, reduces alert fatigue, focuses on sustained issues. Monitor pending alerts: ALERTS{alertstate="pending"}. Essential for stable production alerting.

99% confidence
A

Alert labels control Alertmanager routing, grouping, and inhibition. Standard labels: severity (critical, warning, info), service (affected component), environment (prod, staging, dev), team (owning team), cluster (Kubernetes cluster). Example: labels: severity: critical service: "{{ $labels.service }}" environment: production team: platform cluster: "{{ $labels.cluster }}" page: "true". Severity routing in Alertmanager: routes: - match: severity: critical receiver: pagerduty - match: severity: warning receiver: slack. Dynamic labels: use "{{ $labels.label_name }}" to copy from metrics. Reserved labels: alertname (auto-generated). Common patterns: page: "true" (requires paging), component: database (system component), priority: P1 (SLA priority). Labels must be consistent across alerts for proper grouping. Avoid high-cardinality labels (instance IDs). Use annotations for variable details (values, runbook links).

99% confidence
A

Annotations provide human-readable alert context without affecting routing (unlike labels). Template syntax: annotations: summary: "High latency on {{ $labels.service }}" description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 500ms)" runbook_url: "https://runbooks.example.com/latency" dashboard: "https://grafana.example.com/d/service?var-service={{ $labels.service }}". Template variables: $labels.label_name (label values), $value (alert expression result), $externalLabels (global labels). Template functions: humanize (format numbers), humanizeDuration (convert seconds to duration), humanizePercentage (format as %), printf (format strings). Example: description: "Error rate is {{ printf "%.2f" $value | humanizePercentage }} on {{ $labels.instance }}". Annotations appear in Alertmanager notifications (email, Slack, PagerDuty). Best practices: include actionable information, link to runbooks/dashboards, explain threshold/impact, use consistent formatting. Annotations don't affect alert identity or routing.

99% confidence
A

Alertmanager handles Prometheus alerts: deduplication, grouping, routing to receivers (email, Slack, PagerDuty), silencing, inhibition. Configuration in alertmanager.yml defines routing trees. Example: route: receiver: default group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 12h routes: - match: severity: critical receiver: pagerduty group_wait: 10s continue: true - match_re: service: ^(api|web)$ receiver: slack-dev. Root route: default receiver for unmatched alerts. Nested routes: match specific label patterns. continue: true allows multiple receivers. Grouping combines alerts with same group_by labels into single notification. Deduplication prevents duplicate firing alerts. Repeat interval controls notification frequency. Alertmanager essential for production: prevents alert storms, routes to correct teams, provides UI for silence management.

99% confidence
A

Grouping combines related alerts into single notifications to reduce noise. Configuration: route: group_by: [alertname, cluster, service] group_wait: 30s group_interval: 5m repeat_interval: 4h. group_by: labels for grouping (empty [] groups all). group_wait: wait for more alerts before sending first notification. group_interval: time between notifications for same group. repeat_interval: time before re-sending resolved/firing alert. Inhibition suppresses alerts when other alerts active: inhibit_rules: - source_matchers: [severity="critical", alertname="NodeDown"] target_matchers: [severity="warning"] equal: [instance]. Example: NodeDown critical alert inhibits NodeHighCPU warning on same instance. Use cases: cluster-wide outage inhibits node-level alerts, critical database alert inhibits slow query warnings. Proper grouping/inhibition reduce alert fatigue while maintaining visibility of critical issues.

99% confidence
A

Receivers define notification integrations. Email: receivers: - name: email email_configs: - to: [email protected] from: [email protected] smarthost: smtp.gmail.com:587 auth_username: [email protected] auth_password: app-password headers: Subject: '[{{ .Status }}] {{ .GroupLabels.alertname }}'. Slack: slack_configs: - api_url: https://hooks.slack.com/services/XXX channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'. PagerDuty: pagerduty_configs: - routing_key: severity: '{{ .GroupLabels.severity }}'. Webhook: webhook_configs: - url: http://webhook.example.com/alerts http_config: bearer_token: secret-token. Multiple receivers: receivers: - name: multi email_configs: [...] slack_configs: [...]. Test: amtool alert add alertname=Test severity=warning --alertmanager.url=http://localhost:9093.

99% confidence
A

Silences mute alerts temporarily without modifying rules. Create via UI (http://localhost:9093 → Silences → Create), amtool, or API. amtool: amtool silence add alertname=HighCPU instance=web-1 --duration=2h --comment="Planned maintenance" --author=ops. API: curl -XPOST http://localhost:9093/api/v2/silences -d '{"matchers":[{"name":"alertname","value":"HighCPU","isRegex":false},{"name":"instance","value":"web-1","isRegex":false}],"startsAt":"2025-01-15T10:00:00Z","endsAt":"2025-01-15T12:00:00Z","createdBy":"ops","comment":"Maintenance window"}'. List silences: amtool silence query or GET /api/v2/silences. Delete: amtool silence expire or DELETE /api/v2/silence/. Use cases: planned maintenance, known issues, testing, deployments. Silences match alerts by labels (supports regex). Best practice: always add comments explaining why, include end time, use specific matchers to avoid over-silencing.

99% confidence
A

Prometheus TSDB uses block-based architecture combining in-memory Head block and immutable on-disk blocks. Head block stores recent samples (2 hours default) with Write-Ahead Log (WAL) for crash recovery. Completed blocks written to disk contain: meta.json (metadata), index (inverted index for labels), chunks/ (compressed time series data). Blocks compacted into larger blocks (up to 10% retention or 31 days) using Log-Structured Merge-Tree (LSMT). Each time series gets unique ID from label sets. Storage flags: --storage.tsdb.path (default ./data), --storage.tsdb.retention.time (default 15d). Monitor with prometheus_tsdb_size_retained_bytes, prometheus_tsdb_head_series. Calculate storage: sample_size_bytes × series_count × scrape_interval × retention. Example: 1M series × 2 bytes × 15s interval × 15d = ~130GB. Understanding TSDB architecture optimizes capacity planning and performance.

99% confidence
A

Recording rules pre-compute expensive queries and store results as new time series. Format: level:metric:operations where level lists preserved labels, metric is unchanged (strip _total from counters), operations lists applied functions. Example: groups: - name: api.rules interval: 30s rules: - record: job:http_requests:rate5m expr: sum without (instance) (rate(http_requests_total[5m])). Load in prometheus.yml: rule_files: - 'recording_rules.yml'. Evaluation interval: global.evaluation_interval (default 1m) or per-group interval. Best for: histogram_quantile() pre-computation, expensive aggregations (sum, avg by), frequently used dashboard queries. Monitor evaluation time: prometheus_rule_evaluation_duration_seconds. Use without() clause to preserve labels like job. Recording rules should have zero or two colons. Dramatically improves dashboard performance without storing unnecessary data.

99% confidence
A

Recording rules follow level:metric:operations naming where level lists preserved labels. Examples: job:http_requests:rate5m, instance:node_cpu:usage_avg1m, cluster:container_memory:bytes:sum. Use for: expensive histogram_quantile() operations, multi-service aggregations (sum without(instance)), frequently queried dashboard metrics, alert pre-computation. Avoid: over-recording rarely used queries, duplicating simple metrics, recording without without() clause (loses labels). Best practices: (1) Always use without() to preserve job/cluster labels. (2) Use ratio separator for divisions: job:http_requests:errors_per_requests:ratio. (3) Document with comments. (4) Test with promtool check rules. (5) Monitor evaluation time: prometheus_rule_evaluation_duration_seconds. (6) Group related rules with appropriate interval. (7) Only record queries used 3+ times. Recording rules should have zero or two colons. Proper rules improve dashboard performance without storage bloat.

99% confidence
A

Instrument apps with official client libraries for Go, Python, Java, Node.js, Ruby, .NET. Python: from prometheus_client import Counter, Histogram, Gauge, start_http_server. requests_total = Counter('http_requests_total', 'Total requests', ['method', 'status']). request_duration = Histogram('http_request_duration_seconds', 'Request duration', buckets=[0.1, 0.5, 1.0, 2.0, 5.0]). @app.route('/api'): requests_total.labels(method='GET', status='200').inc(); with request_duration.time(): return response. start_http_server(8000) exposes /metrics. Go: import "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp". requestsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{Name: "http_requests_total", Help: "Total requests"}, []string{"method", "status"}). prometheus.MustRegister(requestsTotal). http.Handle("/metrics", promhttp.Handler()). Libraries handle: thread-safety, metric registration, text/protobuf exposition, automatic process metrics (CPU, memory).

99% confidence
A

Instrumentation best practices: (1) Metric names: use base units (seconds not ms, bytes not KB), suffix with unit (_seconds, _bytes, _total). (2) Metric types: Counter for cumulative (requests_total), Gauge for current state (memory_usage_bytes), Histogram for distributions (request_duration_seconds with buckets=[0.1, 0.5, 1, 2.5, 5]). (3) Labels: use finite cardinality (method, status, service), avoid user_id, ip_address, timestamps. (4) Initialize metrics at startup: counter.labels(method='GET', status='200').inc(0). (5) Instrument critical paths: request handlers, database queries, external API calls. (6) Export business metrics: orders_total, revenue_dollars, active_users_gauge. (7) Document with Help text. (8) Use consistent label names across services (status not http_status). Good: http_requests_total{method="GET", status="200"}. Bad: requests{user="alice", path="/api/v1/users/123"}. Follow conventions for cross-team querying.

99% confidence
A

High cardinality (>10K unique series) causes memory issues, slow queries, and potential TSDB crashes. Common causes: user_id, ip_address, request_id, email as labels. Detection: topk(10, count by (name, job)({name=~".+"})) or use mimirtool for unused metrics analysis. Solutions: (1) Drop labels via relabel_configs: - source_labels: [user_id] action: drop. (2) Pre-aggregate with recording rules: record: service:requests:rate5m expr: sum(rate(requests_total[5m])) by (service). (3) Use metric_relabel_configs to hash high-card labels. (4) Limit scrape scope with label selectors. (5) Consider VictoriaMetrics for better high-cardinality handling or managed solutions like Levitate. Monitor: prometheus_tsdb_head_series, prometheus_tsdb_symbol_table_size_bytes. Set limits: --storage.tsdb.max-series-per-metric. Keep <100K series per metric for optimal Prometheus performance.

99% confidence
A

Optimize scrape_interval based on metric volatility and query needs. Recommended: 15s-30s for dynamic metrics (CPU, memory, request rates), 1m-5m for stable metrics (disk usage). Configuration: global: scrape_interval: 15s. Per-job override: scrape_configs: - job_name: 'slow-metrics' scrape_interval: 5m. Retention: --storage.tsdb.retention.time=15d (default) or --storage.tsdb.retention.size=50GB (size-based). Calculate storage: 2 bytes/sample × series_count × (seconds_in_retention / scrape_interval). Example: 1M series × 2 bytes × (15d × 86400s / 15s) = ~173GB. Optimization: (1) Use recording rules for dashboards instead of reducing scrape interval. (2) Remote storage (Thanos, Mimir, VictoriaMetrics) for long-term retention (>30d). (3) Monitor: prometheus_tsdb_size_retained_bytes, prometheus_tsdb_head_series. (4) Set --storage.tsdb.max-block-duration for compaction tuning.

99% confidence
A

Federation allows hierarchical Prometheus topologies where one server scrapes selected metrics from another via /federate endpoint. Configuration: scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: /federate params: 'match[]': - '{job="kubernetes-.*"}' - '{name="job:."}' static_configs: - targets: ['source-prometheus:9090']. honor_labels: true preserves original labels from source. Use match[] to select specific metrics/jobs for federation. Patterns: (1) Hierarchical: regional Prometheus → global aggregation server. (2) Cross-datacenter: isolate prod/staging while centralizing dashboards. Best practices: federate recording rules (job:) not raw metrics to reduce cardinality. Alternatives: Thanos (object storage), Mimir (multi-tenant), VictoriaMetrics (better compression) offer better scalability than federation for large-scale deployments.

99% confidence
A

Pushgateway acts as intermediary for short-lived jobs that can't be scraped directly. Workflow: job → push metrics → Pushgateway → Prometheus scrapes Pushgateway. Push example: echo "backup_duration_seconds 123.4" | curl --data-binary @- http://pushgateway:9091/metrics/job/backup/instance/db-backup. Python: from prometheus_client import CollectorRegistry, Gauge, push_to_gateway. registry = CollectorRegistry(). g = Gauge('job_duration_seconds', 'Job duration', registry=registry). g.set(123.4). push_to_gateway('pushgateway:9091', job='backup', registry=registry). Scrape config: scrape_configs: - job_name: pushgateway honor_labels: true static_configs: - targets: [pushgateway:9091]. Use cases: batch jobs, cron jobs, CI/CD pipelines, serverless functions. Limitations: stale metrics if job crashes, no automatic expiry, single point of failure. Best practices: delete metrics after job (DELETE /metrics/job/), use unique instance labels, prefer pull model for long-running services.

99% confidence
A

Prometheus lacks built-in auth; use external tools or native TLS (v2.24+). TLS config: tls_server_config: cert_file: /etc/prometheus/prometheus.crt key_file: /etc/prometheus/prometheus.key client_auth_type: RequireAndVerifyClientCert client_ca_file: /etc/prometheus/client_ca.crt. Basic auth (v2.24+): basic_auth_users: admin: $2y$10$hashed_password_here. Reverse proxy auth (nginx): location / { auth_basic "Prometheus"; auth_basic_user_file /etc/nginx/.htpasswd; proxy_pass http://localhost:9090; }. OAuth2 proxy: oauth2-proxy --upstream=http://localhost:9090 --provider=google. Scrape target auth: basic_auth: username: scrape_user password_file: /etc/prometheus/scrape_password. bearer_token_file: /var/run/secrets/token. tls_config: ca_file: /etc/prometheus/ca.crt. Network: firewall rules, VPC peering, mTLS between components. Monitor failed auth: prometheus_http_requests_total{code="401"}.

99% confidence
A

Anti-patterns to avoid: (1) High-cardinality labels: using user_id, ip_address, request_id as labels causes memory exhaustion. (2) Missing 'for' clause: alerts fire on brief spikes causing alert fatigue. (3) Pushgateway for services: use pull model for long-running apps. (4) Scraping every second: creates unnecessary load; use 15-30s minimum. (5) Ignoring metric naming: inconsistent names (request_count vs requests_total) break queries. (6) Not monitoring Prometheus: monitor prometheus_tsdb_head_series, up{job="prometheus"}. (7) Single Prometheus instance: use federation/remote storage for scale. (8) No retention planning: disk fills unexpectedly. (9) Using gauge for counters: breaks rate() calculations. (10) Complex dashboards without recording rules: slow queries at scale. Best practices: follow naming conventions, use recording rules, implement proper cardinality controls, test alerts, plan capacity.

99% confidence
A

Prometheus Operator manages Prometheus/Alertmanager on Kubernetes via CRDs. CRDs: Prometheus (server config), ServiceMonitor (pod/service scraping), PodMonitor (pod scraping), PrometheusRule (alert/recording rules), Alertmanager (alerting config), ThanosRuler (Thanos ruler), PrometheusAgent (agent mode). ServiceMonitor: apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api namespace: monitoring spec: selector: matchLabels: {app: api} endpoints: - port: metrics interval: 30s path: /metrics. PrometheusRule: kind: PrometheusRule spec: groups: - name: api rules: - alert: HighErrorRate expr: rate(errors[5m]) > 0.05. Benefits: declarative GitOps, automatic config reload, native Kubernetes integration, multi-tenancy support. Install: helm install prometheus prometheus-community/kube-prometheus-stack. Part of kube-prometheus providing full monitoring stack (Prometheus, Alertmanager, Grafana, node-exporter).

99% confidence
A

Grafana is standard for Prometheus visualization. Setup: Add datasource → Prometheus → URL: http://prometheus:9090 → Save & Test. Create dashboard: Panel → Query: rate(http_requests_total[5m]) → Visualization: Time series. Variables for dynamic dashboards: $namespace: label_values(kube_pod_info, namespace), $service: label_values(http_requests_total{namespace="$namespace"}, service). Query examples: (1) Request rate: sum(rate(http_requests_total[5m])) by (service). (2) Error rate: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])). (3) P95 latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)). Visualization types: Time series (trends), Stat (current value), Gauge (percentage), Heatmap (latency distribution). Best practices: USE method (Utilization, Saturation, Errors), RED method (Rate, Errors, Duration), consistent colors, meaningful thresholds, dashboard folders, annotation queries.

99% confidence