Skip to main content

Overview

Metrics provide numerical measurements of your system’s performance over time. The Observability Bundle includes four components that work together to collect, store, and visualize metrics from your Kubernetes clusters:
  • Prometheus: Collects and stores short-term metrics
  • Grafana Mimir: Long-term metrics storage and querying
  • Kube State Metrics: Exposes Kubernetes object state as metrics
  • Prometheus Node Exporter: Exposes node hardware and OS metrics

Prometheus

Prometheus is the industry-standard monitoring system and time-series database for cloud-native environments. It automatically discovers services in Kubernetes and scrapes metrics from them.

What Prometheus Monitors

Application Metrics:
  • Request rates, latencies, error rates (RED metrics)
  • Custom business metrics from your applications
  • Metrics from /metrics endpoints
Kubernetes Metrics:
  • Pod CPU and memory usage
  • Container resource consumption
  • Service endpoint availability
Integration:
  • Scrapes metrics every 15-30 seconds (configurable)
  • Stores data locally for 15-30 days (typical configuration)
  • Remote writes to Mimir for long-term retention

PromQL Basics

Prometheus uses PromQL (Prometheus Query Language) for querying metrics. Example queries:
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage by namespace
sum(container_memory_usage_bytes) by (namespace)

# Request rate for a service
rate(http_requests_total{job="my-service"}[5m])

# 95th percentile response time
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Configuration

Prometheus is configured via the Observability Bundle’s GitOps workflow. Key configuration options:
enabled: true
valuesObject:
  # Scrape interval
  scrapeInterval: 30s

  # Retention period
  retention: 15d

  # Remote write to Mimir
  remoteWrite:
    - url: http://mimir-gateway/api/v1/push

  # Resource limits
  resources:
    limits:
      memory: 4Gi
      cpu: 2
For detailed Helm chart values, see the Prometheus chart documentation.

Grafana Mimir

Mimir provides long-term, scalable storage for Prometheus metrics. It’s designed to handle billions of active time series across multiple tenants.

Why Mimir?

Long-term Retention:
  • Stores metrics for months or years
  • Prometheus typically keeps 15-30 days locally
  • Historical analysis and capacity planning
Horizontal Scalability:
  • Scales to handle millions of samples per second
  • Distributed architecture handles large metric volumes
  • No single point of failure
Prometheus Compatible:
  • Receives data via Prometheus remote write
  • Queries with PromQL
  • Drop-in replacement for long-term storage

How It Works

Prometheus ──[remote write]──> Mimir ──[PromQL queries]──> Grafana
    │                                                           │
    └─────────────[PromQL queries for recent data]─────────────┘
Grafana can query both Prometheus (recent data) and Mimir (historical data) simultaneously, providing seamless access to both real-time and long-term metrics.

Configuration

enabled: true
valuesObject:
  # Storage backend (S3, GCS, Azure Blob)
  storage:
    backend: s3
    s3:
      endpoint: s3.amazonaws.com
      bucket: my-mimir-metrics

  # Retention period
  limits:
    compactor_blocks_retention_period: 1y

  # Ingestion limits
  ingester:
    ring:
      replication_factor: 3
For detailed configuration, see the Mimir chart documentation.

Kube State Metrics

Kube State Metrics generates metrics about the state of Kubernetes objects. Unlike metrics from the Kubernetes API server, kube-state-metrics focuses on the state of the objects (e.g., deployments, pods, nodes) rather than their resource consumption.

What It Exposes

Deployment Metrics:
  • kube_deployment_status_replicas: Number of desired replicas
  • kube_deployment_status_replicas_available: Number of available replicas
  • kube_deployment_status_replicas_unavailable: Number of unavailable replicas
Pod Metrics:
  • kube_pod_status_phase: Pod phase (Pending, Running, Succeeded, Failed)
  • kube_pod_status_ready: Whether pod is ready
  • kube_pod_container_status_restarts_total: Container restart count
Node Metrics:
  • kube_node_status_condition: Node conditions (Ready, MemoryPressure, DiskPressure)
  • kube_node_status_allocatable: Allocatable resources per node
  • kube_node_status_capacity: Total capacity per node

Use Cases

Cluster Health Monitoring:
# Pods not in Running state
count(kube_pod_status_phase{phase!="Running"})

# Nodes with conditions other than Ready
sum(kube_node_status_condition{condition!="Ready",status="true"})
Capacity Planning:
# Available CPU capacity across cluster
sum(kube_node_status_allocatable{resource="cpu"})

# Pod resource requests vs node capacity
sum(kube_pod_container_resource_requests{resource="memory"}) /
sum(kube_node_status_capacity{resource="memory"})
Deployment Monitoring:
# Deployments with unavailable replicas
kube_deployment_status_replicas_unavailable > 0

# Deployments not at desired replica count
kube_deployment_status_replicas != kube_deployment_spec_replicas

Configuration

Kube State Metrics typically requires minimal configuration:
enabled: true
valuesObject:
  # Resource limits
  resources:
    limits:
      memory: 256Mi
      cpu: 100m

  # Which resources to monitor (default: all)
  collectors:
    - deployments
    - pods
    - nodes
    - services
    - configmaps
For more details, see the Kube State Metrics chart.

Prometheus Node Exporter

The Node Exporter exposes hardware and OS metrics from Kubernetes nodes. It runs as a DaemonSet (one pod per node) to collect node-level performance data.

What It Exposes

CPU Metrics:
  • node_cpu_seconds_total: CPU time spent in different modes (user, system, idle)
  • Used to calculate CPU usage percentages
Memory Metrics:
  • node_memory_MemTotal_bytes: Total memory
  • node_memory_MemAvailable_bytes: Available memory
  • node_memory_Cached_bytes: Cached memory
Disk Metrics:
  • node_filesystem_size_bytes: Filesystem size
  • node_filesystem_avail_bytes: Available space
  • node_disk_io_time_seconds_total: Disk I/O time
Network Metrics:
  • node_network_receive_bytes_total: Bytes received
  • node_network_transmit_bytes_total: Bytes transmitted
  • node_network_receive_errors_total: Receive errors

Use Cases

Node Resource Monitoring:
# CPU usage by node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage by node
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

# Disk usage by mount point
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Network Monitoring:
# Network traffic by node
rate(node_network_receive_bytes_total[5m])

# Network errors
rate(node_network_receive_errors_total[5m]) > 0
Disk I/O:
# Disk I/O operations
rate(node_disk_io_time_seconds_total[5m])

Configuration

Node Exporter runs as a DaemonSet with host network and pid namespace access:
enabled: true
valuesObject:
  # Host network access for accurate metrics
  hostNetwork: true
  hostPID: true

  # Resource limits
  resources:
    limits:
      memory: 128Mi
      cpu: 100m
For more configuration options, see the Node Exporter chart.

Visualizing Metrics in Grafana

Once metrics are being collected, Grafana provides powerful visualization capabilities.

Pre-built Dashboards

The Observability Bundle includes dashboards for common monitoring needs:
  • Kubernetes Cluster Monitoring: Overview of cluster health and resource usage
  • Node Metrics: Detailed node performance (from Node Exporter)
  • Pod Metrics: CPU, memory, network by pod
  • Deployment Status: Replica counts, pod states

Creating Custom Dashboards

  1. Navigate to Grafana (access via ingress URL configured during setup)
  2. Click DashboardsNew Dashboard
  3. Add a panel and select Prometheus or Mimir as data source
  4. Write PromQL queries to visualize your metrics
Example panel configuration:
  • Panel Title: “API Request Rate”
  • Data Source: Prometheus
  • Query: rate(http_requests_total{job="api-service"}[5m])
  • Visualization: Time series graph

Alerting

Grafana can alert based on metric thresholds:
  1. Create an alert rule in a dashboard panel
  2. Define alert conditions (e.g., CPU > 80% for 5 minutes)
  3. Configure notification channels (email, Slack, PagerDuty)

Accessing Grafana

To access Grafana and view your metrics:
  1. Get the Grafana URL from your Observability Bundle configuration (configured during setup)
  2. Log in with the credentials you configured:
    • Username: admin (default)
    • Password: Set during bundle configuration
If you need to retrieve the password:
kubectl get secret grafana -o jsonpath="{.data.admin-password}" -n observability | base64 --decode
For security, change the default admin password and create additional user accounts as needed through the Grafana UI.

Troubleshooting

Check Prometheus targets:
  1. Access Prometheus UI (typically at http://prometheus:9090)
  2. Go to StatusTargets
  3. Look for targets in “down” state
Common issues:
  • Network policies blocking Prometheus
  • Service selector not matching pods
  • Pods don’t expose /metrics endpoint
Verify metrics endpoint:
kubectl port-forward pod/my-app-pod 8080:8080
curl http://localhost:8080/metrics
Prometheus memory usage scales with:
  • Number of time series (unique label combinations)
  • Scrape interval (more frequent = more data)
  • Retention period (longer = more data stored)
Solutions:
  • Reduce retention period (keep 7-15 days, use Mimir for long-term)
  • Increase scrape interval (30s → 60s)
  • Drop unnecessary metrics using relabeling
  • Add more memory to Prometheus pods
Check cardinality:
# Access Prometheus UI
# Go to Status → TSDB Status
# Look for series with high cardinality
Check Prometheus logs:
kubectl logs -n observability prometheus-server-0 | grep "remote write"
Common issues:
  • Mimir gateway not accessible (network/DNS issues)
  • Authentication failure (check credentials)
  • Rate limiting (ingestion rate too high)
Verify Mimir is running:
kubectl get pods -n observability | grep mimir
Verify kube-state-metrics is running:
kubectl get pods -n observability | grep kube-state-metrics
Verify node-exporter is running on all nodes:
kubectl get pods -n observability -l app=prometheus-node-exporter -o wide
# Should show one pod per node
Check Prometheus is scraping these exporters:
  • Access Prometheus UI → Status → Targets
  • Look for kube-state-metrics and node-exporter targets

Best Practices

Metric Naming

Follow Prometheus naming conventions:
  • Use base unit (seconds, not milliseconds)
  • Append _total for counters
  • Append _bucket for histograms
  • Example: http_request_duration_seconds

Label Usage

Keep cardinality low:
  • Avoid high-cardinality labels (user IDs, timestamps)
  • Use service name, environment, region as labels
  • Don’t create unique label values for every request

Retention Strategy

Balance cost and utility:
  • Prometheus: 7-15 days (recent, high-resolution)
  • Mimir: months to years (long-term, downsampled)
  • Adjust based on storage costs and query patterns

Resource Limits

Right-size resource allocations:
  • Prometheus memory scales with active series
  • Monitor and adjust based on actual usage
  • Use horizontal scaling for very large deployments

Next Steps