Overview
Metrics provide numerical measurements of your system’s performance over time. The Observability Bundle includes four components that work together to collect, store, and visualize metrics from your Kubernetes clusters:- Prometheus: Collects and stores short-term metrics
- Grafana Mimir: Long-term metrics storage and querying
- Kube State Metrics: Exposes Kubernetes object state as metrics
- Prometheus Node Exporter: Exposes node hardware and OS metrics
Prometheus
Prometheus is the industry-standard monitoring system and time-series database for cloud-native environments. It automatically discovers services in Kubernetes and scrapes metrics from them.What Prometheus Monitors
Application Metrics:- Request rates, latencies, error rates (RED metrics)
- Custom business metrics from your applications
- Metrics from
/metricsendpoints
- Pod CPU and memory usage
- Container resource consumption
- Service endpoint availability
- Scrapes metrics every 15-30 seconds (configurable)
- Stores data locally for 15-30 days (typical configuration)
- Remote writes to Mimir for long-term retention
PromQL Basics
Prometheus uses PromQL (Prometheus Query Language) for querying metrics. Example queries:Configuration
Prometheus is configured via the Observability Bundle’s GitOps workflow. Key configuration options:Grafana Mimir
Mimir provides long-term, scalable storage for Prometheus metrics. It’s designed to handle billions of active time series across multiple tenants.Why Mimir?
Long-term Retention:- Stores metrics for months or years
- Prometheus typically keeps 15-30 days locally
- Historical analysis and capacity planning
- Scales to handle millions of samples per second
- Distributed architecture handles large metric volumes
- No single point of failure
- Receives data via Prometheus remote write
- Queries with PromQL
- Drop-in replacement for long-term storage
How It Works
Configuration
Kube State Metrics
Kube State Metrics generates metrics about the state of Kubernetes objects. Unlike metrics from the Kubernetes API server, kube-state-metrics focuses on the state of the objects (e.g., deployments, pods, nodes) rather than their resource consumption.What It Exposes
Deployment Metrics:kube_deployment_status_replicas: Number of desired replicaskube_deployment_status_replicas_available: Number of available replicaskube_deployment_status_replicas_unavailable: Number of unavailable replicas
kube_pod_status_phase: Pod phase (Pending, Running, Succeeded, Failed)kube_pod_status_ready: Whether pod is readykube_pod_container_status_restarts_total: Container restart count
kube_node_status_condition: Node conditions (Ready, MemoryPressure, DiskPressure)kube_node_status_allocatable: Allocatable resources per nodekube_node_status_capacity: Total capacity per node
Use Cases
Cluster Health Monitoring:Configuration
Kube State Metrics typically requires minimal configuration:Prometheus Node Exporter
The Node Exporter exposes hardware and OS metrics from Kubernetes nodes. It runs as a DaemonSet (one pod per node) to collect node-level performance data.What It Exposes
CPU Metrics:node_cpu_seconds_total: CPU time spent in different modes (user, system, idle)- Used to calculate CPU usage percentages
node_memory_MemTotal_bytes: Total memorynode_memory_MemAvailable_bytes: Available memorynode_memory_Cached_bytes: Cached memory
node_filesystem_size_bytes: Filesystem sizenode_filesystem_avail_bytes: Available spacenode_disk_io_time_seconds_total: Disk I/O time
node_network_receive_bytes_total: Bytes receivednode_network_transmit_bytes_total: Bytes transmittednode_network_receive_errors_total: Receive errors
Use Cases
Node Resource Monitoring:Configuration
Node Exporter runs as a DaemonSet with host network and pid namespace access:Visualizing Metrics in Grafana
Once metrics are being collected, Grafana provides powerful visualization capabilities.Pre-built Dashboards
The Observability Bundle includes dashboards for common monitoring needs:- Kubernetes Cluster Monitoring: Overview of cluster health and resource usage
- Node Metrics: Detailed node performance (from Node Exporter)
- Pod Metrics: CPU, memory, network by pod
- Deployment Status: Replica counts, pod states
Creating Custom Dashboards
- Navigate to Grafana (access via ingress URL configured during setup)
- Click Dashboards → New Dashboard
- Add a panel and select Prometheus or Mimir as data source
- Write PromQL queries to visualize your metrics
- Panel Title: “API Request Rate”
- Data Source: Prometheus
- Query:
rate(http_requests_total{job="api-service"}[5m]) - Visualization: Time series graph
Alerting
Grafana can alert based on metric thresholds:- Create an alert rule in a dashboard panel
- Define alert conditions (e.g., CPU > 80% for 5 minutes)
- Configure notification channels (email, Slack, PagerDuty)
Accessing Grafana
To access Grafana and view your metrics:- Get the Grafana URL from your Observability Bundle configuration (configured during setup)
- Log in with the credentials you configured:
- Username: admin (default)
- Password: Set during bundle configuration
Troubleshooting
Prometheus not scraping metrics
Prometheus not scraping metrics
Check Prometheus targets:
- Access Prometheus UI (typically at
http://prometheus:9090) - Go to Status → Targets
- Look for targets in “down” state
- Network policies blocking Prometheus
- Service selector not matching pods
- Pods don’t expose
/metricsendpoint
High memory usage in Prometheus
High memory usage in Prometheus
Prometheus memory usage scales with:
- Number of time series (unique label combinations)
- Scrape interval (more frequent = more data)
- Retention period (longer = more data stored)
- Reduce retention period (keep 7-15 days, use Mimir for long-term)
- Increase scrape interval (30s → 60s)
- Drop unnecessary metrics using relabeling
- Add more memory to Prometheus pods
Mimir remote write failing
Mimir remote write failing
Check Prometheus logs:Common issues:
- Mimir gateway not accessible (network/DNS issues)
- Authentication failure (check credentials)
- Rate limiting (ingestion rate too high)
Missing node or cluster metrics
Missing node or cluster metrics
Verify kube-state-metrics is running:Verify node-exporter is running on all nodes:Check Prometheus is scraping these exporters:
- Access Prometheus UI → Status → Targets
- Look for
kube-state-metricsandnode-exportertargets
Best Practices
Metric Naming
Follow Prometheus naming conventions:
- Use base unit (seconds, not milliseconds)
- Append
_totalfor counters - Append
_bucketfor histograms - Example:
http_request_duration_seconds
Label Usage
Keep cardinality low:
- Avoid high-cardinality labels (user IDs, timestamps)
- Use service name, environment, region as labels
- Don’t create unique label values for every request
Retention Strategy
Balance cost and utility:
- Prometheus: 7-15 days (recent, high-resolution)
- Mimir: months to years (long-term, downsampled)
- Adjust based on storage costs and query patterns
Resource Limits
Right-size resource allocations:
- Prometheus memory scales with active series
- Monitor and adjust based on actual usage
- Use horizontal scaling for very large deployments
Next Steps
- View Logs - Learn about log aggregation with Loki
- Configure Tracing - Set up distributed tracing
- Observability Overview - Return to bundle overview
