Skip to main content

Overview

Logs provide detailed records of events in your applications and infrastructure. The Observability Bundle includes Grafana Loki, a horizontally scalable log aggregation system that makes it easy to explore and analyze logs from all your Kubernetes clusters. Loki is inspired by Prometheus but designed specifically for logs. Instead of indexing log content (which is expensive), Loki indexes only metadata labels, making it cost-effective to store and query large volumes of logs.

Grafana Loki

Loki is a log aggregation system that collects, stores, and indexes logs from your Kubernetes clusters. It integrates seamlessly with Grafana for log visualization and querying.

Key Features

Cost-Effective Storage:
  • Indexes metadata labels, not log content
  • Significantly reduces storage and indexing costs compared to traditional log systems
  • Compresses log data efficiently
Label-Based Querying:
  • Uses the same label-based approach as Prometheus
  • Fast queries using metadata indexes
  • No need for expensive full-text indexes
Native Kubernetes Integration:
  • Automatically discovers pods and services
  • Extracts Kubernetes metadata as labels (namespace, pod, container)
  • Works with standard stdout/stderr container logs
Grafana Integration:
  • Explore logs directly in Grafana dashboards
  • Correlate logs with metrics and traces
  • LogQL query language similar to PromQL

How Loki Works

Container Logs (stdout/stderr)

    └─> Promtail/OTel Collector ──────> Grafana Loki
            (collects logs)               (stores logs)

                                              └──> Grafana
                                                 (visualizes logs)
Log Collection:
  1. Containers write logs to stdout/stderr
  2. Kubernetes stores logs on nodes
  3. Promtail or OpenTelemetry Collector reads log files
  4. Labels are extracted (namespace, pod, container, etc.)
  5. Logs are sent to Loki for storage
Log Querying:
  1. User queries logs in Grafana using LogQL
  2. Loki uses label indexes to find relevant log streams
  3. Loki filters and returns matching log lines
  4. Grafana displays results

LogQL Query Language

LogQL (Log Query Language) is used to query logs in Loki. It’s similar to PromQL but designed for log data.

Basic Queries

Select logs by labels:
{namespace="production"}

{app="api-server", env="prod"}

{namespace="production", pod=~"api-.*"}
Filter log content:
# Contains text
{namespace="production"} |= "error"

# Doesn't contain text
{namespace="production"} != "debug"

# Regex match
{app="api"} |~ "error|ERROR|Error"

# Case-insensitive match
{app="api"} |~ "(?i)error"

Advanced Queries

Parse structured logs:
# JSON logs
{app="api"} | json | status_code >= 400

# Logfmt (key=value format)
{app="api"} | logfmt | level="error"

# Pattern matching
{app="api"} | pattern `<method> <path> <status>`
Aggregations:
# Count logs
count_over_time({namespace="production"}[5m])

# Rate of logs
rate({app="api"} |= "error" [5m])

# Sum a parsed field
sum by (status_code) (
  rate({app="api"} | json | __error__="" [5m])
)
Multi-line queries:
# Errors with high status codes
sum by (pod) (
  count_over_time(
    {namespace="production"}
      | json
      | status_code >= 500 [5m]
  )
)

Configuration

Loki is configured through the Observability Bundle’s GitOps workflow.

Basic Configuration

enabled: true
valuesObject:
  # Storage backend
  storage:
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      region: us-east-1

  # Retention period
  limits_config:
    retention_period: 30d

  # Ingestion rate limits
  ingester:
    chunk_idle_period: 1h
    max_chunk_age: 2h

Multi-Tenancy

For multi-cluster deployments, configure tenants to separate logs:
valuesObject:
  auth_enabled: true
  distributor:
    ring:
      kvstore:
        store: memberlist
Then send logs with an X-Scope-OrgID header identifying the cluster/tenant. For detailed configuration options, see the Loki Helm chart documentation.

Viewing Logs in Grafana

Accessing Logs

  1. Navigate to Grafana (access URL configured during Observability Bundle setup)
  2. Go to Explore in the left sidebar
  3. Select Loki as the data source
  4. Use the query builder or write LogQL queries

Common Use Cases

View recent logs for a pod:
  1. Select labels: namespace=production, pod=api-server-xyz
  2. Add time range (e.g., last 15 minutes)
  3. Optionally filter by keyword (e.g., “error”)
Search for errors across all services:
{namespace="production"} |= "error" |= "ERROR" |= "Error"
View logs around a specific time:
  1. Use time picker to select date/time
  2. Adjust time range to ±5 minutes around the event
  3. Filter by relevant labels

Live Tail

Grafana supports live tailing of logs (similar to kubectl logs -f):
  1. In Explore view, click the Live button
  2. Logs will stream in real-time as they’re collected
  3. Use filters to narrow down which logs to tail

Log Collection

The Observability Bundle supports multiple log collection methods: The OTel Collector can collect logs and forward them to Loki:
# OTel Collector configuration
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'

exporters:
  loki:
    endpoint: http://loki-gateway/loki/api/v1/push
Benefits:
  • Single agent for metrics, traces, and logs
  • Unified configuration and management
  • Consistent labeling across telemetry types

Promtail (Loki Native)

Promtail is Loki’s purpose-built log collector:
enabled: true
valuesObject:
  config:
    clients:
      - url: http://loki-gateway/loki/api/v1/push
    positions:
      filename: /tmp/positions.yaml
    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
Benefits:
  • Optimized for Loki
  • Lower resource usage
  • Native Kubernetes integration

Structuring Logs for Observability

Log Levels

Use consistent log levels across applications:
  • DEBUG: Detailed information for diagnosing issues
  • INFO: General informational messages
  • WARN: Warning messages for potentially harmful situations
  • ERROR: Error messages for failure scenarios
  • FATAL: Critical errors that cause application shutdown

Structured Logging

Use structured logging (JSON) rather than plain text: Good - JSON structured logs:
{
  "level": "error",
  "timestamp": "2025-01-16T10:30:00Z",
  "message": "Database connection failed",
  "error": "connection timeout",
  "service": "api-server",
  "trace_id": "abc123"
}
Bad - Unstructured text:
[ERROR] 2025-01-16 10:30:00 - Database connection failed: connection timeout
Structured logs enable:
  • Filtering by fields (| json | level="error")
  • Aggregations (sum by (status_code))
  • Correlation with traces (trace_id)

Include Context

Add context to log messages:
  • User ID, request ID, trace ID
  • Operation being performed
  • Input parameters (sanitized)
  • Error codes or types

Correlating Logs with Traces

When using distributed tracing, include trace IDs in logs for correlation:
// Application code
logger.info("Processing request", {
  trace_id: span.context().traceId,
  user_id: request.userId,
  operation: "create_order"
});
In Grafana:
  1. View a trace in Tempo
  2. Click “Logs for this trace”
  3. Grafana automatically queries Loki for logs with matching trace ID
  4. See all logs related to that specific request

Troubleshooting

Check log collection agent:
# If using Promtail
kubectl logs -n observability -l app=promtail

# If using OTel Collector
kubectl logs -n observability -l app=opentelemetry-collector
Verify Loki is running:
kubectl get pods -n observability | grep loki
Check Loki ingester logs:
kubectl logs -n observability -l app=loki -c ingester
Common issues:
  • Network connectivity between collector and Loki
  • Loki ingestion rate limits hit
  • Incorrect labels causing streams to be dropped
Loki memory usage scales with:
  • Number of active streams (unique label combinations)
  • Ingestion rate (logs per second)
  • Query load
Solutions:
  • Reduce label cardinality (avoid high-cardinality labels like request IDs)
  • Decrease retention period
  • Increase chunk idle period to batch more data before flushing
  • Add more Loki ingesters for horizontal scaling
Check stream cardinality:
# Access Loki metrics
kubectl port-forward -n observability svc/loki-gateway 3100:80
curl http://localhost:3100/metrics | grep loki_ingester_streams
Optimize queries:
  • Use specific labels to narrow search space
  • Avoid querying very long time ranges
  • Use aggregations instead of returning raw logs when possible
Good query:
{namespace="production", app="api"} |= "error" [1h]
Bad query (too broad):
{namespace="production"} [24h]  # Searches all apps for 24 hours
Check query performance:
  • Grafana shows query execution time
  • Loki query stats show chunks scanned
Solutions:
  • Add more specific labels
  • Reduce time range
  • Use log aggregation rules for common queries
Check pod labels:
kubectl get pods -n my-namespace --show-labels
Verify log collector is scraping pod:
# Check Promtail targets
kubectl port-forward -n observability svc/promtail 3101:3101
curl http://localhost:3101/targets
Common issues:
  • Pod doesn’t match log collector’s label selectors
  • Pod is in a namespace not being monitored
  • Logs are being written to files instead of stdout/stderr

Best Practices

Use Structured Logging

Always use JSON or structured log formats. This enables filtering, parsing, and aggregation in LogQL queries.

Control Label Cardinality

Keep the number of unique label combinations low. Avoid labels with values like request IDs, timestamps, or user IDs.

Include Trace IDs

Add trace IDs to logs for correlation with distributed traces. This enables powerful debugging workflows.

Set Appropriate Retention

Balance storage costs with retention needs. Typical: 7-30 days for logs, longer for critical systems.

Next Steps