Cluster View

The Cluster View provides a comprehensive health dashboard for each connected Kubernetes cluster. It surfaces problems, tracks resource utilization, monitors certificates, and gives you quick access to detailed resource inspection.

Cluster view dashboard with health status and active problems

Accessing Cluster View

From the Clusters List, click any cluster name to open its Cluster View.

Health Status

At the top of the page, you’ll see an overall health indicator showing the number of active issues detected in the cluster. Status types:

Healthy - No critical issues detected
Warning - Non-critical issues found (e.g., pod restarts, resource pressure)
Critical - Urgent problems requiring attention (e.g., CrashLoopBackOff, OOM kills)

Click the health badge to jump to the Active Problems section below.

Cluster Information Panel

Agent Details

Cluster Connector Version - The version of the Cluster Connector running in this cluster
API Key - The unique authentication key for this cluster
Reinstall Button - Click to view the helm upgrade command to update the agent

Networking

Skyhook auto-assigns unique wildcard DNS (e.g., *.7xzftmip.mct545.cluster-tests.com) to each load balancer, so you can expose services without manual DNS setup. Just point your custom domain to the load balancer IP. The panel shows all LoadBalancer services with their external IPs, auto-generated DNS entries, and ingress classes. If you’re running multiple ingress controllers (nginx + envoy), each gets its own DNS and IP, useful for traffic splitting, canary testing, or migrating between ingress implementations.

Storage Classes

Lists available storage backends for persistent volumes (e.g., standard-rwo, premium-rwo on GCP; gp2, gp3 on AWS). The default storage class is marked; that’s what PVCs use when you don’t specify storageClassName. Helps you understand which storage options are available when provisioning volumes.

Get Kubeconfig

Pre-filled gcloud or aws command to configure kubectl access to your cluster from your terminal. The command includes your exact cluster name, region/zone, and project/account ID. Click copy when you need to run commands the UI doesn’t support (custom kubectl plugins, helm charts, etc.).

Cluster Resource Viewer

Quick-access icons to jump directly to any resource type: Nodes, Services, Ingresses, Deployments, Pods, ConfigMaps, Service Accounts, Jobs, Custom Resources, and Events. Faster than clicking through tabs when you know what you’re looking for.

Active Problems

The Active Problems panel automatically detects Kubernetes issues across your cluster with 18+ distinct problem types. Each problem is categorized by severity and includes temporal tracking.

Problem Types Detected

Node Problems:

NodeNotReady (critical) - Nodes not accepting workloads
MemoryPressure (critical) - Node running low on memory
DiskPressure (critical) - Node running low on disk space
PIDPressure (minor) - Node running low on process IDs
ManyCordonedNodes (major) - Excessive nodes marked unschedulable

Pod/Container Problems:

CrashLoopBackOff (critical) - Pods repeatedly failing to start
ImagePullBackOff (major) - Cannot pull container images
OOMKilled (critical) - Containers terminated due to out-of-memory
CreateContainerConfigError (minor) - Invalid container configuration
LivenessProbeFailing (critical) - Liveness health checks failing
ReadinessProbeFailing (minor) - Readiness health checks failing
HighRestartCount (major) - Pods restarting >5 times

Pod State Problems:

PodPending (minor) - Pods stuck in Pending state
UnschedulablePods (minor) - Pods cannot be scheduled to nodes
ManyFailedPods (major) - Multiple pods in Failed state
EvictedPods (major) - Pods evicted due to resource pressure
TerminatingPodsStuck (major) - Pods stuck terminating >5 minutes

Storage & Workload Problems:

FailedMount (minor) - Volume mount failures
FailedAttachVolume (major) - Volume attachment failures
DegradedWorkloads (major) - Deployments/StatefulSets with fewer ready replicas than desired

How Problem Tracking Works

Problems are tracked in two states:

Active Problems: Currently detected in your cluster’s resource state. These affect your cluster health and are shown as colored chips.
Recently Observed: Detected in cluster events within the last 10 minutes but not currently active. These help you spot intermittent issues or auto-resolved problems.

Each problem chip includes rich tooltips showing:

Severity: Critical (red), Major (orange), or Minor (yellow)
Affected count: How many resources have this problem
First observed / Last seen: Temporal tracking with duration
Observation count: How many times this problem was detected
Remediation guidance: Specific steps to fix the issue
View Affected Resources button: Directly filters the Resource Viewer to show impacted pods/workloads

Fix an issue in your cluster and watch it move from Active Problems to Recently Observed within 1-2 minutes as the agent syncs changes.

Resource Summary Cards

Quick-view cards show high-level metrics for different resource categories:

Nodes

Total Nodes - Number of nodes in the cluster
Ready Nodes - Nodes in Ready state and accepting workloads

Click to view detailed node information including capacity, allocatable resources, and conditions.

Workloads

Healthy Workloads - Deployments, StatefulSets, and DaemonSets meeting their desired state
Total Workloads - All workload resources in the cluster

A workload is considered healthy if:

All desired replicas are ready
No pods in CrashLoopBackOff or Error state
All containers passing health checks

Events

Warning Events - Events requiring attention
Normal Events - Informational events
Total Events - All cluster events in the last hour

Click to filter and view all cluster events in the Resource Viewer.

Certificates

Valid Certificates - TLS certificates that haven’t expired
Total Certificates - All cert-manager Certificate resources

Skyhook tracks certificate expiration dates and alerts you before certificates expire.

Capacity and Headroom

The capacity section shows cluster-wide resource utilization with visual progress bars and overcommit detection.

Capacity card showing requests and limits with overcommit warning

Understanding Requests vs Limits

Requests - Guaranteed resources the cluster must provide (used for scheduling decisions)
Limits - Maximum resources pods can consume (pods are throttled or killed if exceeded)

Reading the Metrics

Each resource shows two progress bars:

Solid bar (Requests): Guaranteed resources as % of node allocatable capacity
Light overlay (Limits): Maximum possible usage as % of node allocatable capacity

Example from screenshot:

CPU:     Requests: 60%  | Limits: 2558% ⚠️
Memory:  Requests: 43%  | Limits: 261% ⚠️

Interpretation:

✅ CPU Requests (60%): Healthy headroom for scheduling new workloads
🚨 CPU Limits (2558%): Severely overcommitted - if all pods hit their limits simultaneously, nodes will throttle containers
⚠️ Memory Requests (43%): Good utilization with room to grow
🚨 Memory Limits (261%): Overcommitted - pods at risk of OOMKill if memory usage spikes

Limits >100% means overcommitment. Progress bars show green (healthy), yellow (moderate), orange (high), or orange with warning icon (overcommitted). If all pods hit their limits simultaneously:

CPU: Containers get throttled (performance degradation)
Memory: Pods get killed (OOMKilled) based on priority

Overcommitment isn’t always bad; many workloads never hit their limits. Monitor actual usage in the namespace breakdown to see if you need to adjust limits or add nodes.

Warning Events

A detailed table of recent warning events aggregated by type. The table shows the most recent issues requiring attention.

Columns:

Reason - Event classification with color coding (Failed, Unhealthy, BackOff, etc.)
Message - Detailed event description (truncated for long messages)
Affected Resource - The pod, deployment, or resource triggering the event (clickable link)
Namespace - Where the event occurred
Count - How many times this event has been observed
Last Seen - When the event last occurred (e.g., “0m ago” for very recent events)

Click any Affected Resource link to jump directly to that resource in the Resource Viewer for deeper inspection.

Resource Usage by Namespace

A breakdown of resource consumption across namespaces helps you:

Identify resource-heavy namespaces: Sort by CPU/Memory to find top consumers
Detect overcommitment: Compare requests vs limits per namespace
Track team resource usage: If you use namespace-per-team organization
Plan capacity: See which namespaces need quota adjustments

Table Columns:

Namespace - The namespace name
Pods - Number of running pods
CPU (Requests / Limits) - Total requested and max CPU in cores (e.g., “4 / 12”)
Memory (Requests / Limits) - Total requested and max memory in Gi (e.g., “8Gi / 24Gi”)
Quota Status - ✅ Within quota or ⚠️ Over quota (if ResourceQuota defined)

Example Use Cases:

Finding oversized resource requests

If production namespace shows 8 cores requested but only 5 pods running with low actual usage, you may have oversized resource requests preventing efficient scheduling.

Identifying runaway workloads

A namespace with 50 pods requesting 20 cores might indicate a scaling issue or resource leak, especially if replicas grew unexpectedly.

Planning capacity additions

If multiple namespaces show high request percentages and you’re adding new services, use this data to justify adding nodes or adjusting quotas.

Introduction

Getting Started

Application Features

Infrastructure & Platform

Observability

Settings

AI Assistant

Terms and Privacy Policy

Accessing Cluster View

Health Status