Phase 04 · Observability
Grafana — deep dive
ansible/roles/monitoring/  ·  kube-prometheus-stack · dashboards & visualisation
Grafana is the visualisation layer of the observability stack. It connects to Prometheus as a data source, executes PromQL queries, and renders the results as graphs, gauges, tables, and heatmaps. kube-prometheus-stack ships with 30+ pre-built dashboards covering every layer of the cluster — from raw node metrics to Kubernetes workload health.
01 Data Sources & the Dashboard Model
Grafana architecture
──────────────────────────────────────────────────────────
Data Source  →  Grafana knows how to talk to it
  e.g. Prometheus, Loki, InfluxDB, MySQL, Elasticsearch

Dashboard    →  a collection of panels
  → each panel runs a query against a data source
  → query result is rendered as a graph / gauge / table

Panel        →  one visualisation
  → has a PromQL query, a time range, and a display config
  → e.g. "show CPU usage for node {{ instance }} over last 1h"


How kube-prometheus-stack wires everything
──────────────────────────────────────────────────────────
1. Grafana starts with a Prometheus data source pre-configured
   pointing to http://kube-prometheus-stack-prometheus:9090

2. Dashboard ConfigMaps are mounted into Grafana automatically
   via the sidecar — no manual import needed

3. You open http://grafana.lab.local → login → dashboards are there

Grafana itself stores no metrics — it only queries. This means you can point multiple Grafana instances at the same Prometheus, or swap out the data source entirely without rebuilding your dashboards. The dashboards are just JSON that describes which queries to run and how to render them.

02 Pre-built Dashboards — What's Included

kube-prometheus-stack ships dashboards for every component it installs. They load automatically — no import step needed. Navigate to Dashboards → Browse in Grafana to see all of them.

Node Exporter / Nodes Per-node CPU usage, memory pressure, disk read/write throughput, network in/out. The most useful dashboard for understanding VM health — use it when pods are running slow to check if a node is saturated.
Kubernetes / Compute Resources / Cluster Cluster-wide CPU and memory requests vs limits vs actual usage. Shows which namespaces are consuming the most resources. Essential for spotting over- or under-provisioned workloads.
Kubernetes / Compute Resources / Namespace Same as the cluster view, but scoped to a single namespace. Great for checking the monitoring stack itself or the online-boutique namespace when a service is behaving unexpectedly.
Kubernetes / Compute Resources / Pod CPU throttling, memory usage, and network I/O per container inside a pod. CPU throttling > 25% means the pod is hitting its limit — either raise the limit or fix a CPU-intensive bug.
Kubernetes / API server Request rate, error rate, and latency for the K8s API server. Useful for diagnosing slow kubectl commands or failed Ansible tasks that call the API server.
Alertmanager / Overview Shows how many alerts are currently firing, their severity, and the notification rate. Use this to confirm the alerting pipeline is working end-to-end — you should see the Watchdog alert always firing.
03 PromQL — Writing Your Own Queries
# ── Instant queries (single value at a point in time) ──────────────

# Is every scrape target up?  (1 = up, 0 = down)
up

# Free memory per node in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

# CPU usage % per node (averaged over last 5m)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# How many pods are running per namespace?
count by(namespace) (kube_pod_status_phase{phase="Running"})


# ── Range queries (rate / increase over time) ───────────────────────

# HTTP requests per second (if your app exposes http_requests_total)
rate(http_requests_total[5m])

# Total restarts per pod in the last hour
increase(kube_pod_container_status_restarts_total[1h])

# Disk writes per second per node
rate(node_disk_written_bytes_total[5m])

Use the Explore view in Grafana (Explore → Select Prometheus) to run ad-hoc PromQL queries without creating a dashboard panel. This is the fastest way to investigate something — type a query, pick a time range, see the result.

  • rate() — converts a counter (always-increasing) into a per-second rate over a time window. Always use rate() with counters like http_requests_total, never query the raw counter.
  • by(label) — splits the result by a label. sum by(pod) gives one line per pod. Without it you get a single aggregate across all pods.
  • Labels — every metric has labels like namespace, pod, container, instance. Use {namespace="monitoring"} to filter to a specific namespace.
Tip: In the Explore view, switch to Metrics browser to see all available metric names with autocomplete. Type node_ to see all node-exporter metrics, or kube_pod_ for pod-level metrics from kube-state-metrics.
04 Auto-provisioned Data Source & Dashboards
# How the chart auto-wires Prometheus as a data source
# (rendered inside Grafana's provisioning ConfigMap)

apiVersion: 1
datasources:
  - name:      Prometheus
    type:      prometheus
    url:       http://kube-prometheus-stack-prometheus:9090
    isDefault: true
    access:    proxy   # Grafana server makes the request, not the browser

The chart injects a datasource.yaml into Grafana's provisioning directory at startup — so Prometheus is already connected when you first log in. The sidecar container watches for ConfigMaps labelled grafana_dashboard: "1" and mounts them as dashboards automatically.

  • Adding your own dashboard — create a ConfigMap in any namespace with label grafana_dashboard: "1" containing your dashboard JSON. The sidecar picks it up without restarting Grafana.
  • Persistence is disabled — dashboards created manually in the UI survive pod restarts only via ConfigMaps. If you create something in the UI you want to keep, export it as JSON and store it in a ConfigMap.
Default login: http://grafana.lab.local  ·  Username: admin  ·  Password: grafana
Change the password in ansible/group_vars/all.ymlgrafana_admin_password and re-run ansible-playbook phase-04.yml to apply.
05 Verify — Confirm Grafana is Working
# Check Grafana pod is healthy
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

# Check the Grafana service and ingress
kubectl get svc,ingress -n monitoring | grep grafana

# If the UI isn't loading, check Grafana logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -c grafana

# ── In the Grafana UI ───────────────────────────────────────────────
# 1. Go to http://grafana.lab.local
# 2. Login: admin / grafana
# 3. Connections → Data sources → Prometheus → Test  (should show green)
# 4. Dashboards → Browse → open "Node Exporter / Nodes"
#    → should see CPU/RAM/disk graphs for all 3 nodes
# 5. Explore → run: up
#    → should show 1 for every healthy scrape target

Phase 04 is fully operational when: the Prometheus data source test passes in Grafana, the Node Exporter dashboard shows live data for all 3 nodes, and http://prometheus.lab.local/targets shows all targets as UP.