Prometheus & Alertmanager — Deep Dive · k8s-observability-platform

01 Pull-based Metrics Model Concept

Push model (Datadog, CloudWatch agent)
─────────────────────────────────────────────────────────
App  →  push metrics  →  central server
  ↑ app must know where to send data
  ↑ agent runs on every host, consumes resources
  ↑ hard to know if an agent silently stopped sending


Pull model (Prometheus)
─────────────────────────────────────────────────────────
Prometheus  →  GET /metrics  →  Target (pod / node / service)
  ↑ Prometheus decides when and what to scrape
  ↑ targets just expose an HTTP endpoint — no agent needed
  ↑ if a target disappears, Prometheus knows immediately
  ↑ scrape interval is centrally controlled (default: 15s)


Full pipeline in this cluster
─────────────────────────────────────────────────────────
node-exporter        (host CPU, RAM, disk, network)
kube-state-metrics   (pod counts, deployment status, limits)
K8s API server       (etcd, scheduler, controller-manager)
Online Boutique pods (if ServiceMonitor added)
        ↓  GET /metrics every 15s
Prometheus TSDB      (stores time-series on disk)
        ↓  PromQL queries
Grafana dashboards   (visualise)
        ↓  firing alerts
Alertmanager         (route → email / Slack / PagerDuty)

The key advantage of the pull model: Prometheus is always in control. If a pod crashes and stops exposing /metrics, Prometheus immediately marks it as DOWN and can fire an alert. With push-based systems, a dead agent just silently stops sending data — you might not notice for hours.

02 kube-prometheus-stack — What the Chart Bundles Helm

Instead of installing Prometheus, Grafana, Alertmanager, and the operator separately, the kube-prometheus-stack Helm chart deploys and wires them all together in one shot — including pre-built dashboards, alerting rules, and a Prometheus Operator that manages everything declaratively.

Prometheus Operator A Kubernetes controller that watches for Prometheus, Alertmanager, and ServiceMonitor CRDs and manages the actual StatefulSets and config. You never edit Prometheus config files directly.

Prometheus Deployed as a StatefulSet (prometheus-kube-prometheus-stack-prometheus). Scrapes all ServiceMonitors in all namespaces. Stores metrics as a time-series in a local TSDB with 1-day retention in this cluster.

Grafana Deployed as a Deployment. Pre-configured with Prometheus as a data source and 30+ built-in dashboards covering nodes, namespaces, deployments, and K8s API server metrics — all working out of the box.

Alertmanager Deployed as a StatefulSet. Receives alerts from Prometheus when rules fire, then deduplicates, groups, and routes them. Configured via alertmanager.yml — can route to Slack, email, PagerDuty.

node-exporter A DaemonSet — one pod per node. Reads host-level metrics directly from /proc and /sys: CPU usage, memory, disk I/O, network traffic. No agent install needed — it runs in a container.

kube-state-metrics Talks to the K8s API server and exposes object-level metrics: how many pods are running vs desired, which deployments are unavailable, pod resource requests vs actual limits. Essential for K8s-aware dashboards.

03 Prometheus Operator & ServiceMonitor CRD CRD

# How scraping is configured — declaratively, not via prometheus.yml
# The Operator watches for ServiceMonitor objects and auto-generates
# the Prometheus scrape config from them.

apiVersion: monitoring.coreos.com/v1
kind:       ServiceMonitor
metadata:
  name:      my-app
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-app       # match pods with this label
  endpoints:
    - port: metrics      # scrape the port named "metrics"
      interval: 15s

Before the Prometheus Operator, you had to hand-edit prometheus.yml every time you added a new service to scrape. The Operator replaces that with a CRD: drop a ServiceMonitor in any namespace and Prometheus automatically starts scraping the matching pods — no restarts, no config edits.

Why serviceMonitorSelectorNilUsesHelmValues: false?
By default, Prometheus only picks up ServiceMonitors that have the chart's release label. Setting this to false removes that restriction — any ServiceMonitor in any namespace gets picked up automatically. This means adding monitoring to a new app only requires creating a ServiceMonitor in that app's namespace.

04 Alertmanager — Alert Routing Pipeline Alerting

Alert lifecycle
───────────────────────────────────────────────────────────
PrometheusRule  (defines when to fire)
  e.g. "if node memory > 85% for 5 minutes → fire HighMemory"
        ↓
Prometheus  evaluates rules every 15s
  → state: Inactive → Pending (condition met but not long enough)
         → Firing  (condition held for the full duration)
        ↓  POST /api/v1/alerts
Alertmanager
  1. Deduplicate   — same alert firing from 3 replicas → 1 notification
  2. Group         — bundle related alerts into one message
  3. Inhibit       — suppress child alerts when parent fires
  4. Route         — send to the right receiver (Slack, email, PagerDuty)
  5. Silence       — mute alerts during maintenance windows

Alertmanager is intentionally separate from Prometheus. Prometheus decides when something is wrong. Alertmanager decides who to tell and how. This split means you can have multiple Prometheus servers all routing to a single Alertmanager, and you only configure your notification channels in one place.

PrometheusRule — a CRD that holds alerting rules. The Operator automatically loads them into Prometheus. kube-prometheus-stack ships ~200 default rules covering node health, K8s components, and pod state.
Pending → Firing — the for: duration prevents flapping. A brief CPU spike won't page you — the condition must hold for the full duration before an alert fires.
Silences — accessible at http://alertmanager.lab.local/#/silences. Create a silence before planned maintenance so you don't get paged for expected downtime.

05 Helm Values — Key Decisions for This Cluster YAML

# ansible/roles/monitoring/files/kube-prometheus-values.yaml

prometheus:
  prometheusSpec:
    retention: 1d         # keep 1 day of data — lab use only
    storageSpec: {}       # empty = ephemeral, no PVC needed

    # scrape ALL namespaces — no label selector required
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues:     false
    ruleSelectorNilUsesHelmValues:           false

alertmanager:
  alertmanagerSpec:
    storage: {}           # no PVC

grafana:
  adminPassword: "grafana"
  persistence:
    enabled: false        # no PVC — dashboards survive via ConfigMaps

Three decisions that differ from a production setup — each made deliberately for a local VM environment.

storageSpec: {} / storage: {} — no PersistentVolumeClaim. Metrics live in an emptyDir volume and disappear if the pod restarts. Acceptable for a lab — in production you'd attach a PVC with enough space for your retention window.
retention: 1d — production clusters keep 15–90 days. 1 day keeps disk usage low on local VMs where you only have the OS disk.
Selector nil values = false — in production you'd scope which ServiceMonitors Prometheus picks up to avoid scraping everything. Here, scraping everything is the goal.

06 Verify — Check Everything is Running kubectl

# All monitoring pods should be Running
kubectl get pods -n monitoring

# Expected pods:
#  kube-prometheus-stack-operator-*           1/1  Running
#  kube-prometheus-stack-grafana-*            3/3  Running
#  kube-prometheus-stack-kube-state-metrics-* 1/1  Running
#  prometheus-kube-prometheus-stack-prometheus-0  2/2  Running
#  alertmanager-kube-prometheus-stack-alertmanager-0  2/2  Running
#  kube-prometheus-stack-prometheus-node-exporter-*  (one per node)  1/1  Running

# Check all 3 ingresses exist
kubectl get ingress -n monitoring

# Quick PromQL to verify scraping is working
# Open http://prometheus.lab.local and run:
#   up                          → 1 for every healthy target
#   node_memory_MemAvailable_bytes  → RAM available per node
#   kube_pod_status_phase           → pod phase counts

At http://prometheus.lab.local/targets you'll see every scrape target Prometheus knows about. All targets should show UP with a recent scrape timestamp. Any DOWN target means either the pod isn't exposing /metrics or the ServiceMonitor selector doesn't match.

Alertmanager UI is at http://alertmanager.lab.local. The default rules from kube-prometheus-stack will likely show some alerts in Firing state — Watchdog (always fires, used to confirm the pipeline is working) and possibly some informational alerts about the cluster. This is normal.