Skip to content
yisusvii Blog
Go back

Kubernetes Monitoring with Grafana + Prometheus Metrics and Loki Logs using Grafana Alloy (Helm Setup)

Suggest Changes

Observability is the cornerstone of running reliable production workloads on Kubernetes. If you can’t see what your pods are doing, you can’t fix them when they crash.

In the past, the standard stack for Kubernetes observability was Prometheus for metrics and Promtail + Loki for logs. However, the ecosystem has evolved. Promtail is now in maintenance mode, and Grafana Alloy (formerly Grafana Agent Flow) is the new, unified collector for metrics, logs, traces, and profiles.

In this guide, we will provision a complete, modern observability stack on Kubernetes using Helm:

  1. Metrics: kube-prometheus-stack (Prometheus Operator + Grafana).
  2. Logs Backend: grafana/loki (Scalable log aggregation).
  3. Log Collector: grafana/alloy (The new standard agent).

We will set this up entirely via Helm and YAML values—no manual clicking in the UI.

Architecture Overview

Prerequisites

Step 1: Create Namespaces

Let’s keep our monitoring tools organized.

kubectl create namespace monitoring
kubectl create namespace logging

Step 2: Install kube-prometheus-stack

This chart bundles Prometheus Operator, Grafana, and default alerts.

First, add the repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create a kps-values.yaml file to configure persistence and Grafana admin password:

# kps-values.yaml
grafana:
  adminPassword: "<GRAFANA_ADMIN_PASSWORD>"
  persistence:
    enabled: true
    size: 10Gi
  # Optional: Enable Ingress if you want external access
  ingress:
    enabled: false
    hosts:
      - grafana.<YOUR_DOMAIN>

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: <STORAGE_CLASS>
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    retention: 10d

Install the chart:

helm install kps prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kps-values.yaml

Step 3: Install Loki

We will install Loki in “Monolithic” mode (all components in one binary), which is perfect for getting started or small-to-medium clusters. For massive scale, you’d use “Distributed” or “Microservices” mode.

Add the Grafana repo:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Create loki-values.yaml. We enable persistence to ensure logs survive restarts.

# loki-values.yaml
loki:
  commonConfig:
    replication_factor: 1
  storage:
    type: 'filesystem'
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: index_
          period: 24h
  pattern_ingester:
      enabled: true
  limits_config:
    allow_structured_metadata: true
    volume_enabled: true
    retention_period: 14d

singleBinary:
  replicas: 1
  persistence:
    enabled: true
    size: 20Gi
    storageClass: <STORAGE_CLASS>

Production Note: For production, replace filesystem storage with object storage (S3, Azure Blob, GCS) for infinite retention and lower cost.

Install Loki:

helm install loki grafana/loki \
  --namespace logging \
  -f loki-values.yaml

Step 4: Install Grafana Alloy

This is where the magic happens. We will deploy Alloy as a DaemonSet so it runs on every node, mounting the host’s log directories.

Create alloy-values.yaml. We configure it to:

  1. Run as a DaemonSet.
  2. Mount /var/log and /var/lib/docker/containers (typical paths) from the host.
  3. Inject the HOSTNAME environment variable with the actual Node Name (vital for discovery).
  4. Use a River config to discover pods on the same node and tail their logs.
# alloy-values.yaml
alloy:
  # Important: Map the Node Name to the HOSTNAME env var so Alloy knows which node it's on.
  extraEnv:
    - name: HOSTNAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName

  configMap:
    content: |
      // 1. Discover pods on the same node
      discovery.kubernetes "pods" {
        role = "pod"
        selectors {
          role  = "pod"
          field = "spec.nodeName=" + sys.env("HOSTNAME")
        }
      }

      // 2. Relabel metadata to create useful labels (namespace, app, etc.)
      discovery.relabel "pod_logs" {
        targets = discovery.kubernetes.pods.targets

        // Label: namespace
        rule {
          source_labels = ["__meta_kubernetes_namespace"]
          target_label  = "namespace"
        }

        // Label: pod
        rule {
          source_labels = ["__meta_kubernetes_pod_name"]
          target_label  = "pod"
        }

        // Label: container
        rule {
          source_labels = ["__meta_kubernetes_pod_container_name"]
          target_label  = "container"
        }

        // Label: app (try standard labels)
        rule {
          source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name", "__meta_kubernetes_pod_label_app", "__meta_kubernetes_pod_label_k8s_app"]
          regex         = "^;*([^;]+)(;.*)?$"
          target_label  = "app"
          replacement   = "$1"
        }

        // Label: job (namespace/app)
        rule {
          source_labels = ["namespace", "app"]
          target_label  = "job"
          separator     = "/"
        }

        // 3. Construct the __path__ to the actual log file on the node
        // Path format: /var/log/pods/<namespace>_<pod_name>_<uid>/<container_name>/*.log
        rule {
          source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name", "__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
          target_label  = "__path__"
          separator     = "_"
          replacement   = "/var/log/pods/$1_$2_$3/$4/*.log"
        }
      }

      // 4. Tail the log files found above
      loki.source.file "pod_logs" {
        targets    = discovery.relabel.pod_logs.output
        forward_to = [loki.write.default.receiver]
      }

      // 5. Send to Loki
      loki.write "default" {
        endpoint {
          url = "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"
        }
      }

  # Mount host log paths so Alloy can read them
  # The grafana/alloy chart uses 'alloy.mounts.varlog' as a shortcut,
  # or we can manually define extraVolumes.
  mounts:
    varlog: true
    dockercontainers: true

  # Or if you need custom paths, use extraVolumes/extraVolumeMounts:
  # extraVolumes:
  #   - name: custom-logs
  #     hostPath: { path: /var/custom-logs }
  # extraVolumeMounts:
  #   - name: custom-logs
  #     mountPath: /var/custom-logs
  #     readOnly: true

controller:
  type: daemonset

Install Alloy:

helm install alloy grafana/alloy \
  --namespace logging \
  -f alloy-values.yaml

Step 4.5: Configure Kubernetes Events Collection (Optional)

To collect Kubernetes Events (e.g., PodScaling, CrashLoopBackOff), we need a separate Alloy instance. Why? Because the alloy release above runs as a DaemonSet (on every node). If we asked it to collect cluster-wide events, every single node would send the same events to Loki, causing massive duplication.

Instead, we deploy a second Alloy instance as a Deployment with 1 replica to handle cluster-level data.

Create alloy-events-values.yaml:

# alloy-events-values.yaml
alloy:
  configMap:
    content: |
      // 1. Collect Kubernetes Events
      loki.source.kubernetes_events "cluster_events" {
        job_name   = "integrations/kubernetes/eventhandler"
        log_format = "logfmt"
        forward_to = [loki.write.default.receiver]
      }

      // 2. Send to Loki
      loki.write "default" {
        endpoint {
          url = "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"
        }
      }

# Run as a Deployment (Singleton)
controller:
  type: deployment
  replicas: 1

Install the events collector:

helm install alloy-events grafana/alloy \
  --namespace logging \
  -f alloy-events-values.yaml

Step 5: Auto-provision Grafana Datasource

We want Grafana to automatically see Loki as a data source. We can do this by updating our kps-values.yaml (from Step 2) to include a “sidecar” datasource configuration.

Update kps-values.yaml:

grafana:
  # ... previous config ...
  additionalDataSources:
    - name: Loki
      type: loki
      uid: loki
      access: proxy
      url: http://loki.logging.svc.cluster.local:3100
      jsonData:
        derivedFields:
          - datasourceUid: prometheus
            matcherRegex: "traceID=(\\w+)"
            name: TraceID
            url: "$${__value.raw}"

Now apply the update:

helm upgrade kps prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kps-values.yaml

Step 6: Validate Everything

1. Access Grafana

Port-forward Grafana:

kubectl port-forward svc/kps-grafana -n monitoring 8080:80

Open http://localhost:8080, log in (default user admin), and go to Explore.

2. Query Logs (LogQL)

Select Loki from the datasource dropdown.

Troubleshooting

Production Best Practices

  1. Object Storage: Do not use filesystem storage for Loki in production. Configure S3/Azure Blob.
  2. Resource Limits: Set CPU/Memory requests/limits for Alloy and Loki to prevent them from consuming all node resources during log spikes.
  3. Retention: Configure compactor in Loki to enforce retention policies (e.g., delete logs older than 30 days).
  4. Security: Enable authentication (AuthN/AuthZ) if exposing Loki or Grafana publicly.

Conclusion

You now have a robust, code-defined observability stack.

This setup scales well and adheres to modern GitOps practices by defining everything in YAML values. Happy monitoring!


Suggest Changes
Share this post on:

Previous Post
Azure Kubernetes: From Manual Deployment to Automated Pipelines with ArgoCD
Next Post
Welcome to My Blog!