Observability is the cornerstone of running reliable production workloads on Kubernetes. If you can’t see what your pods are doing, you can’t fix them when they crash.
In the past, the standard stack for Kubernetes observability was Prometheus for metrics and Promtail + Loki for logs. However, the ecosystem has evolved. Promtail is now in maintenance mode, and Grafana Alloy (formerly Grafana Agent Flow) is the new, unified collector for metrics, logs, traces, and profiles.
In this guide, we will provision a complete, modern observability stack on Kubernetes using Helm:
- Metrics:
kube-prometheus-stack(Prometheus Operator + Grafana). - Logs Backend:
grafana/loki(Scalable log aggregation). - Log Collector:
grafana/alloy(The new standard agent).
We will set this up entirely via Helm and YAML values—no manual clicking in the UI.
Architecture Overview
- Prometheus: Scrapes metrics from your applications and Kubernetes components.
- Grafana: Visualizes everything (dashboards).
- Loki: “Like Prometheus, but for logs.” It indexes labels, not the full log content, making it extremely cost-effective.
- Grafana Alloy: The agent running on every node (DaemonSet) that tails your container log files (
/var/log/pods/...) and pushes them to Loki.
Prerequisites
- A Kubernetes cluster (local, AKS, EKS, GKE, etc.).
kubectlinstalled and configured.helminstalled.- A default
StorageClassfor persistence (standard in most managed clusters).
Step 1: Create Namespaces
Let’s keep our monitoring tools organized.
kubectl create namespace monitoring
kubectl create namespace logging
Step 2: Install kube-prometheus-stack
This chart bundles Prometheus Operator, Grafana, and default alerts.
First, add the repo:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create a kps-values.yaml file to configure persistence and Grafana admin password:
# kps-values.yaml
grafana:
adminPassword: "<GRAFANA_ADMIN_PASSWORD>"
persistence:
enabled: true
size: 10Gi
# Optional: Enable Ingress if you want external access
ingress:
enabled: false
hosts:
- grafana.<YOUR_DOMAIN>
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: <STORAGE_CLASS>
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
retention: 10d
Install the chart:
helm install kps prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f kps-values.yaml
Step 3: Install Loki
We will install Loki in “Monolithic” mode (all components in one binary), which is perfect for getting started or small-to-medium clusters. For massive scale, you’d use “Distributed” or “Microservices” mode.
Add the Grafana repo:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
Create loki-values.yaml. We enable persistence to ensure logs survive restarts.
# loki-values.yaml
loki:
commonConfig:
replication_factor: 1
storage:
type: 'filesystem'
schemaConfig:
configs:
- from: 2024-04-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
pattern_ingester:
enabled: true
limits_config:
allow_structured_metadata: true
volume_enabled: true
retention_period: 14d
singleBinary:
replicas: 1
persistence:
enabled: true
size: 20Gi
storageClass: <STORAGE_CLASS>
Production Note: For production, replace
filesystemstorage with object storage (S3, Azure Blob, GCS) for infinite retention and lower cost.
Install Loki:
helm install loki grafana/loki \
--namespace logging \
-f loki-values.yaml
Step 4: Install Grafana Alloy
This is where the magic happens. We will deploy Alloy as a DaemonSet so it runs on every node, mounting the host’s log directories.
Create alloy-values.yaml. We configure it to:
- Run as a DaemonSet.
- Mount
/var/logand/var/lib/docker/containers(typical paths) from the host. - Inject the
HOSTNAMEenvironment variable with the actual Node Name (vital for discovery). - Use a River config to discover pods on the same node and tail their logs.
# alloy-values.yaml
alloy:
# Important: Map the Node Name to the HOSTNAME env var so Alloy knows which node it's on.
extraEnv:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
configMap:
content: |
// 1. Discover pods on the same node
discovery.kubernetes "pods" {
role = "pod"
selectors {
role = "pod"
field = "spec.nodeName=" + sys.env("HOSTNAME")
}
}
// 2. Relabel metadata to create useful labels (namespace, app, etc.)
discovery.relabel "pod_logs" {
targets = discovery.kubernetes.pods.targets
// Label: namespace
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
// Label: pod
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
// Label: container
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
// Label: app (try standard labels)
rule {
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name", "__meta_kubernetes_pod_label_app", "__meta_kubernetes_pod_label_k8s_app"]
regex = "^;*([^;]+)(;.*)?$"
target_label = "app"
replacement = "$1"
}
// Label: job (namespace/app)
rule {
source_labels = ["namespace", "app"]
target_label = "job"
separator = "/"
}
// 3. Construct the __path__ to the actual log file on the node
// Path format: /var/log/pods/<namespace>_<pod_name>_<uid>/<container_name>/*.log
rule {
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name", "__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
target_label = "__path__"
separator = "_"
replacement = "/var/log/pods/$1_$2_$3/$4/*.log"
}
}
// 4. Tail the log files found above
loki.source.file "pod_logs" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.write.default.receiver]
}
// 5. Send to Loki
loki.write "default" {
endpoint {
url = "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"
}
}
# Mount host log paths so Alloy can read them
# The grafana/alloy chart uses 'alloy.mounts.varlog' as a shortcut,
# or we can manually define extraVolumes.
mounts:
varlog: true
dockercontainers: true
# Or if you need custom paths, use extraVolumes/extraVolumeMounts:
# extraVolumes:
# - name: custom-logs
# hostPath: { path: /var/custom-logs }
# extraVolumeMounts:
# - name: custom-logs
# mountPath: /var/custom-logs
# readOnly: true
controller:
type: daemonset
Install Alloy:
helm install alloy grafana/alloy \
--namespace logging \
-f alloy-values.yaml
Step 4.5: Configure Kubernetes Events Collection (Optional)
To collect Kubernetes Events (e.g., PodScaling, CrashLoopBackOff), we need a separate Alloy instance. Why? Because the alloy release above runs as a DaemonSet (on every node). If we asked it to collect cluster-wide events, every single node would send the same events to Loki, causing massive duplication.
Instead, we deploy a second Alloy instance as a Deployment with 1 replica to handle cluster-level data.
Create alloy-events-values.yaml:
# alloy-events-values.yaml
alloy:
configMap:
content: |
// 1. Collect Kubernetes Events
loki.source.kubernetes_events "cluster_events" {
job_name = "integrations/kubernetes/eventhandler"
log_format = "logfmt"
forward_to = [loki.write.default.receiver]
}
// 2. Send to Loki
loki.write "default" {
endpoint {
url = "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"
}
}
# Run as a Deployment (Singleton)
controller:
type: deployment
replicas: 1
Install the events collector:
helm install alloy-events grafana/alloy \
--namespace logging \
-f alloy-events-values.yaml
Step 5: Auto-provision Grafana Datasource
We want Grafana to automatically see Loki as a data source. We can do this by updating our kps-values.yaml (from Step 2) to include a “sidecar” datasource configuration.
Update kps-values.yaml:
grafana:
# ... previous config ...
additionalDataSources:
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki.logging.svc.cluster.local:3100
jsonData:
derivedFields:
- datasourceUid: prometheus
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
Now apply the update:
helm upgrade kps prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f kps-values.yaml
Step 6: Validate Everything
1. Access Grafana
Port-forward Grafana:
kubectl port-forward svc/kps-grafana -n monitoring 8080:80
Open http://localhost:8080, log in (default user admin), and go to Explore.
2. Query Logs (LogQL)
Select Loki from the datasource dropdown.
- View all logs in
defaultnamespace:{namespace="default"} - Search for errors:
{namespace="default"} |= "error" - Correlate Metrics and Logs: Create a dashboard with a CPU usage graph (Prometheus) and a logs panel (Loki). When you see a CPU spike, zoom into that time range, and the logs panel will automatically show what your app was logging at that exact moment.
Troubleshooting
- Alloy not sending logs?
Check Alloy logs:
Common issues:kubectl logs -n logging -l app.kubernetes.io/name=alloy- “targets”: []: The discovery isn’t finding pods. Check if
HOSTNAMEenv var is set correctly (Step 4) and matchesspec.nodeName. - Permissions: Ensure the DaemonSet has
hostPathmount permissions (PSP/PSA). - Paths: Verify your node actually stores logs in
/var/log/pods. Some K3s/K0s distros or specialized cloud nodes might use different paths.
- “targets”: []: The discovery isn’t finding pods. Check if
- Datasource Error?
Ensure the
urlinkps-values.yamlmatches the Loki Service DNS name (http://loki.logging.svc.cluster.local:3100).
Production Best Practices
- Object Storage: Do not use filesystem storage for Loki in production. Configure S3/Azure Blob.
- Resource Limits: Set CPU/Memory requests/limits for Alloy and Loki to prevent them from consuming all node resources during log spikes.
- Retention: Configure
compactorin Loki to enforce retention policies (e.g., delete logs older than 30 days). - Security: Enable authentication (AuthN/AuthZ) if exposing Loki or Grafana publicly.
Conclusion
You now have a robust, code-defined observability stack.
- Prometheus watches your metrics.
- Loki stores your logs efficiently.
- Alloy reliably ships data from your nodes.
This setup scales well and adheres to modern GitOps practices by defining everything in YAML values. Happy monitoring!