Centralizing Kubernetes Logs with the EFK Stack

A practitioner’s walkthrough of building observability for containerized workloads

TL;DR

  • Container logs in Kubernetes are ephemeral by design — centralizing them is non-negotiable in production.
  • Fluentd as a DaemonSet collects logs from every node with no per-app changes required.
  • Kubernetes metadata enrichment is what turns raw text into searchable, useful data.
  • Index Lifecycle Management (ILM) keeps storage costs and query performance under control.
  • The hard parts aren’t the architecture — they’re multiline logs, backpressure, and resource limits.

Introduction

Kubernetes makes a lot of things easier. Logging is not one of them.

When you go from a small fleet of long-running VMs to a cluster of hundreds of pods that come and go, the assumptions baked into traditional logging fall apart. Logs vanish with the pod that wrote them. SSH-ing into a node to grep through files works exactly until you realize the pod you cared about ran on a different node yesterday and got rescheduled three times since. By the time the third on-call ticket of the week mentions “can someone check the logs,” it’s clear the platform needs something better.

This post walks through how I implemented centralized logging on Kubernetes using the EFK stack — Elasticsearch, Fluentd, and Kibana. I’ll cover what the architecture looks like, the configuration that mattered, and the gotchas that took the most time to get right.

Why “Centralized” Isn’t Optional in Kubernetes

Pods are ephemeral. The kubelet rotates container logs by default at around 10 MB and keeps a small number of older files. The moment a pod is deleted — which happens regularly during deployments, scaling events, or even node maintenance — its logs go with it unless something has already shipped them off the node.

A few specific Kubernetes characteristics make this worse:

  • Pods can land on any node, so log location is a moving target.
  • Replicas of the same Deployment write logs that need to be correlated, not searched separately.
  • Init containers and sidecars produce their own log streams that often matter for debugging.
  • Cluster-level events (scheduling failures, OOM kills) live separately from pod logs.
Without vs With Centralized logging-diagram-anblicks
Figure 1 — The cost of relying on node-local logs versus centralizing them.

The EFK Stack at a Glance

The EFK stack is one of the standard answers to this problem. Three components, each doing one job well:

  • Elasticsearch — stores logs in indices optimized for search and time-based queries.
  • Fluentd — collects logs from every node, parses them, adds context, and forwards them to a backend.
  • Kibana — gives engineers a UI to query, filter, and build dashboards on those indices.

A note on alternatives

Fluent Bit (the lighter, faster sibling of Fluentd written in C) has become a common choice for resource-constrained clusters. OpenSearch is the open-source fork of Elasticsearch that emerged after the 2021 license change. The architecture below applies almost identically with those substitutions — the responsibilities don’t change, just the implementations.

EFK Stack on Kubernetes - End-to-End Architecture - Diagram - Anblicks
Figure 2 — The full pipeline: from pod stdout to Kibana, with metadata enrichment along the way.

Designing the Log Pipeline

The flow we settled on:

  • The application writes to stdout/stderr — 12-factor style, no log files inside the container.
  • The container runtime captures those streams to /var/log/containers/<pod>_<ns>_<container>-<id>.log on the node.
  • The Fluentd DaemonSet — one pod per node — tails those files.
  • The fluent-plugin-kubernetes_metadata_filter enriches each line with namespace, pod name, labels, and container name.
  • Fluentd forwards enriched logs to Elasticsearch over HTTP.
  • Kibana queries Elasticsearch for search and dashboards.

The key insight: applications don’t need to know logging exists. They write to stdout. The platform handles the rest. That separation is what makes centralized logging scalable — it’s a platform concern, not an application concern.

Why DaemonSet (and What to Watch For)

A DaemonSet is the right primitive here. It guarantees one Fluentd pod on every node, including new nodes added by the cluster autoscaler. No manual provisioning, no missed nodes.

Three things matter in practice:

Tolerations

By default, DaemonSets don’t run on tainted nodes (control plane, GPU pools, dedicated workload nodes). If you want logs from those, you need explicit tolerations.

Resource limits

Fluentd is a Ruby process and can spike memory under heavy log volume. Set conservative requests and reasonable limits — too low, and you’ll OOMKill the very component that’s supposed to tell you why the pod next door crashed.

hostPath mounts

Fluentd needs to read /var/log and the symlinked container log directory (/var/lib/docker/containers or /var/log/pods/, depending on container runtime). It runs with the appropriate hostPath volumes.

A trimmed DaemonSet manifest looks like this:

				
					apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels: { app: fluentd }
  template:
    metadata:
      labels: { app: fluentd }
    spec:
      tolerations:
        - operator: Exists   # log even from tainted nodes
      containers:
        - name: fluentd
          image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
          resources:
            requests: { cpu: 100m, memory: 200Mi }
            limits:   { cpu: 500m, memory: 500Mi }
          volumeMounts:
            - { name: varlog,    mountPath: /var/log }
            - { name: containers, mountPath: /var/lib/docker/containers, readOnly: true }
      volumes:
        - { name: varlog,    hostPath: { path: /var/log } }
        - { name: containers, hostPath: { path: /var/lib/docker/containers } }

				
			

Enriching Logs with Kubernetes Metadata

Raw container logs are nearly useless on their own. A line like ERROR: connection refused tells you nothing about which pod, namespace, or service produced it. The kubernetes_metadata filter fixes this by hitting the Kubernetes API to attach context to every log event:

  • Pod name and namespace
  • Container name
  • Pod labels — the gold for filtering by app, version, environment
  • Node name

A minimal Fluentd configuration to do this:

				
					<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>
 
<filter kubernetes.**>
  @type kubernetes_metadata
  skip_labels false
  skip_container_metadata false
</filter>
 
<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc
  port 9200
  logstash_format true
  logstash_prefix logs
  include_tag_key true
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes.system.buffer
    flush_interval 5s
    chunk_limit_size 8MB
    queue_limit_length 32
    retry_max_interval 30
  </buffer>
</match>

				
			
How a Log line Gets Enriched on Its Way to Elasticsearch-diagram-Anblicks
Figure 3 — Each log line is parsed, enriched with cluster context, filtered, and forwarded.

Storage, Indices, and Lifecycle

Elasticsearch happily eats logs at almost any volume. The challenge is keeping it from being eaten by them. Three things prevent runaway storage growth:

Time-based indices

Index names like logs-2025.05.06 (instead of one giant logs index) make rollover and deletion straightforward. Old indices can be deleted as a single operation rather than running expensive delete-by-query.

Index Lifecycle Management

Elasticsearch can automatically transition indices through hot, warm, cold, and delete phases based on age or size. Hot is for active writes on fast storage; warm is for read-only but still queryable indices; cold sits on cheaper disks; and delete removes the data entirely.

Sensible retention

Most teams don’t need 90 days of debug-level logs in fast storage. A common starting point is 14 to 30 days hot, with longer retention in cold storage if compliance requires it. Adjust based on actual query patterns, not aspirations.

Index Lifecycle Diagram Anblicks
Figure 4 — Without ILM, indices grow forever. With it, retention and tiering become a config rather than a chore.

Kibana for the Day-to-Day

Kibana is where most engineers actually live. The search bar accepts KQL queries that compose well with the metadata Fluentd added:

				
					# Find timeout errors in the payments namespace
kubernetes.namespace_name : "payments" and message : "timeout"
 
# Errors only, grouped by app label
kubernetes.labels.app : * and log.level : "ERROR"

				
			

Discover view is for ad-hoc investigation. Dashboards are for known patterns — error rate per service, top noisy pods, message processing latency. A few that consistently earned their keep:

  • Service health overview — errors per service over time
  • Deployment correlation — logs filtered around deployment timestamps
  • Message-flow dashboards — producer and consumer logs side by side for systems like RabbitMQ

This is where MTTR drops the most. Going from “let me ssh to the node” to “type the query” cuts incident investigation by orders of magnitude.

Gotchas I Wish I’d Known Earlier

A few things took longer than expected to get right. They’re worth flagging because they tend to bite in production rather than in test.

Multiline logs

Java stack traces and Python tracebacks span many lines. Without a multiline parser configured, each line becomes its own searchable record, which fragments the trace and breaks correlation. Fluentd has multiline plugins, but the regex needs to match the actual output format of the application.

Backpressure

When Elasticsearch is slow or unavailable, Fluentd buffers locally. Without buffer limits, that buffer can fill the node’s disk and impact every workload running there. Configure buffer size, retry settings, and chunk limits explicitly — don’t rely on defaults.

Time fields

Container runtimes emit timestamps in different formats. If Fluentd doesn’t parse them correctly, your logs will sort by ingestion time rather than the actual event time. Most of the time this looks fine — until the day you correlate two events and they appear in the wrong order.

Index templates

Define index mappings up front. If you let Elasticsearch auto-detect field types from the first document it sees, you’ll regret it the day a field is sometimes a number and sometimes a string. Mapping conflicts are painful to fix retroactively.

What Centralized Logging Actually Changed

The before-and-after picture for the team and platform:

AreaBeforeAfter
Log accessSSH into individual nodes, grep through filesSingle Kibana query across all pods and nodes
Pod lifecycleLogs lost when pods are deleted or rescheduledLogs preserved centrally, independent of pod lifetime
Service correlationManually piecing together logs from each replicaFilter by app label, namespace, or trace ID
Incident responseHours of context-gathering before fixingMinutes from alert to relevant log line
Compliance / auditInconsistent retention, hard to demonstrateTime-bounded retention via ILM, queryable history
Developer accessRequired node-level shell accessRead-only Kibana dashboards by namespace

Lessons Worth Carrying Forward

A few principles emerged from this work that hold up across stacks and clusters:

  • Logs are platform infrastructure, not an application concern.
  • Metadata enrichment is the difference between “here are some logs” and “tell me what happened.”
  • Plan for the day the logging backend goes down — applications shouldn’t notice.
  • ILM is mandatory at scale, not optional.
  • The cheapest log is the one you didn’t ingest. Filter aggressively at the source.

Closing Thoughts

Building observability isn’t about adopting a stack. It’s about giving the team the ability to answer “what just happened?” in seconds instead of hours. The EFK stack is one path to that capability, and a well-proven one. Whatever components you choose, the goal stays the same — turn the noise of a thousand pods into a signal you can act on, and make it cheap enough that everyone reaches for it instead of avoiding it.

The blueprint for the AI-native enterprise,
delivered to your inbox.

    Read Next

    Related Insights

    ×