Users should log several components of an Elastic Kubernetes
Service (EKS) cluster to ensure easy operations, maintenance, and troubleshooting. An effective
logging strategy will involve selecting the appropriate tools and validating that they meet user
requirements.
This article will discuss how to enable logging for each component of an EKS cluster, what tools are
available, and how they can be implemented to improve the operational readiness of EKS clusters.
Why is logging important for EKS clusters? | Logging is important for production EKS clusters where log data is necessary to aid in troubleshooting problems, analyzing performance, investigating security incidents, and improving operational excellence. |
What components of EKS can be logged? | EKS supports exporting logs from the control plane, EC2 worker nodes, Fargate nodes, and pods. |
How can I enable EKS control plane logs? | EKS control plane logs must be manually enabled and will stream logs to AWS CloudWatch from master node components like the API Server and Kube Controller Manager. |
How can I query EKS control plane logs? | EKS control plane logs are present in AWS CloudWatch and can be queried via the CloudWatch Log Insights tool. |
How can I enable logs for EKS worker nodes and pods? | EKS supports the same logging tools as any other Kubernetes cluster, whether they are open-source, third-party services, or AWS-specific. Users can select any tool that fits their use cases. |
How can I enable logging for EKS Fargate? | EKS Fargate supports logging via sidecar containers and via the built-in Fluent Bit log router. Either approach allows users to export log data from their Fargate pods. |
How can I optimize EKS logging costs? | Excessive log expenses are mitigated by controlling how many logs are collected, excluding unnecessary log data, reducing log retention time, and optimizing log queries. |
Logging is helpful for any environment hosting critical applications. Log data provides insight into the
performance of applications and the underlying infrastructure. The data is essential for analyzing
performance, identifying bottlenecks, troubleshooting bugs and unexpected behavior, detecting security
breaches, maintaining uptime, and monitoring the environment’s health.
In the context of an EKS cluster, users will benefit from logs providing insights into the control plane,
worker nodes, system pods, application pods, and surrounding AWS-related resources. Logging these
components will give users deep insights into how the entire cluster behaves, ensuring that they can
manage their clusters effectively.
EKS clusters contain several different components that support logging. Users operating production
clusters should consider configuring logging for each of these.
As discussed further below, EKS provides a range of log streams generated by master node components (like
the API Server and Kube Scheduler). Enabling these logs using the steps below will provide insight into
cluster operations, including performance and security posture.
The worker nodes in a Kubernetes cluster will generate system logs based on the operating system in use.
Collecting the system logs will provide insight into the host’s performance and data, which is helpful
in troubleshooting host issues.
Kubernetes-specific logs are generated by the kubelet agent,
which is the agent responsible for communicating between the worker node and the control plane. The
kubelet logs are typically stored alongside the operating system logs (e.g., the SystemD Journal on Linux) and should be
collected to enable analysis of the kubelet’s behavior.
Each pod running in a Kubernetes cluster can produce log output; the contents of the output will depend
on the application containers deployed. The application’s developer controls which log messages are
output to stdout, and
Kubernetes will fetch the logs accordingly when running kubectl logs <pod_name>
. Users can collect
these logs to enable application-level troubleshooting and diagnostics capabilities.
Spend less time optimizing Kubernetes Resources. Rely on AI-powered Kubex - an automated Kubernetes optimization platform
Free 60-day TrialEKS clusters run on AWS infrastructure, and most AWS services will generate CloudTrail log event output. CloudTrail provides
insight into API calls made to AWS services (e.g., EC2:RunInstance) and is a valuable tool for
troubleshooting problems, auditing AWS resource access, and verifying the operational stability of AWS
resources. A production EKS setup should include enabling CloudTrail logging to ensure that all AWS
resources implemented for the cluster are logged. The log data is helpful for administrators managing
EKS clusters.
Maintaining an effective logging setup for every aspect of an EKS cluster may appear daunting and complex
to many users due to the number of components involved. However, implementing an appropriate logging
setup will help improve the operational readiness of an EKS cluster, enable easier troubleshooting for
future issues, allow more straightforward forensic analysis, and provide greater insight into
performance bottlenecks. The payoff for proper logging is significant and will benefit many users.
EKS provides the ability to forward control plane
logs automatically to the CloudWatch Logs service. Logs from the control plane are disabled by
default to mitigate unnecessary costs, but it is highly recommended that users enable the logs for
production clusters where analyzing cluster operations is vital for administrators.
Control plane logs are enabled via the AWS web console or the AWS CLI tool.
Enabling logs via the CLI is done with the following AWS CLI
command:
aws eks update-cluster-config --region region-code --name my-cluster --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
Users have a choice of which control plane logs to enable. Generally, users will enable all logs in a
production environment, but those optimizing for cost may selectively enable the logs most relevant to
user requirements.
The EKS control plane provides log data from several master node components:
The EKS control plane log data provides extensive insight into the cluster’s activities. The data is
crucial for administrative operations like troubleshooting issues and forensic analysis, so all users
should enable these logs when operating production clusters.
Spend less time optimizing Kubernetes Resources. Rely on AI-powered Kubex - an automated Kubernetes optimization platform
Free 60-day TrialEKS control plane logs are exported to the CloudWatch
Logs service. CloudWatch Logs supports a tool called Insights
for querying and analyzing log data, which is used for querying the EKS control plane logs.
To query the control plane logs for a particular EKS cluster, users will need to open the Insights tool:
Discovering useful log queries is very helpful for investigations, troubleshooting, and diagnostics
purposes. Storing successful queries in a document is a common practice for users who use CloudWatch Log
Insights often. Saving important queries saves time in the future and makes it easier to customize
existing queries for new purposes.
The CloudWatch Log Insights query syntax can be found here.
Query which IAM principals have accessed the “kubernetes-admin” RBAC user
This user has unlimited permissions to modify the cluster, and its access is heavily restricted as a best
practice. Monitoring its actions is useful for investigating potential security incidents. The query can
be modified to investigate which other RBAC users are being accessed by IAM principals.
fields @logStream, @timestamp, @message
| sort @timestamp desc
| filter @logStream like /authenticator/
| filter @message like "username=kubernetes-admin"
| limit 50
Query what actions were performed by the “kubernetes-admin” RBAC user
Following on from the above, investigating exactly what actions were performed by this privileged user in
the EKS cluster is useful for securing a cluster.
fields @logStream, @timestamp, @message
| filter @logStream like /^kube-apiserver-audit/
| filter strcontains(user.username,"kubernetes-admin")
| sort @timestamp desc
| limit 50
Query which API Server requests resulted in 5XX errors
Analyzing this data is useful for troubleshooting potential issues occurring in the EKS control plane or
misconfigured requests being performed by clients.
fields @logStream, @timestamp, responseStatus.code, @message
| filter @logStream like /^kube-apiserver-audit/
| filter responseStatus.code >= 500
| limit 50
Spend less time optimizing Kubernetes Resources. Rely on AI-powered Kubex - an automated Kubernetes optimization platform
Free 60-day TrialQuery which RBAC User deleted a particular pod
Queries like this are useful for determining which users accessed or modified a particular resource.
Simply modify the “verb” and “requestURI” to perform a range of useful queries related to auditing and
root cause analysis.
fields @logStream, @timestamp, @message
| filter @logStream like /^kube-apiserver-audit/
| filter verb == "delete" and requestURI like "/api/v1/namespaces/default/pods/my-app"
| sort @timestamp desc
| limit 10
There are many more types of CloudWatch Log Insights queries available for users. Learning the basic
syntax for performing queries will yield many benefits for cluster administrators, especially
for the troubleshooting and analysis of cluster behavior.
EKS supports various logging tools provided by AWS, third-party companies, and open-source communities.
These tools typically run as software installed on the worker nodes, allowing them access to collect and
export node and pod logs to the destination log storage service.
All open-source tools enabling log collection for Kubernetes clusters are supported by EKS. Examples of
open-source logging tools include Grafana Loki, Logstash, and FluentD. These are all big projects with large user
bases and developer communities.
Open-source projects provide a high degree of flexibility and many features at low cost. A key drawback
of open-source logging tools is the operational overhead of configuring log storage, though. Proprietary
solutions typically manage the operational burden of supplying redundant, highly available, and scalable
storage backends for log storage. Open-source projects like Grafana Loki involve self-hosting the log
storage, which requires additional operational overhead for the user.
Users will need to evaluate their use cases to determine whether self-hosting their logs is a worthwhile
trade-off for an open-source project’s improved flexibility and community support.
Companies like SumoLogic, DataDog, Splunk, and New Relic provide managed solutions
for log streaming and storage. Users already implementing these types of services in other environments
may choose to adopt the same solutions in their EKS clusters for consistency. Managed solutions will
cost more than open-source equivalents and may provide reduced feature sets. However, they will handle
the operational overhead of log storage, redundancy, and scaling.
The in-house cluster logging solution provided by AWS is called CloudWatch
Container Insights. AWS’s approach to enabling cluster logging involves exporting logs to AWS
CloudWatch, a service allowing the storage and analysis of logs and metrics.
Determining which of the logging solutions above to implement will depend on various factors. Users will
have to take into account the following:
These questions will help users narrow down the ideal logging solution for their use cases. As with all
tooling, the best way to find the appropriate one is to test and validate various options.
Experimentation will provide data on which tools fit requirements and which are inadequate.
AWS users with simple requirements and no exceptional use cases typically default to implementing
CloudWatch Container Insights. This is appropriate for users who are already comfortable using
CloudWatch and may already be using it to store logs/metrics from other AWS services. Container Insights
is a good starting point for a logging solution, and it is easy to migrate away from if a user
eventually decides to switch the logging solution to another provider.
Spend less time optimizing Kubernetes Resources. Rely on AI-powered Kubex - an automated Kubernetes optimization platform
Free 60-day TrialAWS provides a Quickstart Solution that contains all the relevant manifests for enabling Container
Insights and deploys the following setup:
The complete installation procedure is located here.
Users will benefit from experimenting with various logging solutions to gather data on which ones meet
their requirements.
EKS Fargate is a serverless
compute feature available for EKS users. Deploying pods to Fargate allows users to delegate the
management of worker node compute hosts to AWS. This enables users to mitigate the operational overhead
of managing a fleet of EC2 instances.
Pods deployed to Fargate are still capable of exporting log data. However, since AWS manages the
underlying compute host, there are fewer options available for configuring log streaming.
The two options for enabling logging for EKS Fargate pods are sidecar containers and the Fluent Bit log
router.
A sidecar container is a secondary container defined in the pod schema. A pod can define multiple
containers to run collectively on the same host.
Implementing sidecar containers helps enable additional functionality for the primary application pods.
For example, they can be used to include network proxies, service meshes, and log routers. A typical
pattern for EKS Fargate users is to include a sidecar with a logging solution like Fluent Bit or DataDog
to capture logs from the primary container and forward them to a destination service. This pattern
provides the flexibility of using almost any logging agent (open-source, AWS native, or third-party) for
Fargate pods. However, there is added complexity in modifying every Fargate pod’s schema to include an
additional container with the logging agent. This can result in significant complexity and overhead in
clusters with large numbers of Fargate pods.
The sidecar container solution used to be standard for users to implement on Fargate. However, based on
user feedback regarding the complexity of managing sidecar logging containers, AWS provided an
alternative solution to simplify the logging setup for users.
EKS Fargate nodes now include a built-in log router based on the open-source Fluent Bit project. The log
router is transparently installed by default on the underlying Fargate node, so users do not need to
include a sidecar container for log routing. Users let the Fluent Bit router manage log streaming
automatically for their application pods. The Fluent Bit log router deployed by AWS is capable of
streaming Fargate pod logs to a variety of AWS services, including:
Streaming logs to third-party providers like SumoLogic and DataDog is not supported. Use cases requiring
third-party log providers are better suited to EC2 worker nodes than Fargate.
Three key steps are required to enable Fluent Bit logging for Fargate nodes:
The setup process for Fargate logging via Fluent Bit can be found here.
Implementing logging for EKS clusters will incur varying costs depending on the configuration setup. Logs
will typically involve expenses for data transfer from the EC2 instances (worker nodes), storage, and
log queries. The exact cost will depend on what log service is implemented.
Users can prevent high and unexpected costs by configuring their logging setups appropriately. Log
configuration details that will have a significant impact on costs include the following:
kubectl logs <pod_name>
may be enough for testing purposes.For CloudWatch
Container Insights, pricing details for log ingestion, storage, and querying can be found here.
Configuring a thorough logging setup is crucial to operating a production EKS cluster. Log data enables
users to investigate every aspect of cluster behavior, troubleshoot problems, analyze performance,
diagnose security issues, and optimize operations.
Log data is readily available from every component of an EKS cluster, including the control plane, worker
nodes, pods, and AWS API events. Collecting and storing this data will be beneficial in the long term.
However, the initial configuration will require time and effort to accurately determine what approach
will suit the user’s use case.
Identifying the appropriate logging strategy will require testing and validation. Experimenting with
various tools and services will provide greater confidence in logging strategy choices and help with
validating the selected approach.