Users should log several components of an Elastic Kubernetes Service (EKS) cluster to ensure easy operations, maintenance, and troubleshooting. An effective logging strategy will involve selecting the appropriate tools and validating that they meet user requirements.
This article will discuss how to enable logging for each component of an EKS cluster, what tools are available, and how they can be implemented to improve the operational readiness of EKS clusters.
|Why is logging important for EKS clusters?||Logging is important for production EKS clusters where log data is necessary to aid in troubleshooting problems, analyzing performance, investigating security incidents, and improving operational excellence.|
|What components of EKS can be logged?||EKS supports exporting logs from the control plane, EC2 worker nodes, Fargate nodes, and pods.|
|How can I enable EKS control plane logs?||EKS control plane logs must be manually enabled and will stream logs to AWS CloudWatch from master node components like the API Server and Kube Controller Manager.|
|How can I query EKS control plane logs?||EKS control plane logs are present in AWS CloudWatch and can be queried via the CloudWatch Log Insights tool.|
|How can I enable logs for EKS worker nodes and pods?||EKS supports the same logging tools as any other Kubernetes cluster, whether they are open-source, third-party services, or AWS-specific. Users can select any tool that fits their use cases.|
|How can I enable logging for EKS Fargate?||EKS Fargate supports logging via sidecar containers and via the built-in Fluent Bit log router. Either approach allows users to export log data from their Fargate pods.|
|How can I optimize EKS logging costs?||Excessive log expenses are mitigated by controlling how many logs are collected, excluding unnecessary log data, reducing log retention time, and optimizing log queries.|
Logging is helpful for any environment hosting critical applications. Log data provides insight into the performance of applications and the underlying infrastructure. The data is essential for analyzing performance, identifying bottlenecks, troubleshooting bugs and unexpected behavior, detecting security breaches, maintaining uptime, and monitoring the environment’s health.
In the context of an EKS cluster, users will benefit from logs providing insights into the control plane, worker nodes, system pods, application pods, and surrounding AWS-related resources. Logging these components will give users deep insights into how the entire cluster behaves, ensuring that they can manage their clusters effectively.
EKS clusters contain several different components that support logging. Users operating production clusters should consider configuring logging for each of these.
As discussed further below, EKS provides a range of log streams generated by master node components (like the API Server and Kube Scheduler). Enabling these logs using the steps below will provide insight into cluster operations, including performance and security posture.
The worker nodes in a Kubernetes cluster will generate system logs based on the operating system in use. Collecting the system logs will provide insight into the host’s performance and data, which is helpful in troubleshooting host issues.
Kubernetes-specific logs are generated by the kubelet agent, which is the agent responsible for communicating between the worker node and the control plane. The kubelet logs are typically stored alongside the operating system logs (e.g., the SystemD Journal on Linux) and should be collected to enable analysis of the kubelet’s behavior.
Each pod running in a Kubernetes cluster can produce log output; the contents of the output will depend
on the application containers deployed. The application’s developer controls which log messages are
output to stdout, and
Kubernetes will fetch the logs accordingly when running
kubectl logs <pod_name>. Users can collect
these logs to enable application-level troubleshooting and diagnostics capabilities.
EKS clusters run on AWS infrastructure, and most AWS services will generate CloudTrail log event output. CloudTrail provides insight into API calls made to AWS services (e.g., EC2:RunInstance) and is a valuable tool for troubleshooting problems, auditing AWS resource access, and verifying the operational stability of AWS resources. A production EKS setup should include enabling CloudTrail logging to ensure that all AWS resources implemented for the cluster are logged. The log data is helpful for administrators managing EKS clusters.
Maintaining an effective logging setup for every aspect of an EKS cluster may appear daunting and complex to many users due to the number of components involved. However, implementing an appropriate logging setup will help improve the operational readiness of an EKS cluster, enable easier troubleshooting for future issues, allow more straightforward forensic analysis, and provide greater insight into performance bottlenecks. The payoff for proper logging is significant and will benefit many users.
EKS provides the ability to forward control plane logs automatically to the CloudWatch Logs service. Logs from the control plane are disabled by default to mitigate unnecessary costs, but it is highly recommended that users enable the logs for production clusters where analyzing cluster operations is vital for administrators.
Control plane logs are enabled via the AWS web console or the AWS CLI tool.
Enabling logs via the CLI is done with the following AWS CLI command:
Users have a choice of which control plane logs to enable. Generally, users will enable all logs in a production environment, but those optimizing for cost may selectively enable the logs most relevant to user requirements.
The EKS control plane provides log data from several master node components:
The EKS control plane log data provides extensive insight into the cluster’s activities. The data is crucial for administrative operations like troubleshooting issues and forensic analysis, so all users should enable these logs when operating production clusters.
EKS control plane logs are exported to the CloudWatch Logs service. CloudWatch Logs supports a tool called Insights for querying and analyzing log data, which is used for querying the EKS control plane logs.
To query the control plane logs for a particular EKS cluster, users will need to open the Insights tool:
Discovering useful log queries is very helpful for investigations, troubleshooting, and diagnostics purposes. Storing successful queries in a document is a common practice for users who use CloudWatch Log Insights often. Saving important queries saves time in the future and makes it easier to customize existing queries for new purposes.
The CloudWatch Log Insights query syntax can be found here.
Query which IAM principals have accessed the “kubernetes-admin” RBAC user
This user has unlimited permissions to modify the cluster, and its access is heavily restricted as a best practice. Monitoring its actions is useful for investigating potential security incidents. The query can be modified to investigate which other RBAC users are being accessed by IAM principals.
Query what actions were performed by the “kubernetes-admin” RBAC user
Following on from the above, investigating exactly what actions were performed by this privileged user in the EKS cluster is useful for securing a cluster.
Query which API Server requests resulted in 5XX errors
Analyzing this data is useful for troubleshooting potential issues occurring in the EKS control plane or misconfigured requests being performed by clients.
Query which RBAC User deleted a particular pod
Queries like this are useful for determining which users accessed or modified a particular resource. Simply modify the “verb” and “requestURI” to perform a range of useful queries related to auditing and root cause analysis.
There are many more types of CloudWatch Log Insights queries available for users. Learning the basic syntax for performing queries will yield many benefits for cluster administrators, especially for the troubleshooting and analysis of cluster behavior.
EKS supports various logging tools provided by AWS, third-party companies, and open-source communities. These tools typically run as software installed on the worker nodes, allowing them access to collect and export node and pod logs to the destination log storage service.
All open-source tools enabling log collection for Kubernetes clusters are supported by EKS. Examples of open-source logging tools include Grafana Loki, Logstash, and FluentD. These are all big projects with large user bases and developer communities.
Open-source projects provide a high degree of flexibility and many features at low cost. A key drawback of open-source logging tools is the operational overhead of configuring log storage, though. Proprietary solutions typically manage the operational burden of supplying redundant, highly available, and scalable storage backends for log storage. Open-source projects like Grafana Loki involve self-hosting the log storage, which requires additional operational overhead for the user.
Users will need to evaluate their use cases to determine whether self-hosting their logs is a worthwhile trade-off for an open-source project’s improved flexibility and community support.
Companies like SumoLogic, DataDog, Splunk, and New Relic provide managed solutions for log streaming and storage. Users already implementing these types of services in other environments may choose to adopt the same solutions in their EKS clusters for consistency. Managed solutions will cost more than open-source equivalents and may provide reduced feature sets. However, they will handle the operational overhead of log storage, redundancy, and scaling.
The in-house cluster logging solution provided by AWS is called CloudWatch Container Insights. AWS’s approach to enabling cluster logging involves exporting logs to AWS CloudWatch, a service allowing the storage and analysis of logs and metrics.
Determining which of the logging solutions above to implement will depend on various factors. Users will have to take into account the following:
These questions will help users narrow down the ideal logging solution for their use cases. As with all tooling, the best way to find the appropriate one is to test and validate various options. Experimentation will provide data on which tools fit requirements and which are inadequate.
AWS users with simple requirements and no exceptional use cases typically default to implementing CloudWatch Container Insights. This is appropriate for users who are already comfortable using CloudWatch and may already be using it to store logs/metrics from other AWS services. Container Insights is a good starting point for a logging solution, and it is easy to migrate away from if a user eventually decides to switch the logging solution to another provider.
AWS provides a Quickstart Solution that contains all the relevant manifests for enabling Container Insights and deploys the following setup:
The complete installation procedure is located here.
Users will benefit from experimenting with various logging solutions to gather data on which ones meet their requirements.
EKS Fargate is a serverless compute feature available for EKS users. Deploying pods to Fargate allows users to delegate the management of worker node compute hosts to AWS. This enables users to mitigate the operational overhead of managing a fleet of EC2 instances.
Pods deployed to Fargate are still capable of exporting log data. However, since AWS manages the underlying compute host, there are fewer options available for configuring log streaming.
The two options for enabling logging for EKS Fargate pods are sidecar containers and the Fluent Bit log router.
A sidecar container is a secondary container defined in the pod schema. A pod can define multiple containers to run collectively on the same host.
Implementing sidecar containers helps enable additional functionality for the primary application pods. For example, they can be used to include network proxies, service meshes, and log routers. A typical pattern for EKS Fargate users is to include a sidecar with a logging solution like Fluent Bit or DataDog to capture logs from the primary container and forward them to a destination service. This pattern provides the flexibility of using almost any logging agent (open-source, AWS native, or third-party) for Fargate pods. However, there is added complexity in modifying every Fargate pod’s schema to include an additional container with the logging agent. This can result in significant complexity and overhead in clusters with large numbers of Fargate pods.
The sidecar container solution used to be standard for users to implement on Fargate. However, based on user feedback regarding the complexity of managing sidecar logging containers, AWS provided an alternative solution to simplify the logging setup for users.
EKS Fargate nodes now include a built-in log router based on the open-source Fluent Bit project. The log router is transparently installed by default on the underlying Fargate node, so users do not need to include a sidecar container for log routing. Users let the Fluent Bit router manage log streaming automatically for their application pods. The Fluent Bit log router deployed by AWS is capable of streaming Fargate pod logs to a variety of AWS services, including:
Streaming logs to third-party providers like SumoLogic and DataDog is not supported. Use cases requiring third-party log providers are better suited to EC2 worker nodes than Fargate.
Three key steps are required to enable Fluent Bit logging for Fargate nodes:
The setup process for Fargate logging via Fluent Bit can be found here.
Implementing logging for EKS clusters will incur varying costs depending on the configuration setup. Logs will typically involve expenses for data transfer from the EC2 instances (worker nodes), storage, and log queries. The exact cost will depend on what log service is implemented.
Users can prevent high and unexpected costs by configuring their logging setups appropriately. Log configuration details that will have a significant impact on costs include the following:
kubectl logs <pod_name>may be enough for testing purposes.
Configuring a thorough logging setup is crucial to operating a production EKS cluster. Log data enables users to investigate every aspect of cluster behavior, troubleshoot problems, analyze performance, diagnose security issues, and optimize operations.
Log data is readily available from every component of an EKS cluster, including the control plane, worker nodes, pods, and AWS API events. Collecting and storing this data will be beneficial in the long term. However, the initial configuration will require time and effort to accurately determine what approach will suit the user’s use case.
Identifying the appropriate logging strategy will require testing and validation. Experimenting with various tools and services will provide greater confidence in logging strategy choices and help with validating the selected approach.