Kubernetes and EKS have matured significantly over the last few years, with many standard practices
developing across the industry based on lessons learned from earlier mistakes. Best practices for EKS
build on the knowledge of Kubernetes-specific considerations and AWS-related standards. Following these
recommendations ensures that the clusters are designed according to well-known conventions, reducing
potential problems and improving the cluster management experience.
Security | Both Kubernetes and AWS provide many controls for maintaining a strong cluster security posture. |
Scalability and high availability | EKS workloads should be designed to scale based on utilization and to survive partial outages without downtime. |
GitOps and CI/CD | The GitOps model allows deployment to large cluster fleets, the tracking of infrastructure revision histories, change rollbacks upon failure, and audit changes by personnel. |
Observability | Maintaining insight into cluster and workload behavior is critical to managing a production cluster. |
Cluster version upgrades | Ensuring that EKS cluster versions maintain cadence with the upstream Kubernetes software release cycle is a challenge but provides significant benefits for administrators. |
Tenancy | Kubernetes supports multiple tenancy models to allow administrators to effectively manage a wide range of workload types. |
Security-related functionality has matured significantly for Kubernetes in recent years and provides
administrators with many options for securing production EKS clusters.
Security Layers
Security on EKS is a complicated topic due to the number of moving parts involved. Careful consideration
is required when designing and building EKS clusters to ensure that best practices are followed
rigorously.
Security best practices include the following:
This process involves locking down API endpoint access, worker node security groups, and network ACLs.
Preventing cluster access is the first line of defense for EKS. This page is a
good starting place for understanding the AWS EKS infrastructure security best practices.
This applies to both IAM and Kubernetes RBAC. Minimizing permissions granted via the aws-auth ConfigMap
and Kubernetes Roles/ClusterRoles will decrease the attack surface, following the “principle of least
privilege.” The goal is always to ensure that compromised credentials are a limited risk.
Pods should have limited privileges to maintain cluster security, which means blocking those requesting
host filesystem, kernel capabilities, root user access, etc. These limits can be enforced through the Open
Policy Agent (OPA) project. OPA is an open-source tool that allows administrators to enforce
constraints on objects being deployed to the cluster by validating the object schema against a set of
user-defined rules and blocking access accordingly. The restrictions enforced by this tool can help
mitigate a compromised pod’s access to the worker node host.
Visualize Utilization Metrics | Set Resource Requests & Limits | Set Requests & Limits with Machine Learning | Identify mis-sized containers at a glance & automate resizing | Get Optimal Node Configuration Recommendations | |
---|---|---|---|---|---|
Kubernetes | |||||
Kubernetes + Densify |
This includes restricting IAM access, enabling EBS volume encryption, using up-to-date worker node AMIs,
using SSM instead of SSH, and enabling VPC flow logs. AWS provides extensive documentation on security-related options.
Tools like Calico allow admins to
limit unwanted communication between pods, restricting the blast radius of compromised pods. Calico
allows administrators to define firewall rules restricting pod-to-pod communication via a resource type
called “Network
Policies.” Enforcing these rules ensures that pods only communicate with other pods when
whitelisted; otherwise, network traffic is denied. This ensures that if a pod is compromised, there will
be limits on what other pods it can attack via the cluster’s network.
Ensuring that applications are scalable and highly available on EKS are done by leveraging
Kubernetes-native features and tooling. Scalability is an essential aspect of enabling an application to
be elastic, able to absorb growth in incoming request volume, and shrink when utilization becomes low.
High availability ensures an application’s ability to continue servicing requests during partial system
failures.
High Availability EKS Cluster
Following the best practices described below will improve the scalability and availability of workloads
on EKS clusters.
Readiness and liveness probes are a built-in Kubernetes feature enabling automatic application health
checking, verifying that they are responsive, and terminating malfunctioning pods. This auto-healing
feature helps to maintain high availability by recovering automatically from pod-level failures. All
pods within a cluster should be configured with these probes to ensure that the cluster can recover from
intermittent failures automatically without administrator intervention.
This can be done via the Kubernetes Anti-Affinity feature, which causes pods to schedule automatically
across different worker nodes. Spreading pods across nodes ensures that a single node failure will not
impact all application pods.
Tools like the Horizontal Pod Autoscaler or Keda can be configured to automatically scale the pod replica
count based on relevant metrics, like CPU usage or application request count. These tools can help
maintain an application’s elasticity, increasing the replica count when more resources are necessary and
downscaling when possible to reduce costs.
Applying the appropriate resource request and limits to every pod can be complex without the right
tooling to complement the built-in functionality provided in Kubernetes known as Vertical Pod Autoscaling
(VPA).
The complexity with auto-scaling arises in three areas:
Complementing Kubernetes auto-scaling with machine learning technology and a fine-grain analysis of
real-time capacity utilization will address these issues – you can learn more about these techniques here.
Spreading worker nodes across zones ensures that a single zone outage will not cause a complete cluster
outage. Spreading worker nodes is done by implementing AWS AutoScalingGroups
across multiple Availability Zones.
Pick the ideal instance type for your workload using an ML-powered visual catalog map
See how it worksGit is a version control system that stores change revision histories of primarily text files. It is
useful for storing application source code, Kubernetes resource manifests, and infrastructure
configuration files. GitOps is a model for using Git as a single source of truth for a cluster’s
configuration.
Implementing the GitOps model involves using Git to store all artifacts related to
the EKS cluster and leveraging tools like ArgoCD to scan and continuously deploy changes made to the Git
repository. ArgoCD is a CI/CD tool that can be installed into an EKS cluster to poll the remote Git
repository for changes. This combination of tools allows administrators to efficiently deploy changes to
large numbers of EKS clusters by simply pushing application source code, Kubernetes manifests, or
infrastructure configuration details to a Git repository. Changes will be reflected to clusters
automatically by ArgoCD.
This model allows administrators to deploy new changes quickly, roll out
to large fleets, maintain a simple single source of truth, track revision histories, roll back changes
on failure, audit changes by personnel, and use a single entry point for cluster configuration. GitOps
is the preferred model for modern Kubernetes/EKS cluster setups, especially for administrators managing
large cluster fleets.
Some specific best practices for implementing a GitOps workflow include the following.
An intuitive and logical layout ensures that administrators can easily navigate the repository and make
appropriate changes. Dividing the repository into separate sections for application code, Kubernetes
resource manifests, and cloud infrastructure configuration is recommended. Admins will benefit from
dividing resources across multiple Git repositories for large or complex setups to improve organization.
Defining the infrastructure configuration as code using tools such as CloudFormation or Terraform allows
admins to leverage the benefits of a version control system. These include tracking changes to the
infrastructure over time, reviewing differences between revisions, rolling back to older configuration
versions, and replicating infrastructure consistently to different environments. EKS cluster
configurations can be fully defined via IaC, including the cluster control plane, node groups, AWS VPCs,
and other cloud provider resources.
This can include validating syntax correctness, checking for security best practices, checking for
deployment conflicts, and testing changes in a staging environment. Automated testing can help catch
problematic cluster changes before deploying to production environments. A benefit of using Git is the
ease of implementing automatic tests and validation mechanisms.
Ensuring system-wide visibility of all cluster components is essential for validating production
readiness, identifying bottlenecks, optimizing costs, ensuring security, and performing problem root
cause analysis. EKS supports a wide range of tooling related to observability, typically categorized
into three key observability pillars: metrics (e.g., Prometheus/Grafana), logging (e.g., CloudWatch and
FluentD), and tracing (e.g., AWS XRay and OpenTelemetry). This tooling provides administrators with
insight into their clusters, which is critical for production workloads.
The benefits of a high-quality observability setup are:
The challenge with cost optimization is that a simple estimation of CPU and memory utilization based on
observing 95 percentile watermarks ignores the short bursts that cause application performance
bottlenecks and also typically leaves out network and I/O usage measurements. As explained in this infographic,
overestimating CPU and memory requests and limits causes waste at the cluster node level, while
underestimating them can cause CPU throttling or pod termination, among other performance issues. Densify’s capacity optimization
solution leverages advanced algorithms to automatically set container resource requests and
limits based on a comprehensive analysis of all of the cluster resources.
Maintaining a high-quality observability setup is critical for managing a production cluster.
Observability best practices include the following.
Identify under/over-provisioned K8s resources and use Terraform to auto-optimize
WATCH 3-MIN VIDEOEach of these elements provides a unique view of cluster operations, and all are important to gather.
Deciding on which tools to implement for the observability pillars (metrics, logging, and tracing) will
depend on various factors, such as
To get a simple proof-of-concept of a metrics and logging setup running for testing purposes,
administrators may consider installing Prometheus, Grafana, and CloudWatch
Container Insights. These tools are simple to install, and will immediately collect and
aggregate cluster logs and metrics. Admins can compare the usability and features of each tool to
determine which fits their use case.
Tracing can be implemented on EKS via the OpenTelemetry
standard and AWS Xray. This setup enables the collection of cluster telemetry data, allowing
granular insight into application workloads.
There are many alternative tools available for implementing cluster observability. Choosing the ideal
tools will require administrators to clearly understand their use cases and requirements and to test
available options to validate their effectiveness.
This can include practices like applying Prometheus Metric Labels, grouping logs and traces by category,
building high-quality dashboards, and developing documentation on how the observability setup works.
This approach will accelerate the research and investigation of cluster admins consuming the
observability data.
This exercise can also pay off in cost optimization. For instance, Densify
conveniently leverages Prometheus data to analyze resource consumption using advanced
algorithms.
This will mean either configuring the relevant tooling to store redundant data copies or utilizing
managed cloud services with built-in high availability. Managed observability solutions almost always
provide high availability by default. Deploying self-hosted solutions will require additional
configuration to ensure that storage is being replicated across nodes/zones.
This can include exporting metrics, streaming application logs, and supporting request tracing to ensure
that the application is operating transparently. Cluster administrators will typically need to ensure
that developers deploying applications to the cluster are following software development best practices
like outputting application logs and exposing metrics. Enforcing these standards will ensure that
administrators can effectively expose observability data to better optimize cluster operations.
A new version of the Kubernetes project is released every four months, and the managed EKS service
follows this same cadence. Versions of Kubernetes more than four releases old are considered deprecated,
and EKS installations running older versions will be forcefully upgraded.
This fast-paced release cycle allows Kubernetes to publish new features and extend functionality quickly,
but it also challenges administrators to keep up with the software’s release cycle. Organizations
running production workloads on EKS are typically hesitant about making frequent changes that may impact
the stability of their applications.
Administrators also bear the operational overhead of upgrading EKS cluster control planes, worker nodes,
and installed components, running validation tests, reviewing change release notes, and actioning
roadblocks like version compatibility problems.
The operational overhead involved with upgrading EKS versions results in many organizations needing to
catch up to the recommended versions. This can cause several problems:
Organizations benefit from prioritizing a smooth release pipeline for EKS version upgrades to mitigate
these challenges. The investment in a proper process and tooling will pay off via reduced operational
overhead, access to newer Kubernetes versions and included features, less time spent troubleshooting
compatibility issues, and the mitigated risk of a forced EKS version upgrade.
The best practices for maintaining a solid cluster version upgrade pipeline are as follows.
A free 30-day trial for cloud & Kubernetes resource control. Experience in your environment or use sample data. See optimization potential in 48 hours. No card required.
Free TrialWhether the components are installed using Helm Charts, Kustomize, or other tools, administrators must
regularly check for updates to maintain cluster version compatibility, access new component features,
and reduce security vulnerabilities. Keeping components up to date is key for mitigating cluster
upgrade-related compatibility issues.
Use the version release calendar
provided by EKS to effectively plan for adopting new releases and viewing deprecation timelines.
Ensuring that cluster administrators have adequately planned the engineering resources required to
upgrade regularly will help maintain a consistent cadence.
Ensuring that observability tooling is in place will help quickly diagnose upgrade-related problems.
Observability data can provide greater confidence that an upgrade is working as expected by providing
insight into a cluster’s health.
Configuring deployment pipelines to run workloads on staging clusters running the upgraded versions will
help validate whether workloads are compatible with the latest Kubernetes releases.
Clusters may be utilized by various stakeholders, or “tenants,” which are stakeholders (like developers)
who deploy applications to the cluster. When there are many tenants consuming an EKS cluster, a tenancy
model should be implemented to ensure that each tenant is governed properly. Cluster tenancy models
describe how workloads are divided and segregated to enable effective cluster administration and a
high-quality developer experience.
Tenancy Models
There are several standard tenancy models supported by EKS:
Determining which model to implement requires an analysis of user, security, and administration
requirements. Planning tenancy models carefully enables administrators to more efficiently manage
workloads with specific security, operations, or performance requirements. Ensuring the appropriate
solutions are implemented can reduce complexity and operational overhead in the long term.
Best practices regarding tenancy include the following:
Tenants who are utilizing the cluster (engineers, developers, etc.) may have useful feedback to help
determine whether existing cluster controls are sufficient for their requirements. A cluster is
typically designed as a one-size-fits-all solution, and some users may not suit the supplied cluster
configuration. Gathering feedback from all stakeholders is an essential aspect of iteratively improving
the cluster’s design.
Growing clusters will eventually need to be split into multiples to avoid hitting Kubernetes hard limits,
like the maximum number of allowed pods and worker nodes. Planning for multi-cluster tenancy solutions
early on will enable the growth of high-volume cluster workloads.
Differing security requirements indicate that workloads need to be split between distinct worker nodes or
clusters. Mixing workloads with varying security requirements is a poor practice.
Data on workload utilization will indicate when particular workloads should be moved to separate worker
nodes or clusters. For example, workloads frequently bursting to high CPU utilization may be better
served on dedicated worker nodes to avoid “noisy neighbor” problems for other workloads having their
resource access impeded.
Mastering EKS cluster design and operations require knowledge of many facets of EKS, Kubernetes, and AWS.
To help cluster administrators along with their Kubernetes/EKS journey, we’ve created a set of guides
with useful information on essential topics:
Learn all about EKS Fargate, the AWS serverless Kubernetes solution, including features, getting started,
best practices, limitations, and more..
What is AWS ECS? What is AWS EKS? How are they similar, and where do they differ? Find out the details,
complete with useful recommendations, in our free guide.
Learn all about the architecture of Amazon’s Elastic Kubernetes Service (EKS) in our free, detailed
guide.
eksctl is a powerful tool for managing AWS Elastic Kubernetes Service (EKS) installations. Learn how to use it with our free guide.
Learn how to use the different types of ephemeral and persistent storage options available in EKS in our free guide.
AWS’s Elastic Kubernetes Service (EKS) is a powerful tool for cloud computing. What if you could also use it with your on-premises hardware? You can! Read our free guide to learn more about EKS Anywhere and how it could meet your needs.
Learn how to enable logging for each component of an EKS cluster, use the best tools, and manage costs.
Learn all about the best practices for protecting an EKS cluster, including how to secure worker nodes, pods, container images, as well as securing the overall AWS infrastructure.
Learn all about the EKS control plane, a critical component of overall EKS architecture, including concepts, best practices, and recommendations.
Learn how to deploy EKS clusters and their surrounding infrastructure using preconfigured modules with AWS Blueprints.
Learn how to optimize EKS costs optimization for all resources in the cluster, including worker node, pod, and data transfer costs.
More chapters to come soon.