# AWS EKS Cost Optimization with Karpenter v1.0 in 2026: A Deep Dive

Unlock significant savings on AWS EKS with Karpenter v1.0 in 2026. Learn advanced strategies for dynamic node provisioning and Spot instance utilization, ensuri

Ahmet Çelik

11 min read
0

/

# AWS EKS Cost Optimization with Karpenter v1.0 in 2026: A Deep Dive

Most EKS teams rely on the Kubernetes Cluster Autoscaler (CA) for node lifecycle management. But this foundational approach often leads to persistent overprovisioning, suboptimal Spot instance utilization, and significant wasted cloud spend at scale. By 2026, the industry standard for truly agile and cost-effective EKS node management has shifted.


TL;DR

  • Karpenter v1.0 dynamically provisions EC2 instances for EKS workloads, responding in real-time to pod scheduling demands.
  • It dramatically improves Spot instance adoption and reduces node startup times compared to traditional Cluster Autoscaler.
  • Karpenter’s unified API via `Provisioners` and `EC2NodeClasses` simplifies node group management and enables fine-grained control over instance types.
  • Implement a multi-provisioner strategy to separate critical On-Demand workloads from cost-optimized Spot-tolerant applications for maximum savings.
  • Robust monitoring, alert configuration, and careful planning for Spot interruptions are crucial for production-ready Karpenter deployments.


The Problem: Legacy EKS Node Management Hinders Agility and Cost Efficiency


Managing compute capacity for AWS EKS clusters presents a persistent challenge for backend and platform engineers. As systems scale and workloads become more diverse, the limitations of traditional node management approaches become acutely apparent. Most teams, aiming for operational simplicity, initially adopt AWS Managed Node Groups or self-managed node groups combined with the Kubernetes Cluster Autoscaler (CA). While functional, this strategy introduces significant friction and unnecessary cost at scale.


The core issue stems from the Cluster Autoscaler's architecture. It operates by interacting with AWS Auto Scaling Groups (ASGs). This layer of indirection means that scaling decisions are constrained by ASG configurations (min/max size, instance types). Consequently, CA cannot provision arbitrary instance types or efficiently combine diverse instance families to match workload requirements precisely. This leads to a common scenario: when a cluster needs more capacity, CA often scales up existing ASGs, potentially overprovisioning or selecting suboptimal, more expensive instance types because its configuration limits its flexibility.


This impedance mismatch frequently results in nodes running at significantly lower utilization rates than desired. Teams commonly report 30-50% wasted compute capacity across their EKS clusters, a direct consequence of rigid ASG configurations and the Cluster Autoscaler's slower reaction to dynamic pod scheduling needs. Furthermore, maximizing Spot Instance utilization — a critical component of cloud cost optimization — becomes cumbersome. CA requires separate ASGs for Spot and On-Demand instances, leading to fragmented capacity management and reduced ability to fully leverage Spot's economic benefits. By 2026, relying solely on `Cluster Autoscaler` for advanced EKS node management is recognized as an inefficient practice that leaves substantial cost savings on the table. The need for a more dynamic, intelligent provisioning solution has become paramount.


How It Works: Karpenter's Intelligent Node Provisioning


Karpenter, an open-source node provisioning project built by AWS, fundamentally rethinks how Kubernetes nodes are managed. Unlike the Cluster Autoscaler, which interacts with existing Auto Scaling Groups, Karpenter directly interfaces with the AWS EC2 API. This direct integration empowers Karpenter to provision instances with unparalleled speed and flexibility, bypassing the constraints of ASGs entirely.


At its core, Karpenter operates as a Kubernetes controller that watches for unschedulable pods. When a pod cannot find a home on existing cluster nodes, Karpenter steps in. It evaluates the pod's resource requests (CPU, memory, GPU), node selectors, taints, and tolerations. Based on this information, and the definitions provided through Karpenter's Custom Resources, it decides precisely what kind of EC2 instance is needed.


This decision-making process is highly optimized. Karpenter prioritizes the cheapest available instance types that meet the pod's requirements, including Spot Instances when configured. It can provision a wide array of instance families and sizes, selecting the optimal fit rather than being confined to a pre-defined ASG. This flexibility is critical for heterogeneous workloads that might require different compute profiles (e.g., memory-intensive databases, CPU-bound batch jobs, GPU-accelerated ML tasks).


Once a suitable instance type is identified, Karpenter calls the EC2 API to launch the instance, installs the necessary Kubernetes components (kubelet, CNI), and joins it to the EKS cluster. The entire process, from pod pending to node ready, is significantly faster than CA's ASG-driven approach, often reducing node startup times by 30-50%.


Karpenter's Architecture and Core Principles


Karpenter introduces two primary Custom Resources (CRs) that define its behavior: `Provisioner` and `EC2NodeClass`.


  • `Provisioner`: This CR defines the policies for how Karpenter provisions nodes. It specifies constraints such as instance families, CPU/memory/GPU requirements, Spot/On-Demand preferences, capacity types, launch settings, and consolidation strategies. A `Provisioner` acts as a blueprint, telling Karpenter what kind of nodes it can provision and under what conditions. You can have multiple `Provisioners` in a cluster, allowing for distinct node types and cost strategies for different workloads.

  • `EC2NodeClass`: This CR acts as a templating mechanism for EC2-specific launch configuration details. It defines shared configurations that multiple `Provisioners` can reference, such as AMIs, subnet selections, security groups, IAM instance profiles, and user data scripts. By abstracting these EC2 details, `EC2NodeClass` keeps `Provisioners` focused on the what (node requirements) rather than the how (EC2 configuration). This separation enhances reusability and simplifies management.


The interaction is as follows: When Karpenter needs to provision a node, it looks at the `Provisioner` that best matches the unschedulable pod's requirements. That `Provisioner` then references an `EC2NodeClass` to get the EC2-specific details for launching the instance. This modularity ensures a clean separation of concerns and allows engineers to manage node lifecycle and EC2 infrastructure configuration independently.


Advanced Provisioner Strategy: Multi-Provisioner for Granular Control


A key differentiator for Karpenter's cost optimization capabilities lies in its `Provisioner` CRD. Instead of a single, monolithic node group configuration, you can define multiple `Provisioners`, each with distinct policies. This enables a sophisticated, granular approach to managing your EKS compute.


Consider a typical production scenario: a mix of mission-critical, low-latency applications, and resilient, batch-oriented workloads. Assigning both to the same `Provisioner` would mean either overspending on On-Demand instances for batch jobs or risking interruptions for critical services on Spot. Karpenter allows you to segment this:


  1. On-Demand Provisioner for Critical Workloads: Define a `Provisioner` that exclusively requests On-Demand instances, potentially with specific, highly available instance types. These nodes can be tainted to ensure only critical workloads schedule on them.

  2. Spot Provisioner for Tolerant Workloads: Create a separate `Provisioner` that aggressively targets Spot Instances across a broad range of instance families. This `Provisioner` would have `spot` as its `capacityType` and include detailed `instanceRequirements` to maximize Spot availability and cost savings. This can be combined with appropriate tolerations on your Spot-tolerant pods.


This multi-provisioner strategy ensures that critical applications always have reliable capacity, while non-critical workloads leverage the significant cost savings of Spot Instances without impacting production stability. Karpenter dynamically chooses the correct `Provisioner` based on pod requirements, applying the most cost-effective solution available.


<!-- A base EC2NodeClass for shared configuration -->

apiVersion karpenter.k8s.aws/v1beta1

kind EC2NodeClass

metadata

name default

spec

amiFamily AL2 # Or Bottlerocket, Ubuntu, etc.

role TheKarpenterNodeRoleARN # Replace with actual ARN

subnetSelector

karpenter.sh/discovery/my-cluster "true" # Discovers subnets tagged for this cluster

securityGroupSelector

karpenter.sh/discovery/my-cluster "true" # Discovers security groups tagged for this cluster

instanceProfile my-karpenter-node-instance-profile # Replace with actual name

tags

karpenter.sh/provisioner-name default # Karpenter adds this automatically, but good for tracking

environment production

owner ahmet-celik

userData >

#!/bin/bash

echo "Karpenter node started on $(date)"

# Add any specific bootstrapping commands here, e.g., for custom metrics agents


  • This `EC2NodeClass` named `default` provides the common EC2 launch settings.


<!-- A Provisioner targeting On-Demand instances for critical workloads -->

apiVersion karpenter.sh/v1beta1

kind Provisioner

metadata

name on-demand-critical

spec

providerRef

name default # Reference the EC2NodeClass

requirements

- key karpenter.k8s.aws/instance-category

operator In

values [c, m, r] # CPU, Memory, Compute optimized instance families

- key karpenter.k8s.aws/instance-cpu

operator Gt

values ["2"]

- key karpenter.k8s.aws/instance-memory

operator Gt

values ["4Gi"]

- key karpenter.k8s.aws/capacity-type

operator In

values [on-demand] # Exclusively On-Demand

- key kubernetes.io/arch

operator In

values [amd64]

- key karpenter.sh/capacity-type

operator In

values [on-demand] # Use the Karpenter specific capacity-type label for consistency

limits

resources

cpu "100" # Max 100 CPU cores for this provisioner

ttlSecondsAfterEmpty 300 # Terminate nodes after 5 minutes of being empty

consolidation

enabled true # Enable consolidation for cost efficiency

maxUntainedNoExecuteDuration "30s" # Grace period for consolidation before node taints are applied

weight 10 # Lower weight means higher priority for Karpenter to consider first (if matching)

labels

purpose critical-workloads

taints

- key critical-workloads.example.com/on-demand

value "true"

effect NoSchedule


  • This `Provisioner` named `on-demand-critical` specifies On-Demand instances for CPU/memory intensive workloads, with a taint to ensure only pods tolerating it are scheduled.


<!-- A Provisioner targeting Spot instances for cost-optimized workloads -->

apiVersion karpenter.sh/v1beta1

kind Provisioner

metadata

name spot-cost-optimized

spec

providerRef

name default # Reference the EC2NodeClass

requirements

- key karpenter.k8s.aws/instance-category

operator In

values [c, m, r]

- key karpenter.k8s.aws/instance-cpu

operator Gt

values ["2"]

- key karpenter.k8s.aws/instance-memory

operator Gt

values ["4Gi"]

- key karpenter.k8s.aws/capacity-type

operator In

values [spot] # Exclusively Spot

- key karpenter.k8s.aws/instance-type

operator NotIn

values [t2.medium, t3.small] # Exclude specific instances if they're problematic or too small

- key kubernetes.io/arch

operator In

values [amd64, arm64] # Allow both architectures for broader Spot availability

- key karpenter.sh/capacity-type

operator In

values [spot]

limits

resources

cpu "200" # Max 200 CPU cores for Spot

ttlSecondsAfterEmpty 60 # Aggressively terminate empty Spot nodes after 1 minute

consolidation

enabled true # Enable consolidation for cost efficiency

maxUntainedNoExecuteDuration "60s" # A longer grace period for Spot consolidation

weight 20 # Higher weight means lower priority to be considered, so on-demand-critical is preferred if it matches

labels

purpose spot-workloads

taints

- key spot-workloads.example.com/interruption-tolerant

value "true"

effect NoSchedule


  • This `Provisioner` named `spot-cost-optimized` leverages Spot instances, allowing for both `amd64` and `arm64` architectures for increased availability and potentially lower prices. It also has a taint for segregation.


Node Consolidation and Deprovisioning


Beyond provisioning, Karpenter excels at cost optimization through its intelligent consolidation feature. Consolidation is the process by which Karpenter identifies opportunities to reduce cluster costs by either terminating underutilized nodes or replacing existing nodes with cheaper alternatives.


Karpenter continuously monitors the cluster's nodes and pods. It operates on two main consolidation strategies:


  1. Empty Node Consolidation: If a node becomes completely empty (no pods scheduled on it), Karpenter will terminate it after a configurable `ttlSecondsAfterEmpty` period. This ensures that unused capacity is quickly removed, preventing idle resource waste.

  2. Cross-Type Consolidation: This is where Karpenter truly shines. It identifies scenarios where existing workloads could be repacked onto fewer, cheaper, or different nodes. For example, if a cluster has two moderately utilized `m5.xlarge` instances, and Karpenter determines that their combined workloads could fit onto a single, cheaper `c6a.2xlarge` Spot instance, it will initiate a consolidation. It achieves this by:

* Provisioning the new, optimal node (e.g., the `c6a.2xlarge` Spot instance).

* Draining and evicting pods from the existing, more expensive nodes.

* Terminating the old nodes once they are empty.


This proactive optimization ensures that your EKS cluster is always running on the most cost-effective compute footprint possible, adapting to changing workload demands and Spot market conditions. For stateful applications or those sensitive to interruption, careful planning with Pod Disruption Budgets (PDBs) and preStop lifecycle hooks is essential to ensure graceful shutdown during consolidation events.


Step-by-Step Implementation: Deploying Karpenter for Cost-Optimized EKS


This section outlines the process of deploying Karpenter into an existing EKS cluster and configuring it for advanced cost optimization using multiple provisioners. We will focus on a scenario where you want to differentiate between critical On-Demand workloads and Spot-tolerant batch jobs.


1. Prerequisites for Karpenter Installation


Before deploying Karpenter, ensure you have an existing EKS cluster and the necessary AWS IAM roles and Kubernetes service accounts configured. Karpenter requires specific permissions to interact with EC2, EKS, IAM, and other AWS services.


  • EKS Cluster: An operational AWS EKS cluster (version 1.23+ recommended for Karpenter v1.0).

  • `kubectl` and `helm`: Configured to interact with your EKS cluster.

  • IAM Role for Karpenter Controller: This role grants Karpenter the permissions it needs.

  • IAM Role for Karpenter-Provisioned Nodes: This role is attached to the EC2 instances Karpenter launches, granting them permissions to join the EKS cluster (e.g., `AmazonEKSWorkerNodePolicy`, `AmazonEKSCNIPolicy`, `AmazonEC2ContainerRegistryReadOnly`).


Create IAM Role for Karpenter Controller:


First, we need to create an IAM role that Karpenter will assume. This role requires a trust policy allowing the EKS OIDC provider to assume it.

Replace `YOUREKSCLUSTERNAME` and `YOURAWSACCOUNTID` with your specific values.


$ export CLUSTER_NAME="my-karpenter-cluster-2026"

$ export AWSACCOUNTID=$(aws sts get-caller-identity --query Account --output text)

$ export OIDCPROVIDER=$(aws eks describe-cluster --name ${CLUSTERNAME} --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:////")


$ cat <<EOF > karpenter-trust-policy.json

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Federated": "arn:aws:iam::${AWSACCOUNTID}:oidc-provider/${OIDC_PROVIDER}"

},

"Action": "sts:AssumeRoleWithWebIdentity",

"Condition": {

"StringEquals": {

""${OIDC_PROVIDER}:sub"": "system:serviceaccount:karpenter:karpenter"

}

}

}

]

}

EOF


$ aws iam create-role --role-name KarpenterControllerRole-${CLUSTER_NAME} \

--assume-role-policy-document file://karpenter-trust-policy.json


$ KARPENTERIAMROLEARN=$(aws iam get-role --role-name KarpenterControllerRole-${CLUSTERNAME} --query Role.Arn --output text)


$ aws iam put-role-policy --role-name KarpenterControllerRole-${CLUSTER_NAME} --policy-name KarpenterControllerPolicy \

--policy-document "file://<(curl -fsSL https://raw.githubusercontent.com/aws/karpenter/v0.32.0/website/content/en/preview/getting-started/getting-started-with-eks/cloudformation.yaml | grep -A20 "KarpenterControllerPolicy:" | tail -n +2 | yq -o json -p -j | jq .Properties.PolicyDocument)"

  • This sequence of commands creates the IAM role `KarpenterControllerRole` and attaches the necessary inline policy for Karpenter to manage EC2 instances, EKS, and other resources. Note: The `curl` command fetches a specific version of the Karpenter policy document. Always verify the latest recommended policy from official Karpenter documentation.


Create IAM Instance Profile for Karpenter-Provisioned Nodes:


These roles are for the nodes themselves, allowing them to join the cluster.


$ cat <<EOF > karpenter-node-trust-policy.json

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Service": "ec2.amazonaws.com"

},

"Action": "sts:AssumeRole"

}

]

}

EOF


$ aws iam create-role --role-name KarpenterNodeRole-${CLUSTER_NAME} \

--assume-role-policy-document file://karpenter-node-trust-policy.json


$ aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} \

--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

$ aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} \

--policy-arn arn:aws:iam::aws:policy/AmazonEKSCNIPolicy

$ aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} \

--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly


$ aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}

$ aws iam add-role-to-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} \

--role-name KarpenterNodeRole-${CLUSTER_NAME}

  • This creates an IAM role for the nodes and an instance profile, attaching the standard EKS node policies.


2. Install Karpenter via Helm


With the IAM roles in place, install Karpenter using its Helm chart. We'll specify the IAM role for the controller and other cluster-specific details.


$ helm repo add karpenter https://charts.karpenter.sh/

$ helm repo update


$ helm install karpenter karpenter/karpenter \

--namespace karpenter --create-namespace \

--set serviceAccount.annotations."eks.amazonaws.com/role-arn"=${KARPENTERIAMROLE_ARN} \

--set clusterName=${CLUSTER_NAME} \

--set clusterEndpoint=$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text) \

--set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \

--set controller.resources.requests.cpu="1" \

--set controller.resources.requests.memory="1Gi" \

--version 0.32.0 # Specify Karpenter v1.0 version when available in charts (0.32.0 is current stable for v1beta1)

  • This command deploys Karpenter to the `karpenter` namespace, configuring its service account to assume the `KarpenterControllerRole`. It also sets the cluster endpoint and the default instance profile for provisioned nodes.


Expected Output (Helm installation):


Release "karpenter" has been upgraded. Happy Helming!

NAME: karpenter

LAST DEPLOYED: Thu Jan 01 10:00:00 2026

NAMESPACE: karpenter

STATUS: deployed

REVISION: 1

TEST SUITE: None

NOTES:

karpenter installed successfully!


Verify Karpenter Pods:


$ kubectl get pods -n karpenter


Expected Output:


NAME READY STATUS RESTARTS AGE

karpenter-controller-abcdef-12345 1/1 Running 0 2m


3. Create an `EC2NodeClass`


Before defining `Provisioners`, create an `EC2NodeClass` that specifies common EC2 configurations for your nodes. This `EC2NodeClass` will be referenced by both your On-Demand and Spot `Provisioners`.


<!-- ec2nodeclass-default.yaml -->

apiVersion karpenter.k8s.aws/v1beta1

kind EC2NodeClass

metadata

name default-node-class

spec

amiFamily AL2023 # Use the latest Amazon Linux 2023 AMI

role KarpenterNodeRole-${CLUSTER_NAME} # Reference the node IAM role created earlier

subnetSelector

karpenter.sh/discovery/${CLUSTER_NAME} "true" # Use the cluster discovery tag for subnets

securityGroupSelector

karpenter.sh/discovery/${CLUSTER_NAME} "true" # Use the cluster discovery tag for security groups

instanceProfile KarpenterNodeInstanceProfile-${CLUSTER_NAME} # Reference the node instance profile

tags

karpenter.sh/provisioner-name karpenter-nodes-2026 # Tag for all nodes provisioned by Karpenter

environment production

managed-by karpenter

userData >

#!/bin/bash

set -eux

/etc/eks/bootstrap.sh ${CLUSTER_NAME} \

--container-runtime containerd \

--kubelet-extra-args "--node-labels=karpenter.sh/provisioner-name=default-node-class,karpenter-optimized=true"


  • This `EC2NodeClass` uses Amazon Linux 2023, references the node IAM role and instance profile, and uses EKS discovery tags for subnets/security groups. The `userData` bootstraps the node for EKS and adds custom labels.


$ kubectl apply -f ec2nodeclass-default.yaml


Expected Output:


ec2nodeclass.karpenter.k8s.aws/default-node-class created


4. Define a Base On-Demand `Provisioner`


This `Provisioner` will be used for critical, latency-sensitive workloads that require guaranteed capacity. It will exclusively provision On-Demand instances.


<!-- provisioner-on-demand.yaml -->

apiVersion karpenter.sh/v1beta1

kind Provisioner

metadata

name on-demand-critical

spec

providerRef

name default-node-class # Reference the EC2NodeClass defined previously

requirements

- key karpenter.k8s.aws/instance-category

operator In

values [c, m, r] # Focus on general purpose and compute/memory optimized

- key karpenter.k8s.aws/instance-cpu

operator Gt

values ["2"] # Require at least 4 vCPUs

- key karpenter.k8s.aws/instance-memory

operator Gt

values ["8Gi"] # Require at least 8GiB memory

- key karpenter.k8s.aws/capacity-type

operator In

values [on-demand] # Strictly On-Demand instances

- key kubernetes.io/arch

operator In

values [amd64] # Specify architecture

- key topology.kubernetes.io/zone # Spread nodes across multiple AZs

operator Exists

limits

resources

cpu "100" # Limit total CPU capacity for this provisioner to prevent overprovisioning

ttlSecondsAfterEmpty 600 # Terminate empty nodes after 10 minutes

consolidation

enabled true # Enable consolidation for this provisioner

maxUntaintedNoExecuteDuration "30s" # Allow 30s before tainting for consolidation

taints # Taint nodes so only tolerating pods can schedule

- key karpenter.example.com/critical-workload

value "true"

effect NoSchedule

labels

karpenter.sh/provisioner-name on-demand-critical

app-tier production-critical


  • This `Provisioner` is named `on-demand-critical`. It targets specific instance categories, requires a minimum CPU/memory, strictly uses On-Demand capacity, and applies a taint to ensure only critical workloads schedule on its nodes.


$ kubectl apply -f provisioner-on-demand.yaml


Expected Output:


provisioner.karpenter.sh/on-demand-critical created


5. Implement a Spot `Provisioner` for Cost Optimization


This `Provisioner` will aggressively utilize Spot Instances for resilient, batch-oriented workloads, aiming for maximum cost savings.


<!-- provisioner-spot.yaml -->

apiVersion karpenter.sh/v1beta1

kind Provisioner

metadata

name spot-batch

spec

providerRef

name default-node-class # Reference the EC2NodeClass

requirements

- key karpenter.k8s.aws/instance-category

operator In

values [c, m, r] # Broad range of instance families for better Spot availability

- key karpenter.k8s.aws/instance-cpu

operator Gt

values ["1"] # Smaller minimum CPU for flexibility

- key karpenter.k8s.aws/instance-memory

operator Gt

values ["2Gi"] # Smaller minimum memory

- key karpenter.k8s.aws/capacity-type

operator In

values [spot] # Exclusively Spot instances

- key kubernetes.io/arch

operator In

values [amd64, arm64] # Allow both x86 and ARM for broader Spot availability and better pricing

- key topology.kubernetes.io/zone

operator Exists

limits

resources

cpu "300" # Higher CPU limit for batch workloads

ttlSecondsAfterEmpty 120 # More aggressive termination for empty Spot nodes (2 minutes)

consolidation

enabled true # Crucial for Spot to continuously optimize

maxUntaintedNoExecuteDuration "60s" # A longer grace period for Spot, anticipating potential preemption

taints # Taint nodes to segregate Spot workloads

- key karpenter.example.com/spot-tolerant

value "true"

effect NoSchedule

labels

karpenter.sh/provisioner-name spot-batch

app-tier batch-processing


  • This `Provisioner` named `spot-batch` targets a wide range of instance types including ARM, exclusively uses Spot, and has a shorter `ttlSecondsAfterEmpty` for faster deprovisioning. It also applies a specific taint.


$ kubectl apply -f provisioner-spot.yaml


Expected Output:


provisioner.karpenter.sh/spot-batch created


Common mistake: Forgetting to add appropriate `tolerations` to your pods when using taints on `Provisioners`. If a pod requires a node provisioned by `on-demand-critical` but does not tolerate `karpenter.example.com/critical-workload=true:NoSchedule`, it will remain unschedulable.


6. Deploy a Sample Workload and Observe Provisioning


Now, deploy a sample application that requests specific resources and tolerations, demonstrating how Karpenter provisions the correct node type.


Deploy a Critical On-Demand Workload:


<!-- critical-app-deployment.yaml -->

apiVersion apps/v1

kind Deployment

metadata

name critical-app

labels

app critical-app

spec

replicas 2

selector

matchLabels

app critical-app

template

metadata

labels

app critical-app

spec

terminationGracePeriodSeconds 60

tolerations

- key karpenter.example.com/critical-workload

operator Exists

effect NoSchedule

containers

- name nginx

image nginx:latest

resources

requests

cpu "1"

memory "2Gi"

limits

cpu "2"

memory "4Gi"


  • This deployment requests 1 CPU and 2GiB memory, and crucially, tolerates the `critical-workload` taint, directing Karpenter to use the `on-demand-critical` `Provisioner`.


$ kubectl apply -f critical-app-deployment.yaml


Observe Karpenter in Action (On-Demand):


$ kubectl get pods -w

  • You should see `critical-app` pods initially in `Pending` state, then Karpenter will provision a new node, and the pods will move to `Running`.


$ kubectl get nodes -L karpenter.sh/provisioner-name


Expected Output: (Look for a new node with `on-demand-critical` label)


NAME STATUS ROLES AGE VERSION PROVISIONER

ip-10-0-100-123.eu-west-1.compute.internal Ready <none> 2m v1.27.4-eks-123 on-demand-critical


Deploy a Spot-Tolerant Batch Workload:


<!-- batch-app-deployment.yaml -->

apiVersion apps/v1

kind Deployment

metadata

name batch-app

labels

app batch-app

spec

replicas 3

selector

matchLabels

app batch-app

template

metadata

labels

app batch-app

spec

terminationGracePeriodSeconds 90 # Give enough time for graceful shutdown on Spot interruption

tolerations

- key karpenter.example.com/spot-tolerant

operator Exists

effect NoSchedule

containers

- name busybox

image busybox

command ["sh", "-c", "while true; do echo 'Batch job running...'; sleep 30; done"]

resources

requests

cpu "500m"

memory "512Mi"

limits

cpu "1"

memory "1Gi"


  • This deployment requests less resources per pod and tolerates the `spot-tolerant` taint, directing Karpenter to use the `spot-batch` `Provisioner`.


$ kubectl apply -f batch-app-deployment.yaml


Observe Karpenter in Action (Spot):


$ kubectl get pods -w

  • Again, watch for `Pending` pods, followed by a new node appearing.


$ kubectl get nodes -L karpenter.sh/provisioner-name


Expected Output: (Look for a new node with `spot-batch` label)


NAME STATUS ROLES AGE VERSION PROVISIONER

ip-10-0-101-234.eu-west-1.compute.internal Ready <none> 1m v1.27.4-eks-123 spot-batch


7. Observe Consolidation


To observe consolidation, scale down one of your deployments or make a node idle. Karpenter will then identify nodes that are underutilized or empty and terminate them according to the `ttlSecondsAfterEmpty` configured in the `Provisioner`.


$ kubectl scale deployment/batch-app --replicas=0


Observe Node Termination:


$ kubectl get nodes -w


  • After the `ttlSecondsAfterEmpty` (e.g., 120 seconds for `spot-batch` provisioner), Karpenter will cordon and drain the node, then terminate the underlying EC2 instance. You will see the node eventually disappear from `kubectl get nodes` output.


Common mistake: Setting `ttlSecondsAfterEmpty` too aggressively for production workloads, potentially causing unnecessary node churn, especially if transient workloads frequently cycle. Balance fast deprovisioning with the stability needs of your application.


Production Readiness with Karpenter


Deploying Karpenter into production requires more than just installation. A robust strategy encompasses monitoring, alerting, security, and meticulous planning for edge cases to ensure operational stability and continued cost efficiency.


Monitoring and Alerting


Karpenter exposes a comprehensive set of Prometheus metrics that are invaluable for understanding its behavior and identifying potential issues. These metrics provide insights into:


  • Provisioning activity: Number of nodes launched, types of instances, provisioning latency.

  • Consolidation events: Nodes terminated, consolidation efficiency, pods disrupted.

  • Pod scheduling: Pods pending due to insufficient resources, reasons for failure.


Integrate Karpenter's metrics endpoint with your existing Prometheus and Grafana setup.

$ kubectl get svc -n karpenter

  • Locate the `karpenter-webhook` service. Its metrics are typically exposed at `/metrics`.


Key Metrics to Monitor:


  • `karpenternodeslaunched_total`: Track node launches. Spikes here outside of expected scaling could indicate misconfigured workloads.

  • `karpenternodesterminated_total`: Monitor node terminations.

  • `karpenterprovisionerlimitscpu`, `karpenterprovisionerlimitsmemory`: Observe if provisioners are hitting their configured resource limits.

  • `karpenterpodspending`: Critical for understanding if pods are waiting for capacity. Alert if this metric is consistently high.

  • `karpenterconsolidationnodesconsolidatedtotal`: Track consolidation efficiency.


Alerting:


Configure alerts in Prometheus Alertmanager for:


  • High `karpenter_pods_pending`: Indicates a lack of capacity, potentially a misconfigured `Provisioner` or resource starvation.

  • Failed instance launches: Monitor `karpenternodesfailedtolaunch_total` for issues with EC2 capacity or IAM permissions.

  • Rapid node churn: If nodes are frequently being launched and terminated without clear reason, it could indicate thrashing or inefficient pod scheduling.

  • Provisioner limits breached: Alert if a `Provisioner` is constantly hitting its CPU/memory limits, suggesting it might need adjustment.


Beyond Karpenter's internal metrics, also monitor standard EC2 and EKS CloudWatch metrics. For instance, observe `EC2 Spot Instance Interruptions` for your Spot-provisioned nodes.


Security Considerations


Maintaining a strong security posture with Karpenter involves several layers:


  • IAM Least Privilege: Ensure the Karpenter controller's IAM role (e.g., `KarpenterControllerRole`) has only the permissions strictly necessary to manage EC2 instances, EKS, and other required services. Regularly review and audit these permissions. Similarly, the IAM role for Karpenter-provisioned nodes (`KarpenterNodeRole`) should adhere to the principle of least privilege.

  • IMDSv2 Enforcement: Configure `EC2NodeClasses` to enforce IMDSv2 (Instance Metadata Service Version 2) for all provisioned nodes. This mitigates SSRF (Server-Side Request Forgery) vulnerabilities.

apiVersion karpenter.k8s.aws/v1beta1

kind EC2NodeClass

metadata

name default-node-class

spec

# ... other fields ...

metadataOptions

httpTokens required

httpPutResponseHopLimit 1

* This ensures all nodes provisioned by this `EC2NodeClass` require IMDSv2.

  • Pod Security Standards (PSS): Ensure your `Provisioners` and `EC2NodeClasses` are configured to support your cluster's Pod Security Standards. For example, if you enforce `restricted` PSS, ensure that Karpenter-provisioned nodes are compatible and that workload pods adhere to these standards.

  • Supply Chain Security: Regularly update Karpenter to the latest stable versions. Utilize image scanning for the Karpenter controller image and the AMIs used by your `EC2NodeClasses`.

  • Network Security: Ensure security groups and network ACLs are appropriately configured to allow necessary traffic between Karpenter-provisioned nodes, the EKS control plane, and other services, while restricting unnecessary access.


Cost Management Best Practices


Karpenter is inherently a cost optimization tool, but its effectiveness depends on proper configuration:


  • Tagging Strategy: Karpenter automatically tags provisioned EC2 instances with `karpenter.sh/provisioner-name`, `karpenter.sh/nodepool` (for Karpenter v1.0 and above), and `karpenter.sh/capacity-type`. Supplement these with your organization's standard cost allocation tags (e.g., `Owner`, `Environment`, `Project`). This enables granular cost analysis in AWS Cost Explorer and detailed billing reports.

  • Right-Sizing with Requests/Limits: Emphasize accurate pod resource `requests` and `limits`. Karpenter provisions nodes based on `requests`. Under-requesting can lead to performance issues, while over-requesting leads to inflated node sizes and wasted capacity.

  • Consolidation Aggressiveness: Fine-tune `ttlSecondsAfterEmpty` and the overall `consolidation` settings in your `Provisioners`. More aggressive settings reduce idle costs but might increase node churn. Balance this with workload stability.

  • Spot Instance Strategy: Maximize Spot utilization for fault-tolerant workloads. Use a broad `instanceRequirements` selection (e.g., multiple instance categories, architectures like `arm64`) within your Spot `Provisioners` to increase Spot availability and reduce price fluctuations.

  • Cost Visibility: Integrate Karpenter cost data with your cloud cost management platform (e.g., FinOps tools) to provide transparency and track savings.


Edge Cases and Failure Modes


Anticipating and mitigating edge cases is crucial for production stability:


  • Spot Interruptions: For workloads running on Spot Instances, prepare for 2-minute interruption notices.

* Implement `terminationGracePeriodSeconds` in your pod specifications to allow applications to gracefully shut down.

* Utilize `preStop` lifecycle hooks to perform cleanup tasks or flush in-memory data.

* Ensure your applications are stateless or can recover quickly from interruptions.

* For critical Spot-tolerant workloads, consider Pod Disruption Budgets (PDBs) to maintain a minimum number of available replicas during disruptions. While PDBs don't prevent Spot interruptions, they can help Karpenter manage the draining process more safely.

  • Insufficient EC2 Capacity: Even with broad instance selections, certain regions or availability zones might experience temporary insufficient capacity for specific instance types. Karpenter will retry, but prolonged issues can lead to pending pods.

* Configure `Provisioners` with a diverse range of `instanceRequirements` across multiple availability zones.

* Monitor `karpenternodesfailedtolaunch_total` and associated logs for capacity errors.

* Have a fallback `Provisioner` or manual intervention plan for extreme cases.

  • Misconfigured `Provisioners` or `EC2NodeClasses`: Incorrect IAM roles, security group/subnet selections, or `amiFamily` can prevent nodes from launching or joining the cluster.

* Thoroughly test `Provisioner` configurations in non-production environments.

* Check Karpenter controller logs for errors related to EC2 API calls.

* Verify node `kubelet` logs if instances launch but fail to join.

  • Pod Disruption Budgets (PDBs): Karpenter respects PDBs during consolidation. If a PDB prevents Karpenter from draining a node, consolidation might be blocked. Review PDB configurations to ensure they don't overly restrict node movement, especially for Spot workloads.

  • Lifecycle Hooks for Stateful Workloads: For stateful applications that cannot tolerate arbitrary interruptions (even graceful ones), segregate them to dedicated On-Demand `Provisioners` without aggressive consolidation. If using Spot for stateful workloads is a requirement, implement robust backup, restore, and failover mechanisms. Use `preStop` hooks for data synchronization or unmounting volumes.


By meticulously addressing these production readiness aspects, teams can harness Karpenter v1.0's full potential for significant cost optimization and enhanced operational agility in their EKS environments by 2026.


Summary & Key Takeaways


Karpenter v1.0 represents a significant leap forward in AWS EKS node management, moving beyond the inherent limitations of traditional Cluster Autoscaler and Auto Scaling Groups. By adopting Karpenter, engineering teams can achieve a truly dynamic, cost-optimized, and resilient compute layer for their Kubernetes workloads.


  • Prioritize Karpenter for EKS Cost Optimization: Karpenter's direct EC2 integration and intelligent consolidation capabilities deliver superior cost savings and faster scaling responses compared to older node management strategies.

  • Design Multi-Provisioner Strategies: Segment your workloads by criticality and interruption tolerance. Implement dedicated `Provisioners` for critical On-Demand applications and highly aggressive Spot-based `Provisioners` for batch or resilient services. Leverage taints and tolerations for effective workload segregation.

  • Implement Robust Monitoring and Alerting: Monitor Karpenter's internal metrics (e.g., pending pods, launch/termination counts, consolidation events) with Prometheus and Grafana. Configure alerts for capacity issues, failed launches, and unexpected node churn to maintain operational visibility.

  • Plan for Spot Interruptions: For Spot-tolerant workloads, bake in `terminationGracePeriodSeconds` and `preStop` lifecycle hooks. Ensure applications are designed for resiliency and fast recovery in the face of potential Spot instance preemption.

  • Avoid Over-Reliance on Default Settings: While Karpenter offers sensible defaults, finely tune `ttlSecondsAfterEmpty`, `consolidation` settings, `instanceRequirements`, and resource `requests`/`limits` in your `Provisioners` to precisely match your cluster's unique workload profile and cost optimization goals.

WRITTEN BY

Ahmet Çelik

Former AWS Solutions Architect, 8 years in cloud and infrastructure. Computer Engineering graduate, Bilkent University. Lead writer for AWS, Terraform and Kubernetes content.Read more

Responses (0)

    Hottest authors

    View all

    Ahmet Çelik

    Lead Writer · ex-AWS Solutions Architect, 8 yrs · AWS, Terraform, K8s

    Alp Karahan

    Contributor · MongoDB certified, NoSQL specialist · MongoDB, DynamoDB

    Ayşe Tunç

    Lead Writer · Engineering Manager, ex-Meta, Google · System Design, Interviews

    Berk Avcı

    Lead Writer · Principal Backend Eng., API design · REST, GraphQL, gRPC

    Burak Arslan

    Managing Editor · Content strategy, developer marketing

    Cansu Yılmaz

    Lead Writer · Database Architect, 9 yrs Postgres · PostgreSQL, Indexing, Perf

    Popular posts

    View all
    Deniz Şahin
    ·

    GCP vs AWS vs Azure: Serverless Comparison 2026

    GCP vs AWS vs Azure: Serverless Comparison 2026
    Deniz Şahin
    ·

    Google Cloud Run Tutorial: Deploying Production Services

    Google Cloud Run Tutorial: Deploying Production Services
    Ahmet Çelik
    ·

    S3 Intelligent-Tiering vs Glacier: A Cost Analysis

    S3 Intelligent-Tiering vs Glacier: A Cost Analysis
    Ahmet Çelik
    ·

    Terraform AWS Tutorial: Production-Ready IaC Foundations

    Terraform AWS Tutorial: Production-Ready IaC Foundations
    Deniz Şahin
    ·

    GCP Serverless Compute: Cloud Run vs Functions vs App Engine

    GCP Serverless Compute: Cloud Run vs Functions vs App Engine
    Deniz Şahin
    ·

    BigQuery Tutorial: Quickstart for Backend Engineers

    BigQuery Tutorial: Quickstart for Backend Engineers