GCP Cost Optimization Checklist: Cloud Run & GKE

Most teams prioritize rapid feature delivery and high availability, often deploying resources with generous defaults to mitigate immediate performance risks. But this approach commonly leads to 30–50% over-provisioning at scale, resulting in substantial, unnecessary cloud spend across flexible services like Cloud Run and GKE where defaults are rarely optimal for cost efficiency.

TL;DR

Proactive resource right-sizing on both Cloud Run and GKE is critical to avoid hidden costs from over-provisioning.
Fine-tune Cloud Run's concurrency, CPU allocation, and `min-instances` to match workload patterns precisely.
Leverage GKE's Cluster Autoscaler, Vertical Pod Autoscaler, and Horizontal Pod Autoscaler in concert for dynamic resource scaling.
Implement GKE Spot VMs with careful workload affinity and anti-affinity rules for significant compute savings.
Establish continuous monitoring with GCP Budgets and custom metrics to detect and address cost anomalies promptly.

The Problem: Unchecked Cloud Sprawl and Escalating Bills

Our team at a growing SaaS company experienced firsthand how quickly cloud costs can spiral without active management. Initially, our focus was solely on rapid deployment and ensuring service availability, using generous default configurations for our GKE clusters and Cloud Run services. We observed our monthly GCP bill climb by an average of 20% quarter-over-quarter, even when user growth was linear, not exponential. Analysis revealed substantial waste: GKE nodes ran at 40-50% average CPU utilization, and Cloud Run services frequently spun up more instances than necessary due to lax concurrency settings. This scenario is common; teams often allocate resources based on peak theoretical load, leading to significant idle capacity and inefficient spending on infrastructure that is not fully utilized. The challenge lies in optimizing these resources without impacting reliability or developer velocity.

Optimizing Cloud Run Cost Management

Cloud Run offers a compelling serverless container platform, but its cost efficiency heavily depends on configuration. Understanding how CPU allocation, concurrency, and instance scaling interact is key to significant savings.

CPU Allocation and Concurrency

Cloud Run instances can allocate CPU differently: "CPU is always allocated" or "CPU is only allocated during requests." The latter is often more cost-effective for services with intermittent traffic, as you only pay for CPU when it's actively processing. Concurrency dictates how many simultaneous requests a single Cloud Run instance can handle. A higher concurrency setting can reduce the number of instances needed, directly lowering costs, provided your application can effectively handle multiple requests without significant latency. However, aggressive concurrency settings on CPU-intensive applications can lead to increased latency and timeouts.

Consider a service handling API requests. If each request is lightweight and I/O-bound, a higher concurrency (e.g., 80-100 requests/instance) is suitable. For CPU-bound tasks, a lower concurrency (e.g., 2-10 requests/instance) might be more appropriate, even if it means more instances are provisioned.

# cloudrun-service-optimized.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: api-service
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/client-name: "cloud-console"
        run.googleapis.com/cpu-throttling: "false" # Keep CPU available for burst
    spec:
      containers:
      - image: us-docker.pkg.dev/cloudrun/container/hello # Replace with your image
        resources:
          limits:
            cpu: "1000m" # Allocate 1 vCPU per instance
            memory: "512Mi" # Allocate 512 MiB memory
        ports:
        - containerPort: 8080
      serviceAccountName: cloud-run-service-account@your-project.iam.gserviceaccount.com # Replace with your service account
      timeoutSeconds: 300
      containerConcurrency: 80 # Set to 80 concurrent requests per instance
      scaling:
        minInstances: 0 # Allow scaling down to zero for idle periods
        maxInstances: 50 # Cap maximum instances to control cost spikes

Understanding `min-instances` and `max-instances`

The `min-instances` setting keeps a specified number of instances warm, ready to serve requests immediately. While it reduces cold starts, it incurs cost even during idle periods. For services with strict latency requirements, a `min-instances` value of 1 or 2 might be justified, but for batch jobs or services with bursty, non-critical traffic, setting `min-instances` to 0 maximizes cost savings by allowing the service to scale down completely.

`max-instances` sets an upper bound on how many instances Cloud Run will provision. This is a critical cost control mechanism. Without it, a sudden traffic spike could provision hundreds of instances, leading to an unexpected bill. Set `max-instances` based on observed peak traffic, performance requirements, and budget constraints.

GKE Resource Efficiency and Cost Control

Optimizing GKE costs involves a multi-faceted approach, combining intelligent autoscaling, leveraging cost-effective VM types, and meticulous resource management within pods.

Advanced Autoscaling with CA, VPA, and HPA

GKE offers three distinct autoscalers:

Cluster Autoscaler (CA): Scales the number of nodes in your GKE cluster based on pending pods. It ensures pods have resources to run but does not manage pod resource usage.
Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on observed CPU utilization or custom metrics. HPA ensures your application can handle varying loads.
Vertical Pod Autoscaler (VPA): Recommends or automatically sets resource requests and limits for pods based on historical usage. VPA helps right-size individual pods, preventing over-provisioning at the pod level.

Using these together requires careful configuration. VPA and HPA can conflict if both try to manage CPU/memory for the same pods. For instance, if VPA is in `UpdateMode: Auto`, it will adjust CPU/memory requests, potentially interfering with HPA's CPU-based scaling. A common and recommended strategy in 2026 is to use VPA in `Off` or `Recommender` mode for pods also managed by HPA. VPA provides recommendations, which can then be manually applied to deployment manifests or used by HPA if HPA is scaling on custom metrics rather than CPU/memory.

# deployment-vpa-hpa-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-deployment
spec:
  replicas: 1 # Start with a minimal replica count
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: your-gke-image:v1.0.0 # Replace with your image
        resources:
          # Set conservative initial requests. VPA will recommend better values.
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m" # Set a reasonable limit to prevent resource hogging
            memory: "256Mi"