Google Cloud Run Tutorial: Deploy Scalable Services

Master the Google Cloud Run tutorial for deploying scalable, serverless containerized applications. Learn practical steps for production readiness by 2026.

Deniz Şahin

11 min read
0

/

Google Cloud Run Tutorial: Deploy Scalable Services

Most teams building new services today default to container orchestration platforms like Kubernetes. But this often introduces significant operational overhead and management complexity, leading to slower iteration cycles and higher infrastructure costs for stateless applications at scale.


TL;DR: Google Cloud Run Tutorial for Scalable Deployments

  • Cloud Run provides a fully managed serverless platform for deploying containerized applications, abstracting away infrastructure management.
  • It offers automatic scaling from zero to thousands of instances based on demand, optimizing costs through per-request billing.
  • Leverage traffic splitting for controlled rollouts, A/B testing, and efficient incident mitigation with revision management.
  • Secure your Cloud Run services using IAM, VPC Access Connectors for private network access, and Secret Manager for sensitive data.
  • Implement robust production readiness by integrating Cloud Monitoring and Cloud Logging for comprehensive observability and alerting.


The Problem


Modern backend development prioritizes agility and cost-efficiency. However, deploying and managing stateless microservices on self-managed infrastructure or even managed Kubernetes can still consume substantial engineering resources. Teams commonly report dedicating 20-40% of their operational budget and engineering time to infrastructure provisioning, scaling, and patching for services that primarily handle HTTP requests. This overhead often overshadows the actual business logic development, particularly for smaller, event-driven, or intermittently used services. The challenge lies in achieving granular resource allocation and true "pay-per-use" billing without compromising scalability or developer velocity. This is where a focused Google Cloud Run tutorial becomes essential, addressing how to deploy and manage services efficiently.


How It Works


Cloud Run offers a compelling solution by providing a fully managed environment for stateless containers, automatically scaling them based on request traffic. It abstracts the underlying infrastructure, allowing engineers to focus purely on application logic packaged as a Docker image.


Understanding Cloud Run Deployment Strategies


At its core, Cloud Run operates on services, revisions, and instances. A service is the main deployment entity, representing your application. Each deployment creates an immutable revision, which contains a specific container image, configuration, and environment variables. When traffic arrives, Cloud Run scales up instances of the active revision to handle requests.


This revision-based model is crucial for robust deployment strategies. You can direct traffic to multiple revisions simultaneously, enabling fine-grained control over rollouts. For instance, to test a new feature, you can deploy a new revision and route a small percentage of user traffic to it. If issues arise, rolling back is as straightforward as redirecting all traffic to a previous stable revision. This approach significantly de-risks deployments and facilitates advanced patterns like blue/green deployments and A/B testing.


Serverless Container Scaling Mechanics


Cloud Run's scaling is a key differentiator. It automatically scales instances based on incoming request load, from zero to potentially thousands, within seconds. You configure minimum and maximum instance counts and a concurrency setting, which dictates how many simultaneous requests a single instance can handle. When an instance handles fewer requests than its configured concurrency, Cloud Run scales down; if it handles more, it scales up.


A critical decision point is the CPU allocation setting: "CPU is always allocated" or "CPU is only allocated during request processing." The latter is often more cost-effective for intermittent workloads, as CPU is only billed when active. However, it can lead to cold starts where an instance needs to warm up before processing requests, adding latency. For services requiring consistent low latency, keeping CPU always allocated (at a higher cost) can be a better trade-off, especially if `min-instances` is also configured to prevent scale-to-zero.


Step-by-Step Implementation


This section provides a practical Google Cloud Run tutorial for deploying a simple Python Flask application.


1. Set up your GCP Project and gcloud CLI


First, ensure you have a GCP project configured and the `gcloud` CLI installed and authenticated.


$ gcloud config set project YOURPROJECTID # Replace YOURPROJECTID

$ gcloud auth login

$ gcloud services enable run.googleapis.com \

artifactregistry.googleapis.com \

cloudbuild.googleapis.com


Expected output:

Setting project to [YOURPROJECTID].

Your active configuration is: [default]

... (browser window opens for login) ...

Operation "operations/system/run.googleapis.com/enable" finished successfully.

Operation "operations/system/artifactregistry.googleapis.com/enable" finished successfully.

Operation "operations/system/cloudbuild.googleapis.com/enable" finished successfully.


2. Create a Simple Python Flask Application


Create `app.py` and `requirements.txt` for a basic "Hello, World!" web service.


  • `app.py`:

app.py

from flask import Flask

import os


app = Flask(name)


@app.route("/")

def hello_world():

# Retrieve revision from environment variable

revision = os.environ.get("K_REVISION", "unknown")

return f"<p>Hello from Cloud Run! Revision: {revision} - Deployed 2026.</p>"


if name == "main":

# Cloud Run provides PORT env var, default to 8080 for local testing

port = int(os.environ.get("PORT", 8080))

app.run(host="0.0.0.0", port=port)


  • `requirements.txt`:

Flask>=2.0


  • `Dockerfile`:

Dockerfile

Use a slim Python base image for smaller size

FROM python:3.9-slim-buster


Set the working directory in the container

WORKDIR /app


Copy the requirements file and install dependencies

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt


Copy the application code

COPY . .


Cloud Run expects the application to listen on the port specified by the PORT environment variable

Expose port for clarity, though not strictly required for Cloud Run

EXPOSE 8080


Run the application

Use gunicorn or another production-ready WSGI server in a real application

CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app


3. Build and Push Docker Image to Artifact Registry


Create a Docker repository and then build and push your image. Replace `YOURPROJECTID` and `YOUR_REGION` (e.g., `us-central1`).


$ gcloud artifacts repositories create cloud-run-repo \

--repository-format=docker \

--location=YOUR_REGION \

--description="Docker repository for Cloud Run images"


$ docker build -t gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.0 .


$ docker push gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.0


Expected output (after building and pushing):

Created repository [cloud-run-repo].

...

Pushed gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.0

The push refers to repository [gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run]

v1.0: digest: sha256:a1b2c3d4e5... size: 1234


4. Deploy to Cloud Run


Deploy your container image to Cloud Run. Make sure to specify the region and allow unauthenticated invocations for a public service. For production, restrict access and use IAM.


$ gcloud run deploy hello-cloud-run-service \

--image gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.0 \

--platform managed \

--region YOUR_REGION \

--allow-unauthenticated \

--min-instances 0 \

--max-instances 2 \

--concurrency 80 \

--memory 256Mi \

--cpu 1 # 1 vCPU, 'cpu-throttled' if CPU is only allocated during request processing


Common mistake: Forgetting `--allow-unauthenticated` for public services, leading to "permission denied" errors when accessing the URL. Or, for private services, not configuring the correct IAM permissions for the calling identity.


Expected output:

Deploying container to Cloud Run service [hello-cloud-run-service] in project [YOURPROJECTID] region [YOUR_REGION]

...

Service URL: https://hello-cloud-run-service-xyz.run.app


5. Test the Service


Navigate to the `Service URL` provided in the deployment output.


Expected browser output:

Hello from Cloud Run! Revision: hello-cloud-run-service-00001-abc - Deployed 2026.


6. Update and Deploy a New Revision with Traffic Splitting


Modify `app.py` to `Hello from Cloud Run! Updated Revision: {revision} - Deployed 2026.` Build and push a `v1.1` image.


$ docker build -t gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.1 .

$ docker push gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.1


$ gcloud run deploy hello-cloud-run-service \

--image gcr.io/YOURPROJECTID/cloud-run-repo/hello-cloud-run:v1.1 \

--platform managed \

--region YOUR_REGION \

--no-traffic # Deploy without routing traffic initially


Route 10% traffic to the new revision (v1.1)

$ gcloud run services update-traffic hello-cloud-run-service \

--to-latest=10 \

--to-revisions=hello-cloud-run-service-00001-abc=90 \

--platform managed \

--region YOUR_REGION


Now, refresh your service URL multiple times. Approximately 10% of requests will hit the new revision (`v1.1`), and 90% will still go to `v1.0`. This demonstrates a canary deployment.


Production Readiness


Deploying a service is just the first step. For production systems, you must consider observability, cost, and security implications.


Cloud Run Production Readiness and Observability


Monitoring & Alerting: Cloud Run integrates seamlessly with Cloud Monitoring and Cloud Logging. Critical metrics include request count, request latency (p99, p95), CPU utilization, memory utilization, and instance count. Configure dashboards to visualize these metrics. Set up alerts for error rates exceeding a threshold (e.g., 5% 5xx errors over 5 minutes) or for unexpected latency spikes. Log Explorer provides powerful tools for filtering and analyzing application logs, crucial for debugging. For transient issues or cold starts, monitoring the instance lifecycle events in logs can provide deep insights.


Cost Management: Cloud Run's pricing model is highly optimized for serverless workloads. You pay per request and per instance usage (CPU, memory, network). Scaling to zero is a significant cost-saving feature for intermittent services. Carefully consider the "CPU is always allocated" setting; while it reduces cold starts, it increases billing. Choosing the correct region also influences cost, as pricing varies. For services with predictable base load, setting `min-instances` can improve latency but means paying for those instances even when idle. Teams commonly observe 30-60% cost savings for intermittent workloads compared to continuously running VMs or Kubernetes deployments.


Security: Implement a strong security posture using IAM service accounts. Grant your Cloud Run service's identity only the least necessary permissions for accessing other GCP resources (e.g., Cloud Storage, BigQuery, Secret Manager). For services needing to access resources within your Virtual Private Cloud (VPC), like private databases or internal APIs, use a VPC Access Connector. This establishes a fully managed connection between your serverless environment and your VPC, ensuring traffic remains private. Sensitive data, such as API keys and database credentials, should never be hardcoded. Integrate with Google Cloud Secret Manager to securely store and retrieve secrets at runtime.


Edge Cases and Failure Modes:

  • Cold Starts: When a service scales from zero instances or needs to bring up a new instance, there's a latency hit known as a cold start. Mitigate this by setting `min-instances` to 1 or more for critical, low-latency services, or by using a dedicated "CPU is always allocated" setting.

  • Long-running Requests: Cloud Run services have a request timeout (up to 60 minutes). Design your applications to handle requests efficiently and avoid blocking operations that might exceed this limit. For longer background tasks, consider offloading to Cloud Tasks or Cloud Workflows.

  • Connection Draining: When an instance is scaled down or terminated, Cloud Run sends a `SIGTERM` signal. Your application should gracefully handle this signal, completing in-flight requests and cleaning up resources within a short grace period (typically 10 seconds), preventing abrupt disconnections.


Summary & Key Takeaways


Cloud Run empowers engineers to deploy and manage containerized services with unparalleled agility and cost efficiency, especially for stateless HTTP workloads.


  • Do containerize effectively: Build lean Docker images and ensure your application listens on the `PORT` environment variable Cloud Run provides.

  • Do leverage revision management: Utilize traffic splitting for phased rollouts, A/B testing, and rapid rollbacks to maintain service stability.

  • Do optimize for cost and performance: Understand the trade-offs between `min-instances`, `concurrency`, and CPU allocation to balance latency and operational expense.

  • Do prioritize observability: Configure Cloud Monitoring dashboards and alerts for key metrics, and use Log Explorer for efficient debugging.

  • Do secure your deployments: Implement least-privilege IAM, use VPC Access Connectors for private network access, and integrate Secret Manager for credential handling.

  • Avoid ignoring cold starts: For user-facing or latency-sensitive services, anticipate and mitigate cold start impact through configuration.

WRITTEN BY

Deniz Şahin

GCP Certified Professional with developer relations experience. Electronics and Communication Engineering graduate, Istanbul Technical University. Writes on GCP, Cloud Run and BigQuery.Read more

Responses (0)

    Hottest authors

    View all

    Ahmet Çelik

    Lead Writer · ex-AWS Solutions Architect, 8 yrs · AWS, Terraform, K8s

    Alp Karahan

    Contributor · MongoDB certified, NoSQL specialist · MongoDB, DynamoDB

    Ayşe Tunç

    Lead Writer · Engineering Manager, ex-Meta, Google · System Design, Interviews

    Berk Avcı

    Lead Writer · Principal Backend Eng., API design · REST, GraphQL, gRPC

    Burak Arslan

    Managing Editor · Content strategy, developer marketing

    Cansu Yılmaz

    Lead Writer · Database Architect, 9 yrs Postgres · PostgreSQL, Indexing, Perf

    Popular posts

    View all
    Ahmet Çelik
    ·

    Terraform AWS Tutorial: Production-Ready IaC Foundations

    Terraform AWS Tutorial: Production-Ready IaC Foundations
    Murat Doğan
    ·

    <h1 class="text-3xl font-bold mb-4">Azure DevOps Pipelines: Serverless Functions CI/CD</h1>

    <h1 class="text-3xl font-bold mb-4">Azure DevOps Pipelines: Serverless Functions CI/CD</h1>
    Deniz Şahin
    ·

    Google Cloud Run Tutorial: Deploying Production Services

    Google Cloud Run Tutorial: Deploying Production Services
    Ahmet Çelik
    ·

    AWS VPC Peering vs Transit Gateway vs PrivateLink

    AWS VPC Peering vs Transit Gateway vs PrivateLink
    Ahmet Çelik
    ·

    S3 Intelligent-Tiering vs Glacier: A Cost Analysis

    S3 Intelligent-Tiering vs Glacier: A Cost Analysis
    Sercan Öztürk
    ·

    # GitHub Actions Tutorial: Step-by-Step CI/CD Workflows

    # GitHub Actions Tutorial: Step-by-Step CI/CD Workflows