Data processing pipelines that run on managed infrastructure are a better default than self-managed servers for most teams. You write the processing logic; the platform handles scaling, patching, and availability. Google Cloud Run is a strong choice for this pattern because it runs any container, scales from zero, and integrates cleanly with the rest of Google Cloud.
This post walks through building a serverless data processing pipeline on Cloud Run, covering ingestion, processing, storage, and event-driven triggers.
Why Serverless for Data Processing?
Serverless works well for data pipelines because data workloads tend to be bursty. Traffic is not constant: files arrive in batches, events spike during business hours, and some pipelines run once a night. With serverless, you pay for compute only when something is actually being processed. An idle pipeline costs nothing.
The tradeoff is cold start latency and stateless execution. Cloud Run handles both reasonably well, but pipeline steps that need sub-second startup time or persistent in-memory state require additional configuration.
Google Cloud Run
Cloud Run runs stateless containers that respond to HTTP requests. You push a container image, configure memory and concurrency limits, and Google handles the rest. It scales to zero when idle and back up quickly under load. Any language or framework that can run in a container works.
Key capabilities:
- No infrastructure to manage.
- Scales from zero to many instances based on incoming request volume.
- Integrates with Cloud Pub/Sub and Eventarc for event-driven triggers.
- Supports any language or runtime that can be containerized.
Building the Pipeline
Step 1: Setting Up Your Environment
Before starting, you need:
- A Google Cloud account with billing enabled.
- Google Cloud SDK installed.
- Cloud Run API enabled in your project.
Step 2: Writing the Data Processing Application
A simple Node.js and Express service that receives data via HTTP POST:
app.js:
const express = require('express');
const app = express();
app.use(express.json());
app.post('/', async (req, res) => {
const data = req.body;
// Perform data processing here
const processedData = processData(data);
// Optionally, store or forward the processed data
// For example, publish to a Pub/Sub topic, write to BigQuery, etc.
res.status(200).send('Data processed successfully');
});
function processData(data) {
// Simulate data transformation
data.processedAt = new Date().toISOString();
return data;
}
const PORT = process.env.PORT || 8080;
app.listen(PORT, () => {
console.log(`Service listening on port ${PORT}`);
});
Step 3: Containerizing the Application
Dockerfile:
# Use the official Node.js 14 runtime as a parent image
FROM node:14-slim
# Create and set the working directory
WORKDIR /usr/src/app
# Copy package.json and package-lock.json
COPY package*.json ./
# Install dependencies
RUN npm install --only=production
# Copy the rest of the application code
COPY . .
# Expose the port
ENV PORT 8080
EXPOSE 8080
# Start the application
CMD [ "node", "app.js" ]
Step 4: Building and Deploying to Cloud Run
# Build the container image
gcloud builds submit --tag gcr.io/PROJECT_ID/data-processor
# Replace PROJECT_ID with your Google Cloud project ID
# Deploy the image to Cloud Run
gcloud run deploy data-processor \
--image gcr.io/PROJECT_ID/data-processor \
--platform managed \
--region REGION \
--allow-unauthenticated
# Replace REGION with your preferred deployment region
After deployment, Google provides a URL for your Cloud Run service.
Step 5: Connecting Event Sources
Triggering via Pub/Sub
Create a Pub/Sub topic and subscription that pushes messages to your Cloud Run service.
gcloud pubsub topics create data-topic
Create a service account with the Pub/Sub Subscriber role:
gcloud iam service-accounts create pubsub-invoker \
--display-name "Pub/Sub Invoker"
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:pubsub-invoker@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
Create a push subscription pointed at Cloud Run:
gcloud pubsub subscriptions create data-subscription \
--topic=data-topic \
--push-endpoint=YOUR_CLOUD_RUN_URL \
--push-auth-service-account=pubsub-invoker@PROJECT_ID.iam.gserviceaccount.com
Replace YOUR_CLOUD_RUN_URL with your Cloud Run service URL.
Cloud Storage Events via Eventarc
To trigger processing when a file is uploaded to Cloud Storage:
gcloud eventarc triggers create storage-trigger \
--destination-run-service=data-processor \
--event-filters="type=google.cloud.storage.object.v1.finalized" \
--location=REGION \
--service-account=pubsub-invoker@PROJECT_ID.iam.gserviceaccount.com
Any object finalized in Cloud Storage now triggers your Cloud Run service.
Step 6: Scaling and Configuration
Set concurrency and memory limits:
gcloud run services update data-processor \
--concurrency=80 \
--memory=512Mi
Set environment variables:
gcloud run services update data-processor \
--update-env-vars "ENV=production,DEBUG=false"
Step 7: Monitoring and Logging
Cloud Logging collects application logs automatically. Cloud Monitoring provides dashboards and alerting for CPU utilization, memory usage, and request latency. Set up alerts for error rate spikes and latency increases before they become user-visible problems.
Best Practices
Cold Start Mitigation
Keep minimum one instance running to eliminate cold starts for latency-sensitive pipelines:
gcloud run services update data-processor \
--min-instances=1
Use slim base images to reduce container startup time.
Security
Grant service accounts only the permissions they need. Use VPC Service Controls for pipelines that handle sensitive data.
Error Handling and Idempotency
Design processing logic to handle duplicate messages safely. Pub/Sub guarantees at-least-once delivery, so a message can arrive more than once. Retries should produce the same result regardless of how many times they run.
Real-World Use Cases
Real-time analytics from IoT devices or user events, ETL processes loading into BigQuery, image and video processing triggered by file uploads, and machine learning preprocessing pipelines all map naturally onto this architecture. The pattern is the same in each case: an event triggers a Cloud Run invocation, the service processes the payload, and the result is stored or forwarded.
Conclusion
Cloud Run removes most of the operational burden from running data processing pipelines. You write a container, define the event sources, and let Google handle scaling, availability, and infrastructure. The cost model, paying only for actual compute, makes it practical for pipelines with variable or unpredictable load. The main discipline required is designing processing steps to be stateless and idempotent, which is good practice in any distributed system.