Simple Guide to Autoscaling with KEDA in Kubernetes

In this article we are going to cover with Simple Guide to Autoscaling with KEDA in Kubernetes.

In data-intensive applications, efficiently managing background tasks like batch processing is critical — especially when demand fluctuates. Instead of provisioning static resources, why not let Kubernetes dynamically scale processing jobs based on workload?

In this Article, we build a scalable CSV processing pipeline where:

  • A Python job generates CSV files with dummy data.
  • Filenames are pushed to a Redis list.
  • A KEDA ScaledJob in Kubernetes watches Redis and spins up multiple worker pods to process those CSVs.
  • Docker Hub hosts our container images.
  • Everything is containerized, deployed, and autoscaled within a Kubernetes cluster.

Let’s walk through the entire process — from structuring the project to deploying it step-by-step.

Directory Structure

.
├── csv-processor.Dockerfile
├── data-generator.Dockerfile
├── k8s
│   ├── csv-processor-scaledjob.yaml
│   ├── data-generator-job.yaml
│   ├── data-pvc.yaml
│   ├── pvc-inspection.yaml
│   ├── redis-deployment.yaml
│   └── redis-service.yaml
├── requirements.txt
└── src
    ├── data_generator.py
    └── process_csv.py

Step-by-Step Process

#1.Install Keda in your Kubernetes Cluster

Install Keda in your Kubernetes Cluster:

kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.10.0/keda-2.10.0.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 1

#2.Python Scripts

data_generator.py:

  • Generates synthetic sales data from Jan to May 2020.
  • Creates one CSV file per day under /data/raw/.
  • Pushes the filename into a Redis list (csvs-to-process).
import os
import csv
import random
from datetime import datetime, timedelta
import redis

def generate_csv_files(start_date, end_date, directory):
    print("Starting generation of csv files")

    redis_host = os.getenv("REDIS_HOST")
    redis_list = os.getenv("REDIS_LIST")
    if not redis_host or not redis_list:
        print(f"Missing environment variables. REDIS_HOST: {redis_host}, REDIS_LIST: {redis_list}")
        exit()

    redis_client = redis.Redis(host=redis_host, port=6379, db=0)

    if not os.path.exists(directory):
        os.makedirs(directory)

    current_date = start_date
    while current_date < end_date:
        filename = current_date.strftime('%Y%m%d') + '.csv'
        data = []
        for _ in range(200):
            timestamp = current_date + timedelta(seconds=random.randint(0, 86400))
            item_id = random.randint(1, 5)
            quantity_sold = random.randint(1, 10)
            data.append([timestamp, item_id, quantity_sold])

        filepath = os.path.join(directory, filename)
        with open(filepath, 'w', newline='') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow(['timestamp', 'item_id', 'quantity_sold'])
            writer.writerows(data)

        redis_client.lpush(redis_list, filename)
        current_date += timedelta(days=1)

    print("Finished generation of csv files")

if __name__ == '__main__':
    start_date = datetime(2020, 1, 1)
    end_date = datetime(2020, 6, 1)
    directory = os.path.join(os.path.dirname(__file__), '..', 'data/raw')
    generate_csv_files(start_date, end_date, directory)

process_csv.py:

  • Pops a filename from the Redis list.
  • Reads the CSV, aggregates total quantity sold by item ID.
  • Writes result to /data/processed/ with _aggregated_sales.csv suffix.
import os
import pandas as pd
import redis
from typing import Tuple

def read_csv_from_redis() -> Tuple[str, pd.DataFrame]:
    redis_host = os.getenv("REDIS_HOST")
    redis_list = os.getenv("REDIS_LIST")

    if not redis_host or not redis_list:
        print(f"Missing environment variables. REDIS_HOST: {redis_host}, REDIS_LIST: {redis_list}")
        exit()

    redis_client = redis.Redis(host=redis_host, port=6379, db=0)
    last_csv_name = redis_client.rpop(redis_list)
    if not last_csv_name:
        print("No CSV files to process, exiting.")
        exit()

    last_csv_name = last_csv_name.decode("utf-8")
    full_filename = os.path.join(os.path.dirname(__file__), '..', 'data/raw', last_csv_name)
    df = pd.read_csv(full_filename)
    return last_csv_name, df

def aggregate_csv(filename_raw: str, df: pd.DataFrame) -> None:
    aggregated_data = df.groupby('item_id')['quantity_sold'].sum().reset_index()
    processed_dir = os.path.join(os.path.dirname(__file__), '..', 'data/processed')
    if not os.path.exists(processed_dir):
        os.makedirs(processed_dir)
    output_filename = os.path.join(processed_dir, f"{filename_raw.split('.')[0]}_aggregated_sales.csv")
    aggregated_data.to_csv(output_filename, index=False)
    print(f"Aggregated sales data written to '{output_filename}'.")

def main():
    filename, df = read_csv_from_redis()
    aggregate_csv(filename, df)

if __name__ == '__main__':
    main()

requirements.txt:

numpy==1.24.4
pandas==2.0.3
redis==4.6.0

#3.Dockerfiles

data-generator.Dockerfile:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/data_generator.py ./src/data_generator.py

ENTRYPOINT ["python", "src/data_generator.py"]

csv-processor.Dockerfile:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/process_csv.py ./src/process_csv.py

ENTRYPOINT ["python", "src/process_csv.py"]

#4.Push Docker images to Docker Hub

Make sure you have a Docker Hub account. Then:

# Login to Docker Hub
docker login

# Build images
docker build -t <dockerhub-username>/datagenerator:v1 -f data-generator.Dockerfile .
docker build -t <dockerhub-username>/csvprocessor:v1 -f csv-processor.Dockerfile .

# Push images
docker push <dockerhub-username>/datagenerator:v1
docker push <dockerhub-username>/csvprocessor:v1
Simple Guide to Autoscaling with KEDA in Kubernetes 2
Simple Guide to Autoscaling with KEDA in Kubernetes 3
Simple Guide to Autoscaling with KEDA in Kubernetes 4

#5.Kubernetes manifests

redis-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:latest
ports:
- containerPort: 6379

redis-service.yaml:

apiVersion: v1
kind: Service
metadata:
name: redis
spec:
selector:
app: redis
ports:
- protocol: TCP
port: 6379
targetPort: 6379
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/redis-service.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 5

Note the redis CLUSTER IP to use in the data-generator-job.yaml and csv-processor-scaledjob.yaml

data-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
kubectl apply -f k8s/data-pvc.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 6

pvc-inspection.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspection-pod
spec:
  containers:
  - name: busybox
    image: busybox
    command:
      - sleep
      - "3600"
    volumeMounts:
    - name: data-volume
      mountPath: /data
  volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: data-pvc
  restartPolicy: Always
kubectl apply -f k8s/pvc-inspection.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 7

csv-processor-scaledjob.yaml:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: csv-processor
spec:
jobTargetRef:
template:
spec:
containers:
- name: csv-processor
image: harish981/csvprocessor:latest
env:
- name: REDIS_HOST
value: 10.105.171.157
- name: REDIS_LIST
value: csvs-to-process
resources:
limits:
cpu: "0.2"
memory: "100Mi"
requests:
cpu: "0.2"
memory: "100Mi"
volumeMounts:
- name: data-volume
mountPath: /app/data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
restartPolicy: Never
pollingInterval: 10
maxReplicaCount: 5
triggers:
- type: redis
metadata:
address: 10.105.171.157:6379
listName: csvs-to-process
dataType: list
kubectl apply -f k8s/csv-processor-scaledjob.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 8

data-generator-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
name: data-generator
spec:
template:
spec:
containers:
- name: data-generator
image: harish981/datagenerator:latest
env:
- name: REDIS_HOST
value: 10.105.171.157
- name: REDIS_LIST
value: csvs-to-process
volumeMounts:
- name: data-volume
mountPath: /app/data
restartPolicy: Never
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc

Change the below things in data-generator-job.yaml and csv-processor-scaledjob.yaml:

  • Replace <dockerhub-username> with your actual Docker Hub username
  • redis value and address to your redis value and address in the env and triggers
kubectl apply -f k8s/data-generator-job.yaml
Simple Guide to Autoscaling with KEDA in Kubernetes 9

#6.Verify

kubectl get scaledobjects
Simple Guide to Autoscaling with KEDA in Kubernetes 10
kubectl get pods
Simple Guide to Autoscaling with KEDA in Kubernetes 11

Here you can see the pods are getting created

To check if the csv-processor jobs are actually generating the processed files, let’s use kubectl exec to open a shell inside the running pvc-inspection-pod:

kubectl exec -it pvc-inspection-pod -- /bin/sh
Simple Guide to Autoscaling with KEDA in Kubernetes 12

This opens a shell in the pvc-inspection-pod, where the PVC is mounted at /data. Go to the processed folder by running:

cd data/processed
ls
Simple Guide to Autoscaling with KEDA in Kubernetes 13

You should see .csv files like 20200101_aggregated_sales.csv if everything went well. To check the contents, use one of the file names from the data/processed folder with the command below.

cat 20200101_aggregated_sales.csv
Simple Guide to Autoscaling with KEDA in Kubernetes 14

Conclusion:

In this project, we built a fully containerized, event-driven CSV processing pipeline that automatically scales based on workload using KEDA. By combining Python, Redis, Docker, and Kubernetes, we created a clean and efficient system that generates, queues, and processes data — only when needed.

This setup not only saves resources but also keeps your system flexible and responsive. Whether you’re handling CSVs or other types of background jobs, this pattern can be a solid foundation for real-world, scalable applications.

Harish Reddy

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share via
Copy link
Powered by Social Snap