VOOZH about

URL: https://dev.to/uaslimcreate/render-postgresql-backups-to-s3-automating-disaster-recovery-for-multi-tenant-saas-without-vendor-1c71

⇱ Render PostgreSQL Backups to S3: Automating Disaster Recovery for Multi-Tenant SaaS Without Vendor Lock-In - DEV Community


Render PostgreSQL Backups to S3: Automating Disaster Recovery for Multi-Tenant SaaS Without Vendor Lock-In

I've been building CitizenApp on Render for two years. The platform is solid—zero complaints about uptime or developer experience. But I'll be honest: the first time I read Render's backup documentation, I realized I was entirely dependent on their infrastructure for data recovery.

That terrified me.

Here's the thing: Render's managed PostgreSQL comes with daily backups, but they're stored in their infrastructure. If Render experiences a catastrophic failure (unlikely, but possible), or if they discontinue the service, or if I need to migrate to another provider—I'm at their mercy. For CitizenApp, which handles sensitive tenant data across 200+ organizations, this isn't acceptable.

I'm not paranoid. I'm just someone who's read enough incident reports.

Why You Can't Sleep Well With Platform-Only Backups

Render's backup system is reliable for Render's purposes. They retain backups for 30 days, support point-in-time recovery (PITR), and honestly, their restoration process works smoothly. But there's a fundamental asymmetry: they control the backup, they control the restore, and they decide how long data lives.

Here's what keeps me awake:

  1. Vendor lock-in: Your data is only as portable as Render's export capabilities
  2. Compliance: SOC 2 audits often require evidence that backups exist outside the primary infrastructure
  3. Recovery time objectives (RTO): Platform outages mean you can't restore even if you have backups
  4. Cost surprises: Render might price backup storage differently next year
  5. Multi-region redundancy: A single region failure shouldn't cascade to your recovery capability

For CitizenApp's clients, especially enterprises, I need to answer this question with confidence: "If Render evaporates tomorrow, how fast can we restore your data?" The honest answer with platform-only backups is: "We're probably okay, but I can't guarantee it."

So I automated offsite backups. Here's how.

The Architecture: Dump, Upload, Verify

I prefer a push-based model over trying to hook into Render's backup system directly. Here's why:

  • Simplicity: One scheduled job, no complex WAL archiving setup
  • Portability: Works with any PostgreSQL, not Render-specific
  • Auditability: I can see exactly what backed up and when
  • Cost control: S3/R2 storage is predictable and cheap

The flow looks like this:

PostgreSQL on Render
 ↓
pg_dump (compressed)
 ↓
Encrypt with age
 ↓
Upload to Cloudflare R2 (or AWS S3)
 ↓
Verify checksum & log
 ↓
Alert if failed

Setting Up Automated Backups

Step 1: Create an S3-Compatible Bucket

I use Cloudflare R2 because egress is free (S3 charges $0.09/GB). For CitizenApp, that's the difference between $10/month and $50/month at our current data size.

# AWS S3 (if you prefer)
aws s3 mb s3://citizenapp-postgres-backups-prod

# Cloudflare R2
# Done via dashboard, bucket: citizenapp-postgres-backups-prod

Create an IAM user with restricted permissions:

{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:PutObject","s3:GetObject","s3:ListBucket"],"Resource":["arn:aws:s3:::citizenapp-postgres-backups-prod","arn:aws:s3:::citizenapp-postgres-backups-prod/*"]}]}

Step 2: Python Backup Script

This runs daily via GitHub Actions (or a cron job elsewhere). I prefer GitHub Actions because it's free, version-controlled, and doesn't require managing another server.

# backup_postgres.py
import os
import subprocess
import hashlib
import sys
from datetime import datetime
import boto3
from pathlib import Path

def backup_postgres(db_url: str, bucket: str, region: str = "us-east-1") -> bool:
 """
 Backup PostgreSQL to S3-compatible storage.
 Returns True if successful.
 """
 timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
 backup_file = f"/tmp/citizenapp_backup_{timestamp}.sql.gz"
 checksum_file = f"{backup_file}.sha256"

 try:
 # Step 1: Dump with compression
 print(f"[{timestamp}] Starting pg_dump...")
 subprocess.run(
 [
 "pg_dump",
 "--format=plain",
 "--compress=9",
 "--no-password",
 db_url,
 ],
 stdout=open(backup_file, "wb"),
 stderr=subprocess.PIPE,
 check=True,
 )

 file_size_mb = Path(backup_file).stat().st_size / (1024 * 1024)
 print(f"Dump complete: {file_size_mb:.2f} MB")

 # Step 2: Calculate checksum
 print("Calculating checksum...")
 sha256_hash = hashlib.sha256()
 with open(backup_file, "rb") as f:
 for chunk in iter(lambda: f.read(4096), b""):
 sha256_hash.update(chunk)

 checksum = sha256_hash.hexdigest()
 with open(checksum_file, "w") as f:
 f.write(f"{checksum}{os.path.basename(backup_file)}\n")

 # Step 3: Upload to S3
 print(f"Uploading to S3 ({bucket})...")
 s3_client = boto3.client(
 "s3",
 region_name=region,
 endpoint_url=os.getenv("S3_ENDPOINT_URL"), # For R2
 aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
 aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
 )

 s3_client.upload_file(
 backup_file,
 bucket,
 f"backups/{timestamp}.sql.gz",
 ExtraArgs={
 "Metadata": {
 "checksum": checksum,
 "size-mb": str(file_size_mb),
 }
 },
 )

 s3_client.upload_file(
 checksum_file,
 bucket,
 f"backups/{timestamp}.sql.gz.sha256",
 )

 print(f"✓ Backup successful: {timestamp}")
 print(f" Checksum: {checksum}")

 # Step 4: Cleanup
 os.remove(backup_file)
 os.remove(checksum_file)

 return True

 except subprocess.CalledProcessError as e:
 print(f"✗ pg_dump failed: {e.stderr.decode()}")
 return False
 except Exception as e:
 print(f"✗ Backup failed: {str(e)}")
 return False

if __name__ == "__main__":
 db_url = os.getenv("DATABASE_URL")
 bucket = os.getenv("BACKUP_BUCKET", "citizenapp-postgres-backups-prod")
 region = os.getenv("BACKUP_REGION", "us-east-1")

 success = backup_postgres(db_url, bucket, region)
 sys.exit(0 if success else 1)

Step 3: GitHub Actions Workflow

# .github/workflows/backup-postgres.yml
name: Backup PostgreSQL to S3

on:
 schedule:
 - cron: "02***" # 2 AM UTC daily
 workflow_dispatch:

jobs:
 backup:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4

 - name: Set up Python
 uses: actions/setup-python@v4
 with:
 python-version: "3.11"

 - name: Install dependencies
 run: |
 pip install boto3 psycopg2-binary

 - name: Run backup
 env:
 DATABASE_URL: ${{ secrets.DATABASE_URL }}
 AWS_ACCESS_KEY_ID: ${{ secrets.BACKUP_AWS_ACCESS_KEY }}
 AWS_SECRET_ACCESS_KEY: ${{ secrets.BACKUP_AWS_SECRET_KEY }}
 S3_ENDPOINT_URL: ${{ secrets.BACKUP_S3_ENDPOINT }}
 BACKUP_BUCKET: ${{ secrets.BACKUP_BUCKET }}
 run: python scripts/backup_postgres.py

 - name: Notify on failure
 if: failure()
 uses: actions/github-script@v6
 with:
 script: |
 const slack_webhook = "${{ secrets.SLACK_WEBHOOK_BACKUPS }}";
 await fetch(slack_webhook, {
 method: "POST",
 body: JSON.stringify({
 text: "⚠️ PostgreSQL backup failed for CitizenApp"
 })
 });

Testing Recovery (The Part Everyone Skips)

Here's the critical thing: a backup you've never restored is just hope. I test recovery quarterly.


bash
# Download and verify
aws s3 cp s3://citizenapp-postgres-backups-prod/backups/20240115_020000.sql.gz .
sha256sum -c <(echo "abc123... 20240115_020000.sql.gz")

# Restore to a test database
gunzip -c 20240115_020000.sql.