Deploying a Scraper: Cron, Docker, Lambda

A scraper that works on your laptop is a prototype. Here’s how to get it running on a schedule without babysitting.

Cron on a small VPS — simplest, $5/month, fine for anything under a few minutes of runtime per run:

# crontab -e
15 3 * * * cd /opt/scraper && /opt/scraper/.venv/bin/python run.py >> /var/log/scraper.log 2>&1

Gotchas: cron jobs run with a minimal PATH and no shell profile. Always use absolute paths and activate your venv explicitly. Redirect stderr or errors disappear silently.

Docker + systemd timer — cleaner than cron for anything serious. Define the scrape as a docker run in a systemd service, a timer unit schedules it. You get logs via journalctl, restart policies, and environment isolation.

AWS Lambda — great for short scrapes (< 15 min) triggered on a schedule. Package deps with a container image rather than a zip once you depend on lxml or Playwright, otherwise the native-binary pain is real:

FROM public.ecr.aws/lambda/python:3.12

COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN pip install -r requirements.txt -t ${LAMBDA_TASK_ROOT}

COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.lambda_handler"]

Trigger with an EventBridge schedule.

When to skip Lambda: runs longer than 15 minutes, requires a persistent browser across invocations, or needs residential proxy IPs (Lambda IPs are very obviously AWS). For those, a small ECS Fargate task on a schedule works better.

One rule regardless of platform: alert on silent failures. A cron that’s been returning 0 items for three weeks because the selector broke is worse than no scraper at all. Log item counts and alert when they drop anomalously.