Deploying a Scraper: Cron, Docker, Lambda
A scraper that works on your laptop is a prototype. Here’s how to get it running on a schedule without babysitting.
Cron on a small VPS — simplest, $5/month, fine for anything under a few minutes of runtime per run:
# crontab -e
15 3 * * * cd /opt/scraper && /opt/scraper/.venv/bin/python run.py >> /var/log/scraper.log 2>&1
Gotchas: cron jobs run with a minimal PATH and no shell profile. Always use absolute paths and activate your venv explicitly. Redirect stderr or errors disappear silently.
Docker + systemd timer — cleaner than cron for anything serious. Define the scrape as a docker run in a systemd
service, a timer unit schedules it. You get logs via journalctl, restart policies, and environment isolation.
AWS Lambda — great for short scrapes (< 15 min) triggered on a schedule. Package deps with a container image rather
than a zip once you depend on lxml or Playwright, otherwise the native-binary pain is real:
FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN pip install -r requirements.txt -t ${LAMBDA_TASK_ROOT}
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.lambda_handler"]
Trigger with an EventBridge schedule.
When to skip Lambda: runs longer than 15 minutes, requires a persistent browser across invocations, or needs residential proxy IPs (Lambda IPs are very obviously AWS). For those, a small ECS Fargate task on a schedule works better.
One rule regardless of platform: alert on silent failures. A cron that’s been returning 0 items for three weeks because the selector broke is worse than no scraper at all. Log item counts and alert when they drop anomalously.