Rate Limiting and Being a Polite Scraper
A scraper that hammers a server at 500 requests/sec is a denial-of-service attack with extra steps. Pacing isn’t just ethics — it’s self-interest. Gentle scrapers don’t get blocked.
The simplest pacing, with jitter to avoid traffic patterns:
import time, random
def throttled_get(session, url, min_delay=1.0, max_delay=2.5):
resp = session.get(url, timeout=10)
time.sleep(random.uniform(min_delay, max_delay))
return resp
For anything serious, use a token-bucket limiter:
from pyrate_limiter import Duration, Rate, Limiter
limiter = Limiter(Rate(30, Duration.MINUTE)) # 30 req/min, hard cap
for url in urls:
limiter.try_acquire("scrape") # blocks until a token is free
resp = session.get(url)
This holds under concurrency — multiple threads or async tasks share the same rate budget.
Honor Retry-After on 429 and 503 responses. The server is literally telling you how long to wait:
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 60))
time.sleep(wait)
And read robots.txt before you start. Not because it’s legally binding (it mostly isn’t), but because it tells you
which paths the site considers expensive. Crawl-delay is a hint worth respecting even when requests doesn’t enforce
it.
A scraper that runs slightly slower but finishes the job is infinitely better than a fast one that gets banned after 200 requests.