Async Scraping with httpx and asyncio
For scraping, async is a 10–50× speedup basically for free — IO-bound workloads are the textbook use case. httpx has
the same ergonomics as requests with an async client built in.
import asyncio
import httpx
async def fetch(client, url, sem):
async with sem:
r = await client.get(url, timeout=10)
r.raise_for_status()
return url, r.text
async def main(urls):
limits = httpx.Limits(max_connections=20, max_keepalive_connections=20)
sem = asyncio.Semaphore(10) # hard cap on concurrency
async with httpx.AsyncClient(limits=limits, http2=True) as client:
tasks = [fetch(client, u, sem) for u in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
results = asyncio.run(main(urls))
Two mistakes people make their first time:
Unbounded concurrency. asyncio.gather over 10,000 URLs without a semaphore opens 10,000 sockets and gets you
banned in 30 seconds. Always bound with Semaphore or httpx.Limits.
Forgetting return_exceptions=True. One 503 kills the whole gather otherwise, and you lose the 9,999 successful
results.
Pair async with a rate limiter (like aiolimiter) when you need pacing on top of concurrency — concurrency controls
how many at once, pacing controls how often. They’re orthogonal.
For parsing, use selectolax instead of BeautifulSoup in async code — it’s 5–10× faster and releases the GIL, which
actually matters when you’re processing responses as they stream in.