Async Scraping with httpx and asyncio

For scraping, async is a 10–50× speedup basically for free — IO-bound workloads are the textbook use case. httpx has the same ergonomics as requests with an async client built in.

import asyncio
import httpx

async def fetch(client, url, sem):
    async with sem:
        r = await client.get(url, timeout=10)
        r.raise_for_status()
        return url, r.text

async def main(urls):
    limits = httpx.Limits(max_connections=20, max_keepalive_connections=20)
    sem = asyncio.Semaphore(10)   # hard cap on concurrency
    async with httpx.AsyncClient(limits=limits, http2=True) as client:
        tasks = [fetch(client, u, sem) for u in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

results = asyncio.run(main(urls))

Two mistakes people make their first time:

Unbounded concurrency. asyncio.gather over 10,000 URLs without a semaphore opens 10,000 sockets and gets you banned in 30 seconds. Always bound with Semaphore or httpx.Limits.

Forgetting return_exceptions=True. One 503 kills the whole gather otherwise, and you lose the 9,999 successful results.

Pair async with a rate limiter (like aiolimiter) when you need pacing on top of concurrency — concurrency controls how many at once, pacing controls how often. They’re orthogonal.

For parsing, use selectolax instead of BeautifulSoup in async code — it’s 5–10× faster and releases the GIL, which actually matters when you’re processing responses as they stream in.