Scrapy Basics: When to Upgrade from requests
requests + BeautifulSoup is fine until you’re managing queues, retries, deduplication, concurrency, and pipeline
logic by hand. That’s when Scrapy starts earning its keep.
A minimal spider:
import scrapy
class ProductsSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/shop/"]
custom_settings = {
"DOWNLOAD_DELAY": 1.0,
"CONCURRENT_REQUESTS_PER_DOMAIN": 4,
"USER_AGENT": "my-crawler/1.0 (+https://example.com/contact)",
}
def parse(self, response):
for card in response.css(".product-card"):
yield {
"url": response.urljoin(card.css("a::attr(href)").get()),
"title": card.css("h3::text").get(),
"price": card.css(".price::text").get(),
}
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Run with scrapy crawl products -o out.jsonl.
What you’re getting over hand-rolled requests:
- Concurrency with per-domain throttling —
CONCURRENT_REQUESTS_PER_DOMAINis the setting that matters - Automatic retries and redirect handling
- URL deduplication — the same URL won’t be scraped twice
- Item pipelines — cleaning, validation, and storage separated from extraction
- Telnet console + stats — real observability while the crawl runs
The main reason not to use Scrapy: if you need a browser (JavaScript rendering), the integrations are awkward. Use
Playwright directly, or scrapy-playwright if you’re already invested in the Scrapy ecosystem.