Scrapy Basics: When to Upgrade from requests

requests + BeautifulSoup is fine until you’re managing queues, retries, deduplication, concurrency, and pipeline logic by hand. That’s when Scrapy starts earning its keep.

A minimal spider:

import scrapy

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/shop/"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1.0,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
        "USER_AGENT": "my-crawler/1.0 (+https://example.com/contact)",
    }

    def parse(self, response):
        for card in response.css(".product-card"):
            yield {
                "url": response.urljoin(card.css("a::attr(href)").get()),
                "title": card.css("h3::text").get(),
                "price": card.css(".price::text").get(),
            }

        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run with scrapy crawl products -o out.jsonl.

What you’re getting over hand-rolled requests:

  • Concurrency with per-domain throttlingCONCURRENT_REQUESTS_PER_DOMAIN is the setting that matters
  • Automatic retries and redirect handling
  • URL deduplication — the same URL won’t be scraped twice
  • Item pipelines — cleaning, validation, and storage separated from extraction
  • Telnet console + stats — real observability while the crawl runs

The main reason not to use Scrapy: if you need a browser (JavaScript rendering), the integrations are awkward. Use Playwright directly, or scrapy-playwright if you’re already invested in the Scrapy ecosystem.