Scrapy Pipelines and Middlewares Explained

Scrapy’s two extension points look similar but do opposite things. Middleware sits between Scrapy and the network; pipelines sit between the spider and storage.

Item pipelines process extracted items — cleaning, validating, deduplicating, writing to a database:

# pipelines.py
from itemadapter import ItemAdapter

class PriceNormalizerPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        raw = adapter["price"]
        adapter["price_cents"] = int(float(raw.replace("$", "")) * 100)
        return item

class SqlitePipeline:
    def open_spider(self, spider):
        import sqlite3
        self.db = sqlite3.connect("scrape.db")

    def process_item(self, item, spider):
        self.db.execute("INSERT OR REPLACE INTO products VALUES (?, ?, ?)",
                        (item["url"], item["title"], item["price_cents"]))
        return item

    def close_spider(self, spider):
        self.db.commit()
        self.db.close()

Enable in settings.py:

ITEM_PIPELINES = {
    "myproject.pipelines.PriceNormalizerPipeline": 300,
    "myproject.pipelines.SqlitePipeline":          400,
}

The numbers are execution order (lower runs first).

Downloader middleware is the hook for everything that touches HTTP — proxy rotation, custom retries, caching, fingerprint spoofing:

class RotatingProxyMiddleware:
    def process_request(self, request, spider):
        request.meta["proxy"] = self.next_proxy()

Rule of thumb: if it’s about the item (shape, content, storage), use a pipeline. If it’s about the request or response (headers, retries, proxies), use middleware. Keeping this split clean makes Scrapy projects much easier to debug.