Scrapy Pipelines and Middlewares Explained
Scrapy’s two extension points look similar but do opposite things. Middleware sits between Scrapy and the network; pipelines sit between the spider and storage.
Item pipelines process extracted items — cleaning, validating, deduplicating, writing to a database:
# pipelines.py
from itemadapter import ItemAdapter
class PriceNormalizerPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
raw = adapter["price"]
adapter["price_cents"] = int(float(raw.replace("$", "")) * 100)
return item
class SqlitePipeline:
def open_spider(self, spider):
import sqlite3
self.db = sqlite3.connect("scrape.db")
def process_item(self, item, spider):
self.db.execute("INSERT OR REPLACE INTO products VALUES (?, ?, ?)",
(item["url"], item["title"], item["price_cents"]))
return item
def close_spider(self, spider):
self.db.commit()
self.db.close()
Enable in settings.py:
ITEM_PIPELINES = {
"myproject.pipelines.PriceNormalizerPipeline": 300,
"myproject.pipelines.SqlitePipeline": 400,
}
The numbers are execution order (lower runs first).
Downloader middleware is the hook for everything that touches HTTP — proxy rotation, custom retries, caching, fingerprint spoofing:
class RotatingProxyMiddleware:
def process_request(self, request, spider):
request.meta["proxy"] = self.next_proxy()
Rule of thumb: if it’s about the item (shape, content, storage), use a pipeline. If it’s about the request or response (headers, retries, proxies), use middleware. Keeping this split clean makes Scrapy projects much easier to debug.