Storing Scraped Data: CSV, SQLite, Postgres
The right storage for scraped data depends less on scale than on what you plan to do with it next.
CSV — for small, one-shot jobs you’ll hand to someone else. pandas.to_csv() is fine until it isn’t. The failure
mode: any cell containing a newline, comma, or quote breaks naive parsers downstream. Use csv.DictWriter with
quoting=csv.QUOTE_ALL and you’re safer.
SQLite — the right default for most scrapers. One file, no server, queryable, and crucially, supports upsert:
import sqlite3
db = sqlite3.connect("scrape.db")
db.execute("""
CREATE TABLE IF NOT EXISTS products (
url TEXT PRIMARY KEY,
title TEXT,
price_cents INTEGER,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")
db.executemany("""
INSERT INTO products (url, title, price_cents)
VALUES (:url, :title, :price_cents)
ON CONFLICT(url) DO UPDATE SET
title = excluded.title,
price_cents = excluded.price_cents,
scraped_at = CURRENT_TIMESTAMP
""", rows)
db.commit()
Make the URL a PRIMARY KEY and you get idempotent scraping for free — rerunning the job updates rather than duplicates.
Postgres — when you need concurrent writers or full-text search. psycopg + execute_values for bulk inserts;
ON CONFLICT DO UPDATE works the same.
One cross-cutting tip: store raw HTML alongside the parsed fields, at least for the first few runs. When your selector breaks two months later, you’ll want the original page to diff against.