Storing Scraped Data: CSV, SQLite, Postgres

The right storage for scraped data depends less on scale than on what you plan to do with it next.

CSV — for small, one-shot jobs you’ll hand to someone else. pandas.to_csv() is fine until it isn’t. The failure mode: any cell containing a newline, comma, or quote breaks naive parsers downstream. Use csv.DictWriter with quoting=csv.QUOTE_ALL and you’re safer.

SQLite — the right default for most scrapers. One file, no server, queryable, and crucially, supports upsert:

import sqlite3

db = sqlite3.connect("scrape.db")
db.execute("""
    CREATE TABLE IF NOT EXISTS products (
        url TEXT PRIMARY KEY,
        title TEXT,
        price_cents INTEGER,
        scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
    )
""")

db.executemany("""
    INSERT INTO products (url, title, price_cents)
    VALUES (:url, :title, :price_cents)
    ON CONFLICT(url) DO UPDATE SET
        title = excluded.title,
        price_cents = excluded.price_cents,
        scraped_at = CURRENT_TIMESTAMP
""", rows)
db.commit()

Make the URL a PRIMARY KEY and you get idempotent scraping for free — rerunning the job updates rather than duplicates.

Postgres — when you need concurrent writers or full-text search. psycopg + execute_values for bulk inserts; ON CONFLICT DO UPDATE works the same.

One cross-cutting tip: store raw HTML alongside the parsed fields, at least for the first few runs. When your selector breaks two months later, you’ll want the original page to diff against.