Headers, Cookies, and Sessions in requests

The difference between a scraper that works once and one that works reliably is usually session management.

Always use a Session object. Cookies persist across requests automatically, connections are pooled (faster), and default headers apply to every call:

import requests

s = requests.Session()
s.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/128.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml",
})

# Landing page sets cookies (CSRF tokens, session IDs)
s.get("https://example.com/")

# Now the protected page works
data = s.get("https://example.com/api/items").json()

Things that look mysterious but are usually just headers:

  • 403 on the API, 200 in the browser — missing Referer or Origin
  • Empty response / JSON error — missing Accept: application/json
  • Different content than the browser sees — missing Accept-Language, so the server served a different locale

If the site uses CSRF tokens embedded in HTML, scrape the token from the first page and send it with subsequent writes:

from bs4 import BeautifulSoup
soup = BeautifulSoup(s.get(login_url).text, "lxml")
token = soup.select_one('input[name="csrf_token"]')["value"]
s.post(login_url, data={"user": u, "pass": p, "csrf_token": token})

Ninety percent of “my scraper returns a login page” bugs are one missing header or cookie.