Python on Web Scraping Python

Playwright Stealth and Single-Page Apps

Tue, 14 Apr 2026 10:00:00 +0200

Plain Playwright fails on sites with serious bot detection. The problem isn’t Playwright — it’s the tells that a default Chromium automation leaves everywhere.

Deploying a Scraper: Cron, Docker, Lambda

Tue, 24 Mar 2026 10:00:00 +0100

A scraper that works on your laptop is a prototype. Here’s how to get it running on a schedule without babysitting.

Async Scraping with httpx and asyncio

Tue, 03 Feb 2026 10:00:00 +0100

For scraping, async is a 10–50× speedup basically for free — IO-bound workloads are the textbook use case. httpx has the same ergonomics as requests with an async client built in.

Scrapy Pipelines and Middlewares Explained

Tue, 13 Jan 2026 10:00:00 +0100

Scrapy’s two extension points look similar but do opposite things. Middleware sits between Scrapy and the network; pipelines sit between the spider and storage.

Scrapy Basics: When to Upgrade from requests

Tue, 30 Dec 2025 10:00:00 +0100

requests + BeautifulSoup is fine until you’re managing queues, retries, deduplication, concurrency, and pipeline logic by hand. That’s when Scrapy starts earning its keep.

Storing Scraped Data: CSV, SQLite, Postgres

Tue, 09 Dec 2025 10:00:00 +0100

The right storage for scraped data depends less on scale than on what you plan to do with it next.

Rate Limiting and Being a Polite Scraper

Tue, 18 Nov 2025 10:00:00 +0200

A scraper that hammers a server at 500 requests/sec is a denial-of-service attack with extra steps. Pacing isn’t just ethics — it’s self-interest. Gentle scrapers don’t get blocked.

User-Agents and Browser Fingerprinting

Tue, 21 Oct 2025 10:00:00 +0200

The default requests User-Agent (python-requests/2.x) is the fastest possible way to get blocked. But modern anti-bot stacks look at far more than one header.

Proxies and Rotating IPs: When You Actually Need Them

Tue, 30 Sep 2025 10:00:00 +0200

Most scraping tutorials reach for proxies on page one. In reality, you should reach for them last — after you’ve verified a single IP with a good User-Agent and sensible rate limit actually gets blocked.

Headers, Cookies, and Sessions in requests

Tue, 09 Sep 2025 10:00:00 +0200

The difference between a scraper that works once and one that works reliably is usually session management.

Find the Hidden JSON API Behind Any Site

Tue, 26 Aug 2025 10:00:00 +0200

Most modern sites that look like HTML are secretly driven by JSON APIs. Finding that API turns a messy scraping job into reading documentation you didn’t know existed.

Five Pagination Patterns and How to Scrape Them

Tue, 05 Aug 2025 10:00:00 +0200

Pagination looks trivial until you hit your fifth different implementation. Here are the patterns worth recognizing on sight.

Playwright for JavaScript-Rendered Pages

Tue, 15 Jul 2025 10:00:00 +0200

If requests.get(url).text returns an empty shell with no data, the site renders in the browser. Playwright is the cleanest way to scrape it.

Scraping with lxml and XPath

Tue, 17 Jun 2025 10:00:00 +0200

When CSS selectors run out of expressive power, XPath is the next step up. lxml is also substantially faster than BeautifulSoup on large pages.

BeautifulSoup Selectors: A Practical Deep Dive

Tue, 27 May 2025 10:00:00 +0200

Most BeautifulSoup scrapers use maybe 20% of what .select() supports. Here are the selectors that actually come up.

Robust HTTP: Errors, Retries, and Exponential Backoff

Tue, 06 May 2025 10:00:00 +0200

Scrapers fail. The question is whether yours fails once and stops, or retries intelligently and finishes the job.

Getting Started with requests and BeautifulSoup

Tue, 22 Apr 2025 10:00:00 +0200

The simplest Python scraping stack is still the best place to start: requests to fetch the page, BeautifulSoup to pick the parts you want.