CAPTCHAs: What Works, What's Legal, What's Not

CAPTCHAs are the point where scraping stops being a technical problem and starts being a legal and ethical one. Here’s the honest landscape.

Types you’ll see, roughly in order of difficulty:

  • Simple image CAPTCHAs — mostly gone, trivially solved by OCR
  • reCAPTCHA v2 (“check the box”) — a behavioral and fingerprint check; the checkbox is the result, not the test
  • reCAPTCHA v3 / Enterprise — invisible scoring based on behavior; no puzzle shown
  • Cloudflare Turnstile / hCaptcha — similar invisible approaches
  • Managed challenges (Cloudflare, DataDome, PerimeterX, Akamai) — full interstitial pages with JS challenges

What actually works:

  1. Don’t trigger them. Realistic fingerprint (curl_cffi), sane request rate, residential proxies, consistent session. 80% of the time this is enough.
  2. Real browser. Playwright with stealth plugin, slow human-like interactions. Works for scoring-based challenges.
  3. Solving services — 2Captcha, Anti-Captcha, CapSolver. You send the site key, they return a token. Costs ~$1–3 per 1000 solves.

Legal reality (not legal advice):

  • CAPTCHA presence ≠ you can’t scrape. Public data is public.
  • But bypassing protections has been argued as CFAA violation in the US (hiQ v. LinkedIn went the other way, but the law is unsettled). EU GDPR adds separate concerns for personal data.
  • Terms of Service matter more when you have an account. Scraping while logged in against ToS is legally the riskiest zone.

Rule of thumb: if a site is actively fighting you with managed challenges, they’ve told you “no.” Scraping anyway is technically possible and legally fraught. Consider whether there’s an API, a data partnership, or a different source.