Scraping with lxml and XPath

June 17, 2025 · 1 min read

When CSS selectors run out of expressive power, XPath is the next step up. lxml is also substantially faster than BeautifulSoup on large pages.

import requests
from lxml import html

page = html.fromstring(requests.get(url, timeout=10).content)

# Find the <a> whose text contains "Next"
next_link = page.xpath('//a[contains(., "Next")]/@href')

# Price inside a row that has a specific label
price = page.xpath(
    '//tr[td[normalize-space()="Final price"]]/td[2]/text()'
)

# Parent axis — go from a matched child back up
row = page.xpath('//span[@class="sold-out"]/ancestor::tr')

Three XPath idioms worth memorizing:

contains(., "text") — matches text content across child elements, saner than trying to match whitespace-exact strings
normalize-space() — collapses whitespace, so " Final price " matches "Final price"
ancestor:: and following-sibling:: — traverse the DOM in ways CSS can’t

Convert between: lxml.cssselect translates CSS to XPath for you, so you can drop a CSS selector in mid-script ( page.cssselect('.price')) without switching libraries.

One gotcha: .text_content() concatenates all descendant text; .text only returns the immediate text before the first child element. Mixing them up produces very confusing bugs.