Scraping with lxml and XPath
When CSS selectors run out of expressive power, XPath is the next step up. lxml is also substantially faster than
BeautifulSoup on large pages.
import requests
from lxml import html
page = html.fromstring(requests.get(url, timeout=10).content)
# Find the <a> whose text contains "Next"
next_link = page.xpath('//a[contains(., "Next")]/@href')
# Price inside a row that has a specific label
price = page.xpath(
'//tr[td[normalize-space()="Final price"]]/td[2]/text()'
)
# Parent axis — go from a matched child back up
row = page.xpath('//span[@class="sold-out"]/ancestor::tr')
Three XPath idioms worth memorizing:
contains(., "text")— matches text content across child elements, saner than trying to match whitespace-exact stringsnormalize-space()— collapses whitespace, so" Final price "matches"Final price"ancestor::andfollowing-sibling::— traverse the DOM in ways CSS can’t
Convert between: lxml.cssselect translates CSS to XPath for you, so you can drop a CSS selector in mid-script (
page.cssselect('.price')) without switching libraries.
One gotcha: .text_content() concatenates all descendant text; .text only returns the immediate text before the first
child element. Mixing them up produces very confusing bugs.