Querying the DOM with XPath and lxml
XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:
- Can easily navigate through the DOM tree
- More sophisticated and powerful than other selectors like CSS selectors and regular expressions
- It has a great set (200+) of built-in functions and is extensible with custom functions
- It is widely supported by parsing libraries and scraping platforms
XPath contains seven data models (we have seen some of them previously):
- root node (top level parent node)
- element nodes (
<a>
..</a>
) - attribute nodes (
href="example.html"
) - text nodes (
"this is a text"
) - comment nodes (
<!-- a comment -->
) - namespace nodes
- processing instruction nodes
XPath expressions can return different data types:
- strings
- booleans
- numbers
- node-sets (probably the most common case)
An (XPath) axis defines a node-set relative to the current...