Reading XML/HTML content
Reading HTML or XML files allows us to parse web pages' content and to read documents or configurations described in XML.
Python has a built-in XML parser, the ElementTree
module which is perfect for parsing XML files, but when HTML is involved, it chokes quickly due to the various quirks of HTML.
Consider trying to parse the following HTML:
<html> <body class="main-body"> <p>hi</p> <img><br> <input type="text" /> </body> </html>
You will quickly face errors:
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 6
Luckily, it's not too hard to adapt the parser to handle at least the most common HTML files, such as self-closing/void tags.
How to do it...
You need to perform the following steps for this recipe:
ElementTree
by default usesexpat
to parse documents, and then relies onxml.etree.ElementTree.TreeBuilder
to build the DOM of the document.
We can replace XMLParser
based...