Reading XML/HTML content
Reading HTML or XML files allows us to parse web pages' content and to read documents or configurations described in XML.
Python has a built-in XML parser, the ElementTree module which is perfect for parsing XML files, but when HTML is involved, it chokes quickly due to the various quirks of HTML.
Consider trying to parse the following HTML:
<html>
<body class="main-body">
<p>hi</p>
<img><br>
<input type="text" />
</body>
</html>You will quickly face errors:
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 6Luckily, it's not too hard to adapt the parser to handle at least the most common HTML files, such as self-closing/void tags.
How to do it...
You need to perform the following steps for this recipe:
ElementTreeby default usesexpatto parse documents, and then relies onxml.etree.ElementTree.TreeBuilderto build the DOM of the document.
We can replace XMLParser based...