Reading office document metadata
Recipe Difficulty: Medium
Python Version: 2.7 or 3.5
Operating System: Any
Reading metadata from office documents can expose interesting information about the authorship and history of those files. Conveniently, the 2007 formatted .docx
, .xlsx
, and .pptx
files store metadata in XML. The XML tags can be easily processed with Python.
Getting started
All libraries used in this script are present in Python's standard library. We use the built-in xml
library and the zipfile
library to allow us access to the XML documents within the ZIP container.
Note
To learn more about the xml
library, visit https://docs.python.org/3/library/xml.etree.elementtree.html.
To Learn more about the zipfile
library, visit https://docs.python.org/3/library/zipfile.html.
How to do it...
We extract embedded Office metadata by performing the following steps:
- Confirm that the input file is a valid ZIP file.
- Extract the
core.xml
andapp.xml
files from Office file. - Parse XML data and print embedded metadata...