Mining for PDF metadata
Recipe Difficulty: Easy
Python Version: 2.7 or 3.5
Operating System: Any
While PDF documents can represent a wide variety of media, including images, text, and forms, they contain structured embedded metadata in the Extensible Metadata Platform (XMP) format that can provide us with some additional information. Through this recipe, we access a PDF using Python and extract metadata describing the creation and lineage of the document.
Getting started
This recipe requires the installation of the third-party library PyPDF2
. All other libraries used in this script are present in Python's standard library. The PyPDF2
module provides us with bindings to read and write PDF files. In our case, we will only use this library to read the metadata stored in the XMP format. To install this library, run the following command:
pip install PyPDF2==1.26.0
Note
To learn more about the PyPDF2
library, visit http://mstamy2.github.io/PyPDF2/.
How to do it...
To handle PDFs for this recipe, we follow...