Extract Metadata from PDF using Python

In this tutorial we will explore how to extract metadata from PDF using Python.

Table of Contents


Introduction

PDF metadata consists of information about the PDF document, which includes title, author, creation date, and so on. All of these are searchable fields of each PDF document and can be retrieved.

To continue following this tutorial we will need the following Python library: pikepdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

PowerShell
pip install pikepdf

Sample PDF

In order to continue in this tutorial we will need some PDF file to work with.

Let’s reuse one of the PDF we created in one of our previous tutorials:


Extract metadata from PDF using Python

In order to extract metadata from PDF using Python, we will follow the three simple steps:

  1. Open PDF using pikepdf
  2. Extract metadata from PDF
  3. Print out metadata

And now we can create the metadata from PDF using the following code:

Python
import pikepdf

#Open PDF with pikepdf
pdf = pikepdf.Pdf.open('webpage.pdf')

#Extract metadata from PDF
pdf_info = pdf.docinfo

#Print out the metadata
for key, value in pdf_info.items():
    print(key, ':', value)

You should get:

PowerShell
/CreationDate : D:20220624153735-04'00'
/Creator : wkhtmltopdf 0.12.6
/Producer : Qt 4.8.7
/Title : wkhtmltopdf

Conclusion

In this article we explored how to extract metadata from PDF using Python and pikepdf.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *