Python’s PyPDF2 library fills PDF templates with Excel data. It’s lightning fast too.
Data Entry. It is slow. It is boring. It is prone to human error. It is rule based and repetitive. It is ripe for automation.
Filling out a PDF template’s fields with the data found in an excel spreadsheet is a common task and one that takes a human an age to complete. In fewer than 100 lines of python code, you can automate this process. Once set up this thing can churn out 2971 copies a minute¹. And that’s without any coffee breaks.
You will need Adobe Reader installed to create a PDF template with labelled fields but it is not required if you have an existing template or are just looking to run this example; the full repo on my GitHub has a template ready to play with! You just have to install
python 3 and two libraries:
PyPDF2 . You can install these by typing in the below commands into your command prompt
pip install pandas
pip install PyPDF2
This will allow you to import these modules into a Python interpreter with the below lines
import pandas as pd
In this example we have the quintessentially boring task of filling out an EIS3 Enterprise Tax Relief form with data found in a csv file. In the downloadable example, you will see the script itself and two folders, “In” and “Out”.
- In Contains PDF template, .csv file with data to fill with
- Out The folder where the filled out templates are output to
pdf_processor.py The main script to execute
Running pdf_processor.py will immediatley show you how quickly this works. The gif below is me running this very example.
The pertinent parts of the script are below. The full, more verbose script, is what you will need to adjust to your requirement (but it’s not any more complex than what is seen in the snippet below).
There are three variables you must adjust to fill out these PDFs effectively:
csv_filename Filename of your csv found in the “In” folder
pdf_filename Filename of your pdf template found in the “In” folder
field_dictionary_1 A dictionary containing the key/value pairs of the field names of the pdf template and the values you wish you put into them.
Finding out the pdf field names to use in
field_dictionary_1 is as simple as inspecting the value of the list variable
pdf_fields (line 15) in a debugger or printing it to the console and using those values as the dictionary keys. These will be identical to the pdf field label names in the template too.
The values for these are extracted from the csv. Here we will use the column name in the .csv as the value. For example line 24
rows['AddressLine1'] will take the row value from the column called ‘AddressLine1’
You will need a separate dictionary object for each page you wish to fill in — in our example only page 1 requires filling so we only have the one dictionary. Otherwise this is all you need to do to get it running — it really is as simple as this!
The documentation for the PyPDF2 is found here but for this job you shouldn’t need much more detail! You will need to consult it if you are trying to do anything a little more sophisticated. https://pythonhosted.org/PyPDF2/
If you are trying to get this to work for your own use case and run into any issues — get in touch! You can find my GitHub profile below. When I’m not surfing or building robots, I’m most probably at my computer tinkering with new ways to automate the really boring stuff so please don’t hesitate to reach out! https://github.com/Hridai
 My very average Huawei Matebook D 14 laptop churns out a PDF copy every ~0.02 seconds. Only 2971 a minute.