Pdf manipulation in Python

Python 3 Comments

For the past few months, apart from the data issue, I am involved in merging multiple pdfs to create a book of reading for the university. Refer to my post: Pdf merging. I am using two different libraries to complete this project: pypdf and reportlab

Here is a very simple example of using reportlab library to create pages with or without content and save to output file (Note: it will only return buffer of the page, so we need to use PdfFileReader to read the page):

[html]
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.lib import pagesizes
from reportlab.pdfgen import canvas
from reportlab.lib.units import cm, mm, inch
from StringIO import StringIO

PAGESIZE = pagesizes.A4

def createPdfPage(nPages, pagesize=None, content=None):
        buffer = StringIO()
        c = canvas.Canvas(None)
        if pagesize is None:
            pagesize = PAGESIZE
        c.setPageSize(pagesize)
        c.showOutline()
        for page in range(nPages):
            if content:
                c.drawString(9*cm, 22*cm, content)
            c.showPage()
        buffer.write(c.getpdfdata())
        buffer.seek(0)
        return buffer

#simple PdfFileWriter class
class PdfWriter(object):
    def __init__(self, outputFile):
        self.outputWriter = PdfFileWriter()
        self.__outputFile = outputFile
       
    def savePdf(self):
        outputStream = file(self.__outputFile, "wb")
        self.outputWriter.write(outputStream)
        outputStream.close()
   
    def addPage(self, page):
        self.outputWriter.addPage(page)
   
#create 2 pages of pdfs file, one is empty, one with content
outputFile = "test.pdf"

#create PdfWriter
pdfWriter = PdfWriter(outputFile)
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

pageWithContent = createPdfPage(1, content="more content")
pageWithContentReader = PdfFileReader(pageWithContent)
#get the page and append it to the output stream
pdfWriter.addPage(pageWithContentReader.getPage(0))

#save the pdf
pdfWriter.savePdf()

Here is another example merging two files with additional blank page in between the two files:

[html]
fileOne = "test.pdf"
fileTwo = "test2.pdf"
outputFile = "outputFile.pdf"
#createWriter for fileOne
pdfWriter = PdfWriter(fileOne)

#create pdfReader for test.pdf
fileOneStream = file(fileOne, "rb")
pdfReader = PdfFileReader(fileOneStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileOneStream.close()

#create an empty page
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

#create pdfReader for test2.pdf
fileTwoStream = file(fileTwo, "rb")
pdfReader = PdfFileReader(fileTwoStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileTwoStream.close()

#save the pdf
pdfWriter.savePdf()

There are lot of cases users need to split the pdfs file using tools like Adobe or other available tools. Although the splitted pdfs can be viewed using pdf viewer, some of these pdfs might be corrupted, e.g. no pdf end of file maker (%%EOF) at the end of the pdf. PdfFileReader will not be able to read the pdf if the EOF marker not found. To fix this:

[html]
#check if the pdf is corrupted, and try to fix it...
def fixPdf(pdfFile):
    try:
        fileOpen = file(pdfFile, "a")
        fileOpen.write("%%EOF")
        fileOpen.close()
        return "Fixed"
    except Exception, e:
        return "Unable to open file: %s with error: %s" % (pdfFile, str(e))

corruptedFile = "corrupted.pdf"
try:
    fileStream = file(corruptedFile)
    pdfReader = PdfFileReader(fileStream)
except:
    fileStream.close()
    print 'error in opeing pdf file, try to fix it'
    print fixPdf(corruptedFile)
    #try to reopen the pdf file again
    try:
        fileStream = file(corruptedFile)
        pdfReader = PdfFileReader(fileStream)
        print 'number of pages: ', pdfReader.getNumPages()
        fileStream.close()
    except:
        print 'this pdf file cannot be fixed'

Below are the example to get the individual page detail in the pdf file, this might be useful to find the inconsistency page size found in the pdf:

[html]
#get page detail
def getpageBox(page):
        return page.trimBox
   
def rectangle2box(pdfPage):
    return {
        'width'   : pdfPage.upperRight[0],
        'height'  : pdfPage.upperRight[1],
        'offset_x': pdfPage.lowerLeft[0],
        'offset_y': pdfPage.lowerLeft[1],
        'unit'    : 'pt',
        'units_x' : pdfPage.upperRight[0],
        'units_y' : pdfPage.upperRight[1],
        }

testFile = "test2.pdf"
fileStream = file(testFile)
pdfReader = PdfFileReader(fileStream)
for page in range(pdfReader.getNumPages()):
    pageBox = getpageBox(pdfReader.getPage(page))
    rectangleDetail = rectangle2box(pageBox)
    print '--- page number: ', page + 1
    for key in rectangleDetail:
        print "%s\t: %s" % (key, rectangleDetail[key])

Pdf merging

Python No Comments

I hv been spending past few weeks in trying to merging different PDFs into one book to provide a complete reading books for student. With quite a few available python library, this task sound easy. I worked with reportlab, it has a powerful builtin functions, but I do not need these as all I need is just merge the available PDFs.

Turn out pyPdf has this simple meeting functions. All I do is to extract each of the PDF pages using getPage function and add to a new PDF document using addPage function.

After building a simple PDF utils class, I start to work on the building the table of content of the book. On top of this, I need to have a title page of the reading before the PDFs file being inserted to the book. I also need to keep track each of the title page need to start in odd page. I had a thought to write these title readings directly to PDFs. However, it’s not easy to control the text flow and the font if writing is done directly in PDFs. So, I decided to create these table of content and the reading title page in open office document first since i can hack content.xml to write the content and render it to PDF. In the end all I need to do is insert the reading to this PDF based on the table of content. It’s done. ;) . However there are a few issue I need to address beside the user requirements. Some of the PDFs are encrypted with password. Due to the copyright issue, I am not suppose to know the password. Because of this PDFs, the book of readings can’t be produce until the encrypted PDFs has been decrypted by library. Another issue is some of the PDFs do not have EOF marker, and cause pyPdf unable to open the file. I need to find a way to add back the EOF marker. sigh…

After the user saw the end result, all I need to do now is fix some fonts and add some validation if the reading is not exist. This is actually the tedious task as I need to check each of the pages in the pdfs and ensure the PDFs font still being preserved. An optional requirement is to add tab at the right side of the page on each if the odd page of the PDFs. This is for printing purpose so student can easily flip the page.