Pdf manipulation in Python

Python 3 Comments

For the past few months, apart from the data issue, I am involved in merging multiple pdfs to create a book of reading for the university. Refer to my post: Pdf merging. I am using two different libraries to complete this project: pypdf and reportlab

Here is a very simple example of using reportlab library to create pages with or without content and save to output file (Note: it will only return buffer of the page, so we need to use PdfFileReader to read the page):

[html]
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.lib import pagesizes
from reportlab.pdfgen import canvas
from reportlab.lib.units import cm, mm, inch
from StringIO import StringIO

PAGESIZE = pagesizes.A4

def createPdfPage(nPages, pagesize=None, content=None):
        buffer = StringIO()
        c = canvas.Canvas(None)
        if pagesize is None:
            pagesize = PAGESIZE
        c.setPageSize(pagesize)
        c.showOutline()
        for page in range(nPages):
            if content:
                c.drawString(9*cm, 22*cm, content)
            c.showPage()
        buffer.write(c.getpdfdata())
        buffer.seek(0)
        return buffer

#simple PdfFileWriter class
class PdfWriter(object):
    def __init__(self, outputFile):
        self.outputWriter = PdfFileWriter()
        self.__outputFile = outputFile
       
    def savePdf(self):
        outputStream = file(self.__outputFile, "wb")
        self.outputWriter.write(outputStream)
        outputStream.close()
   
    def addPage(self, page):
        self.outputWriter.addPage(page)
   
#create 2 pages of pdfs file, one is empty, one with content
outputFile = "test.pdf"

#create PdfWriter
pdfWriter = PdfWriter(outputFile)
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

pageWithContent = createPdfPage(1, content="more content")
pageWithContentReader = PdfFileReader(pageWithContent)
#get the page and append it to the output stream
pdfWriter.addPage(pageWithContentReader.getPage(0))

#save the pdf
pdfWriter.savePdf()

Here is another example merging two files with additional blank page in between the two files:

[html]
fileOne = "test.pdf"
fileTwo = "test2.pdf"
outputFile = "outputFile.pdf"
#createWriter for fileOne
pdfWriter = PdfWriter(fileOne)

#create pdfReader for test.pdf
fileOneStream = file(fileOne, "rb")
pdfReader = PdfFileReader(fileOneStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileOneStream.close()

#create an empty page
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

#create pdfReader for test2.pdf
fileTwoStream = file(fileTwo, "rb")
pdfReader = PdfFileReader(fileTwoStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileTwoStream.close()

#save the pdf
pdfWriter.savePdf()

There are lot of cases users need to split the pdfs file using tools like Adobe or other available tools. Although the splitted pdfs can be viewed using pdf viewer, some of these pdfs might be corrupted, e.g. no pdf end of file maker (%%EOF) at the end of the pdf. PdfFileReader will not be able to read the pdf if the EOF marker not found. To fix this:

[html]
#check if the pdf is corrupted, and try to fix it...
def fixPdf(pdfFile):
    try:
        fileOpen = file(pdfFile, "a")
        fileOpen.write("%%EOF")
        fileOpen.close()
        return "Fixed"
    except Exception, e:
        return "Unable to open file: %s with error: %s" % (pdfFile, str(e))

corruptedFile = "corrupted.pdf"
try:
    fileStream = file(corruptedFile)
    pdfReader = PdfFileReader(fileStream)
except:
    fileStream.close()
    print 'error in opeing pdf file, try to fix it'
    print fixPdf(corruptedFile)
    #try to reopen the pdf file again
    try:
        fileStream = file(corruptedFile)
        pdfReader = PdfFileReader(fileStream)
        print 'number of pages: ', pdfReader.getNumPages()
        fileStream.close()
    except:
        print 'this pdf file cannot be fixed'

Below are the example to get the individual page detail in the pdf file, this might be useful to find the inconsistency page size found in the pdf:

[html]
#get page detail
def getpageBox(page):
        return page.trimBox
   
def rectangle2box(pdfPage):
    return {
        'width'   : pdfPage.upperRight[0],
        'height'  : pdfPage.upperRight[1],
        'offset_x': pdfPage.lowerLeft[0],
        'offset_y': pdfPage.lowerLeft[1],
        'unit'    : 'pt',
        'units_x' : pdfPage.upperRight[0],
        'units_y' : pdfPage.upperRight[1],
        }

testFile = "test2.pdf"
fileStream = file(testFile)
pdfReader = PdfFileReader(fileStream)
for page in range(pdfReader.getNumPages()):
    pageBox = getpageBox(pdfReader.getPage(page))
    rectangleDetail = rectangle2box(pageBox)
    print '--- page number: ', page + 1
    for key in rectangleDetail:
        print "%s\t: %s" % (key, rectangleDetail[key])

Pdf merging

Python No Comments

I hv been spending past few weeks in trying to merging different PDFs into one book to provide a complete reading books for student. With quite a few available python library, this task sound easy. I worked with reportlab, it has a powerful builtin functions, but I do not need these as all I need is just merge the available PDFs.

Turn out pyPdf has this simple meeting functions. All I do is to extract each of the PDF pages using getPage function and add to a new PDF document using addPage function.

After building a simple PDF utils class, I start to work on the building the table of content of the book. On top of this, I need to have a title page of the reading before the PDFs file being inserted to the book. I also need to keep track each of the title page need to start in odd page. I had a thought to write these title readings directly to PDFs. However, it’s not easy to control the text flow and the font if writing is done directly in PDFs. So, I decided to create these table of content and the reading title page in open office document first since i can hack content.xml to write the content and render it to PDF. In the end all I need to do is insert the reading to this PDF based on the table of content. It’s done. ;) . However there are a few issue I need to address beside the user requirements. Some of the PDFs are encrypted with password. Due to the copyright issue, I am not suppose to know the password. Because of this PDFs, the book of readings can’t be produce until the encrypted PDFs has been decrypted by library. Another issue is some of the PDFs do not have EOF marker, and cause pyPdf unable to open the file. I need to find a way to add back the EOF marker. sigh…

After the user saw the end result, all I need to do now is fix some fonts and add some validation if the reading is not exist. This is actually the tedious task as I need to check each of the pages in the pdfs and ensure the PDFs font still being preserved. An optional requirement is to add tab at the right side of the page on each if the odd page of the PDFs. This is for printing purpose so student can easily flip the page.

Converting Transparent GIF to PNG by using pyPIL

Python 2 Comments

To modify GIF89a files in PIL (Python Imaging Library) is a bit tricky. Unlike GIF87a, GIF89a supported animation as well as transparent background, however, PIL only support “read” mode of GIF89a. Thus, when there is a need to modify GIF89a file, all the information (e.g. alpha channel) will not be maintained as PIL will save the file to GIF97a format.

When a Mathtype object is created in OpenOffice, the replacement object (in GIF89a format) will be created so user who does not have Mathtype installed still able to view the formulas. On top of this, this replacemenet object also can be used as picture for web.

As I need to use this replacement object for web, I still need to modify the size of this replacement GIF file so it can be placed nicely on the browser. However, the only good library can be used in python is PIL. From http://nadiana.com/pil-tips-converting-png-gif, I managed to get few tips to maintain and the transparency for the GIF object.

Original Image with pink background:

Original Image

To resize the image:

[html]
import Image
img = Image.open('1.gif')
transparency = img.info['transparency']
img.resize ((127,47))
img.save('2.gif', transparency=transparency)


However, the quality of the image produced by the above code is really bad:

Resized Image

I tried to find a way to play around with others available modes found in PIL to enhance the quality of the images. I realise that in PIL, the standard mode for GIF image is “P” (Palette mode) and when the image is previewed through PIL, the transparent background will be changed to pink. To maintain the quality of the image when it’s resized, I need to convert the image to grey scale by using “L” mode (liminance). As all my formula images are in black and white, this convertion will be fine. I managed to get the good quality of the resized image, however, the pink background is also converted to grey when using “L” mode. So before I converted to “L” mode, I need to set the original image background to “white”. This is done by hacking the pallete of the image.

And since my formula only uses black and white color, resize will break the black pixel. So to work around this, I need to convert the image to “L” mode (liminance) so the black color will be converted to grey scale. However when a transparent image being converted to “L” mode, the background will turn to black. Thus, before we convert to “L”, we need to change the transparency of the original image to white. The information about the transparency of the image can be gathered from

transparency = img.info['transparency']

. The transparency information is 2 which means the third RGB tuple in the palette. Changing the transparency to white:

[html]
def transparent(im):
    transparency = img.info['transparency']
    x = transparency*3
    p = im.getpalette()  #NOTE: the original image is GIF with "Palette mode", with this we can hack the palette of the image
    for x in range(x, x+3):
        p[x] = 255
    im.putpalette(p)
    return im

Then we convert the image to “L” mode and resize the image:

[html]
im = im.convert('L')
im = im.resize((127,47), Image.ANTIALIAS)

Now, there are two ways to replace the transparency back to the image after it’s resized. I prefer the second option as it’s faster.
First way is to create an invert function to invert the value of each pixel of the image. Then use eval to apply the function to each pixel of the image.
Next, create a new “L” mode image with the same size and black in color for each pixel. Afterwards, create a multi-band image from multiple single-band images. This new multi-band image is RGBA with the values of R, G, B are the new black black image and Alpha color is the converted image.

[html]
def convert1(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
    def invert(p):
        return 255^p
   
    im = Image.eval(im, invert)
    new = Image.new('L', im.size, 0) #0 is black color
    n = Image.merge('RGBA', (new, new, new, im))
    return n

The second option is reverse the paletted directly. After the image being resized, convert back the image to “P” mode and reverse the palette value. This reverse function is built in palette function which perform faster than the above inverse function. At this stage, after the pallete being reversed, we need to convert the image back to the “L” mode again to get the gray scale image. The next process is the same as the above which involve creating a blank black image and merge the blank image and the converted image together in new multi-band image.

[html]
def convert2(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
   
    im = im.convert('P')
    p = im.getpalette()
    p.reverse()
    im.putpalette(p)
   
    im = im.convert('L')
    new = Image.new('L', im.size, 0)
    n = Image.merge('RGBA', (new, new, new, im))
    return n

The png image produced from both of the convert functions is:

Converted Image

The complete code together with the performance test is:

[html]
import Image, time

def transparent(im):
    p = im.getpalette()
    p[6] = 255
    p[7] = 255
    p[8] = 255
    im.putpalette(p)
    return im

def convert1(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
    def invert(p):
        return 255^p
       
    new = Image.new('L', im.size, 0)
    im = Image.eval(im, invert)
    n = Image.merge('RGBA', (new, new, new, im))
    return n
   
def convert2(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
   
    im = im.convert('P')
    p = im.getpalette()
    p.reverse()
    im.putpalette(p)
   
    im = im.convert('L')
    new = Image.new('L', im.size, 0)
    n = Image.merge('RGBA', (new, new, new, im))
    return n

im = Image.open('1.gif')
starttime = time.time()
im1 = convert1(im)
im1.save('convert1.png')
endtime = time.time()
print endtime-starttime

print
starttime = time.time()
im2 = convert2(im)
im2.save('convert2.png') #this is faster
endtime = time.time()
print endtime-starttime

Download link for the above code include the Original GIF Image is: imaging.zip

Need to take note that IE6 do not support transparency, IE6 will turn the transparent background to light grey color. I found javascript code to handle the transparent PNG in IE6. The code can be downloaded from: Transparent PNG problen in Window IE 6

Useful links:

Working in UNO automation of OpenOffice

ICE 1 Comment

Currently, I am working as casual programmer in university. One main task that I have been dealing since the first day I started my contract till now is implementing OO UNO automation in our system called ICE. ICE is developed under python 2.4 and uses pyuno bridge to develop UNO component in python so we can automate open office writer document.

ICE is mainly used by Electronic Printing Department to generate course study book and introductory books. Modules for each books are created separately in different document writer application like NeoOffice, MS words or OpenOffice writer. After each modules created, user uses ICE to generate the complete study book and introductory book and this is where uno automation performs it’s task. These tasks include build the book (with template selected by user e.g. study book template), convert all non OpenOffice writer documents to OpenOffice documents, inserting the converted documents to the book, update all bookmarks, generate table of content for the book and render the completed book to PDF and HTML format.

Issue that I faced (will be added based on what I faced when I work):

  • ICE is developed under Python 2.4 and so does all the module that support ICE, but stable version of pyuno provided by OpenOffice.org still uses Python 2.3. Python 2.3 is installed together with OpenOffice can located in the installation directory (In linux: /opt/openoffice.orgx.x/program/python and in window: C:\Program Files\OpenOffice.org x.x\program\python.bat). ICE uses Python Twisted that run locally in user’s machine (now in the process of migrating to server version), so when running pyuno for the automation, python 2.3 is executed from ICE through command line. When executing pyuno, all the information of the book in data stream format (not file format) are passed to command line and another process of python will handle the automation separately. The problem exists when book document is too big like more than 600 pages study book which have a big data stream. Command lines cant handle more than 5k data. One of the solution that I have done is save the bit stream data into temporary file and pass the file name to the automation module. Although now automation can handle building big book, that solution is not a preferred solution. Now I am working on sending the the big data stream to output stream instead of saving it to the file.
  • MathType Object issue. Automation .uno:UpdateAll can not handle MathType object properly when the process of inserting the document into the book are automatically performed non-visibly by user. .uno:UpdateAll will either cause Open Office to shut down by stating “too many windows opened” or just sit there without doing anything. So, before the document (in odt) being inserted into the book file, I hacked into the the open office content.xml of the document and remove all the MathType objects and then insert the document to the book file. After all the documents being inserted into the book file, I hacked the book content.xml file again and put back all the MathType objects. Surprisingly, after hacking the MathType object, .uno:UpdateAll never complain at all even the insertion of the document are done non-visibly. .uno:UpdateAll is used to build table of content of the book. Instead of throwing error and causing Open Office crashed, .uno:UpdateAll never build a correct table of content without being indexed twice (at least) automatically. This problem is not consistent as sometimes even with a book without MathType objects being build, the indexing for the table of content still wrong. For temporary solution (again), user need to open the book after being built and re-index the table of content by themselves then the table of content will be correct since the re-indexing is done with the book visibly opened by user.