Pdf manipulation in Python

Python 3 Comments

For the past few months, apart from the data issue, I am involved in merging multiple pdfs to create a book of reading for the university. Refer to my post: Pdf merging. I am using two different libraries to complete this project: pypdf and reportlab

Here is a very simple example of using reportlab library to create pages with or without content and save to output file (Note: it will only return buffer of the page, so we need to use PdfFileReader to read the page):

[html]
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.lib import pagesizes
from reportlab.pdfgen import canvas
from reportlab.lib.units import cm, mm, inch
from StringIO import StringIO

PAGESIZE = pagesizes.A4

def createPdfPage(nPages, pagesize=None, content=None):
        buffer = StringIO()
        c = canvas.Canvas(None)
        if pagesize is None:
            pagesize = PAGESIZE
        c.setPageSize(pagesize)
        c.showOutline()
        for page in range(nPages):
            if content:
                c.drawString(9*cm, 22*cm, content)
            c.showPage()
        buffer.write(c.getpdfdata())
        buffer.seek(0)
        return buffer

#simple PdfFileWriter class
class PdfWriter(object):
    def __init__(self, outputFile):
        self.outputWriter = PdfFileWriter()
        self.__outputFile = outputFile
       
    def savePdf(self):
        outputStream = file(self.__outputFile, "wb")
        self.outputWriter.write(outputStream)
        outputStream.close()
   
    def addPage(self, page):
        self.outputWriter.addPage(page)
   
#create 2 pages of pdfs file, one is empty, one with content
outputFile = "test.pdf"

#create PdfWriter
pdfWriter = PdfWriter(outputFile)
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

pageWithContent = createPdfPage(1, content="more content")
pageWithContentReader = PdfFileReader(pageWithContent)
#get the page and append it to the output stream
pdfWriter.addPage(pageWithContentReader.getPage(0))

#save the pdf
pdfWriter.savePdf()

Here is another example merging two files with additional blank page in between the two files:

[html]
fileOne = "test.pdf"
fileTwo = "test2.pdf"
outputFile = "outputFile.pdf"
#createWriter for fileOne
pdfWriter = PdfWriter(fileOne)

#create pdfReader for test.pdf
fileOneStream = file(fileOne, "rb")
pdfReader = PdfFileReader(fileOneStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileOneStream.close()

#create an empty page
#create a page without any content
emptyPageBuffer = createPdfPage(1)
emptyPageReader = PdfFileReader(emptyPageBuffer)
#get the page and append it to the output stream
pdfWriter.addPage(emptyPageReader.getPage(0))

#create pdfReader for test2.pdf
fileTwoStream = file(fileTwo, "rb")
pdfReader = PdfFileReader(fileTwoStream)
for page in range(pdfReader.getNumPages()):
    pdfWriter.addPage(pdfReader.getPage(page))
fileTwoStream.close()

#save the pdf
pdfWriter.savePdf()

There are lot of cases users need to split the pdfs file using tools like Adobe or other available tools. Although the splitted pdfs can be viewed using pdf viewer, some of these pdfs might be corrupted, e.g. no pdf end of file maker (%%EOF) at the end of the pdf. PdfFileReader will not be able to read the pdf if the EOF marker not found. To fix this:

[html]
#check if the pdf is corrupted, and try to fix it...
def fixPdf(pdfFile):
    try:
        fileOpen = file(pdfFile, "a")
        fileOpen.write("%%EOF")
        fileOpen.close()
        return "Fixed"
    except Exception, e:
        return "Unable to open file: %s with error: %s" % (pdfFile, str(e))

corruptedFile = "corrupted.pdf"
try:
    fileStream = file(corruptedFile)
    pdfReader = PdfFileReader(fileStream)
except:
    fileStream.close()
    print 'error in opeing pdf file, try to fix it'
    print fixPdf(corruptedFile)
    #try to reopen the pdf file again
    try:
        fileStream = file(corruptedFile)
        pdfReader = PdfFileReader(fileStream)
        print 'number of pages: ', pdfReader.getNumPages()
        fileStream.close()
    except:
        print 'this pdf file cannot be fixed'

Below are the example to get the individual page detail in the pdf file, this might be useful to find the inconsistency page size found in the pdf:

[html]
#get page detail
def getpageBox(page):
        return page.trimBox
   
def rectangle2box(pdfPage):
    return {
        'width'   : pdfPage.upperRight[0],
        'height'  : pdfPage.upperRight[1],
        'offset_x': pdfPage.lowerLeft[0],
        'offset_y': pdfPage.lowerLeft[1],
        'unit'    : 'pt',
        'units_x' : pdfPage.upperRight[0],
        'units_y' : pdfPage.upperRight[1],
        }

testFile = "test2.pdf"
fileStream = file(testFile)
pdfReader = PdfFileReader(fileStream)
for page in range(pdfReader.getNumPages()):
    pageBox = getpageBox(pdfReader.getPage(page))
    rectangleDetail = rectangle2box(pageBox)
    print '--- page number: ', page + 1
    for key in rectangleDetail:
        print "%s\t: %s" % (key, rectangleDetail[key])

workflow issue….

Python No Comments

Today I sat down whole day working on the generating selected reading book. I completed merging pdf files function and successfully generate the book. Since the beginning of this additional plugin, I faced a lot of data issues that given to me. Part of these are the workflow as the validity of the data are still in doubt. I tested on a few courses as my sample data, but each of them will have some issue, e.g. pdfs file are encrypted, pdfs file are not properly splitted which cause my pdf library could not read the file, some of the reading missing from the xml data and more to go… until now, these issues still not solved.

Now, I am stuck on creating a GUI for the user to rearrange the selected reading book…. As there are too many condition I need to make sure I catered for the data (includes invalid data), my code grow from 500 lines to almost 1500 lines… my unit test even worse. For now, it’s better to refactor my code before I start the GUI and before I get more confuse with my own code. I plan to build a test interface so when the user found the issue with the data, they can just use this test interface to see what data I received.

Back to work after dinner! :)

Pdf merging

Python No Comments

I hv been spending past few weeks in trying to merging different PDFs into one book to provide a complete reading books for student. With quite a few available python library, this task sound easy. I worked with reportlab, it has a powerful builtin functions, but I do not need these as all I need is just merge the available PDFs.

Turn out pyPdf has this simple meeting functions. All I do is to extract each of the PDF pages using getPage function and add to a new PDF document using addPage function.

After building a simple PDF utils class, I start to work on the building the table of content of the book. On top of this, I need to have a title page of the reading before the PDFs file being inserted to the book. I also need to keep track each of the title page need to start in odd page. I had a thought to write these title readings directly to PDFs. However, it’s not easy to control the text flow and the font if writing is done directly in PDFs. So, I decided to create these table of content and the reading title page in open office document first since i can hack content.xml to write the content and render it to PDF. In the end all I need to do is insert the reading to this PDF based on the table of content. It’s done. ;) . However there are a few issue I need to address beside the user requirements. Some of the PDFs are encrypted with password. Due to the copyright issue, I am not suppose to know the password. Because of this PDFs, the book of readings can’t be produce until the encrypted PDFs has been decrypted by library. Another issue is some of the PDFs do not have EOF marker, and cause pyPdf unable to open the file. I need to find a way to add back the EOF marker. sigh…

After the user saw the end result, all I need to do now is fix some fonts and add some validation if the reading is not exist. This is actually the tedious task as I need to check each of the pages in the pdfs and ensure the PDFs font still being preserved. An optional requirement is to add tab at the right side of the page on each if the odd page of the PDFs. This is for printing purpose so student can easily flip the page.

Converting Transparent GIF to PNG by using pyPIL

Python 2 Comments

To modify GIF89a files in PIL (Python Imaging Library) is a bit tricky. Unlike GIF87a, GIF89a supported animation as well as transparent background, however, PIL only support “read” mode of GIF89a. Thus, when there is a need to modify GIF89a file, all the information (e.g. alpha channel) will not be maintained as PIL will save the file to GIF97a format.

When a Mathtype object is created in OpenOffice, the replacement object (in GIF89a format) will be created so user who does not have Mathtype installed still able to view the formulas. On top of this, this replacemenet object also can be used as picture for web.

As I need to use this replacement object for web, I still need to modify the size of this replacement GIF file so it can be placed nicely on the browser. However, the only good library can be used in python is PIL. From http://nadiana.com/pil-tips-converting-png-gif, I managed to get few tips to maintain and the transparency for the GIF object.

Original Image with pink background:

Original Image

To resize the image:

[html]
import Image
img = Image.open('1.gif')
transparency = img.info['transparency']
img.resize ((127,47))
img.save('2.gif', transparency=transparency)


However, the quality of the image produced by the above code is really bad:

Resized Image

I tried to find a way to play around with others available modes found in PIL to enhance the quality of the images. I realise that in PIL, the standard mode for GIF image is “P” (Palette mode) and when the image is previewed through PIL, the transparent background will be changed to pink. To maintain the quality of the image when it’s resized, I need to convert the image to grey scale by using “L” mode (liminance). As all my formula images are in black and white, this convertion will be fine. I managed to get the good quality of the resized image, however, the pink background is also converted to grey when using “L” mode. So before I converted to “L” mode, I need to set the original image background to “white”. This is done by hacking the pallete of the image.

And since my formula only uses black and white color, resize will break the black pixel. So to work around this, I need to convert the image to “L” mode (liminance) so the black color will be converted to grey scale. However when a transparent image being converted to “L” mode, the background will turn to black. Thus, before we convert to “L”, we need to change the transparency of the original image to white. The information about the transparency of the image can be gathered from

transparency = img.info['transparency']

. The transparency information is 2 which means the third RGB tuple in the palette. Changing the transparency to white:

[html]
def transparent(im):
    transparency = img.info['transparency']
    x = transparency*3
    p = im.getpalette()  #NOTE: the original image is GIF with "Palette mode", with this we can hack the palette of the image
    for x in range(x, x+3):
        p[x] = 255
    im.putpalette(p)
    return im

Then we convert the image to “L” mode and resize the image:

[html]
im = im.convert('L')
im = im.resize((127,47), Image.ANTIALIAS)

Now, there are two ways to replace the transparency back to the image after it’s resized. I prefer the second option as it’s faster.
First way is to create an invert function to invert the value of each pixel of the image. Then use eval to apply the function to each pixel of the image.
Next, create a new “L” mode image with the same size and black in color for each pixel. Afterwards, create a multi-band image from multiple single-band images. This new multi-band image is RGBA with the values of R, G, B are the new black black image and Alpha color is the converted image.

[html]
def convert1(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
    def invert(p):
        return 255^p
   
    im = Image.eval(im, invert)
    new = Image.new('L', im.size, 0) #0 is black color
    n = Image.merge('RGBA', (new, new, new, im))
    return n

The second option is reverse the paletted directly. After the image being resized, convert back the image to “P” mode and reverse the palette value. This reverse function is built in palette function which perform faster than the above inverse function. At this stage, after the pallete being reversed, we need to convert the image back to the “L” mode again to get the gray scale image. The next process is the same as the above which involve creating a blank black image and merge the blank image and the converted image together in new multi-band image.

[html]
def convert2(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
   
    im = im.convert('P')
    p = im.getpalette()
    p.reverse()
    im.putpalette(p)
   
    im = im.convert('L')
    new = Image.new('L', im.size, 0)
    n = Image.merge('RGBA', (new, new, new, im))
    return n

The png image produced from both of the convert functions is:

Converted Image

The complete code together with the performance test is:

[html]
import Image, time

def transparent(im):
    p = im.getpalette()
    p[6] = 255
    p[7] = 255
    p[8] = 255
    im.putpalette(p)
    return im

def convert1(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
    def invert(p):
        return 255^p
       
    new = Image.new('L', im.size, 0)
    im = Image.eval(im, invert)
    n = Image.merge('RGBA', (new, new, new, im))
    return n
   
def convert2(im):
    im = transparent(im)
    im = im.convert('L')
    im = im.resize((127,47), Image.ANTIALIAS)
   
    im = im.convert('P')
    p = im.getpalette()
    p.reverse()
    im.putpalette(p)
   
    im = im.convert('L')
    new = Image.new('L', im.size, 0)
    n = Image.merge('RGBA', (new, new, new, im))
    return n

im = Image.open('1.gif')
starttime = time.time()
im1 = convert1(im)
im1.save('convert1.png')
endtime = time.time()
print endtime-starttime

print
starttime = time.time()
im2 = convert2(im)
im2.save('convert2.png') #this is faster
endtime = time.time()
print endtime-starttime

Download link for the above code include the Original GIF Image is: imaging.zip

Need to take note that IE6 do not support transparency, IE6 will turn the transparent background to light grey color. I found javascript code to handle the transparent PNG in IE6. The code can be downloaded from: Transparent PNG problen in Window IE 6

Useful links:

Python ElementTree

Python, xhtml, xml No Comments

Elementtree provides a nice XML document handling in python. I found elementtree is useful to handle my xhtml document. I can easily insert a new element in whichever part of the xhtml document. When I generate html file for LaTeX through tex4ht (my previous blog), the html file is not xhtml format. Since I want to use elementtree to handling adding and removing tags element in my html document, I use libtidy package by ubuntu to tidy up my invalid html document and the use elementtree to handle the rest of manipulation process in my html file.

Small elementtree packages like cElementTree can be used to handle simple xml document. The issue with cElementTree is that it doesn’t support namespaces naming convention like elementtree does. So when we generate xml document through cElementTree with namespace URI provided, it will generate a not-nice but valid xml document:

[html]
<ns0:mods xmlns:mods="http://www.loc.gov/mods/v3">
   <ns0:titleInfo>
       <ns0:title>%s</ns0:title>
   </ns0:titleInfo>
   <ns0:author>name</ns0:author>
</ns0:mods>

Elementtree gives better xml document result. With prefix registration:

[html]
ElementTree._namespace_map[MODS_NS] = "mods"

It will generate:

[html]
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
   <mods:titleInfo>
       <mods:title>%s</mods:title>
   </mods:titleInfo>
   <mods:author>name</mods:author>
</mods:mods>

Eclipse 3.2 and 3.3 Update Site for Pydev, Maven and Subsclipse

Eclipse, Python, Subversion No Comments

Whenever I install eclipse, I always wasted my time searching for the update site. These are dependencies needed to install PyDev and Subsclipe in Eclipse (I assume that you know how to add the remote site and install those Eclipse packages):

  • Mylar (Do not need anymore this since it’s com with pydev now)
    Mylar is Subclipse dependencies. For Eclipse 3.3, Mylar had been changed to MyLyn, but when I “find and install” update for Subclipse, Eclipse still ask for Mylar package. Mylar update site is not available for Eclipse 3.3, so just use below update site, your Eclipse 3.3 will still work.
    Update site:
    http://downloads.open.collab.net/eclipse/update-site/e3.2/
  • PyDev
    Python 2.x must be installed in your system as PyDev will automatically configured to use Python 2.x in your system for interpreting and compiling your code.
    Update site:
    http://pydev.org/updates
    or
    http://pydev.sourceforge.net/updates/
  • Subclipse
    After you finish running Mylar package installation, restart your Eclipse and then add Subclipse update site.
    Update site:
    http://subclipse.tigris.org/update_1.6.x
  • Maven
    Update site:
    http://m2eclipse.sonatype.org/update/

Have fun!