views:

253

answers:

5

I need to pre-produce a million or two PDF files from a simple template (a few pages and tables) with embedded fonts. Usually, I would stay low level in a case like this, and compose everything with a library like ReportLab, but I joined late in the project.

Currently, I have a template.odt and use markers in the content.xml files to fill with data from a DB. I can smoothly create the ODT files, they always look rigth.

For the ODT to PDF conversion, I'm using openoffice in server mode (and PyODConverter w/ named pipe), but it's not very reliable: in a batch of documents, there is eventually a point after which all the processed files are converted into garbage (wrong fonts and letters sprawled all over the page).

Problem is not predictably reproducible (does not depend on the data), happens in OOo 2.3 and 3.2, in Ubuntu, XP, Server 2003 and Windows 7. My Heisenbug detector is ticking.

I tried to reduce the size of batches and restarting OOo after each one; still, a small percentage of the documents are messed up.

Of course I'll write about this on the Ooo mailing lists, but in the meanwhile, I have a delivery and lost too much time already.

Where do I go?

  1. Completely avoid the ODT format and go for another template system.

    • Suggestions? Anything that takes a few seconds to run is way too slow. OOo takes around a second and it sums to 15 days of processing time. I had to write a program for clustering the jobs over several clients.
  2. Keep the format but go for another tool/program for the conversion.

    • Which one? There are many apps in the shareware or commercial repositories for windows, but trying each one is a daunting task. Some are too slow, some cannot be run in batch without buying it first, some cannot work from command line, etc.
    • Open source tools tend not to reinvent the wheel and often depend on openoffice.
  3. Converting to an intermediate .DOC format could help to avoid the OOo bug, but it would double the processing time and complicate a task that is already too hairy.

  4. Try to produce the PDFs twice and compare them, discarding the whole batch if there's something wrong.

    • Although the documents look equal, I know of no way to compare the binary content.
  5. Restart OOo after processing each document.

    • it would take a lot more time to produce them
    • it would lower the percentage of the wrong files, and make it very hard to identify them.
  6. Go for ReportLab and recreate the pages programmatically. This is the approach I'm going to try in a few minutes.

  7. Learn to properly format bulleted lists

Thanks a lot.

Edit: it seems like I cannot use ReportLab at all, it won't let me embed the font. My font comes in TrueType and OpenType versions.

The TrueType one says "TTFError: Font does not allow subsetting/embedding (0100) ".

The OpenType version says "TTFError[...] postscript outlines are not supported".

Very very funny.

A: 

For your scenario it seems that Reportlab PLUS is a good fit, including templates and phone support to get you going fast.

extraneon
The commercial version of reportlab costs several thousand pounds in leasing, each year, depending on the number of generated pages (!) and has a yet different pricing for the financial sector.. i don't have that budget at the moment. When I have it running, I'll evaluate.
Marco Mariani
A: 

Very interesting problem. Since you have already written it to cluster across several machines why not use the double production approach and spread it on EC2 nodes. It will cost a bit extra but you can compare stuff using md5 or sha hashes and if 2 versions are the same you can move on.

whatnick
no, converting the same file twice yields two very different binaries.
Marco Mariani
So the conversion process is not deterministic ? That's odd. How is the content different ? diff can compare binary - you can also try this http://www.melaneum.com/blog/linux/pdf-diff
whatnick
Oh, they differ, like this http://imagebin.ca/view/GcLtXR.html
Marco Mariani
+2  A: 

I would probably end up finding some way to determine when the batch processing goes haywire, then reprocess everything from shortly before it failed. How to determine when it goes haywire? That will require analyzing some correct PDFs and some failed ones, to look for similarities among them:

  • generated files aren't the right size compared to their source
  • the files don't contain some string (like the name of your font)
  • some bit of data is not in the expected place
  • when converted back to text, they don't contain expected data from the template
  • when converted to a bitmap, text isn't in the right place

I suspect that converting them back to text and looking for expected strings is going to be the most accurate solution, but also slow. If it's too slow to run on every file, run it on every 1/100th or so, and just reconvert every file after the last known good one.

Gabe
Not with a simple grep. The only way I can think of to detect some of them is, convert to a raster format and see if it's written over the margins of the page. Hairy...
Marco Mariani
I would think that converting to a bitmap and looking for garbage in the margins would work well. If it's slow, just check every hundredth or thousandth. If you need help figuring out how to do that, just make another post. I use ImageMagick for this sort of thing all the time, so it's not too hard.
Gabe
Yes, I'm actually investigating if "convert -trim" piped through /usr/bin/file works well enough, then I'm going to post-process each batch at the server to refuse the bad ones upon reception. The width of the first page is almost constant for the good ones.
Marco Mariani
Weirdly, the content of the bad PDFs depends on the OS of the client. Under Windows 7, all of them are under a certain size. So I have a fast way to filter them.
Marco Mariani
A: 

For comparing 2 pdf files I would recommended i-net PDF content comparer. It can compare 2 directories of PDF files very good. We are use it in our regression test system.

Horcrux7
+1  A: 

For creating such large amount of PDF files OpenOffice seems me the wrong product. You should use a real reporting solution which is optimized for creating large amount of PDF files. There many different tools. I would recommended i-net Crystal-Clear.

  • I would expect that one PDF file is faster created as with OpenOfice.
  • Creating 2 PDF files and comparing it will cost a lot of speed.
  • It can embedded True Type Fonts.
  • With the API you can work in a loop.
  • With a trial license you can work for 90 days your batch

The disadvantages is that you must restart your development.

Horcrux7
I'm already planning to rewrite everything, I'll evaluate it. But although I am not an open source bigot, the price depending on the number of CPUs is definitely a turnoff :-)
Marco Mariani