views:

937

answers:

5

Summary: How can I reduce the amount of time it takes to convert tifs to pdfs using itextsharp?

Background: I'm converting some fairly large tif's to pdf using C# and itextsharp, and I am getting extremely bad performance. The tif files are approximately 50kb a piece, and some documents have up to 150 seperate tif files (each representing a page). For one 132 page document (~6500 kb) it took about 13 minutes to convert. During the conversion, the single CPU server it was running on was running at 100%, leading me to believe the process was CPU bound. The output pdf file was 3.5 MB. I'm ok with the size, but the time taken seem a bit high to me.

Code:

private void CombineAndConvertTif(IList<FileInfo> inputFiles, FileInfo outputFile)
{
    using (FileStream fs = new FileStream(outputFile.FullName, FileMode.Create, FileAccess.ReadWrite, FileShare.None))
    {
        Document document = new Document(PageSize.A4, 50, 50, 50, 50);
        PdfWriter writer = PdfWriter.GetInstance(document, fs);
        document.Open();
        PdfContentByte cb = writer.DirectContent;

        foreach (FileInfo inputFile in inputFiles)
        {
            using (Bitmap bm = new Bitmap(inputFile.FullName))
            {
                int total = bm.GetFrameCount(FrameDimension.Page);

                for (int k = 0; k < total; ++k)
                {
                    bm.SelectActiveFrame(FrameDimension.Page, k);
                    //Testing shows that this line takes the lion's share (80%) of the time involved.
                    iTextSharp.text.Image img =
                        iTextSharp.text.Image.GetInstance(bm, null, true);
                    img.ScalePercent(72f / 200f * 100);
                    img.SetAbsolutePosition(0, 0);

                    cb.AddImage(img);
                    document.NewPage();
                }
            }
        }

        document.Close();
        writer.Close();
    }

}
+1  A: 

I had this exact problem. I ended up using Adobe Acrobat's Batch Processing feature which worked well. I just set up a new Batch Process that converts all the tiffs in a target folder to PDFs written to a destination folder and started it. It was easy to set up but processing took longer than I liked. It did get the job done.

Unfortunately Adobe Acrobat is not free, but you should consider it (weighing the cost of your time to develop a 'free' solution vs. the cost of the software).

Jay Riggs
Unfortunately that wouldn't solve this problem, because the PDFs must be combined in process.
C. Ross
+2  A: 

You're crunching quite a lot of data, so if the PDF export process is slow, and you're not using a fast PC, then you may be stuck with that sort of performance.

The most obvious way to speed this up on a multi-core system would be to multi-thread it.

Break the code into two stages. First, a set of images can be converted and stored in a list, then the list can be output to the PDF. With the file sizes you're talking about, memory usage to store the entire document in memory during processing shouldn't be a problem.

You can then make the first stage of this process multi-threaded - you could fire off a threadpool thread for each image that needs to be converted, capping the number of active threads (roughtly one per CPU core is enough - any more won't gain you much). Aln alternative is to split your list of inputs into n lists (again, one list per CPU core) and then fire off threads that just process their own list. This reduces the threading overheads, but may result in some threds finishing a long time before others (if their workload turns out to be a lot less) so it may not always work out quite as fast.

By splitting it into two passes you may also gain performance (even without mutlithreading) by doing all the input processing and then all the output processing as separate stages, which will probably reduce the disk seeking involved (depending on how much RAM you have available for disk caches on your PC).

Note that mutithreading it won't be of much use if you only have a single core CPU (though you could still see gains in parts of the process that are I/O bound, it sounds like you're primarily CPU bound).

You could also experiment with resizing the bitmap using something other than itextsharp calls - I don't know anything about itextsharp but it is possible that its image conversion code is slow, or does not make use of graphics hardware in a way that other scaling techniques may be able to. There may also be some scaling options that you can set that will give you a trade-off between quality and speed that you could try.

Jason Williams
I'm running this on a old as dirt single core server unfortunately. Otherwise, seeing the CPU bound nature, I would definitely split it. In this situation additional threads would probably have a <i>negative</i> effect. Thanks for the excellent answer though.
C. Ross
A: 
//Testing shows that this line takes the lion's share (80%) of the time involved.
iTextSharp.text.Image img =
  iTextSharp.text.Image.GetInstance(bm, null, true);

Might be stupid suggestion (don't have a large testset right now to try it locally), but give me the benefit of the doubt:

You're looping through a multitiff here, selecting frame after frame. bm is this (huge, 6.5M) image, in memory. I don't know enough about iTextSharps internal image handling, but maybe you can help here by just providing a single page image here? Can you try creating a new Bitmap of the desired size, drawing bm on it (look at the options to the Graphics object for properties related to speed: InterpolationMode for example) and passing in this single image instead of the huge thing for each call?

Benjamin Podszun
+2  A: 

Modify GetInstance method argument to

GetInstance(bm, ImageFormat.Tiff) 

this might increase the performance

iTextSharp.text.Image img =  iTextSharp.text.Image.GetInstance(bm, ImageFormat.Tiff);
murugesan
A: 

The trouble is with the length of time it takes for iTextSharp to finishing messing around with your System.Drawing.Image object.

To speed this up to literally a 10th of a second in some tests I have run you need to save the selected frame out to a memory stream and then pass the byte array of data directly to the GetInstance method in iTextSharp, see below...

bm.SelectActiveFrame(FrameDimension.Page, k);

iTextSharp.text.Image img;
using(System.IO.MemoryStream mem = new System.IO.MemoryStream())
{
    // This jumps all the inbuilt processing iTextSharp will perform
    // This will create a larger pdf though
    bm.Save(mem, System.Drawing.Imaging.ImageFormat.Png);
    img = iTextSharp.text.Image.GetInstance(mem.ToArray());
}

img.ScalePercent(72f / 200f * 100);
Craig McNicholas