tags:

views:

1393

answers:

2

I have a need to search a pdf file to see if a certin string is present. The string in question is definately encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.

Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?

+2  A: 

There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspx and http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

volatilsis
+1 for iTextSharp. It should be able to do what you need.
jeremcc
Thanks! This worked great
Nathan Reed
A: 

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

using System;
using System.IO;
using QuickPDFDLL0718;

namespace QPLConsoleApp
{
    public class QPL
    {
        public static void Main()
        {
            // This example uses the DLL edition of Quick PDF Library
            // Create an instance of the class and give it the path to the DLL
            PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");

            // Check if the DLL was loaded successfully
            if (QP.LibraryLoaded())
            {
                // Insert license key here / Check the license key
                if (QP.UnlockKey("...") == 1)
                {
                    QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");

                    int iPageCount = QP.PageCount();
                    int PageNumber = 1;
                    int MatchesFound = 0;

                    while (PageNumber <= iPageCount)
                    {
                        QP.SelectPage(PageNumber);
                        string PageText = QP.GetPageText(3);

                        using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
                        {
                            TempFile.Write(PageText);
                        }

                        string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
                        string[][] grid = new string[lines.Length][];

                        for (int i = 0; i < lines.Length; i++)
                        {
                            grid[i] = lines[i].Split(',');
                        }

                        foreach (string[] line in grid)
                        {
                            string FindMatch = line[11];

                            // Update this string to the word that you're searching for.
                            // It can be one or more words (i.e. "sunday" or "last sunday".

                            if (FindMatch.Contains("characters"))
                            {
                                Console.WriteLine("Success! Word match found on page: " + PageNumber);
                                MatchesFound++;
                            }
                        }
                        PageNumber++;
                    }

                    if (MatchesFound == 0)
                    {
                        Console.WriteLine("Sorry! No matches found.");
                    }
                    else
                    {
                        Console.WriteLine();
                        Console.WriteLine("Total: " + MatchesFound + " matches found!");
                    }
                    Console.ReadLine();
                }
            }
        }
    }
}
Rowan