views:

25

answers:

1

hay all. maybe you guys can help me in my project. im using pdfcreator as a virtual printer to print to a file some images. can be pdf can be any type of image. but i need to extract data from it. can it be done? im using C#.

A: 

You cannot extract text from images.

In principle, you can extract text from PDFs.

Here are two methods using Free software commandline utilities; maybe one of them fits your needs:

  1. pdftotext.exe (part of Foolabs' XPDF utilities)
  2. gswin32c.exe (Artifex' Ghostscript)

Example commandlines to extract all text from pages 3-7:

pdftotext:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   "d:\path\to\output.txt"

You want to get the text output to stdout instead of a file? OK, try this:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   -

Ghostscript: (Check that your installation has ps2ascii.ps in its lib subdirectory)

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dSIMPLE ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   "c:/path/to/input.pdf" ^
   -dQUIET 

Text output will appear on stdout. If you test this in a cmd.exe window, you can redirect this to a file by appending > /path/to/output.txt to the command.

pipitas