tags:

views:

302

answers:

2

I need to find the rectangles that make up the paragraphs and/or blocks of text in a PDF page. I have looked at iTextSharp and DataLogics. The best I have been able to do is find an individual words. However, I need to know if the words are in the same block of text. I am using C#. Anybody have any ideas?

A: 

This is in JAVA, but it deals with getting the content from the pdf then getting the value from the index within the content.

I am not sure, but you might be able to achieve something similar in C#. Get the content and print it out.

//create a new reader from the source file

PdfReader reader = new PdfReader(fileName);

//create the file array

RandomAccessFileOrArray raf = new RandomAccessFileOrArray(fileName);

//get the content of the pdf reader (which is the source file)

byte bContent [] = reader.getPageContent(1,raf);

ByteArrayOutputStream bs = new ByteArrayOutputStream();

bs.write(bContent);

//create a string of the contents of the page in order to get the data needed

String contentOf1099 = bs.toString();

if(debug) { System.err.println("contentOf1099 = "+contentOf1099); }

//get the value based off an index

String value = contentOf1099.substring(contentOf1099.indexOf((",contentOf1099.indexOf("155 664 Td"))+1,contentOf1099.indexOf("(",contentOf1099.indexOf("155 664 Td "))+12);

northpole
birdlips, that last line is really giving me trouble. can you break that down for me?
Dave
for sure, basically what I am doing there is saying "get me the next 12 characters at the index of 155 664 Td". Everything on the PDF has a "location" with a defined address of sorts. If you print out the content of the pdf, you might be able to determine what lies within the rectangle.
northpole
A: 

UNless its structured PDF, this is not going to exist. The PDF is a set of drawString commands at locations - there are no paragraph or space markers. You need to work this out from the text positions.