




I need to find the rectangles that make up the paragraphs and/or blocks of text in a PDF page. I have looked at iTextSharp and DataLogics. The best I have been able to do is find an individual words. However, I need to know if the words are in the same block of text. I am using C#. Anybody have any ideas?


This is in JAVA, but it deals with getting the content from the pdf then getting the value from the index within the content.

I am not sure, but you might be able to achieve something similar in C#. Get the content and print it out.

//create a new reader from the source file

PdfReader reader = new PdfReader(fileName);

//create the file array

RandomAccessFileOrArray raf = new RandomAccessFileOrArray(fileName);

//get the content of the pdf reader (which is the source file)

byte bContent [] = reader.getPageContent(1,raf);

ByteArrayOutputStream bs = new ByteArrayOutputStream();


//create a string of the contents of the page in order to get the data needed

String contentOf1099 = bs.toString();

if(debug) { System.err.println("contentOf1099 = "+contentOf1099); }

//get the value based off an index

String value = contentOf1099.substring(contentOf1099.indexOf((",contentOf1099.indexOf("155 664 Td"))+1,contentOf1099.indexOf("(",contentOf1099.indexOf("155 664 Td "))+12);

birdlips, that last line is really giving me trouble. can you break that down for me?
for sure, basically what I am doing there is saying "get me the next 12 characters at the index of 155 664 Td". Everything on the PDF has a "location" with a defined address of sorts. If you print out the content of the pdf, you might be able to determine what lies within the rectangle.

UNless its structured PDF, this is not going to exist. The PDF is a set of drawString commands at locations - there are no paragraph or space markers. You need to work this out from the text positions.