I need to find the rectangles that make up the paragraphs and/or blocks of text in a PDF page. I have looked at iTextSharp and DataLogics. The best I have been able to do is find an individual words. However, I need to know if the words are in the same block of text. I am using C#. Anybody have any ideas?
This is in JAVA, but it deals with getting the content from the pdf then getting the value from the index within the content.
I am not sure, but you might be able to achieve something similar in C#. Get the content and print it out.
//create a new reader from the source file
PdfReader reader = new PdfReader(fileName);
//create the file array
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(fileName);
//get the content of the pdf reader (which is the source file)
byte bContent [] = reader.getPageContent(1,raf);
ByteArrayOutputStream bs = new ByteArrayOutputStream();
bs.write(bContent);
//create a string of the contents of the page in order to get the data needed
String contentOf1099 = bs.toString();
if(debug) { System.err.println("contentOf1099 = "+contentOf1099); }
//get the value based off an index
String value = contentOf1099.substring(contentOf1099.indexOf((",contentOf1099.indexOf("155 664 Td"))+1,contentOf1099.indexOf("(",contentOf1099.indexOf("155 664 Td "))+12);