views:

1202

answers:

2

I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it's something to do with the way the PDF is encoded at those locations but can't figure out what the problem is.

I have looked at this question http://stackoverflow.com/questions/1637505/read-pdf-using-itext but he seems to have solved his problem by changing the location of this file. This is not going to work for me as I get the exception at certain locations within some files - so it's not the file itself but the page in question that is causing the exception.

The stack trace is

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source) at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source) at com.pdfextractor.main.Extractor.main(Extractor.java:61)

And line 61 corresponds to this line:
content = extractor.getTextFromPage(page);
So it seems quite obvious that the getTextFromPage() method is not working.

public static void main(String[] args) throws IOException{
 ArrayList<String> keywords = new ArrayList<String>();
  keywords.add("location");
  keywords.add("Mass Spectrometry"); 
  keywords.add("vacuole");
  keywords.add("cytosol");

 String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
 File directoryToRead = new File(directory); 
 String[] sa_filesToRead = directoryToRead.list();
 List<String> filesToRead = Arrays.asList(sa_filesToRead);

 Iterator<String> fileItr = filesToRead.iterator();
 while(fileItr.hasNext()){   

  String nextFile = fileItr.next();

  PdfReader reader = new PdfReader(directory+nextFile);
  int noPages = reader.getNumberOfPages();
  PdfTextExtractor extractor = new PdfTextExtractor(reader);

 String content=""; 
 for(int page=1;page<=noPages;page++){
  int index = 1;
  System.out.println(page);
  content = extractor.getTextFromPage(page);

     }  
    }
    }
+1  A: 

Most Java classes/libraries expect that a method like getTextFromPage(int) are indexed starting at 0 - meaning that getTextFromPage(0) should return the text from page 1, getTextFromPage(1) should return the text from page 2.

Your for loop that causes the ArrayIndexOutOfBoundsException is indexed starting with 1.

Are you sure that iText's getTextFromPage(int) is indexed starting at 1 rather than the (almost) standard 0?

matt b
No the pages start from 1 I just verified that. And the error occurs at when page=31 and there are 39 pages, so it's really strange.
Ankur
That's True. Paging Does Start From 1 instead of 0.
Kushal Paudyal
A: 

Have you tried posting on the very active IText mailing list?

mark stephens
Yep - the same question has been posted there three times and not been answered.
Ankur