I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it's something to do with the way the PDF is encoded at those locations but can't figure out what the problem is.
I have looked at this question http://stackoverflow.com/questions/1637505/read-pdf-using-itext but he seems to have solved his problem by changing the location of this file. This is not going to work for me as I get the exception at certain locations within some files - so it's not the file itself but the page in question that is causing the exception.
The stack trace is
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source) at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source) at com.pdfextractor.main.Extractor.main(Extractor.java:61)
And line 61 corresponds to this line:
content = extractor.getTextFromPage(page);
So it seems quite obvious that the getTextFromPage() method is not working.
public static void main(String[] args) throws IOException{
ArrayList<String> keywords = new ArrayList<String>();
keywords.add("location");
keywords.add("Mass Spectrometry");
keywords.add("vacuole");
keywords.add("cytosol");
String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
File directoryToRead = new File(directory);
String[] sa_filesToRead = directoryToRead.list();
List<String> filesToRead = Arrays.asList(sa_filesToRead);
Iterator<String> fileItr = filesToRead.iterator();
while(fileItr.hasNext()){
String nextFile = fileItr.next();
PdfReader reader = new PdfReader(directory+nextFile);
int noPages = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
String content="";
for(int page=1;page<=noPages;page++){
int index = 1;
System.out.println(page);
content = extractor.getTextFromPage(page);
}
}
}