I have a pdf which contains 'UniCNS-UCS2-H' font,
I tried both pdfbox and pdfrenderer, they all throw exception:
Unknown encoding for 'UniCNS-UCS2-H'
and this font was included in a font file :mingliu.ttc(it's a true type collection file, I don't know does this matter ?
what can I do to let these two libraries support additional fonts...
I am using PDFBox to parse out text using C# from PDF File, that is working fine, but when the parser come across a table it parse out the text but destroy the format. How can I parse out text from a table but keep the formatting, Please help. Need a sample code
Thank you!
Don
[email protected]
...
Hello,
I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a...
Started playing with PDFBox
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
PDFont font = PDType1Font.HELVETICA_BOLD;
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByA...
Hi! I need to parse a PDF file which contais tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contais a table like this (7 colums: the first two always have data, only one Comple...
I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which wo...
Hi,
I have been using pdfbox for extracting text information from PDFs. I have succesfully parsed all properties of text such as fontname , fontface , size ,position etc.
PROBLEM: I am using pdfbox1.2.1(latest version). The getCharacter() in TextPosition class returns the full string except the last character. The last character is par...
Hi,
How to parse pdf's paper meta data using PDFBOX library?.
Regards,
Magggi
...
Hi,
I would like to build the latest version of PDFBox (http://pdfbox.apache.org/userguide/dot_net.html) for use within my .NET project.
I have no experience with Java whatsoever but I am using the steps defined here:
http://www.ikvm.net/userguide/tutorial.html
I am using the following versions:
- IKVM (0.42.0.6)
- PDFBox (1.2.1) JAR...
I need to edit existing properties or set new PDF properties such as author name, title, subject, etc. from a java application. Is there any way to do that? I have found the apache.pdfbox library but I don't know whether it will solve my issues or not?
...
I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile : pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page : list) {
document.addPage(page);
}
...
I am using pdfbox for a project, and for one specific task I need to pull out an image from a pdf file. The image I need is always on a specific page. PDF itself is usually pretty large 200+ pages. The problem is that every time I need to pull out this image I call getDocumentCatalog().getAllPages() ... and this takes significant amount ...
Hello all,
I am using the Apache PDFBox java library to create PDFs. Is there a way to create a data-table using pdfbox? If there is no such API to do it, I would require to manually draw the table using drawLine etc., Any suggestions on how to go about this?
Thanks
-Keshav
...
Hi,
From PDF, i need to read the co-ordinates of the fields such as, ascent descent of the fields, using PDFBox API. The COS dictioanary object contains those information i guess. As of now i can able to retrieve the rect box of the fields which includes x,y,height and width. But i need to get the baseline which in turn depends on ascen...
Hi, I would like to access parts of pdf content.
ABBYY-FineReader recognizes well the layout and tables but how can I access now these results?
I tried with the pdf-a standard or the java library PDFBox without success.
Thanks!
...
Im using PDFBox API to parse the pdf data. I don't know how to identify the underlined text from the PDF using this API. Please help me out.
Regards,
_naim.
...