questions about pdfbox | ansaurus

pdfbox

why both pdfbox and pdfrenderer can not support "Additional fonts"?

I have a pdf which contains 'UniCNS-UCS2-H' font, I tried both pdfbox and pdfrenderer, they all throw exception: Unknown encoding for 'UniCNS-UCS2-H' and this font was included in a font file :mingliu.ttc(it's a true type collection file, I don't know does this matter ? what can I do to let these two libraries support additional fonts...

PDFBox, how to parse out text from a table in PDF File but keep the formatting C#, PDF

I am using PDFBox to parse out text using C# from PDF File, that is working fine, but when the parser come across a table it parse out the text but destroy the format. How can I parse out text from a table but keep the formatting, Please help. Need a sample code Thank you! Don [email protected] ...

not readable PDF files

Hello, I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this: "┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h" or "10a61a91a22a25a3a27a17a23a20a...

PDFBox setting A5 page size

Started playing with PDFBox PDDocument document = new PDDocument(); PDPage page = new PDPage(); document.addPage( page ); PDFont font = PDType1Font.HELVETICA_BOLD; PDPageContentStream contentStream = new PDPageContentStream(document, page); contentStream.beginText(); contentStream.setFont( font, 12 ); contentStream.moveTextPositionByA...

Parsing PDF files (especially with tables) with PDFBox

Hi! I need to parse a PDF file which contais tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contais a table like this (7 colums: the first two always have data, only one Comple...

Fastest PDF->text library for .NET project

I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which wo...

Java - PDFBox - Text Extraction

Hi, I have been using pdfbox for extracting text information from PDFs. I have succesfully parsed all properties of text such as fontname , fontface , size ,position etc. PROBLEM: I am using pdfbox1.2.1(latest version). The getCharacter() in TextPosition class returns the full string except the last character. The last character is par...

PDFBOX - HOW TO PARSE PDF's PMD(Paper Meta Data) ?

Hi, How to parse pdf's paper meta data using PDFBOX library?. Regards, Magggi ...

PDFBox - Building the latest version for .NET using IKVM

Hi, I would like to build the latest version of PDFBox (http://pdfbox.apache.org/userguide/dot_net.html) for use within my .NET project. I have no experience with Java whatsoever but I am using the steps defined here: http://www.ikvm.net/userguide/tutorial.html I am using the following versions: - IKVM (0.42.0.6) - PDFBox (1.2.1) JAR...

How to edit PDF properties in java?

I need to edit existing properties or set new PDF properties such as author name, title, subject, etc. from a java application. Is there any way to do that? I have found the apache.pdfbox library but I don't know whether it will solve my issues or not? ...

How to merge two PDF files into one in Java?

I want to merge many PDF files into one using PDFBox and this is what I've done: PDDocument document = new PDDocument(); for (String pdfFile : pdfFiles) { PDDocument part = PDDocument.load(pdfFile); List<PDPage> list = part.getDocumentCatalog().getAllPages(); for (PDPage page : list) { document.addPage(page); } ...

PDFBox, how to get a single image fast?

I am using pdfbox for a project, and for one specific task I need to pull out an image from a pdf file. The image I need is always on a specific page. PDF itself is usually pretty large 200+ pages. The problem is that every time I need to pull out this image I call getDocumentCatalog().getAllPages() ... and this takes significant amount ...

Apache PDFBox Java library - Is there an API for creating tables?

Hello all, I am using the Apache PDFBox java library to create PDFs. Is there a way to create a data-table using pdfbox? If there is no such API to do it, I would require to manually draw the table using drawLine etc., Any suggestions on how to go about this? Thanks -Keshav ...

PDF Fields Parser using PDFBox

Hi, From PDF, i need to read the co-ordinates of the fields such as, ascent descent of the fields, using PDFBox API. The COS dictioanary object contains those information i guess. As of now i can able to retrieve the rect box of the fields which includes x,y,height and width. But i need to get the baseline which in turn depends on ascen...

Accessing PDF-Content

Hi, I would like to access parts of pdf content. ABBYY-FineReader recognizes well the layout and tables but how can I access now these results? I tried with the pdf-a standard or the java library PDFBox without success. Thanks! ...

PDF Parsing using PDFBox API

Im using PDFBox API to parse the pdf data. I don't know how to identify the underlined text from the PDF using this API. Please help me out. Regards, _naim. ...

1