ansaurus

Question

Using Java PDFBox library to write Russian PDF

Answer 1

A:

Perhaps the Russian encoding class need to be written, it should look like the WinAnsiEncoding one, I suppose.
Now, I have no idea what to put there!

Or, if that's not what you do already, perhaps you should encode your source file in UTF-8 and use a default encoding.
I saw some messages related to issues with extracting Russian text from existing PDF files (using PDFBox of course) but I don't know if output is related.
You can also write to the PDFBox mailing list.

PhiLho 2009-11-11 08:57:21

Well, extracting Russian text works fine using PDFBox, the problem is in writing Russian text in a PDF.

Brad 2009-11-11 10:35:21

For writing the Rusian encoding, there is the DictionaryEncoding class that i think can let me define my own Encoding ... But it seems a maze to me :http://kickjava.com/src/org/pdfbox/encoding/DictionaryEncoding.java.htm

Brad 2009-11-11 11:38:12

Answer 2

A:

Testing whether this is an encoding issue should be pretty easy to do (just switch to UTF16 encoding).

I'm assuming that you've tried using an editor or something with the VREMACCI font and confirmed that it displays the way you expect it to?

You might want to try doing the same thing in iText just to get a feel for whether the issue is related to the PdfBox library itself... If your primary goal is to generate PDF files, iText might be a better solution anyway.

EDIT - long answer to comments:

ok - sorry for the back and forth on the encoding question... Your core issue (which you probably already knew) is that the encoding of the bytes being written to the content stream is different than the encoding being used to look up glyphs. Now I'll try to actually be helpful:

I took a look at the dictionary encoding class in PdfBox, and it looks quite unintuitive... The 'dictionary' in question is a PDF dictionary. So what you'll basically need to do is create a Pdf dictionary object (I think that PdfBox calls this a type of COSObject), then add entries to it.

The encoding for a font is defined in PDF as a dictionary (see page 266 of the above spec). The dictionary contains a base encoding name, plus an optional differences array. Technically, the differences array should not be used with true-type fonts (although I've seen it used in some cases - don't use it, though).

You will then specify an entry for the cmap for the encoding. This cmap will be the encoding of your font.

My suggestion here is to take an existing PDF that does what you want, then get a dump of the dictionary structure for the font so you can see what it looks like.

This is definitely not for the faint of heart. I can provide some help - if you need a dictionary dump, shoot me a hyperlink with a sample PDF and I'll run it through some of the algorithms I use in my iText development (I'm the maintainer of the iText text extraction sub-system).

EDIT - 11/17/09

OK - here's the dictionary dump from the russian.pdf file (sub-dictionaries are listed indented, and in the order they appeared in the containing dictionary):

(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
    Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
    Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
     Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
      Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
     Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
     Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
      Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
      Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
      Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
      Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
     Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
      Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
       Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
        Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
         Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word)
         Subdictionary /PageElement = (/SubType=/HF)

there's a lot of moving parts here. you might want to put together a test document that has only 3 or 4 characters in the font in question... There are a lot of type-1 fonts being used here (in addition to the TT fonts), so it's hard to tell what is involved in your particular issue.

(Are you sure you don't want to at least try this with iText? ;-) I'm not saying that it'll work, just that it might be worth a shot ).

For reference, the above dictionary dump was obtained using the com.lowagie.text.pdf.parser.PdfContentReaderTool class

Kevin Day 2009-11-12 03:29:52

Brad 2009-11-12 08:34:49

Ugh. If you use PDFBox to parse the content that you created, are you able to recover the text? If so, then it probably isn't a limitation of the encoding, pre-se... Maybe this is just an issue with how PDFBox maps byte tuples to glyphs?

Kevin Day 2009-11-12 17:58:48

What do you mean by recovering it ? I can write some other foreign languages like French, German, ... But others like Russian seem to be a problem. It is an Encoding problem, i am sure. And the class DictionaryEncoding was created to allow extending other unsupported Encodings but i still can't figure how to use it.

Brad 2009-11-15 09:05:43

Well, if you parse the text using PDFBox, do you get the text that you input, or is it munched up? In other words, write text A to PDF, then read text A from PDF, then see if A?=A. If it's an encoding problem, it's unlikely to be symmetric, so you'll most likely get A!=B coming out. If you do get A=A, then the issue is probably not encoding and you are dealing with a character code->glyph transformation issue. Strongly suggest that you try this using iText so you at least have a baseline of the content stream you *should* get.

Kevin Day 2009-11-16 04:23:19

Well, i tried that, and it gets the same text i enter so A=A returns true. On the other hand i don't get the difference between Encoding and glyph transformation ... I thought they were the same thing. When i talked to the library admin here's what he said : "The problem is the mapping between the string to be added and the encoding of the font. AFAIK the WinAnsiEncoding, which is used as default for true type fonts doesn't contain Russian letters. So finally you somehow have to find another way for the mapping. You should be able to define your own mapping using the DictionaryEncoding."

Brad 2009-11-16 07:47:33

Thanks a lot Kevin ... I'm really impressed. Well, here's the code i've tried before - I don't really understand what i'm doing here :) but i think it translates what you are saying : COSDictionary cosDic = new COSDictionary();cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420" ); // Russian letter.font.setEncoding( new DictionaryEncoding( cosDic ) ); Do you mean by the CMap a general map like the COSDictionary map ?Here is a russian pdf sample : http://www.4shared.com/file/153847152/ac2943e0/russian.html

Brad 2009-11-17 11:26:25

Brad - I'll take a look when I get a free minute. you might want to post the above code into the original question so it will format correctly. CMap is a type of data structure used for communicating encoding information - definitely not a COSDictionary. The PDF spec that I link to above has information about CMaps (again, not for the faint of heart). PdfBox has a decent CMap file parser built in, so if you can get the CMap for the font, you can at least parse it (I'm still not certain how that would work for your particualr situation, though).

Kevin Day 2009-11-17 20:12:18

Thanks a lot Kevin, but i don't know what should i do with this dump :) ... I have created a test app for writing a PDF : http://www.4shared.com/file/154503635/62837b87/createGreekPDF.html ... Please check it, just double click the jar. Also i found this : http://www.pinxue.net/java/PDFBox_String_Charset_analyze_en.html ... I think it may be useful, but of course it's so complicated for me.

Brad 2009-11-18 09:39:25

Yeah - dictionary font entries are not simple. Part of the problem you have is that you have way too many fonts in the one file. When digging into this stuff, it's much easier to do one font at a time, with just 4 or 5 characters of text. That allows you to focus on the specific issue at hand. That said, what you'll do with that dictionary dump is create a COS dictionary object (and sub-dictionaries, etc...), and use that for your encoding. Or you could try iText ;-)

Kevin Day 2009-11-18 19:31:41

As i said ... I know iText is great, but i have already finished my program 3 months ago, and this is a critical update to it, so i can't change the library used now.

Brad 2009-11-19 04:38:37

Fair enough. Reverse engineering the font dictionaries may be a 3 month project, though... You might want to at least try iText just to see if it works any better. I know that it's hard to switch horses in the middle of the race, but sometimes you have to bite that bullet (I had to do this change myself awhile back when I discovered that PdfBox didn't support xref streams)

Kevin Day 2009-11-23 21:53:31

:) ... This is a very hard decision, but i think i have to. Will try it, and i hope i won't regret.

Brad 2009-11-24 08:36:53

Answer 3

A:

Hello,

Just try this one:

Phrase leftTitle = new Phrase("САНКТ-ПЕТЕРБУРГ", FontFactory.getFont("Tahoma", "Cp1251", true, 25));

This will work at least with latest (5.0.1) iText

daNIL 2010-05-01 21:16:53

Answer 4

A:

I'm having the same problem with Arabic The things is I can find in the issue tracker that they have fixes for arabic but have no idea how to write in Arabic in the first place any solution?

Java Developer 2010-09-19 11:14:21

ansaurus

tags:

views:

answers:

Using Java PDFBox library to write Russian PDF

related questions