views:

1220

answers:

4

Hello ,

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. Here is the important code lines :

PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) );  // Windows Russian font imported to write the Russian text.
font.setEncoding( new WinAnsiEncoding() );  // Define the Encoding used in writing.
// Some code here to open the PDF & define a new page.
contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.

The WinAnsiEncoding source code is : Click here

--------------------- Edit on 18 November 2009

After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding.

I am not sure how to use it, but here is what i have tried until now :

COSDictionary cosDic = new COSDictionary(); 
cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter.
font.setEncoding( new DictionaryEncoding( cosDic ) );

This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank.

The DictionaryEncoding source code is : Click here

Thanks . . .

A: 

Perhaps the Russian encoding class need to be written, it should look like the WinAnsiEncoding one, I suppose.
Now, I have no idea what to put there!

Or, if that's not what you do already, perhaps you should encode your source file in UTF-8 and use a default encoding.
I saw some messages related to issues with extracting Russian text from existing PDF files (using PDFBox of course) but I don't know if output is related.
You can also write to the PDFBox mailing list.

PhiLho
Well, extracting Russian text works fine using PDFBox, the problem is in writing Russian text in a PDF.
Brad
For writing the Rusian encoding, there is the DictionaryEncoding class that i think can let me define my own Encoding ... But it seems a maze to me :http://kickjava.com/src/org/pdfbox/encoding/DictionaryEncoding.java.htm
Brad
A: 

Testing whether this is an encoding issue should be pretty easy to do (just switch to UTF16 encoding).

I'm assuming that you've tried using an editor or something with the VREMACCI font and confirmed that it displays the way you expect it to?

You might want to try doing the same thing in iText just to get a feel for whether the issue is related to the PdfBox library itself... If your primary goal is to generate PDF files, iText might be a better solution anyway.

EDIT - long answer to comments:

ok - sorry for the back and forth on the encoding question... Your core issue (which you probably already knew) is that the encoding of the bytes being written to the content stream is different than the encoding being used to look up glyphs. Now I'll try to actually be helpful:

I took a look at the dictionary encoding class in PdfBox, and it looks quite unintuitive... The 'dictionary' in question is a PDF dictionary. So what you'll basically need to do is create a Pdf dictionary object (I think that PdfBox calls this a type of COSObject), then add entries to it.

The encoding for a font is defined in PDF as a dictionary (see page 266 of the above spec). The dictionary contains a base encoding name, plus an optional differences array. Technically, the differences array should not be used with true-type fonts (although I've seen it used in some cases - don't use it, though).

You will then specify an entry for the cmap for the encoding. This cmap will be the encoding of your font.

My suggestion here is to take an existing PDF that does what you want, then get a dump of the dictionary structure for the font so you can see what it looks like.

This is definitely not for the faint of heart. I can provide some help - if you need a dictionary dump, shoot me a hyperlink with a sample PDF and I'll run it through some of the algorithms I use in my iText development (I'm the maintainer of the iText text extraction sub-system).

EDIT - 11/17/09

OK - here's the dictionary dump from the russian.pdf file (sub-dictionaries are listed indented, and in the order they appeared in the containing dictionary):

(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
    Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
    Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
     Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
      Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
     Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
     Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
      Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
      Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
      Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
      Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
      Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
       Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
     Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
      Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
       Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
        Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
         Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word)
         Subdictionary /PageElement = (/SubType=/HF)

there's a lot of moving parts here. you might want to put together a test document that has only 3 or 4 characters in the font in question... There are a lot of type-1 fonts being used here (in addition to the TT fonts), so it's hard to tell what is involved in your particular issue.

(Are you sure you don't want to at least try this with iText? ;-) I'm not saying that it'll work, just that it might be worth a shot ).

For reference, the above dictionary dump was obtained using the com.lowagie.text.pdf.parser.PdfContentReaderTool class

Kevin Day
Brad
Ugh. If you use PDFBox to parse the content that you created, are you able to recover the text? If so, then it probably isn't a limitation of the encoding, pre-se... Maybe this is just an issue with how PDFBox maps byte tuples to glyphs?
Kevin Day
What do you mean by recovering it ? I can write some other foreign languages like French, German, ... But others like Russian seem to be a problem. It is an Encoding problem, i am sure. And the class DictionaryEncoding was created to allow extending other unsupported Encodings but i still can't figure how to use it.
Brad
Well, if you parse the text using PDFBox, do you get the text that you input, or is it munched up? In other words, write text A to PDF, then read text A from PDF, then see if A?=A. If it's an encoding problem, it's unlikely to be symmetric, so you'll most likely get A!=B coming out. If you do get A=A, then the issue is probably not encoding and you are dealing with a character code->glyph transformation issue. Strongly suggest that you try this using iText so you at least have a baseline of the content stream you *should* get.
Kevin Day
Well, i tried that, and it gets the same text i enter so A=A returns true. On the other hand i don't get the difference between Encoding and glyph transformation ... I thought they were the same thing. When i talked to the library admin here's what he said : "The problem is the mapping between the string to be added and the encoding of the font. AFAIK the WinAnsiEncoding, which is used as default for true type fonts doesn't contain Russian letters. So finally you somehow have to find another way for the mapping. You should be able to define your own mapping using the DictionaryEncoding."
Brad
Thanks a lot Kevin ... I'm really impressed. Well, here's the code i've tried before - I don't really understand what i'm doing here :) but i think it translates what you are saying : COSDictionary cosDic = new COSDictionary();cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420" ); // Russian letter.font.setEncoding( new DictionaryEncoding( cosDic ) ); Do you mean by the CMap a general map like the COSDictionary map ?Here is a russian pdf sample : http://www.4shared.com/file/153847152/ac2943e0/russian.html
Brad
Brad - I'll take a look when I get a free minute. you might want to post the above code into the original question so it will format correctly. CMap is a type of data structure used for communicating encoding information - definitely not a COSDictionary. The PDF spec that I link to above has information about CMaps (again, not for the faint of heart). PdfBox has a decent CMap file parser built in, so if you can get the CMap for the font, you can at least parse it (I'm still not certain how that would work for your particualr situation, though).
Kevin Day
Thanks a lot Kevin, but i don't know what should i do with this dump :) ... I have created a test app for writing a PDF : http://www.4shared.com/file/154503635/62837b87/createGreekPDF.html ... Please check it, just double click the jar. Also i found this : http://www.pinxue.net/java/PDFBox_String_Charset_analyze_en.html ... I think it may be useful, but of course it's so complicated for me.
Brad
Yeah - dictionary font entries are not simple. Part of the problem you have is that you have way too many fonts in the one file. When digging into this stuff, it's much easier to do one font at a time, with just 4 or 5 characters of text. That allows you to focus on the specific issue at hand. That said, what you'll do with that dictionary dump is create a COS dictionary object (and sub-dictionaries, etc...), and use that for your encoding. Or you could try iText ;-)
Kevin Day
As i said ... I know iText is great, but i have already finished my program 3 months ago, and this is a critical update to it, so i can't change the library used now.
Brad
Fair enough. Reverse engineering the font dictionaries may be a 3 month project, though... You might want to at least try iText just to see if it works any better. I know that it's hard to switch horses in the middle of the race, but sometimes you have to bite that bullet (I had to do this change myself awhile back when I discovered that PdfBox didn't support xref streams)
Kevin Day
:) ... This is a very hard decision, but i think i have to. Will try it, and i hope i won't regret.
Brad
A: 

Hello,

Just try this one:

Phrase leftTitle = new Phrase("САНКТ-ПЕТЕРБУРГ", FontFactory.getFont("Tahoma", "Cp1251", true, 25));

This will work at least with latest (5.0.1) iText

daNIL
A: 

I'm having the same problem with Arabic The things is I can find in the issue tracker that they have fixes for arabic but have no idea how to write in Arabic in the first place any solution?

Java Developer