views:

463

answers:

2

I am trying to use ReportLab with Unicode characters, but it is not working. I tried tracing through the code till I reached the following line:

class TTFont:
    # ...
    def splitString(self, text, doc, encoding='utf-8'):
        # ...
        cur.append(n & 0xFF) # <-- here is the problem!
        # ...

(This code can be found in ReportLab's repository, in the file pdfbase/ttfonts.py. The code in question is in line 1059.)

Why is n's value being manipulated?!

In the line shown above, n contains the code point of the character being processed (e.g. 65 for 'A', 97 for 'a', or 1588 for Arabic sheen 'ش'). cur is a list that is being filled with the characters to be sent to the final output (AFAIU). Before that line, everything was (apparently) working fine, but in this line, the value of n was manipulated, apparently reducing it to the extended ASCII range!

This causes non-ASCII, Unicode characters to lose their value. I cannot understand how this statement is useful, or why it is necessary!

So my question is, why is n's value being manipulated here, and how should I proceed about fixing this issue?

Edit:
In response to the comment regarding my code snippet, here is an example that causes this error:

my_doctemplate.build([Paragraph(bulletText = None, encoding = 'utf8',
    caseSensitive = 1, debug = 0,
    text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
    frags = [ParaFrag(fontName = 'DejaVuSansMono-BoldOblique',
        text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
        sub = 0, rise = 0, greek = 0, link = None, italic = 0, strike = 0,
        fontSize = 12.0, textColor = Color(0,0,0), super = 0, underline = 0,
        bold = 0)])])

In PDFTextObject._textOut, _formatText is called, which identifies the font as _dynamicFont, and accordingly calls font.splitString, which is causing the error described above.

A: 

I'm pretty sure you'd need to change 0xFF to 0xFFFF to use 4-byte unicode characters, as ~unutbu suggested, hence using four bytes instead of two.

dertyp
+1  A: 

What do you mean, "not working"? You have misquoted the reportlab source code. What it is actually doing is that the lower and upper byte of each 16-bit unicode character are coded separately (the upper byte is only written out when it changes, which I assume is a PDF-specific optimization to make documents smaller).

Please explain exactly what the problem is, not what you think what the underlying reason is. Chances are the characters you want to display simply don't exist in the selected font ('DejaVuSansMono-BoldOblique').

Antoine P.
Thanks. You are right. I understood this a little earlier, but I had forgotten to update this question.
Hosam Aly