ansaurus

Question

Answer 1

+2 A:

How many of these blocks might appear in a document?

Often slow Regex execution is the result of catastrophic backtracking, as described here: http://www.regular-expressions.info/catastrophic.html

I don't know what regex technology you're using, but you could try to use lookaround assertions, as described here: http://www.regular-expressions.info/lookaround.html

These allow you to first just match what you want, ABC inside parentheses, and then validate that it is preceded by some value and followed by some other value.

Jay 2010-04-01 20:16:17

Large numbers of blocks, all text sits inside one, but you may get several paragraphs worth inside on block.

Ali Lown 2010-04-02 10:59:07

Answer 2

+1 A:

Are you sure the regex is correct and pulls out ABC as a match? What language's regex engine is this? Using my regular expression debugger shows that:

"BT.*?((.*?)).*?ET" doesn't pull out ABC and in fact must find the string 'ET' then backtrack back to find everything else.

"BT.*?\\((.*?)\\).*?ET" works as expected with a single pass left to right.

bot403 2010-04-01 20:21:54

The engine is python's. I was testing using one of the online tools, which reported it pulling out in the correct block.

Ali Lown 2010-04-02 10:58:37

Answer 3

A:

You can't just parse the PDF with a regex to extract the text. In most cases the text in inside compressed binary blobs or encoded. A PDF with the text shown like this is very much the exception.

mark stephens 2010-04-02 10:22:57

Did you not see that I mentioned that I had already deflated the blob? The text excerpt in the question is the actual PDF format (take a look at the specification if you don't believe me)

Ali Lown 2010-04-02 11:00:19

Yes, but deflating it only works if the text is WinANSI encoded and not subsetted. The values in the TJ command are index values which happen to be the same as the text values only if it is WinANSI encoded and not subsetted. Otherwise they need to be converted using the embedded CMAP or the tables in Appendix D of the PDF Reference.

mark stephens 2010-04-03 08:01:21

Heh, take a look at the PDF spec indeed. Perhaps don't be so quick to down vote in the future.

Rowan 2010-04-04 12:58:35

I have put a longer and more detailed explanation about the Tj command on our blog at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams

mark stephens 2010-04-06 15:57:33

Answer 4

A:

There's not really enough info for a definite answer--or maybe you're assuming we know more about PDF than you do. Are there always parenthesized chunks inside these BT...ET sections? Is there always only one of them? Is the BT or ET always at the beginning of a line? If so, I would suggest

(?m)^BT[^()]*\((.*?)\)[^()]*?^ET

If I knew how PDF represented literal parentheses, I could probably come up with something more efficient.

EDIT: According to the PDF spec, literal parentheses have to be escaped with a backslash, and there are a bunch of other backslash-escape sequences. So try this:

(?s)\bBT\b[^()]*\(((?:[^()\\]*(?:\\.[^()\\]*)*))\)

This part--[^()\\]*(?:\\.[^()\\]*)*--matches a block of text which may contain escaped characters (including parens), but not unescaped parens. I know it looks ugly, but it's the most efficient way, since Python doesn't support atomic groups or possessive quantifiers.

(?s) allows . to match newlines, and \bBT\b makes sure the BT isn't part of a longer "word". I'm reasonably confident that this is all I need to match all of the actual text content, so I don't bother matching the stuff after the closing paren.

Alan Moore 2010-04-02 11:19:04

Parentheses are represented as the ascii character. It is just like a text stream, just with some special control characters. I have put some information about PDF in the question, just beneath the example section I am looking for.

Ali Lown 2010-04-02 11:34:00

Hmm. Gskinner's online tester is happy with that, but python is reporting a 'raise error, v # invalid expression' 'sre_constants.error@ unexpected end of regular expression', yet it looks like all brackets/parentheses are matched.

Ali Lown 2010-04-02 15:46:32

Alan Moore 2010-04-02 22:06:46

Answer 5

A:

here's one without regex. simple string parsing using Python internals.

>>> xtract="""
... BT
...    /F13 12 Tf
...    288 720 Td
...    (ABC) Tj
... ET
...
... """
>>> for chunk in xtract.split("ET"):
...     if "BT" in chunk:
...         for brace in chunk.split(")"):
...             if "(" in brace:
...                  print brace[brace.find("(")+1:]
...
ABC

ghostdog74 2010-04-02 13:32:48

Ultimately, this is the only answer here that works at a sensible speed.

Ali Lown 2010-04-02 16:17:19

Answer 6

A:

Since there will be only one bracketed expression between a BT and an ET, you could try the following regular expression for speed:

r"(?s)\bBT\b[^(]*\(([^)]*)\).*?\bET\b"

ΤΖΩΤΖΙΟΥ 2010-04-02 16:25:04

Intriguingly, this seems to find only one the second block BT/ET block.

Ali Lown 2010-04-02 16:52:38

If you use the `.search` method of the compiled regular expression, then this will find a single match; if you use `.finditer`, you'll get all matches. Unless I misunderstood your comment.

ΤΖΩΤΖΙΟΥ 2010-04-02 20:58:21

ansaurus

tags:

views:

answers:

Regex matching very slow

related questions