I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is BT.*?\((.*?)\).*?ET
(with DOTMATCHALL set) to match something like:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
The only bit I want is the text ABC in the brackets.
The above is only formatted like that to make it clear to see. In the deflated text it may be all in one line, it may not be. There is no gurantee that the BT/ET will be at the start of a line. There may be spaces and text before/after the bracketed section, there may not be. There will however, be only one bracketed section per BT/ET block.
The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.
How may I speed this up?