views:

41

answers:

3

Hello everyone,

The situation is as follows: I have a series of big, fat PDF files, full of imagery and randomly distributed text - these are the sections of a huge promotional pricelist for a vast array of products. What I need is to pattern-match all the catalogue codes in the text of each PDF file and to wrap it with a hyperlink that will point to the respective page in an online store.

So the task is very simple - scan a PDF file for all plain-text 10 digits sequences, and convert those into links whose href is http://something?code=[match].

I would also prefer to put this together in a PHP script if possible, but any language would do. I have a gut feeling that maybe even flash could be an option.

Any ideas? Thanks in advance.

EDIT:

Some answers coming in are teaching me pcre syntax. The problem here is that I need to search and replace in a PDF file. So the problem is twofold. Say we'll do this in PHP:

  • How do you read / write to a PDF in PHP?
  • As PDFs aren't plaintext files, I can't just regex against them, and I also believe that PDF links are not bundled together with the text but come separate as regions. Which also means that I could maybe overlay an active rectangle over the coordinates of the catalogue code's characters, if I only knew where a matched code resides on a page.

What do you think? Other languages are also an option.

Thanks.

A: 
<?
$s="
http://something.com?code=3000 asdf text
http://something.com?code=5000 asdf
";
echo preg_replace('/(http:\/\/something\.com\?code=(\d+))/s', '<a href="$1">$2</a>',$s);
?>

output 3000 asdf text

5000 asdf

JapanPro
This is a truly irrelevant answer.
hristo
Sorry, but regex won't work with PDF content streams.
Dwight Kelly
A: 

I'm pretty sure that you are saying you have ten digit numbers throughout your input text, and you want all ten digit numbers converted to links. Japan's answer does not do that - it converts URLs to links.

This should work for converting numbers:

<?
$s="some text with 1234567890 and then more text 
and then 1234512345 and then 
more text";
echo preg_replace('/(\d{10})/s', '<a href="http://something.com?code=$1"&gt;$1&lt;/a&gt;',$s);
?>

Output:

some text with <a href="http://something.com?code=1234567890"&gt;1234567890&lt;/a&gt; and then more text 
and then <a href="http://something.com?code=1234512345"&gt;1234512345&lt;/a&gt; and then 
more text
JGB146
Guys I know regex for gods sake.
hristo
+1  A: 

Replacing text in a PDF is difficult and none of the open source PDF solutions support this capability.

Apago (www.apago.com) has a developed commercial solution for replacing text in PDF files. It's used by greeting card manufacturer to modify pricing, "MADE IN" text, product numbers, etc.

Dwight Kelly
Ok - but what about finding the text and getting its bounding box, in order to draw an active transparent rectangle directly above it?
hristo
Xpdf could be used to calculate the bbox of text. See the example TextOutput output device class. If you need something ready-to-go, contact [email protected] for more information about the tool I mentioned above.
Dwight Kelly