tags:

views:

149

answers:

5
+1  Q: 

PHP Regex Question

I have a series of urls in a web doc, something like this:

<a href="somepage.php?x=some_document.htm">click here</a>

What I want to do is replace the bold piece:

<a href="somepage.php?x=some_document.htm">click here</a>

.. with some sort of encrypted variation (lets just say base64_encoding) .. something like this:

for each match, turn it into base64_encode(match)

Notes:

1.the phrase href="somepage.php?x= will always precede the phrase.
2.a double-quote (") will always follow the phrase.

I am not a regex guru -- but I know some of you are. Any easy way to do this?

UPDATE:

I solved this by using a modified version of what Chris submitted, here it is:

function encrypt_param( $in_matches ) {    
  return   'href="somepage.php?x=' . base64_encode( $in_matches[1] ) . '"';
}

$webdoc = preg_replace_callback( '/href="somepage.php\?x=([^"]+)"/',
                                 'encrypt_param', 
                                 $webdoc );
+2  A: 

I would consider using the PHP DOM parser. Anything less is a hack. (Not that hacks are always bad, just know the difference between a simple regex and a DOM parser.) getElementsByTagName() will get your <a> tags, getAttribute() will get your href attributes, and setAttribute() modifies.

Adam Backstrom
I think he's asking about doing this server side.
Colin Fine
Yeah, you'll note that I linked to the PHP DOM parser.
Adam Backstrom
That's good if entire web doc is DOM-comatible. If it has been obtained from remote server - are you sure, that guys on that side done HTML well? I.e. transitional HTML says that you may not close some tags (tag <p>, etc.) which is AFAIK not DOM-compatible...
Jet
Ups. Ecuse me for confusing comment. I checked and DOMDocument + SimpleXML works good if you define $doc->strictErrorChecking = false;Well... I had no idea that it works so good. Thanks to Adam.
Jet
@Adam, you linked to the PHP DOM parser after I posted my comment.
Colin Fine
+1  A: 

preg_replace('/href="somepage.php\?x=([^"]*)"/e', "somepage.php?x='.base64_encode("$1").'"', $url)

(not tested). The /e means you can use an expression in the replacement string

Colin Fine
The replacement pattern will be passed to eval() as a whole, and [sompege.php?x=...] ist not valid PHP
soulmerge
using this, but halts script (any ideas?): $html = preg_replace('/href="document.php\?x=([^"]*)"/e', "href=\"document.php?x=" . base64_encode("$1") . '"', $html);
OneNerd
Yes, I missed out a ' at the start of the replacement text. I said it was untested ;-)
Colin Fine
+5  A: 

I think you are looking for something like this:

function doSomething($matches) {
   return base64_encode($matches[1]);
}

preg_replace_callback('/href="somepage.php?x=([^"]+)"/', 'doSomething', $webdoc);

The preg_replace answer works similarly. If you want to do something more elaborate, the callback would allow you do to that

Christian Hang
I must be doing something wrong - cant get this to work. When I try it it replaces the entire phrase encoded instead of just the some_document.htm part.
OneNerd
starting from PHP 5.3.0 you can use anonymous function instead of 'doSomethind'. Read more at http://www.php.net/manual/functions.anonymous.php
Jet
Chris - figured out the missing piece from your suggestion and updated my orig post with the solution which is derived from your idea - thanks!
OneNerd
Sorry about that, I didn't actually test the snippet. Checking the documentation revealed that the return value of the callback function will replace the entire matched string, not just the matched element, as you already figured out. I am glad I could help you to get it to work.
Christian Hang
+1  A: 

It seems like you might be conflating a multi-step task, which may ultimately create more trouble in the long run. You'd basically like to do three things:

  1. Find all anchor tags on a page
  2. Extract the URL in the href attribute from these tags
  3. Extract a specific variable in the query string from that URL

There is a number of ways to do this in PHP. Yes, one direct way is using a regular expression, but it's less transparent. For this particular case, you're really data fitting a very small problem, reduces the scalability of your code for future applications.

My suggestion is the implementation of a light DOM parser available from Source Forge called SimpleHTMLDom. Using this parser, you can write much clearer code for the task you're undertaking.

foreach ($dom_object->find('a') as $anchor){
    $url = $anchor->href;
    $queryArray = array();
    parse_str(parse_url($url, PHP_URL_QUERY), $queryArray);
    $myVariable = $queryArr['x'];
}

And then of course $myVariable will be the value you're looking to get with that regex.

Robert Elwell
A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens