tags:

views:

451

answers:

3

Hi all,

So, the situation I'm currently in is a wee bit complicated (for me that is), but I'm gonna give it a try.

I would like to run to a snippet of HTML and extract all the links referring to my own domain. Next I want to append these URL's with a predefined string of GET vars. For example, I want to append '?var1=2&var2=4' to 'http://www.domain.com/page/' thus creating 'http://www.domain.com/page/?var1=2&var2=4'.

The method I'm currently applying is a simple preg_replace function (PHP), but here is when it gets interesting. How do i create valid appended url's when they already have some GET vars at the end? For example, it could create a url like this: 'http://www.domain.com/page/?already=here&another=one?var1=2&var2=4' thus breaking the GET data.

So to conclude, what I'm looking for is a reg exp which can cope with these scenarios, create my extended url and write it back to the HTML snippet.

This is what I have so far:

$sHTML = preg_replace("'href=\"($domainURL.*?[\/$])\"'", 'href="\1' . $appendedTags . '"', $sHTML);

Thanks in advance

+2  A: 

Regex are not the solution, as somebody said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

But nevermind that, what I would use, is parse_url, and then append my var1=1&var2=2 to the result query string. Something along the lines of:

$broken = parse_url($url);
$broken['query'] .= '&var1=1&var2=2';
if (strpos($broken,'&')==0) $broken['query'] = substr($broken['query'],1);
return $broken['scheme'].'://'.$broken['host'].$broken['path'].
  '?'.$broken['query'].'#'.$broken['fragment'];

If you don't want your variable to appear twice, use also parse_str to break apart the query string.

Elazar Leibovich
parse_url() is definitely the right way to go about this. +1 for that. However, if parse_url were not available, regexp is a natural second choice, and I think it's reasonable to expect someone who's unaware of parse_url() to try to find a regexp solution. This is just one of those problems that regexp is suited towards. In fact, I would be surprised if PHP's own implementation of parse_url() did not employ regexps under the hood.
Calvin
@Calvin; Be surprised http://alanstorm.com/testbed/parse_url.txt
Alan Storm
Maybe given php, the fastest solution is regex, but it *must* be encapsulated by a function. With C++, it's defenitely much less efficient and not good for general parsing of URLs.We'll use parse_url whether it's given or not, if it's not given - we'll implement it. We might choose to implement it with a regex, but this is implementation detail and not the main cause.
Elazar Leibovich
@Alam Storm: LOL. I guess it's time for me to eat my own words.@Elazar: I agree. I just think that it's unfair to characterize this problem as an example of poor use of regexp, even from a code-readability standpoint that Atwood is coming from.
Calvin
+4  A: 

In addition to what Elazar Leibovich suggested, I'd parse the query string with parse_str(), modify the resulting array to my needs and then use http_build_query() to rebuild the query string. This way you won't have any duplicates within your query string and you don't have to bother yourself with url-encoding your query-parts.

The complete example would then look like (augmenting Elazar Leibovich code):

$broken = parse_url($url);
$query = parse_str($broken['query']);
$query['var1'] = 1;
$query['var2'] = 2;
$broken['query'] = http_build_query($query);
return $broken['scheme'] . '://' . $broken['host'] . $broken['path'] .
  '?' . $broken['query'] . '#' . $broken['fragment'];
Stefan Gehrig
Never knew those functions even existed, thanks all. Wish I could accept both answers, but chose yours for the most upvotes and complete answer.
SolidSmile
To be honest: nobody would think parse_str() will do what it does when looking at the function name ;-)
Stefan Gehrig
A: 

Also the parse_str wont return any values as shown in the answer rather it takes an array as a param:

$array = array();    
parse_str($url,$array);
// $array will contain the ["scheme"] ["host"] etc

just a side note ;)

-- G

Crassusg