ansaurus

Question

Is there a regular expression to strip specific query variables from a URI?

Answer 1

A:

With HTML chunks that big, I'd farm this out to an external process, probably a perl script

I'm not positive since I've never attempted to parse anywhere near that much text, but I'm willing to be that PHP is not going to do this quickly.

What is your expected load? How often are you going to have to do this type of processing? This sounds like something that you'd do as a batch operation, which, in my admittedly limited experience with such tasks, doesn't need to necessarily super fast, but fast enough that it will execute in a reasonable amount of time (i.e., you're not waiting for it overnight or whatever)

Peter Bailey 2009-06-04 15:26:40

Answer 2

A:

Not really a RegExp but it may help you (not tested):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&amp;');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;

Ionuț G. Stan 2009-06-04 15:27:49

Wow! I want to see the process listing on that server with 20M DOMs parsing... This way is good, but not for huge files, because DOM extension eats a lot of memory.

Jet 2009-06-04 15:35:29

Honestly, I've never tried it on 20MB of HTML, but I'd like to see the process listing too. PHP is so slow sometimes.

Ionuț G. Stan 2009-06-04 15:43:53

Way to heavy :(.

Zachary Spencer 2009-06-04 15:49:04

Answer 3

A:

Regex is one way. Alternately you could use XPath to find all links within the document and then work on each of those in a loop. Since this is an XHTML document and if assuming it is well formed, this approach seems reasonable.

aleemb 2009-06-04 15:27:51

This approach seems easier, too.

Brian 2009-06-04 15:36:55

Answer 4

+1 A:

if the string is always the same the fastest php function I;ve found for that is strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

Obviously you'll need to benchmark the speed, but that should be up there.

Phil Carter 2009-06-04 15:30:25

`strtr` just replaces certain characters. Use `str_replace` instead.

Gumbo 2009-06-04 15:42:16

Problem with Strstr is it's only the first occurrence of the string.

Zachary Spencer 2009-06-04 15:46:11

@Gumbo he asked for speed, strtr is MUCH faster than str_replace @Zachary Spencer strtr is NOT strstr and does the whole string

Phil Carter 2009-06-04 15:53:39

Trying strtr now...

Zachary Spencer 2009-06-04 15:59:33

strtr gave me issues. Here's sample output from what strtr gave me:There is no acmivimy mo reporm! If yoa didn'm expecm mhis, im pay be an indicamion mham Zachary is circapvenming...using str_replace and that's working... for now. Eventually I want to move to it's own XSL file.

Zachary Spencer 2009-06-04 20:23:44

It doesn’t matter how much faster `strtr` is over `strstr` or `str_replace` when the function is not appropriate for this. `strtr` does only replace certain characters with other characters and not a string with another string. Take this simple example: `strtr('foobar', 'foobar', '')` returns "foobar". Nothing changed. But `strtr('foobar', 'oa', '-+')` returns "f--b+r" (replaces "o" with "-" and "a" with "+").

Gumbo 2009-06-04 20:33:56

You should take a closer look into the manual: “This function returns a copy of `str`, translating all occurrences of each character `in` from to the corresponding character in `to`.”

Gumbo 2009-06-04 20:35:04

@Gumbo, I have re-read the manual and it does do it, but if you use the array input, (updated response) which is how I always use it for international translation.<?php $text = 'a href="http://sub.exmaple.com/link.html?var=val $newtext = strtr($text, array(' print $text."<br />\n"; print $newtext;?>

Phil Carter 2009-06-05 08:34:28

Answer 5

A:

PHP's preg_replace_all() will do this quite fast if you run it in CGI mode in backend. Why not using cronjob to run php script sometimes to process all your HTMLs? So, then your frontend php-script will only put the processed contents to browser without any calculations.

Jet 2009-06-04 15:33:34

Answer 6

+2 A:

You could use sed or some other low level tool to remove that parts:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

But that would remove this string anywhere and not just in URLs. So be careful.

Gumbo 2009-06-04 15:33:50

I would second doing the fix on the original data - rather than modifying it in either the server or the client.

Douglas Leeder 2009-06-04 15:41:50

Does this overwrite the initial file? I don't want to overwrite the initial file.

Zachary Spencer 2009-06-04 15:54:23

Yes it does (see `-i`). If you don’t want that, set `-i.backup` and you’ll get a `*filename*.backup`. But again, try it first on some test files before applying it to all of your files.

Gumbo 2009-06-04 16:01:51

Answer 7

A:

I eventually deferred to using str_replace and replacing the string through the entire contents of the document :(.

Zachary Spencer 2009-06-04 20:25:14

ansaurus

tags:

views:

answers:

Is there a regular expression to strip specific query variables from a URI?

related questions