tags:

views:

462

answers:

7

I have a bunch of HTML that is generated by a daemon using C, XML and XSL. Then I have a PHP script which picks up the HTML markup and displays it on the screen

I have a huge swathe of XHTML 1 compliant markup. I need to modify all of the links in the markup to remove &utm_source=report&utm_medium=email&utm_campaign=report.

So far I've considered two options.

  1. Do a regex search in the PHP backend which trims out the Analytics code
  2. Write some Jquery to loop through the links and then trim out the Analytics code from the href.

Hurdles:

  1. The HTML can be HUGE. I.E. more than 4MB (ran some tests, they average at about 100Kb)
  2. It has to be fast.We get approximately 3K Thoughts?

Right now I'm trying to use str_replace('&utm_source=report&utm_medium=email&utm_campaign=report','',$html); but it's not working.

A: 

With HTML chunks that big, I'd farm this out to an external process, probably a perl script

I'm not positive since I've never attempted to parse anywhere near that much text, but I'm willing to be that PHP is not going to do this quickly.

What is your expected load? How often are you going to have to do this type of processing? This sounds like something that you'd do as a batch operation, which, in my admittedly limited experience with such tasks, doesn't need to necessarily super fast, but fast enough that it will execute in a reasonable amount of time (i.e., you're not waiting for it overnight or whatever)

Peter Bailey
A: 

Not really a RegExp but it may help you (not tested):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&amp;');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;
Ionuț G. Stan
Wow! I want to see the process listing on that server with 20M DOMs parsing... This way is good, but not for huge files, because DOM extension eats a lot of memory.
Jet
Honestly, I've never tried it on 20MB of HTML, but I'd like to see the process listing too. PHP is so slow sometimes.
Ionuț G. Stan
Way to heavy :(.
Zachary Spencer
A: 

Regex is one way. Alternately you could use XPath to find all links within the document and then work on each of those in a loop. Since this is an XHTML document and if assuming it is well formed, this approach seems reasonable.

aleemb
This approach seems easier, too.
Brian
+1  A: 

if the string is always the same the fastest php function I;ve found for that is strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

Obviously you'll need to benchmark the speed, but that should be up there.

Phil Carter
`strtr` just replaces certain characters. Use `str_replace` instead.
Gumbo
Problem with Strstr is it's only the first occurrence of the string.
Zachary Spencer
@Gumbo he asked for speed, strtr is MUCH faster than str_replace @Zachary Spencer strtr is NOT strstr and does the whole string
Phil Carter
Trying strtr now...
Zachary Spencer
strtr gave me issues. Here's sample output from what strtr gave me:There is no acmivimy mo reporm! If yoa didn'm expecm mhis, im pay be an indicamion mham Zachary is circapvenming...using str_replace and that's working... for now. Eventually I want to move to it's own XSL file.
Zachary Spencer
It doesn’t matter how much faster `strtr` is over `strstr` or `str_replace` when the function is not appropriate for this. `strtr` does only replace certain characters with other characters and not a string with another string. Take this simple example: `strtr('foobar', 'foobar', '')` returns "foobar". Nothing changed. But `strtr('foobar', 'oa', '-+')` returns "f--b+r" (replaces "o" with "-" and "a" with "+").
Gumbo
You should take a closer look into the manual: “This function returns a copy of `str`, translating all occurrences of each character `in` from to the corresponding character in `to`.”
Gumbo
@Gumbo, I have re-read the manual and it does do it, but if you use the array input, (updated response) which is how I always use it for international translation.<?php $text = 'a href="http://sub.exmaple.com/link.html?var=val $newtext = strtr($text, array(' print $text."<br />\n"; print $newtext;?>
Phil Carter
A: 

PHP's preg_replace_all() will do this quite fast if you run it in CGI mode in backend. Why not using cronjob to run php script sometimes to process all your HTMLs? So, then your frontend php-script will only put the processed contents to browser without any calculations.

Jet
+2  A: 

You could use sed or some other low level tool to remove that parts:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

But that would remove this string anywhere and not just in URLs. So be careful.

Gumbo
I would second doing the fix on the original data - rather than modifying it in either the server or the client.
Douglas Leeder
Does this overwrite the initial file? I don't want to overwrite the initial file.
Zachary Spencer
Yes it does (see `-i`). If you don’t want that, set `-i.backup` and you’ll get a `*filename*.backup`. But again, try it first on some test files before applying it to all of your files.
Gumbo
A: 

I eventually deferred to using str_replace and replacing the string through the entire contents of the document :(.

Zachary Spencer