views:

404

answers:

3

I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.

So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.

This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.

I work in PHP, but would enjoy seeing something in any language.

NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.

A: 

I asked a similar question here.
Maybe there is something useful there.

Burkhard
+2  A: 

This will find all repeated strings:

(?=((.+)(?:.*?\2)+))

Use that with preg_match_all and select the longest one.

function len_cmp($match1,$match2) {
  return $match2[0] - $match1[0];
}

preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
  $match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}

usort($matches, "len_cmp");

foreach ($matches as $match) {
  echo "($matches[2]) $matches[1]\n";
}

This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.

(?=((.{3,})(?:.*?\2){2,}))

This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).

Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.

MizardX
Better use `(?=((.+?)\2+))`.
Gumbo
That doesn't matter. It will still try all lengths.
MizardX
But you can limit that with `'{1,'.floor(strlen($text)/2).'}'`.
Gumbo
I'm still trying alternatives, mostly for speed, but this is very clever. Clearly standard compression algorithms are doing something less complete!
LibraryThingTim
A: 

Seems I'm a little late, but it also does the work:

preg_match_all('/(id|class)+="([a-zA-Z0-9-_ ]+)"/', $html, $matches);

$result = explode(" ", implode(" ", $matches[2]));
$parsed = array();
foreach($result as $string) {
    if(isset($parsed[$string])) {
        $parsed[$string]++;
    } else {
        $parsed[$string] = 1;
    }
}
arsort($parsed);

foreach($parsed as $k => $v) {
    echo $k . " -> Found " . $v . " times<br/>";
}

The ouput will be something like:

some_id -> Found 2 times
some_class -> Found 2 times
Nathan