ansaurus

Question

Tricky pattern match

Answer 1

A:

Unless you seriously confine the problem domain, I would say that this is impossible.

The title attribute can contain any arbitrary string in any human language (symbols, foreign characters, "smart" stuff, you name it). How would a regex be smart enough to catch the relevant part? Can you even formally define the relevant part in your own words?

Regexes suck when applied to languages, and even much more complex systems tend to suck when applied to human languages.

Tomalak 2009-04-23 09:09:55

Answer 2

+3 A:

I'm not sure you'll ever come up with a pattern that will solve all the eventualities you can run into with a problem like this. A title tag could be totally random text that wouldn't match at all.

For instance, here's a random site I picked off a random google search. The site domain is "plus2net.com", and the title is (obviously geared for SEO) "PHP HTML MySQL articles tutorials, free scripts and programming forum". How would you ever correlate those two things? Theoretically you could use something like the levenshtein() function to give you a sort of statistical analysis, but I think coming up with a regexp to solve this problem is the wrong approach.

I'd re-think the problem. What are you trying to accomplish? If you're just trying to correlate a list of domain names and title tags, couldn't you write a quick script to scrape the title tags from the list of domains you have and get the exact data?

zombat 2009-04-23 09:11:51

Thanks I've never heard of a levenshtein function... I'll check it outOh I'm already scrapping the domain name and title tags I'm looking for something a bit deeper than that - however I can not divulge ;)

EddyR 2009-04-23 10:09:39

I don’t think that the Levenshtein distance would help much. Because it just describes the amount of differences of two sequences. Zero differences would be ideal. But what if there is no perfect match? Add a threshold to take the next best match?

Gumbo 2009-04-23 10:59:33

@Gumbo Exactly. You also need rules about what to do if more than one title has the same Levenshtein distance. Should "Yahood" or "Yahoo" match (both have a distance of 1). Determining what the rules should be is a trial and error thing based on the various inputs.

Chas. Owens 2009-04-23 12:36:43

Agree with all. I didn't expect levenshtein to be an actual solution, I just used it as an example of a different approach to comparison heuristics. Definitely a tough problem.

zombat 2009-04-23 17:45:48

Answer 3

+1 A:

You could build a regular expression based on the domain name such as:

t\s*h\s*e\s*g\s*r\s*e\s*e\s*n\s*p\s*a\s*g\s*e\s*s

This would match The Green Pages in the case-insensitive mode.

Edit Here’s an example of how you could build such a regular expression:

$data = array(
    array('yahoo', 'Yahoo!'),
    array('thegreenpages', 'Welcome to The Green Pages.'),
    array('experts-exchange', 'Experts Exchange - The #1 resource on the web for solving technology problems.')
);

foreach ($data as $item) {
    $domain = preg_split('/(.)/', $item[0], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
    foreach ($domain as $key => $chr) {
     if ($chr == '-') {
      unset($domain[$key]);
     }
    }
    $pattern = '/'.implode('[\s-]*', $domain).'!?/i';
    preg_match($pattern, $item[1], $match);
    var_dump($match[0]);
}

Gumbo 2009-04-23 09:12:14

Answer 4

A:

Is your list of domains fixed? If so could you build regex for each domain?

Obviously, you can strip out the domain fairly simply, but as Tomalak says, unless the problem domain is very much more restricted is actually quite a complex computational problem!

From a domain, you need to strip out the words, for which you would need a reference dictionary (or one for each language), along with some kind of word matching, perhaps some kind of voting for potential matches. Although, really without a more specific problem domain this isn't likely to be accurate.

It might be good to know more about what you are trying to achieve?

SiC 2009-04-23 09:25:49

Answer 5

+1 A:

Try this code:

$sites = array(
 array('domain' => 'www.yahoo.com', 'title' => 'Yahoo!'),
 array('domain' => 'www.thegreenpages.com', 'title' => 'Welcome to The Green Pages.'),
 array('domain' => 'www.experts-exchange.com', 'title' => 'Experts Exchange - The #1 resource on the web for solving technology problems.'),
);

foreach ($sites as $idx => $site) {
 $domain = preg_replace('/^www\./i', '', $site['domain']);
 $domain = preg_replace('/\.(com|net|org|info|us)$/i', '', $domain);

 $expression = '/';
 for ($i = 0; $i < strlen($domain); $i++) {
  $char = $domain[$i];
  $expression .= $char . (ctype_alpha($char) ? '' : '?');
  $expression .= '\s*';
 }
 $expression .= '/i';

 preg_match($expression, $site['title'], $matches);
 $sites[$idx]['name'] = $matches[0];
}

If you print_r($sites) you'll get:

Array
(
    [0] => Array
        (
            [domain] => www.yahoo.com
            [title] => Yahoo!
            [name] => Yahoo
        )

    [1] => Array
        (
            [domain] => www.thegreenpages.com
            [title] => Welcome to The Green Pages.
            [name] => The Green Pages
        )

    [2] => Array
        (
            [domain] => www.experts-exchange.com
            [title] => Experts Exchange - The #1 resource on the web for solving technology problems.
            [name] => Experts Exchange 
        )
)

No matter what you'll have to tweak your script until you get it right, but this is a place to start.

inxilpro 2009-04-23 14:19:31

Answer 6

+1 A:

I see this as at least a three step process.

Remove punctuation from both the title, and the url.
Split Url, if necessary.
Use the url to find the correct case, by comparing to the title.

'www.thegreenpages.com'    'Welcome to The Green Pages.'  'The Green Pages'
    'thegreenpages'                                       # remove punctuation
   'the green pages'    <= 'Welcome to The Green Pages'   # split url (if necessary)
                        =>            'The Green Pages'   # result of search

'www.experts-exchange.com'    'Experts Exchange - The #1 res ...'  'Experts Exchange'
    'experts exchange'        'Experts Exchange   The  1 res    '  # remove punctuation
#   'experts exchange'     <= 'Experts Exchange   The  1 res    '  # split url
                           => 'Experts Exchange'                   # result of search

'www.yahoo.com'    'Yahoo!'  'Yahoo!'
    'yahoo'        'Yahoo'   # remove punctuation
#   'yahoo'     <= 'Yahoo'   # split url (if necessary)
                => 'Yahoo'   # result of search
# whoops left off the exclamation point

Brad Gilbert 2009-04-23 16:23:41

ansaurus

tags:

views:

answers:

Tricky pattern match

I see this as at least a three step process.

related questions