tags:

views:

271

answers:

6

This could be tricky, easy or impossible... I'm not sure

I have a list of domains and I'm trying to match them as closely as possible to the website name in the "title" tag.

For example...

Domain: www.yahoo.com 
Title: Yahoo!
Result: Yahoo!

Domain: www.thegreenpages.com 
Title: Welcome to The Green Pages.
Result: The Green Pages

Domain: www.experts-exchange.com:
Title: Experts Exchange - The #1 resource on the web for solving technology problems.
Result: Experts Exchange

So you can see the problem here. I need to consider case, spaces and any domain special characters. I also need to capture any special characters like the ! in Yahoo! but not something like a period which would just be the end of a sentence and whatever else you can think of.

Make sense?

In PHP.

I truly, truly suck at these types of pattern matching problems :)

A: 

Unless you seriously confine the problem domain, I would say that this is impossible.

The title attribute can contain any arbitrary string in any human language (symbols, foreign characters, "smart" stuff, you name it). How would a regex be smart enough to catch the relevant part? Can you even formally define the relevant part in your own words?

Regexes suck when applied to languages, and even much more complex systems tend to suck when applied to human languages.

Tomalak
+3  A: 

I'm not sure you'll ever come up with a pattern that will solve all the eventualities you can run into with a problem like this. A title tag could be totally random text that wouldn't match at all.

For instance, here's a random site I picked off a random google search. The site domain is "plus2net.com", and the title is (obviously geared for SEO) "PHP HTML MySQL articles tutorials, free scripts and programming forum". How would you ever correlate those two things? Theoretically you could use something like the levenshtein() function to give you a sort of statistical analysis, but I think coming up with a regexp to solve this problem is the wrong approach.

I'd re-think the problem. What are you trying to accomplish? If you're just trying to correlate a list of domain names and title tags, couldn't you write a quick script to scrape the title tags from the list of domains you have and get the exact data?

zombat
Thanks I've never heard of a levenshtein function... I'll check it outOh I'm already scrapping the domain name and title tags I'm looking for something a bit deeper than that - however I can not divulge ;)
EddyR
I don’t think that the Levenshtein distance would help much. Because it just describes the amount of differences of two sequences. Zero differences would be ideal. But what if there is no perfect match? Add a threshold to take the next best match?
Gumbo
@Gumbo Exactly. You also need rules about what to do if more than one title has the same Levenshtein distance. Should "Yahood" or "Yahoo" match (both have a distance of 1). Determining what the rules should be is a trial and error thing based on the various inputs.
Chas. Owens
Agree with all. I didn't expect levenshtein to be an actual solution, I just used it as an example of a different approach to comparison heuristics. Definitely a tough problem.
zombat
+1  A: 

You could build a regular expression based on the domain name such as:

t\s*h\s*e\s*g\s*r\s*e\s*e\s*n\s*p\s*a\s*g\s*e\s*s

This would match The Green Pages in the case-insensitive mode.


Edit   Here’s an example of how you could build such a regular expression:

$data = array(
    array('yahoo', 'Yahoo!'),
    array('thegreenpages', 'Welcome to The Green Pages.'),
    array('experts-exchange', 'Experts Exchange - The #1 resource on the web for solving technology problems.')
);

foreach ($data as $item) {
    $domain = preg_split('/(.)/', $item[0], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
    foreach ($domain as $key => $chr) {
     if ($chr == '-') {
      unset($domain[$key]);
     }
    }
    $pattern = '/'.implode('[\s-]*', $domain).'!?/i';
    preg_match($pattern, $item[1], $match);
    var_dump($match[0]);
}
Gumbo
A: 

Is your list of domains fixed? If so could you build regex for each domain?

Obviously, you can strip out the domain fairly simply, but as Tomalak says, unless the problem domain is very much more restricted is actually quite a complex computational problem!

From a domain, you need to strip out the words, for which you would need a reference dictionary (or one for each language), along with some kind of word matching, perhaps some kind of voting for potential matches. Although, really without a more specific problem domain this isn't likely to be accurate.

It might be good to know more about what you are trying to achieve?

SiC
+1  A: 

Try this code:

$sites = array(
 array('domain' => 'www.yahoo.com', 'title' => 'Yahoo!'),
 array('domain' => 'www.thegreenpages.com', 'title' => 'Welcome to The Green Pages.'),
 array('domain' => 'www.experts-exchange.com', 'title' => 'Experts Exchange - The #1 resource on the web for solving technology problems.'),
);

foreach ($sites as $idx => $site) {
 $domain = preg_replace('/^www\./i', '', $site['domain']);
 $domain = preg_replace('/\.(com|net|org|info|us)$/i', '', $domain);

 $expression = '/';
 for ($i = 0; $i < strlen($domain); $i++) {
  $char = $domain[$i];
  $expression .= $char . (ctype_alpha($char) ? '' : '?');
  $expression .= '\s*';
 }
 $expression .= '/i';

 preg_match($expression, $site['title'], $matches);
 $sites[$idx]['name'] = $matches[0];
}

If you print_r($sites) you'll get:

Array
(
    [0] => Array
        (
            [domain] => www.yahoo.com
            [title] => Yahoo!
            [name] => Yahoo
        )

    [1] => Array
        (
            [domain] => www.thegreenpages.com
            [title] => Welcome to The Green Pages.
            [name] => The Green Pages
        )

    [2] => Array
        (
            [domain] => www.experts-exchange.com
            [title] => Experts Exchange - The #1 resource on the web for solving technology problems.
            [name] => Experts Exchange 
        )
)

No matter what you'll have to tweak your script until you get it right, but this is a place to start.

inxilpro
+1  A: 

I see this as at least a three step process.

  • Remove punctuation from both the title, and the url.
  • Split Url, if necessary.
  • Use the url to find the correct case, by comparing to the title.
'www.thegreenpages.com'    'Welcome to The Green Pages.'  'The Green Pages'
    'thegreenpages'                                       # remove punctuation
   'the green pages'    <= 'Welcome to The Green Pages'   # split url (if necessary)
                        =>            'The Green Pages'   # result of search

'www.experts-exchange.com'    'Experts Exchange - The #1 res ...'  'Experts Exchange'
    'experts exchange'        'Experts Exchange   The  1 res    '  # remove punctuation
#   'experts exchange'     <= 'Experts Exchange   The  1 res    '  # split url
                           => 'Experts Exchange'                   # result of search

'www.yahoo.com'    'Yahoo!'  'Yahoo!'
    'yahoo'        'Yahoo'   # remove punctuation
#   'yahoo'     <= 'Yahoo'   # split url (if necessary)
                => 'Yahoo'   # result of search
# whoops left off the exclamation point
Brad Gilbert