tags:

views:

281

answers:

5

Hi, I wonder if you anyone can construct a regular expression that can detect if a person searches for something like "site:cnn.com" or "site:www.globe.com.ph/". I've been having the most difficult time figuring it out. Thanks a lot in advance!

Edit: Sorry forgot to mention my script is in PHP.

A: 

What are you matching against? A referer url?

Assuming you're matching against a referer url that looks like this:

http://www.google.com/search?client=safari&rls=en-us&q=whatever+site:foo.com&ie=UTF-8&oe=UTF-8

A regex like this should do the trick:

\bsite(?:\:|%3[aA])(?:(?!(?:%20|\+|&|$)).)+

Notes:

  • The colon after 'site' can either be unencoded or it can be percent encoded. Most user agents will leave it unencoded (which I believe is actually contrary to the standard), but this will handle both
  • I assumed the site:... url would be right-bounded by the equivalent of a space character, end of field (&) or end of string ($)
  • I didn't assume x-www-form-urlencoded encoding (spaces == '+') or spaces encoded with percent encoding (space == %20). This will handle both
  • The (?:...) is a non-capturing group. (?!...) is a negative lookahead.
ʞɔıu
A: 

Hi Nick, no it's not for a referrer url. My php script basically spits out information about a domain (e.g. backlinks, pagerank etc) and I need that regex so it will know what the user is searching for. If the user enters something that doesn't match the regex, it does a regular web search instead.

You need to give us examples of exactly what you're trying to match against.
ʞɔıu
Oh, it was in my first message. I want the regex to match something like:site:cnn.comorsite:globe.com.phorsite:http://bbc.co.uk/If these are matched the script displays the domain information.
you're saying what you want to match, but what text do you want to match against? arbitrary user input into a text field? something else?
ʞɔıu
yes, it will be inputed by a user into a text field. Sorry for not mentioning that earlier.
I mean yes, arbitrary user input into a text field.
A: 

If this is all you are trying to do, I guess I'd take the more simple approach and just do:

$entry = $_REQUEST['q'];
$tokens = split(':', trim($entry));
if (1 < count($tokens) && strtolower($tokens[0]) == 'site')
  $site = $tokens[1];
Scott Evernden
+1  A: 

Ok, for input into an arbitary text field, something as simple as the following will work:

\bsite:(\S+)

where the parentheses will capture whatever site/domain they're trying to search. It won't verify it as valid, but validating urls/domains is complex and there are many easily googlable regexes for doing that, for instance, there's one here.

ʞɔıu
Hi Nick, I checked the link that you gave me. I'm kinda confused. Under "Matching a URL", he offers a pattern but I dunno how I can use that in your sample ( \bsite:(\S+) ). Sorry for the trouble.
I would recommend doing that in a second stage to reduce the complexity. Use the simple regex to grab the potential value, then run something else to validate that value. You don't have to use that complex regex, just find anything you like here: http://www.google.com/search?q=php+validate+url
ʞɔıu
A: 

Thanks for your help Nick and Scott. I'll let you know how it goes :)