views:

50

answers:

3

I'm trying to get this to work with perl's regex but can't seem to figure it out.. I want to grab any url that has ".website." in it, except ones that are like this (with "en" preceding ".website."

   $linkhtml =  'http://en.search.website.com/?q=beach&' ;

This is an example of a url that I would want to be returned by the regex, while the one above is rejected

   $linkhtml =  ' http://exsample.website.com/?q=beach&' ;

Here is my attempt at it.. any advice on what I'm doing wrong is appreciated

   $re2='(?<!en)'; # Any number of characters
   $re4='(.*)'; # Any number of characters
   $re6='(\.)'; # Any Single Character 4
   $re7='(website)'; # Word 2
   $re8='(\.)'; # Any Single Character 5
   $re9='(.*)'; # Any number of characters

   $re=$re4.$re2.$re6.$re7.$re8.$re9;

   if ($linkhtml =~ /$re/)
+1  A: 

I'd just do it in two steps: first use a generic regular expression to check for any URL (or rather, anything that looks like a URL). Then check each result that matches that against another regex that looks for en occurring in the host before wordpress, and discard anything that matches.

David Zaslavsky
Actually I already know its a URL as I am using mechanize to extract all links, so the $re3='(http)'; part is unecessary.. I'm having trouble with the part I described in the initial post of matching the negative "en"
Rick
+1  A: 

Negative lookbehind assertions don't work well if the content you are trying to match after the assertion is so general that it would match the assertion itself. Consider:

perl -wle'print "en.website" =~ qr/(?<!en\.)web/'        # doesn't match
perl -wle'print "en.website" =~ qr/(?<!en\.)[a-z]/'      # does match, because [a-z] is matching the 'en'

The best thing to do here is what David suggested: use two patterns to screen out the good and bad values:

my @matches = grep {
     /$pattern1/ and not /$pattern2/
} @strings;

...where pattern1 matches all URLs, and pattern2 matches just the 'en' URLs.

Ether
I edited the original post to take out the http.. I don't need to match that as any string I input is already a link so it was superfluous.. I understand what you mean, though, so I will try to adapt this to my need as I need to discard any url that has the "en" in it and then match any remaining url that has "website" in it
Rick
@Rick: PS. [Regexp::Common::URI::http](http://search.cpan.org/perldoc?Regexp::Common::URI::http) might be of use to you too...
Ether
@Ether: Where do you see an overly generic expression after a lookbehind?
Larry Wang
A: 

Here's the final solution, in case anyone comes across this in the future that is new to regex (as I am) and has a similar problem.. in my case I wrapped this is a "for loop" so it would go through an array but it just depends on the need.

first lets filter out the urls that have "en" as these aren't urls we want

        $re1='(.*)';    # Any number of characters
        $re2='(en)';    # Word 1
        $re3='(.*)'; # Any number of characters


        $re=$re1.$re2.$re3;
        if ($linkhtml =~ /$re/)
        {


    #do nothing, as we don't want a link with "en" in it

        }

        else {

        ### find urls with ".website."
        $re1='(.*)';    # Any number of characters
        $re2='(\.)';    # period
        $re3='(website)';   # Word 1
        $re4='(\.)';    # period
        $re5='(.*)'; # Any number of characters


        $re=$re1.$re2.$re3.$re4.$re5;

            if ($linkhtml =~ /$re/) {

            #match to see if it is a link that has ".website." in it


            ## do something with the data as it matches, such as:
                       print "linkhtml

            }

           }
Rick
You blocked my homepage, www.website.com/benspage.html :(
Larry Wang