tags:

views:

98

answers:

7

Hello, I have been reading about regular expressions for around a couple of hours and I still cannot get my head around this: I am trying to pull the anchor text from a link that is formatted this way:

<h3><b>File</b> : <a href="/en/browse/file/variable_text">i_want_this</a></h3>

I want only the anchor text for the link : "i_want_this"

"variable_text" varies according to the filename so I need to ignore that.

I am using this regex:

<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>

This is matching of course the complete link.
I have not managed to figure out how to only retrieve "i_want_this" and put it in a variable.

Thanks for any help you can provide!

EDIT: Sorry for not being specific enough. I am using PHP

A: 

Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.

Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.

$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
meder
This is one scenario where I think that the question as posed is appropriate for a regex
JSBangs
Sorry for not being specific enough. I am using PHP.I looked at using a DOM parser, but it seemed overkill for what looked like a simple task.
RafaelM
A: 

The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs '<a href="/en/browse/file/variable_text">i_want_this</a>'

If you specify what you want in parenthesis, you can reference it:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'

Perl will have you use $1 instead of [1] like this:

$string = '<a href="/en/browse/file/variable_text">i_want_this</a>';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";

Hope that helps.

jgnagy
Hello jgnagy, thanks for your reply. I'm using PHP. The problem with you example is that "variable_text" varies. So, variable_text could be "123456" or it could be "654321". I need to ignore it, so it can extract only "i_want_this"
RafaelM
A: 

I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.

#<a href="/en/browse/file/.+?">(.*?)</a>#

I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.

If you want to limit to numbers instead, you can use:

#<a href="/en/browse/file/[0-9]+">(.*?)</a>#

If it should have just 5 numbers:

#<a href="/en/browse/file/[0-9]{5}">(.*?)</a>#

If it should have between 3 and 6 numbers:

#<a href="/en/browse/file/[0-9]{3,6}">(.*?)</a>#

If it should have more than 2 numbers:

#<a href="/en/browse/file/[0-9]{2,}">(.*?)</a>#
AlReece45
+1  A: 

PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.

For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.

to be more clear, what you want is this:

  1. <a href="
  2. any character, any number of times (regex = .* )
  3. ">
  4. any character, any number of times (regex = .* )
  5. </a>

beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).

I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.

So all that being said, here's the "lazy web" for your regex:

<?php
$str = '<h3><b>File</b> : <a href="/en/browse/file/variable_text">i_want_this</a></h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);

print $matches['target'];
?>

//This should output "i_want_this"

Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).

Tim
Thanks Tim,that worked great! It was only missing an escape in the </a> tag : <\/a> Now I just need to understand what is going on in the regex hehe. I'm off to read perlretut.org.
RafaelM
Glad to help. I fixed the typo in my answer. Regex is like any good game: easy to begin, incredibly hard to master and I'm nowhere near that. If you take it in small atomic sized pieces, the regex above is easy to understand. Like I broke it down into steps, just be certain about what you want out of the regex. Just did a quick google for regex cheat sheet and found a good one: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v2/png/And don't forget this gem: http://xkcd.com/208/
Tim
I was just looking at simplehtmldom, or BeautifulSoup for doing this in python. Thanks again! Love xkcd
RafaelM
+1  A: 

I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.

Where:

$subject = "<h3><b>File</b> : <a href=\"/en/browse/file/variable_text\">i_want_this</a></h3>";

Option 1:

$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);

Option 2:

$pattern2 = '(<a href=")(.*)(">)(.*)(</a>)';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
jmcdowell
Both options worked great! Thanks Pity I dont have enough rep to upvote your answer
RafaelM
A: 

This should work:

<a href="[^"]*">([^<]*)

this says that take EVERYTHING you find until you meet "

[^"]*

same! take everything with you till you meet <

[^<]*

The paratese around [^<]*

([^<]*)

group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!

Good luck!

And for your concrete example:

<a href="/en/browse/file/variable_text">([^<]*)

I use

[^<]* 

because in some examples...

.*? 

can be extremely slow! Shoudln't use that if you can use

[^<]*
slowkvant
A: 

You should use the tool Expresso for creating regular expression... Pretty handy.. http://www.ultrapico.com/Expresso.htm

nitroxn