ansaurus

Question

Answer 1

A:

Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.

Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.

$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);

meder 2010-07-14 03:04:31

This is one scenario where I think that the question as posed is appropriate for a regex

JSBangs 2010-07-14 03:06:41

Sorry for not being specific enough. I am using PHP.I looked at using a DOM parser, but it seemed overkill for what looked like a simple task.

RafaelM 2010-07-14 03:11:24

Answer 2

A:

The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs '<a href="/en/browse/file/variable_text">i_want_this</a>'

If you specify what you want in parenthesis, you can reference it:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'

Perl will have you use $1 instead of [1] like this:

$string = '<a href="/en/browse/file/variable_text">i_want_this</a>';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";

Hope that helps.

jgnagy 2010-07-14 03:19:22

Hello jgnagy, thanks for your reply. I'm using PHP. The problem with you example is that "variable_text" varies. So, variable_text could be "123456" or it could be "654321". I need to ignore it, so it can extract only "i_want_this"

RafaelM 2010-07-14 03:28:04

Answer 3

A:

I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.

#<a href="/en/browse/file/.+?">(.*?)</a>#

I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.

If you want to limit to numbers instead, you can use:

#<a href="/en/browse/file/[0-9]+">(.*?)</a>#

If it should have just 5 numbers:

#<a href="/en/browse/file/[0-9]{5}">(.*?)</a>#

If it should have between 3 and 6 numbers:

#<a href="/en/browse/file/[0-9]{3,6}">(.*?)</a>#

If it should have more than 2 numbers:

#<a href="/en/browse/file/[0-9]{2,}">(.*?)</a>#

AlReece45 2010-07-14 03:33:32

Answer 4

+1 A:

PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.

For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.

to be more clear, what you want is this:

<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>

beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).

I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.

So all that being said, here's the "lazy web" for your regex:

<?php
$str = '<h3><b>File</b> : <a href="/en/browse/file/variable_text">i_want_this</a></h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);

print $matches['target'];
?>

//This should output "i_want_this"

Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).

Tim 2010-07-14 03:42:02

Thanks Tim,that worked great! It was only missing an escape in the </a> tag : <\/a> Now I just need to understand what is going on in the regex hehe. I'm off to read perlretut.org.

RafaelM 2010-07-14 03:51:27

Glad to help. I fixed the typo in my answer. Regex is like any good game: easy to begin, incredibly hard to master and I'm nowhere near that. If you take it in small atomic sized pieces, the regex above is easy to understand. Like I broke it down into steps, just be certain about what you want out of the regex. Just did a quick google for regex cheat sheet and found a good one: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v2/png/And don't forget this gem: http://xkcd.com/208/

Tim 2010-07-14 04:00:40

I was just looking at simplehtmldom, or BeautifulSoup for doing this in python. Thanks again! Love xkcd

RafaelM 2010-07-14 04:03:40

Answer 5

+1 A:

I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.

Where:

$subject = "<h3><b>File</b> : <a href=\"/en/browse/file/variable_text\">i_want_this</a></h3>";

Option 1:

$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);

Option 2:

$pattern2 = '(<a href=")(.*)(">)(.*)(</a>)';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);

jmcdowell 2010-07-14 03:44:20

Both options worked great! Thanks Pity I dont have enough rep to upvote your answer

RafaelM 2010-07-14 03:54:12

Answer 6

A:

This should work:

<a href="[^"]*">([^<]*)

this says that take EVERYTHING you find until you meet "

[^"]*

same! take everything with you till you meet <

[^<]*

The paratese around [^<]*

([^<]*)

group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!

Good luck!

And for your concrete example:

<a href="/en/browse/file/variable_text">([^<]*)

I use

[^<]*

because in some examples...

.*?

can be extremely slow! Shoudln't use that if you can use

[^<]*

slowkvant 2010-07-14 04:34:56

Answer 7

A:

You should use the tool Expresso for creating regular expression... Pretty handy.. http://www.ultrapico.com/Expresso.htm

nitroxn 2010-07-14 04:39:47

ansaurus

tags:

views:

answers:

Newbie to regular expression

related questions