views:

53

answers:

4

i am parsing an html page, let's say this page lists all players in a football team and those who are seniors will be bolded. I can't parse the file line by line and look for the strong tag because in my real example the pattern is much more complex and span multiple lines.

Something like this:

<strong>Senior:</strong> John Smith
Junior: Joe Smith
<strong>Senior:</strong> Mike Johnson

and so on. How do I write a perl regex to get the names of all seniors?

Thanks

+6  A: 

The reason you're having difficulty writing a regex to do this is because it's the wrong tool for the job. You should use a real HTML parser like HTML::Parser, HTML::TokeParser, or HTML::TreeBuilder.

I can't give a specific example because I doubt that's exactly what your HTML looks like. Your sample appears to be missing some punctuation or additional tags.

cjm
+3  A: 

You don't have to parse a file line by line -- you can read in the entire file at once, if it's small, or you can parse it paragraph by paragraph, using whatever separator you like.

The two magic things you need to do this are 1. set the "line separator" variable, $/ (see perldoc perlvar), to be something else than a newline, and 2. enable multi-line regular expression matching with the /s modifier (see perldoc perlre).

Alternatively, you should use an HTML parser, which is what you would have to do if you are attempting to find things like nested tags.

Ether
+1  A: 

You have to provide a specific example.

Perl regular expressions can be occasionally used for HTML parsing, but only when you know specifically how the page looks like and that it's not too complex.

If you don't know exactly or it is too complex, use the parsers that cjm links.

Karel Bílek
A: 

It's not clear from your example how the end of the senior name is going to be determined, but something like this:

my @seniors = $filecontents =~ m!<strong>Senior:</strong>\s*([^<]+)!g;
ysth