views:

1172

answers:

5

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:

awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html

The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)

Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.

A: 

There are several issues that I see:

  • The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
  • Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
  • You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
  • Awk does not do multi-line searches.
  • In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
  • In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.

It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.

Jonathan Leffler
As to your second point, yes, I was a bit contradictory in my description vs. the pattern. I can assure you the pattern matches exactly what I want it to. I wasn't aware awk didn't do multi-line.Perhaps it isn't the right tool; I can extract the text in 4 lines of python, it just would have been more convenient to have a go-to method to extract them with a single command and be done with it. Of course, I spent an hour trying to figure that command out, so... ;)
tdavis
+2  A: 

If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.

If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.

But in general, awk is the wrong tool for this job.

Eddie
A: 

Don't really know awk, how about Perl instead?

tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'

1) remove newlines from file, pipe through perl

2) initialize a variable with the complete text, start a loop until text is gone

3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass

Make sense? (warning, did not try this code myself, need to go home soon...)

P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

Roboprog
You're answering what the question said, not what the awk regex did :D And I suspect that the script could be compressed - but then it's Perl and TMTWOTDI!
Jonathan Leffler
Perl isn't guaranteed to be installed, particularly if this is on an embedded device, on which Perl may be too big and complicated to compile. Those were precisely the circumstances under which I had to begin to master Awk. I don't know what the OP's circumstances are, but he asked an Awk question, not a Perl question, and as such, the answer should be an Awk answer.
George Jempty
Yep, reading between the lines: interpreted as "wanna glop through lotso text using regex to grab stuff". Answered as such.
Roboprog
A: 

By your script, if you can get what you want (it means <li> and <a> tag is in one line.);

$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'

or

$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'

First one is for every awk, second one is for gnu awk.

Hirofumi Saito
A: 
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file

Worked pretty well for me.

Everett Toews