I have an HTML file and would like to extract the text between <li>
and </li>
tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+)
-- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk
win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.
) is a no-go.