views:

42

answers:

1

Hello, I've been searching for solution to this problem for quite some time, but I can't figure it out on my own.

So I have bunch of HTML blocks of code, and I want to search for specific string that is contained in one of the inner tags and if there's match I want return it's parent tag value. Here's example"

<li rel="Returns this value">
    <some other tags and elements here />
    <a class="link"><span>This match</span></a>
</li>

We search for string This match and it will return Returns this value. Is this possible in awk? If not, what is easiest way to accomplish this? I do not mind any solution, however awk or similar command-line tool would be prefered. I'm runing on Ubuntu server and I have root access, so if needed I could rely on other languages, such as Ruby, Python, Perl, PHP, and others.

So far I've been able to search for string between the span tags, and return its contents. It could be however be done much easier with simple sed command, so there's not much use for it yet. However, it may be still be useful and may be improved to make what I need it to do, so here goes:

awk 'BEGIN{RS="";FS="</span>"}
/li/{
 for(i=1;i<=NF;i++){
    if($i ~ /span/){
        gsub(/.*span>/,"",$i)
        print $i
    }    
 } 
}'

When used on above example, it will return This match. Thanks a lot for suggestions.

+1  A: 

In general you can't parse html with regular expressions.

Which doesn't mean that you can't parse html in awk, though it would be a big job and I've never heard of anyone doing it.

If your targets are well defined and the input is pretty uniform and you can guarantee certain things about the nesting of tags in you input, you might be able to manage it.

However, for the most part, awk is the wrong tool for the job. Better to choose a language that has a HTML parsing engine available and use that. Perl, python, php, ruby...lots of choices.

dmckee