views:

677

answers:

5

First time sed'er, so be gentle.

I have the following text file, 'test_file':

 <Tag1>not </Tag1><Tag2>working</Tag2>

I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.

So far I have this sed based regex:

cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'

which gives the output:

 not working

Anyone any idea how to get this working?

+1  A: 

You're using the wrong tool.

Ignacio Vazquez-Abrams
I've upvoted this version of the answer as it is the earliest. As you all state, the OP is using the wrong tool and I could have upvoted all three (so far) answers.
High Performance Mark
+2  A: 

As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.

The problem with your try, is that you aren't analyzing the string properly.

cat test_file is good - it prints out the contents of the file to stdout.

grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.

sed 's/&lt;[^&gt;]*[&gt;]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.

You can try something like:

cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'

This will produce

working

but it will only work for one tag pair.

Avi
+1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem.
Bart Kiers
+1  A: 

For your nice, friendly example, you could use

sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file 

but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.

Greg Bacon
+1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem.
Bart Kiers
+1  A: 

you can use gawk, eg

$ cat file
 <Tag1>not </Tag1><Tag2>working here</Tag2>
 <Tag1>not </Tag1><Tag2>
working

</Tag2>

$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here

working
ghostdog74
A: 
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'
Vijay Sarathi