tags:

views:

330

answers:

2

I need to get all text between <Annotation> and </Annotation>, where a word MATCH occurs. How can I do it in VIM?

<Annotation about="MATCH UNTIL </Annotation>   " timestamp="0x000463e92263dd4a" href="     5raS5maS90ZWh0YXZha29rb2VsbWEvbGFza2FyaS8QyrqPk5L9mAI">                                                                        
  <Label name="las" />
  <Label name="_cse_6sbbohxmd_c" />
  <AdditionalData attribute="original_url" value="MATCH UNTIL </Annotation>       " />
</Annotation>
<Annotation about="NO MATCH" href="     Cjl3aWtpLmhlbHNpbmtpLmZpL2Rpc3BsYXkvbWF0aHN0YXRLdXJzc2l0L0thaWtraStrdXJzc2l0LyoQh_HGoJH9mAI">
  <Label name="_cse_6sbbohxmd_c" />
  <Label name="courses" />
  <Label name="kurssit" />
  <AdditionalData attribute="original_url" value="NO MATCH" />
</Annotation>
<Annotation about="MATCH UNTIL </ANNOTATION>     " score="1" timestamp="0x000463e90f8eed5c" href="CiZtYXRoc3RhdC5oZWx     zaW5raS5maS90ZWh0YXZha29rb2VsbWEvKhDc2rv8kP2YAg">
  <Label name="_cse_6sbbohxmd_c" />
  <Label name="exercises_without_solutions" />
  <Label name="tehtäväkokoelma" />
  <AdditionalData attribute="original_url" value="MATCH UNTIL </ANNOTATION>" />
</Annotation>
+3  A: 

Does it have to be done within vim? Could you cheat, and open a second window where you pipe something into more/less that tells you what line number to go to within vim?

-- edit --

I have never done a multi-line match/search in vi[m]. However, to cheat in another window:

perl -n -e 'if ( /<tag/ .. /<\/tag/)' -e '{ print "$.:$_"; }' file.xml | less

will show the elements/blocks for "tag" (or other longer matching names), with line numbers, in less, and you can then search for the other text within each block.

Close enough?

-- edit --

within "less", type

/MATCH

to search for occurrences of MATCH. On the left margin will be the line number where that instance (within the targeted element/tags) is.

within vi[m], type

:n

where "n" is the desired line number.

Of course, if what you really wanted to do was some kind of search/yank/replace, it's more complicated. At that point, awk / perl / ruby (or something similar which meets your tastes ... or xsl?) is really the tool you should be using for the transformation.

Roboprog
I think something like this will be the only possible answer, as to do this right you need to use an XML parser.
Eddie
Where is the MATCH word supposed to be? In the place of ..?
Masi
+4  A: 

First, a disclaimer: Any attempt to slice and dice XML with regular expressions is fragile; a real XML parser would do better.

The pattern:

\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=

Let's break it down...

Group 1 is <Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>. It matches the start-tag of the Attribute element. Group 2, which is embedded in Group 1, matches an attribute and may be repeated 0 or more times.

Group 2 is \s*\w\+="[^"]\{-}"\s\{-}. Most of these pieces are commonly used; the most unusual is \{-}, which means non-greedy repetition (*? in Perl-compatible regular expressions). The non-greedy whitespace match at the end is important for performance; without it, Vim will try every possible way to split the whitespace between attributes between the \s* at the end of Group 2 and the \s* at the beginning of the next occurrence of Group 2.

Group 1 is followed by \@<=. This is a zero-width positive look-behind. It prevents the start-tag from being included in the matched text (e.g., for s///).

Group 3 is \(<\/Annotation\)\@!\_.. It includes Group 4, which matches the beginning of the Attribute end-tag. The \@! is a zero-width negative look-ahead and \_. matches any character (including newlines). Together, this groups matches at any character except where the Attribute end-tag starts. Group 3 is followed by a non-greedy repetition marker \{-} so that it matches the smallest block of text before MATCH. If you were to use \_. instead of Group 3, the matched text could include the end-tag of an Annotation element that did not include MATCH and continue through into the next Annotation element with MATCH. (Try it.)

The next bit is straightforward: Find MATCH and a minimal number of other characters before the end-tag.

Group 5 is easy: It's the end tag. \@= is a zero-width positive look-ahead, which is included here for the same reason as the \@<= for the start-tag. We have to repeat <\/Attribute rather than use \4 because groups with zero-width modifiers aren't captured.

Nathan Kitchen
+1 for the explanations. It takes me some time to thoroughly understand them :)
Masi