ansaurus

Question

Regex: To pull out a sub-string between two tags in a string

Answer 1

+6 A:

\[start\](.*?)\[end\]

which'll put the text in the middle within a capture

Karl Seguin 2008-08-04 13:52:06

Much better (simpler) than the accepted answer... :-)

PhiLho 2008-12-09 17:34:52

This still won't catch strings that have line breaks

Doug 2010-04-19 03:22:42

Answer 2

+3 A:

\[start\]\s*(((?!\[start\]|\[end\]).)+)\s*\[end\]

This should hopefully drop the [start] and [end] markers as well.

Xenph Yan 2008-08-04 13:55:05

Answer 3

A:

With Perl you can surround the data you want with ()'s and pull it out later, perhaps other languages have a similar feature.

if ($s_output =~ /(data data data data START(data data data)END (data data)/) 
{
    $dataAllOfIt = $1;      # 1 full string
    $dataInMiddle = $2;     # 2 Middle Data
    $dataAtEnd = $3;        # 3 End Data
}

Grant 2008-08-04 14:00:04

Answer 4

+1 A:

A more complete discussion of the pitfalls of using a regex to find matching tags can be found at: http://faq.perl.org/perlfaq4.html#How_do_I_find_matchi. In particular, be aware that nesting tags really need a full-fledged parser in order to be interpreted correctly.

Note that case sensitivity will need to be turned off in order to answer the question as stated. In perl, that's the i modifier:

$ echo "Data Data Data [Start] Data i want [End] Data" \
  | perl -ne '/\[start\](.*?)\[end\]/i; print "$1\n"'
 Data i want

The other trick is to use the *? quantifier which turns off the greediness of the captured match. For instance, if you have a non-matching [end] tag:

Data Data [Start] Data i want [End] Data [end]

you probably don't want to capture:

 Data i want [End] Data

Jon Ericson 2008-08-20 19:14:19

Answer 5

+1 A:

While you can use a regular expression to parse the data between opening and closing tags, you need to think long and hard as to whether this is a path you want to go down. The reason for it is the potential of tags to nest: if nesting tags could ever happen or may ever happen, the language is said to no longer be regular, and regular expressions cease to be the proper tool for parsing it.

Many regular expression implementations, such as PCRE or perl's regular expressions, support backtracking which can be used to achieve this rough effect. But PCRE (unlike perl) doesn't support unlimited backtracking, and this can actually cause things to break in weird ways as soon as you have too many tags.

There's a very commonly cited blog post that discusses this more, http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html (google for it and check the cache currently, they seem to be having some downtime)

Daniel Papasian 2008-09-15 14:18:27

Answer 6

A:

Well, if you guarantee that each start tag is followed by an end tag then the following would work.

\[start\](.*?)\[end\]

However, If you have complex text such as the follwoing:

[start] sometext [start] sometext2 [end] sometext [end]

then you would run into problems with regex.

Now the following example will pull out all the hot links in a page:

'/<a(.*?)a>/i'

In the above case we can guarantee that there would not be any nested cases of:

'<a></a>'

So, this is a complex question and can't just be solved with a simple answer.

VN44CA 2009-05-11 20:08:30

ansaurus

tags:

views:

answers:

Regex: To pull out a sub-string between two tags in a string

related questions