views:

217

answers:

4
+1  Q: 

RegExp problem

Hi.

I have a problem creating regular expression for the following task:

Suppose we have HTML-like text of the kind:

<x>...<y>a</y>...<y>b</y>...</x>

I want to get a collection of values inside "y" tags located inside given "x" tag, so the result of the above example would be a collection of two elements ["a","b"].

additionaly we know that:

  • "y" tags cannot be enclosed in other "y" tags
  • ... can include any text or other tags.

Please, help with regexp.

+8  A: 

This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.

Bill the Lizard
unfortunately, the text is not guaranteed to be valid XML nor it is in HTML format
Well assuming it is HTML, the standard answer is to run it through Tidy. With --clean, Tidy outputs only valid XHTML that should be parsable by virtually any HTML/XML package.
Eli
What Eli said. :)
Bill the Lizard
+1 this - why does everyone want to parse XML with regex's these days?
annakata
Processing and parsing are going to be way simpler than a regular expression when you have an unlimited number of potential matches. Try out the tools that were designed for this problem before you create even more problems for yourself.
Bill the Lizard
+1 for not falling for the common misuse of regular expressions that is parsing nested structures
Marko Dumic
I can recommend Beautiful Soup for Python. Perfect for parsing invalid HTML/XML-ish data.
kimsnarf
+1  A: 

Short and simple: Use XPath :)

Guðmundur Bjarni
String -> XML -> XPath
Guðmundur Bjarni
He said in a comment to @Bill the Lizard that "text is not guaranteed to be valid XML"
Tomalak
+2  A: 

I'm taking your word on this:

"y" tags cannot be enclosed in other "y" tags

input looks like: <x>...<y>a</y>...<y>b</y>...</x>

and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)

First, find the contents of any X tags with a loop over the matches of this:

<x[^>]*>(.*?)</x>

Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:

<y[^>]*>(.*?)</y>

Pseudo-code:

input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re  = "<x[^>]*>(.*?)</x>"
y_re  = "<y[^>]*>(.*?)</y>"

for each x_match in input.match_all(x_re)
  for each y_match in x_match.group(1).value.match_all(y_re)
    print y_match.group(1).value
  next y_match
next x_match

Pseudo-output:

a
b


Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.

Tomalak
thanks. Didn't mention it from the start, but I do understand how to achieve the goal using loops. But I'm 100% sure there's a solution that uses single "match" operation - and that's what I'm trying to figure out :)
If there is no hard limit on the number of Y elements, then is no regex only solution. What makes you so sure there is? Maybe the question is missing more details.
Tomalak
hmmm.. just a feeling. maybe i'm wrong though.
As I said, your question is missing some details on the exact structure of the strings you expect to be dealing with.
Tomalak
there's no limit on number of "y" tags (hence no limit on resulting collection size). what other details do you think I need to provide?
nesting or no nesting, elements can have attributes or not, is it machine generated or user input, and last but not least: what are you *really* trying to do? Maybe there are better ways than regex to get you there.
Tomalak
"y" cannot be nested within another "y", but it can be nested within some other tag, say "z"; machine-generated non-valid HTML (HTML-like) that I cannot control; lets put it like this - currently i would like to know if it is possible to do the task with single regexp match
and attributes are possible inside any tag (incl "x" and "y")
Okay, I see. If you want to get hold of the textual content of the Y tags, you won't be able to do it with a single regex, period. This is not the way regular expressions work. Use a loop as indicated, I'll expand my regex to accommodate the attributes.
Tomalak
A: 

It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:

String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
  System.out.println(m.group(1));
}

Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.

I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.

So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Alan Moore