ansaurus

Question

RegExp problem

Answer 1

+8 A:

This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.

Bill the Lizard 2008-12-04 13:23:43

unfortunately, the text is not guaranteed to be valid XML nor it is in HTML format

2008-12-04 13:27:39

Well assuming it is HTML, the standard answer is to run it through Tidy. With --clean, Tidy outputs only valid XHTML that should be parsable by virtually any HTML/XML package.

Eli 2008-12-04 13:42:27

What Eli said. :)

Bill the Lizard 2008-12-04 13:58:53

+1 this - why does everyone want to parse XML with regex's these days?

annakata 2008-12-04 14:16:20

2008-12-04 14:21:53

Processing and parsing are going to be way simpler than a regular expression when you have an unlimited number of potential matches. Try out the tools that were designed for this problem before you create even more problems for yourself.

Bill the Lizard 2008-12-04 14:55:30

+1 for not falling for the common misuse of regular expressions that is parsing nested structures

Marko Dumic 2008-12-05 00:54:37

I can recommend Beautiful Soup for Python. Perfect for parsing invalid HTML/XML-ish data.

kimsnarf 2009-05-16 08:37:13

Answer 2

+1 A:

Short and simple: Use XPath :)

Guðmundur Bjarni 2008-12-04 13:24:12

String -> XML -> XPath

Guðmundur Bjarni 2008-12-04 13:30:45

He said in a comment to @Bill the Lizard that "text is not guaranteed to be valid XML"

Tomalak 2008-12-04 13:42:22

Answer 3

+2 A:

I'm taking your word on this:

"y" tags cannot be enclosed in other "y" tags

input looks like: <x>...<y>a</y>...<y>b</y>...</x>

and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)

First, find the contents of any X tags with a loop over the matches of this:

<x[^>]*>(.*?)</x>

Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:

<y[^>]*>(.*?)</y>

Pseudo-code:

input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re  = "<x[^>]*>(.*?)</x>"
y_re  = "<y[^>]*>(.*?)</y>"

for each x_match in input.match_all(x_re)
  for each y_match in x_match.group(1).value.match_all(y_re)
    print y_match.group(1).value
  next y_match
next x_match

Pseudo-output:

a
b

Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.

Tomalak 2008-12-04 13:38:49

thanks. Didn't mention it from the start, but I do understand how to achieve the goal using loops. But I'm 100% sure there's a solution that uses single "match" operation - and that's what I'm trying to figure out :)

2008-12-04 13:45:18

If there is no hard limit on the number of Y elements, then is no regex only solution. What makes you so sure there is? Maybe the question is missing more details.

Tomalak 2008-12-04 13:52:27

hmmm.. just a feeling. maybe i'm wrong though.

2008-12-04 14:10:42

As I said, your question is missing some details on the exact structure of the strings you expect to be dealing with.

Tomalak 2008-12-04 14:12:47

there's no limit on number of "y" tags (hence no limit on resulting collection size). what other details do you think I need to provide?

2008-12-04 14:14:21

nesting or no nesting, elements can have attributes or not, is it machine generated or user input, and last but not least: what are you *really* trying to do? Maybe there are better ways than regex to get you there.

Tomalak 2008-12-04 14:20:10

"y" cannot be nested within another "y", but it can be nested within some other tag, say "z"; machine-generated non-valid HTML (HTML-like) that I cannot control; lets put it like this - currently i would like to know if it is possible to do the task with single regexp match

2008-12-04 14:25:00

and attributes are possible inside any tag (incl "x" and "y")

2008-12-04 14:25:33

Okay, I see. If you want to get hold of the textual content of the Y tags, you won't be able to do it with a single regex, period. This is not the way regular expressions work. Use a loop as indicated, I'll expand my regex to accommodate the attributes.

Tomalak 2008-12-04 15:02:38

Answer 4

A:

It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:

String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
  System.out.println(m.group(1));
}

Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.

I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.

So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Alan Moore 2008-12-05 00:45:09

ansaurus

tags:

views:

answers:

RegExp problem

related questions