tags:

views:

114

answers:

6

What would be the correct way to find a string like this in a large xml:

<ser:serviceItemValues>
    <ord1:label>Start Type</ord1:label>
    <ord1:value>Loop</ord1:value>
    <ord1:valueCd/>
    <ord1:activityCd>iactn</ord1:activityCd>
 </ser:serviceItemValues>

1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:

private static Pattern LOOP_REGEX =
        Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);

Thanks

+4  A: 

Regular expressions are not the best option when parsing large amounts of HTML or XML.

There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.

Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html

Doomspork
+3  A: 

Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.

I recommend using JDOM. It has an easy syntax. An example can be found here: http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html

If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.

mkoryak
+1  A: 

When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).

Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).

djc
+2  A: 

Look up XPath, which is kinda like regex for XML. Sort of.

With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.

I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:

"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
izb
Stop upvoting this, you all know this is the wrong way to approach the problem :(
Esko
Why is this wrong? This is exactly what xpath is for, isn't it?
izb
+1  A: 

As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.

Hardy
Ups, you wanted to do it in Java. Well, xmlstar is still a cool tool.
Hardy
A: 

See this answer for thorough explanation.

Esko