tags:

views:

559

answers:

5

I'm having some difficulty with a specific Regex I'm trying to use. I'm searching for every occurrence of a string (for my purposes, I'll say it's "mystring") in a document, EXCEPT where it's in a tag, e.g.

<a href="_mystring_">

should not match, but

<a href="someotherstring">_mystring_</a>

Should match, since it's not inside a tag (inside meaning "inside the < and > markers") I'm using .NET's regex functions for this as well.

A: 

What are you trying to achieve? Aren't you better off using a XML parser?

Ropstah
A: 

Why use regex?

For xhtml, load it into XDocument / XmlDocument; for (non-x)html the Html Agility Pack would seem a more sensible choice...

Either way, that will parse the html into a DOM so you can iterate over the nodes and inspect them.

Marc Gravell
A: 

Regular expression searches are typically not a good idea in XML. It's too easy to run into problems with search expressions matching to much or too little. It's also almost impossible to formulate a regex that can correctly identify and handle CDATA sections, processing instructions (PIs), and escape sequences that XML allows.

Unless you have complete control over the XML content you're getting and can guarantee it won't include such constructs (and won't change) I would advise to use an XML parser of some kind (XDocument or XmlDocument in .net, for instance).

Having said that, if you're still intent on using regex as your search mechanism, something like the following should work using the RegEx class in .NET. You may want to test it out with some of your own test cases at a site like Regexlib. You may also be able to search their regular expression catalog to find something that might fit your needs.

[>].*(mystring).*[<]

LBushkin
A: 

Ignoring that are there indeed other ways, and that I'm no real regex expert, but one thing that popped into my head was:

  • find all the mystrings that ARE in tags first - because I can't write the expression to do the opposite :)
  • change those to something else
  • then replace all the other mystring (that are left not in tags) as you need
  • restore the original mystrings that were in tags

So, using <[^>]*?(mystring)[^>]*> you can find the tagged ones. Replace those with otherstring. Do you normal replace on the mystrings that are left. Replace otherstring back to mystring

Crude but effective....maybe.

cdm9002
+5  A: 

This should do it:

(?<!<[^>]*)_mystring_

It uses a negative look behind to check that the matched string does not have a < before it without a corresponding >

Nick Higgs
Though I needed a few more rules added to the lookbehind and such for my specific needs, this is what got things working for me. Thank you!
Sukasa
Wow, that's a beautiful regex! @Sukasa, can you post the final one that you came up with?
travis