EDIT:
OK now that I understand this can go across tags I think I understand the difficulty here.
The only algorithm I can think of here is to walk the XML tree reading the text portions searching for your match - you'll need to do this matching yourself character by character across multiple nodes. The difficulty of course is to not munge the tree in the process...
Here's how I would do it:
Create a a walker to walk to the XML tree. Whenever you think you've found the start of the string match, save whatever the current parent node is. When (and if) you find the end of your string match check if the saved node is the same as the end node's parent. If they are the same then its safe to modify the tree.
Example doc:
<doc>This is a an <b>example text I made up</b> on the spot! Nutty.</doc>
Test 1:
Match: example text
The walker would walk along until it finds the "e" in example, and it would save the parent node (<b>
node) and keep walking until it found the end of text
where it would check to see if it was still in the same reference node <b>
which it is, so it is a match and you can tag it with or whatever.
Test 2:
Match: an example
The walker would first hit a
and quickly reject it, then hit an
and save the <doc>
node. It would continue to match over to the example
text until it realizes that example's parent node is <b>
and not <doc>
at which point the match is failed and no node is installed.
Implementation 1:
If you are only matching straight text, then the simple matcher using a Java (SAX or something) seems like a way to go here.
Implementation 2:
If matching input is regex itself, then you'll need something very special. I know of no engine which could work here for sure, what you might be able to do is write a bit of ugly something to do it... Maybe some sort of recursive walker which would break down the XML tree into smaller and smaller node-sets, searching the complete text at each level...
Very rough (non-working) code:
def search(raw, regex):
tree = parseXml(raw)
text = getText(tree)
if match(text, regex):
def searchXML(tree, regex):
text = getFlatText(tree)
if match(text, regex): # check if this text node might match
textNodes = getTextNodes(tree)
for (tn : textNodes): # check if its contained in a single text node
if match(tn, regex):
return tn
xmlnodes = getXMLNodes(tree)
for (xn : xmlnodes): # check if any of the children contain the text
match = searchXML(xn, regex)
if match
return match
return tree # matches some combination of text/nodes at this level
# but not at a sublevel
else:
return None # no match in this subtree
Once you know where the node is that should contain your match, I'm not sure what can do though because you don't know how you can figure out the index inside the text where it is needed from the regex... Maybe someone has an regex out there you can modify...