tags:

views:

245

answers:

2

I am trying to match the text contents(character data) of an XML file with a series of regexs and then change the XML based on the matches. Example:

 <text>
 <para>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
 </para>
 </text>

I want to match for instance the following regex to the text:

\bdolor.\b

For each match I want to for instance surround the match with tags or similar so above turns into:

<text>
<para>Lorem ipsum <bold>dolor<bold/> sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et <bold>dolore<bold/> magna aliqua.
</para>
</text>

One further complication is that the text(character data) I want to match against might span several tags.

I guess what I am trying to do is very similar to what a word processor app would have to do if you do a search which selects a matching part of the text and then for instance change the formatting of the matched/selected text.

I would like to use Java(actually Clojure) to do this and I intend to use JAXB to manipulate the XML document.

How do I go about doing above?

A: 

EDIT:

OK now that I understand this can go across tags I think I understand the difficulty here.

The only algorithm I can think of here is to walk the XML tree reading the text portions searching for your match - you'll need to do this matching yourself character by character across multiple nodes. The difficulty of course is to not munge the tree in the process...

Here's how I would do it:

Create a a walker to walk to the XML tree. Whenever you think you've found the start of the string match, save whatever the current parent node is. When (and if) you find the end of your string match check if the saved node is the same as the end node's parent. If they are the same then its safe to modify the tree.

Example doc:

<doc>This is a an <b>example text I made up</b> on the spot! Nutty.</doc>

Test 1: Match: example text

The walker would walk along until it finds the "e" in example, and it would save the parent node (<b> node) and keep walking until it found the end of text where it would check to see if it was still in the same reference node <b> which it is, so it is a match and you can tag it with or whatever.

Test 2: Match: an example

The walker would first hit a and quickly reject it, then hit an and save the <doc> node. It would continue to match over to the example text until it realizes that example's parent node is <b> and not <doc> at which point the match is failed and no node is installed.

Implementation 1:

If you are only matching straight text, then the simple matcher using a Java (SAX or something) seems like a way to go here.

Implementation 2:

If matching input is regex itself, then you'll need something very special. I know of no engine which could work here for sure, what you might be able to do is write a bit of ugly something to do it... Maybe some sort of recursive walker which would break down the XML tree into smaller and smaller node-sets, searching the complete text at each level...

Very rough (non-working) code:

def search(raw, regex):
    tree = parseXml(raw)
    text = getText(tree)
    if match(text, regex):


def searchXML(tree, regex):
    text = getFlatText(tree)
    if match(text, regex): # check if this text node might match
        textNodes = getTextNodes(tree)
        for (tn : textNodes): # check if its contained in a single text node
            if match(tn, regex):
                return tn
        xmlnodes = getXMLNodes(tree)
        for (xn : xmlnodes): # check if any of the children contain the text
            match = searchXML(xn, regex)
            if match
                return match
        return tree # matches some combination of text/nodes at this level
                    # but not at a sublevel
    else:
        return None # no match in this subtree

Once you know where the node is that should contain your match, I'm not sure what can do though because you don't know how you can figure out the index inside the text where it is needed from the regex... Maybe someone has an regex out there you can modify...

Petriborg
My problem is that the text I want to match against will span several tags. Extracting the text and chaining it together is not a problem, and finding the start and index of the match(es) isn't either, but getting back to the XML and inserting the start and end tags in the right places - in the middle of the character data - is.
mac
A: 

I take it that "the text I want to match against will span several tags" means something like this:

 In <i>this</i> example, I want to match "In this example".

 In <i><b>this</b></i> example, I also want to match "In this example".

 And <i>in <b>this</b></i> example, it's clear I have to ignore case too.

This seems like an especially hard problem because the transformation you're talking about can result in XML that's not well-formed - e.g. look what happens if you try to put tags around the substring here:

In this <i>example, putting tags around "in this example"</i> will break things.

<i>And in this</i> example, you have a similar problem.

To produce well-formed output, you'd probably need it to look like:

<bold>In this <i>example</i><bold><i>, putting tags around "in this example"</i> will break things.

<i>And <bold>in this</bold></i><bold> example</bold>, you have a similar problem.

In theory, every character you're matching could be in a different element:

Almost like <i><u>i</u><u>n</u> </i><u>th</u>is<i><i><u> ex</i>am</i>ple.</i>

You have basically two problems here, and neither is simple:

  1. Search a stream of XML for a substring, ignoring everything that's not a text node, and return the start and end positions of the substring within the stream.

  2. Given two arbitrary indexes into an XML document, create an element enclosing the text between those indexes, closing (and reopening) any elements whose tags span either but not both of the two indexes.

It's pretty clear to me that XSLT and regular expressions won't help you here. I don't think using a DOM will help you here, either. In fact I don't think that there's an answer to the second problem that doesn't involve writing a parser.

This isn't really an answer, I know.

Robert Rossney
Thanks for your thoughts. At least I know my question is decipherable :-). It occurred to me that the operation is very similar to what a word processor app would have to do if you do a search which selects a matching part of the text and then for instance change the formatting of the matched/selected text.
mac