tags:

views:

79

answers:

1

I have the following text:

abcabcabcabc<2007-01-12><name1><2007-01-12>abcabcabcabc<name2><2007-01-11>abcabcabcabc<name3><2007-02-12>abcabcabcabc<name4>abcabcabcabc<2007-03-12><name5><date>abcabcabcabc<name6>

I need to use regular expressions in order to clean the above text:

The basic extraction rule is:

<2007-01-12>abcabcabcabc<name2>

I have no problem extracting this pattern. My issue is that within th text I have malformed sequences: If the text doesn't start with a date, and end with a name my extraction fails. For example, the text above may have several mal formed sequences, such as:

abcabcabcabc<2007-01-12><name1>

Should be:

<2007-01-12>abcabcabcabc<name1>

Is it possible to have a regular expression that would clean the above, prior to extracting my consistent pattern. In short, i need to find all mal formed patterns, and then take the date tag and put it in front of it, as provided in the example above.

Thanks.

+1  A: 

Do you need something like this perhaps?

public class Extract {
    public static void main(String[] args) {
        String text =
            "abcabcabcabc<2007-01-12><name1>" +
            "<2007-01-12>abcabcabcxxx<name2>" +
            "<2007-01-11>abcabcabcyyy<name3>" +
            "<2007-02-12>abcabcabczzz<name4>" +
            "abcabcabc123<2007-03-12><name5>" +
            "<date>abcabcabc456<name6>";
        System.out.println(
            text.replaceAll(
                "(text)<(text)>(text)<(text)>"
                    .replace("text", "[^<]*"),
                "$1$3 - $2 - $4\n"
            )
        );
    }
}

This prints:

abcabcabcabc - 2007-01-12 - name1
abcabcabcxxx - 2007-01-12 - name2
abcabcabcyyy - 2007-01-11 - name3
abcabcabczzz - 2007-02-12 - name4
abcabcabc123 - 2007-03-12 - name5
abcabcabc456 - date - name6

Essentially, there are 3 parts:

  • The naked text is captured by \1 and \3 -- one of these should be an empty string
  • The date is \2
  • The name is \4

You can of course use a Matcher and extract individual group too.

References

polygenelubricants
Thanks Poly. All is good when the same two patterns are true throughout the extracted text. Unfortunately, there are several other patterns (added one to the original message) that I need to clean and extract, and for this reason, these other patterns kill the final output (date, name, and text note, where "name" is the author of the note, and "date" is the date the note was created. I have other patterns, and for this reason, I'm not sure how I should proceed.
Brian
Should I run the patterns separately, and once a pattern has been extracted, fix them and have them stored in an array. And then proceed to the next pattern, assuming the text of the previous pattern was removed. Hopefully, I haven't confused anyone.
Brian
@Brian: there's enough information in my answer to adapt the technique. Essentially capture the naked text in two groups, before and after the date, and concatenate the result.
polygenelubricants
@Brian: I can't play catch up with you. I gave you the solution for the original spec, but then you break it by changing the spec. I can give the solution for the modified spec, but then it might break if the spec changes again. Either give me the full spec, or learn the essence of the solution and adapt it yourself.
polygenelubricants
Totally right, and I apologize for that. Based on our analysis of the quality of the data we've identified several issues, we've identified the patterns that can be used to clean them. The example described above is one of them (there's like 4 others). All the issues can exist within the same text. Regular expressions are used to find them and then clean them. I figured this would be too much to submit everything, that's the reason I went piecemeal, however, that had it's downside.
Brian
@Brian: quick question: can the date part be just `<date>`, or `<tomorrow>`, or `<yesterday>`, or `<fourth of july>`, etc, or is it always the `<yyyy-mm-dd>` pattern?
polygenelubricants
The format is always <yyyy-mm-dd>. However, there's other situations where the date is missing for the provided text. In this situation we need to use first date found after the text. Also, you may find text that may not have any date, or author, again, here, we need to use the first date and author found following the text. I will try to provide an example with a more complete explanation.
Brian
Poly I've sent you an email, not sure you've received it.
Brian