ansaurus

Question

capture text, including tags from string, and then reorder tags with text

Answer 1

+1 A:

Do you need something like this perhaps?

public class Extract {
    public static void main(String[] args) {
        String text =
            "abcabcabcabc<2007-01-12><name1>" +
            "<2007-01-12>abcabcabcxxx<name2>" +
            "<2007-01-11>abcabcabcyyy<name3>" +
            "<2007-02-12>abcabcabczzz<name4>" +
            "abcabcabc123<2007-03-12><name5>" +
            "<date>abcabcabc456<name6>";
        System.out.println(
            text.replaceAll(
                "(text)<(text)>(text)<(text)>"
                    .replace("text", "[^<]*"),
                "$1$3 - $2 - $4\n"
            )
        );
    }
}

This prints:

abcabcabcabc - 2007-01-12 - name1
abcabcabcxxx - 2007-01-12 - name2
abcabcabcyyy - 2007-01-11 - name3
abcabcabczzz - 2007-02-12 - name4
abcabcabc123 - 2007-03-12 - name5
abcabcabc456 - date - name6

Essentially, there are 3 parts:

The naked text is captured by \1 and \3 -- one of these should be an empty string
The date is \2
The name is \4

You can of course use a Matcher and extract individual group too.

References

regular-expressions.info/Grouping

polygenelubricants 2010-06-17 19:57:54

Thanks Poly. All is good when the same two patterns are true throughout the extracted text. Unfortunately, there are several other patterns (added one to the original message) that I need to clean and extract, and for this reason, these other patterns kill the final output (date, name, and text note, where "name" is the author of the note, and "date" is the date the note was created. I have other patterns, and for this reason, I'm not sure how I should proceed.

Brian 2010-06-17 21:03:47

Should I run the patterns separately, and once a pattern has been extracted, fix them and have them stored in an array. And then proceed to the next pattern, assuming the text of the previous pattern was removed. Hopefully, I haven't confused anyone.

Brian 2010-06-17 21:04:36

@Brian: there's enough information in my answer to adapt the technique. Essentially capture the naked text in two groups, before and after the date, and concatenate the result.

polygenelubricants 2010-06-17 21:07:37

@Brian: I can't play catch up with you. I gave you the solution for the original spec, but then you break it by changing the spec. I can give the solution for the modified spec, but then it might break if the spec changes again. Either give me the full spec, or learn the essence of the solution and adapt it yourself.

polygenelubricants 2010-06-18 07:18:08

Totally right, and I apologize for that. Based on our analysis of the quality of the data we've identified several issues, we've identified the patterns that can be used to clean them. The example described above is one of them (there's like 4 others). All the issues can exist within the same text. Regular expressions are used to find them and then clean them. I figured this would be too much to submit everything, that's the reason I went piecemeal, however, that had it's downside.

Brian 2010-06-18 13:31:32

@Brian: quick question: can the date part be just `<date>`, or `<tomorrow>`, or `<yesterday>`, or `<fourth of july>`, etc, or is it always the `<yyyy-mm-dd>` pattern?

polygenelubricants 2010-06-18 15:09:13

The format is always <yyyy-mm-dd>. However, there's other situations where the date is missing for the provided text. In this situation we need to use first date found after the text. Also, you may find text that may not have any date, or author, again, here, we need to use the first date and author found following the text. I will try to provide an example with a more complete explanation.

Brian 2010-06-18 18:08:02

Poly I've sent you an email, not sure you've received it.

Brian 2010-06-18 21:10:03

ansaurus

tags:

views:

answers:

capture text, including tags from string, and then reorder tags with text

References

related questions