views:

67

answers:

3

I'm attempting to extract a given pattern within a text file, however, the results are not 100% what I want.

Here's my code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ParseText1 {

public static void main(String[] args) {

    String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
        + "more here <2004-08-24> bar<Bob Joe> etc etc\n"
        + "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
        + "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
        + "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";

    Pattern p = Pattern
    .compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/]*>",
            Pattern.MULTILINE);

    Matcher m = p.matcher(content);

    // print all the matches that we find
    while (m.find()) {

        System.out.println(m.group());

    }

}
}

The output I'm getting is:

<2004-08-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe> <Fred Kej>
<2004-08-24> bar<Bob Joe><Fred Kej>
<2004-08-21><2004-08-21> baz <John Doe> and now <code>

The output I want is:

<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-21> baz <John Doe>

In short, the sequence of "date", "text (or blank)", and "name" must be extracted. Everything else should be avoided. For example the tag "Fred Kej" did not have any "date" tag before it, therefore, it should be flagged as invalid.

Also, as a side question, is there a way to store or track the text snippets that were skipped/rejected as were the valid texts.

Thanks, Brian

A: 

Have you tried adding the > character to the list of things not allowed in the second set of brackets?

Pattern p = Pattern
    .compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/>]*>",
            Pattern.MULTILINE);
VeeArr
+2  A: 

This pattern works: "<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>"

As for capturing non-matched strings, I think it's much easier to use Matcher.start() and end() indices and extracting substrings from the original text rather than playing around with the pattern, which is already quite complex.


String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
    + "more here <2004-08-24> bar<Bob Joe> etc etc\n"
    + "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
    + "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
    + "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";

Pattern p = Pattern.compile(
    "<\\d{4}-\\d{2}-\\d{2}>[^<]*<[^%\\d>]*>",
    Pattern.MULTILINE
);

Matcher m = p.matcher(content);
int index = 0;
while (m.find()) {
    System.out.println(content.substring(index, m.start()));
    System.out.println("**MATCH START**" + m.group() + "**MATCH END**");
    index = m.end();
}
System.out.println(content.substring(index));

This prints:

<p>Yada yada yada <code> foo ddd</code>yada yada ...
more here 
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
 etc etc
more here again 
**MATCH START**<2004-09-24> bar<Bob Joe>**MATCH END**
 <Fred Kej> etc etc
more here again 
**MATCH START**<2004-08-24> bar<Bob Joe>**MATCH END**
<Fred Kej> etc etc
and still more <2004-08-21>
**MATCH START**<2004-08-21> baz <John Doe>**MATCH END**
 and now <code>the end</code> </p>
polygenelubricants
Once again, thank you. All answers helped me solve my issue. In the end, I used "polygenelubricants" suggested code. Thanks everyone for their contribution.
Brian
A: 

Use this regex instead. Also added code to echo the discarded text snippets.

    Pattern p = Pattern.compile(
            "(<[0-9]{4}-[0-9]{2}-[0-9]{2}>)" + // <2004-08-21>
            "([^<]*)" +                        //  baz
            "(<[^%0-9>]*>)",                   // <John Doe>
            Pattern.MULTILINE);

    Matcher m = p.matcher(content);

    // print all the matches that we find
    int start = 0;
    while (m.find()) {
        System.out.println("\t"
                + content.substring(start, m.end()).replaceAll("\n", "\n\t"));
        System.out.println(m.group());
        start = m.end();
    }
    System.out.println("\t"
                + content.substring(start).replaceAll("\n", "\n\t"));

The output is

        <p>Yada yada yada <code> foo ddd</code>yada yada ...
        more here <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
         etc etc
        more here again <2004-09-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe>
         <Fred Kej> etc etc
        more here again <2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
        <Fred Kej> etc etc
        and still more <2004-08-21><2004-08-21> baz <John Doe>
<2004-08-21> baz <John Doe>
         and now <code>the end</code> </p>

Where indented lines correspond to discarded fragments

tucuxi