views:

97

answers:

1

Looking to parse the following text file:
Sample text file:

<2008-10-07>text entered by user<Ted Parlor><2008-11-26>additional text entered by user<Ted Parlor>

I would like to parse the above text so that I can have three variables:

v1 = 2008-10-07
v2 = text entered by user
v3 = Ted Parlor
v1 = 2008-11-26
v2 = additional text entered by user
v3 = Ted Parlor

I attempted to use scanner and useDelimiter, however, I'm having issue on how to set this up to have the results as stated above. Here's my first attempt:

import java.io.*;
import java.util.Scanner;

public class ScanNotes {
    public static void main(String[] args) throws IOException {
        Scanner s = null;
        try {
            //String regex = "(?<=\\<)([^\\>>*)(?=\\>)";
            s = new Scanner(new BufferedReader(new FileReader("cur_notes.txt")));
            s.useDelimiter("[<]+");

            while (s.hasNext()) {
                String v1 = s.next();
                String v2= s.next();
                System.out.println("v1= " + v1 + " v2=" + v2);
            }
        } finally {
            if (s != null) {
                s.close();
            }
        }
    }
}

The results is as follows:

v1= 2008-10-07>text entered by user v2=Ted Parlor> 

What I desire is:

v1= 2008-10-07 v2=text entered by user v3=Ted Parlor
v1= 2008-11-26 v2=additional text entered by user v3=Ted Parlor

Any help that would allow me to extract all three strings separately would be greatly appreciated.

+3  A: 

You can use \s*[<>]\s* as delimiter. That is, any of < or >, with any preceding and following whitespaces.

For this to work, there must not be any < or > in the input other than the ones used to mark the date and user fields in the input (i.e. no I <3 U!! in the message).

This delimiter allows empty string parts in an entry, but it also leaves empty string tokens between any two entries, so they must be discarded manually.

import java.util.Scanner;

public class UseDelim {
    public static void main(String[] args) {
        String content = " <2008-10-07>text entered by user <Ted Parlor>"
        + "   <2008-11-26>  additional text entered by user <Ted Parlor>"
        + "   <2008-11-28><Parlor Ted>  ";
        Scanner sc = new Scanner(content).useDelimiter("\\s*[<>]\\s*");
        while (sc.hasNext()) {
            System.out.printf("[%s|%s|%s]%n",
                sc.next(), sc.next(), sc.next());

            // if there's a next entry, discard the empty string token
            if (sc.hasNext()) sc.next();
        }
    }
}

This prints:

[2008-10-07|text entered by user|Ted Parlor]
[2008-11-26|additional text entered by user|Ted Parlor]
[2008-11-28||Parlor Ted]

See also

polygenelubricants
Excellent, thank you for your great response. One more question, regarding blank spaces before and after the tags <>. For example, the result will break if my data is as follows: String content = " <2008-10-07>text entered by user <Ted Parlor>" + " <2008-11-26>additional text entered by user <Ted Parlor>";Perhaps I should of have indicated this before. In short, how would I get the same result as you outputted, while considering the possibility of space(s) before and after the <> tags. Thanks so much.
Brian
I used the following on the content:content = content.replaceAll("\\s+<", "<" ).trim();This resolved my problem. Any other suggestions are welcomed.
Brian
How would I handle the situation where there's no text between the two tags: "<2008-10-07><Ted Parlor>" *** This breaks the order of the fields.The desired result for this exception would be:[2008-10-07||Ted Parlor|]The second value is simply left blank, and order maintained as you provided in the above code.Not sure this is possible. Cheers, and thanks for your input.
Brian
@Brian: see latest revision.
polygenelubricants
Fantastic, thanks so much.
Brian