views:

93

answers:

5

I need to split some info from a asterisk delimitted data.

Data Format:

NAME*ADRESS LINE1*ADDRESS LINE2

Rules:

1. Name should be always present
2. Address Line 1 and 2 might not be
3. There should be always three asterisks.

Samples:

MR JONES A ORTEGA*ADDRESS 1*ADDRESS2*

Name: MR JONES A ORTEGA
Address Line1: ADDRESS 1
Address Line2: ADDRESS 2

A PAUL*ADDR1**
Name: A PAUL
Address Line1: ADDR1
Address Line2: Not Given

My algo is:

1. Iterate through the characters in the line
2. Store all chars in a temp variables until first * is found. Reject the data if no char is found before first occurence of asterisk. If some chars found, use it as the name.
3. Same as step 2 for finding address line 1 and 2 except that this won't reject the data if no char is found

My algo looks ugly. The code looks uglier. Spliting using //* doesn't work either since name can be replaced with address line 1 if the data was *Address 1*Address2. Any suggestion?

EDIT:

Try using the data excluding quotes "-MS DEBBIE GREEN*1036 PINEWOOD CRES**"

A: 
String myLine = "name*addr1*addr2*"
String[] parts = myLine.split('\\*',4);
for (String s : parts) {
    System.out.println(s);
}

Output:

name
addr1
addr2
(empty string)

If you do split on "**addr2*" - you will get array with "","","addr2". So I don't get it why you can't use split.

Also, if you split "***" - you will get a 4 element array with 4 empty strings.

Here you get an example, try running this code:

public void testStrings() {
    String line = "part0***part3*part4****part8*";
    String[] parts = line.split("\\*");
    for (int i=0;i<parts.length;i++) {
        System.out.println(String.format("parts[%d]: '%s'",i, parts[i]));
    }
}

Result will be:

parts[0]: 'part0'
parts[1]: ''
parts[2]: ''
parts[3]: 'part3'
parts[4]: 'part4'
parts[5]: ''
parts[6]: ''
parts[7]: ''
parts[8]: 'part8'
Max
I see. The limit makes it ok.
Milli Szabo
Yes, because as I wrote at the top of this response, you should use: `myLine.split('\\*',3)` - 3 means that there are 3 parts.
Max
@Max: you need limit of 4, otherwise the last `*` will be included in the 3rd part.
polygenelubricants
@Max: The limit needs to be 4.
Milli Szabo
Hm, interesting behaviour. If limit is not specified - it omits all extra delimiters. But if you specify the limit - it puts all extra delimiters into the last element.
Max
A: 

yourString.split("\\*"); should give you an array with name, address1 and address2, where as adress1 and address2 can be empty Srings. More information: here

Daniel Engmann
Nope, it wouldn't.
Milli Szabo
If the data is *Address 1*Address2, you can also use split. You will get an array with 2 items instead of 3. In this case you know that array[0] contains name and address1.
Daniel Engmann
A: 

You can use regex to do this. For example:

String myInput="MR JONES A ORTEGA*ADDRESS 1*ADDRESS2*";

Pattern pattern =  Pattern.compile("([^*]+)\\*([^*]*)\\*([^*]*)\\*");
Matcher matcher = pattern.matcher(myInput);

if (matcher.matches()) {
    String myName = matcher.group(1);
    String myAddress1 = matcher.group(2);
    String myAddress2 = matcher.group(3);
    // ...
} else {
    // input does not match the pre-requisites
}
andcoz
What if the data was full of info delimitted by asterisks (*)?Wouldn't it look unreadable and distorted?
Milli Szabo
I am not sure of which your question is about. Mumble.The regex will be longer and longer if you add more fields and, eventually, validation. More power => more complexity.If you add a 4th field, e.g. a telephone number, You can also add validation writing something like "([^*]+)\\*([^*]*)\\*([^*]*)\\*((+\d{2}\s)\d+)*\\*".Obviously you can comment it, you can write: "([^*]+)\\*" /* 1st field: name, mandatory */ + "([^*]*)\\*" /* 2nd field: address, optional */ + "([^*]*)\\*" /* ... */.
andcoz
+2  A: 

You can use the String[] split(String regex, int limit) as follows:

    String[] tests = {
        "NAME*ADRESS LINE1*ADDRESS LINE2*",
        "NAME*ADRESS LINE1**",
        "NAME**ADDRESS LINE2*",
        "NAME***",
        "*ADDRESS LINE1*ADDRESS LINE2*",
        "*ADDRESS LINE1**",
        "**ADDRESS LINE2*",
        "***",
        "-MS DEBBIE GREEN*1036 PINEWOOD CRES**",
    };
    for (String test : tests) {
        test = test.substring(0, test.length() - 1);
        String[] parts = test.split("\\*", 3);
        System.out.printf(
            "%s%n  Name: %s%n  Address Line1: %s%n  Address Line2: %s%n%n",
            test, parts[0], parts[1], parts[2]
        );
    }

This prints (as seen on ideone.com):

NAME*ADRESS LINE1*ADDRESS LINE2*
  Name: NAME
  Address Line1: ADRESS LINE1
  Address Line2: ADDRESS LINE2

NAME*ADRESS LINE1**
  Name: NAME
  Address Line1: ADRESS LINE1
  Address Line2: 

NAME**ADDRESS LINE2*
  Name: NAME
  Address Line1: 
  Address Line2: ADDRESS LINE2

NAME***
  Name: NAME
  Address Line1: 
  Address Line2: 

*ADDRESS LINE1*ADDRESS LINE2*
  Name: 
  Address Line1: ADDRESS LINE1
  Address Line2: ADDRESS LINE2

*ADDRESS LINE1**
  Name: 
  Address Line1: ADDRESS LINE1
  Address Line2: 

**ADDRESS LINE2*
  Name: 
  Address Line1: 
  Address Line2: ADDRESS LINE2

***
  Name: 
  Address Line1: 
  Address Line2: 

-MS DEBBIE GREEN*1036 PINEWOOD CRES**
  Name: -MS DEBBIE GREEN
  Address Line1: 1036 PINEWOOD CRES
  Address Line2: 

The reason for the "\\*" is because split takes a regular expression, and * is a regex metacharacter, and since you want it to mean literally, it needs to be escaped with a \. Since \ itself is a Java string escape character, to get a \ in a string, you need to double it.

The reason for the limit of 3 is because you want the array to have 3 parts, including trailing empty strings. A limit-less split discards trailing empty strings by default.

The last * is discarded manually before the split is performed.

polygenelubricants
Try using the data excluding quotes "-MS DEBBIE GREEN*1036 PINEWOOD CRES**"
Milli Szabo
@polygenelubricants: It would create 4 indexes for the given sample above because of the limit 4. It doesn't look satsfactory because I always have to ignore the last index.
Milli Szabo
And what is the problem with ignoring last index? If you don't want to ignore last index - try using the solution @andcoz provided. However, the performance would be worse as it uses more complex regular expression which takes more time to compile.
Max
@Milli: I've modified so it's `split-3`.
polygenelubricants
A: 

A complete solution, reading from file using scanner and regular expressions:

import java.io.*;
import java.util.Scanner;
import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) throws FileNotFoundException {
        Scanner s = new Scanner(new File("data.txt"));
        Pattern p = Pattern.compile("([^\\*]+)\\*([^\\*]*)\\*([^\\*]*)\\*");

        while (s.hasNextLine()) {
            if (s.findInLine(p) == null) {
                s.nextLine();
                continue;
            }

            System.out.println("Name: " + s.match().group(1));
            System.out.println("Addr1: " + s.match().group(2));
            System.out.println("Addr2: " + s.match().group(3));
            System.out.println();
        }
    }
}

Input file:

MR JONES A ORTEGA*ADDRESS 1*ADDRESS2*
A PAUL*ADDR1**
*No name*Addr 2*
My Name*Some Addr*Some more addr*

Output:

Name: MR JONES A ORTEGA
Addr1: ADDRESS 1
Addr2: ADDRESS2

Name: A PAUL
Addr1: ADDR1
Addr2: 

Name: My Name
Addr1: Some Addr
Addr2: Some more addr

Note that the line with no name is not matched (as according to Rule 1: Name should be always present). If you still want to process these lines, simply change the + in the regular expressions, to a *.

The regular expressions ([^\\*]*)\\* can be read out as: "Anything except an asterisk, zero or more times, followed by an asterisk."

aioobe
What if the data was full of info delimitted by asterisks (*)?Wouldn't it look unreadable and distorted?
Milli Szabo
No, it would look just fine.
aioobe
Why do you use "([^\\*]*)\\*" instead of "([^*]*)\\*"? "*" has no special meaning inside square brackets.
andcoz