views:

1956

answers:

7

I have a file containing several lines similar to:

Name: Peter
Address: St. Serrano número 12, España
Country: Spain

And I need to extract the address using a regular expression, taking into account that it can contain dots, special characters (ñ, ç), áéíóú...

The current code works, but it looks quite ugly:.

Pattern p = Pattern.compile("^(.+?)Address: ([a-zA-Z0-9ñÑçÇáéíóú., ]+)(.+?)$",
                            Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = p.matcher(content);
if (m.matches()) { ... }

Edit: The Address field could also be divided into multiple lines

Name: Peter
Address: St. Serrano número 12,   
Madrid
España
Country: Spain

Edit: I can't use a Properties object or a YAML parser, as the file contains other kind of information, too.

A: 

Not a Java person, but wouldn't a "Address: (.*)$" work?

Edit: Without the Pattern.MULTILINE | Pattern.DOTALL option it should match only on that line.

cnu
A: 

Can it contain a newline? If it cannot contain a newline, you don't need to use the multiline modifier, and can do instead

Pattern p = Pattern.compile("^Address: (.*)$");

If it can, an alternative I can think of is

Pattern p = Pattern.compile("Address: (.*)\nCountry", Pattern.MULTILINE);

Without the DOTALL, the dot won't match a newline, so you can explicitly specify it in the regexp, allowing you to do what you asked about.

Vinko Vrsalovic
+1  A: 

You might want to look into Properties class instead of regex. It provides you ways to manage plain text or XML files to represent key-value pairs.

So you can read in your example file and then get the values like so after loading to a Properties object:

Properties properties = new Properties();
properties.load(/* InputStream of your file */);

Assert.assertEquals("Peter", properties.getProperty("Name"));
Assert.assertEquals("St. Serrano número 12, España", properties.getProperty("Address"));
Assert.assertEquals("Spain", properties.getProperty("Country"));
Cem Catikkas
Why use Apache Commons Assert isntead of Java assert?
cletus
A: 

You should definitely check out YAML.

You could try JYaml.

Best of all it has implementations in many languages.

ps I have tried the sample text in YAML::XS, and it works perfectly.

Brad Gilbert
+1  A: 

I don't mean to be a stick in the mud, but do you have to use a regex? Why not spare your future self (or others) the headache and do:

String line = reader.readLine();
while(line != null)
{
    line = line.trim();
    if(line.startsWith("Address: "))
    {
        return line.substr("Address: ".length()).trim();
    }
    line = reader.readLine();
}
return null;

Of course this can be parameterized a bit as well and put into a method.

Otherwise, I'd second the Properties or JYaml suggestions.

Dave Ray
+2  A: 

Assuming "content" is a string containing the file's contents, your main problem is that you're using matches() where you should be using find().

Pattern p = Pattern.compile("^Address:\\s*(.*)$", Pattern.MULTILINE);
Matcher m = p.matcher(content);
if ( m.find() )
{
  ...
}

There seems to be some confusion in other answers about MULTLINE and DOTALL modes. MULTILINE is what lets the ^ and $ anchors match the beginning and end, respectively, of a logical line. DOTALL lets the dot (period, full stop, whatever) match line separator characters like \n (linefeed) and \r (carriage return). This regex must use MULTILINE mode and must not use DOTALL mode.

Alan Moore
Thanks. What if address is a multiline field ? Is it possible to capture it with no need to depend on the next field name ?
Guido
Both of Nick's regexes will match if the Address field is at the end of the input. Is that what you mean?
Alan Moore
+3  A: 

I don't know Java's regex objects that well, but something like this pattern will do it:

^Address:\s*((?:(?!^\w+:).)+)$

assuming multiline and dotall modes are on.

This will match any line starting with Address, followed by anything until a newline character and a single word followed by a colon.

If you know the next field has to be "Country", you can simplify this a little bit:

^Address:\s*((?:(?!^Country:).)+)$

The trick is in the lookahead assertion in the repeating group. '(?!Country:).' will match everything except the start of the string 'Country:', so we just stick it in noncapturing parentheses (?:...) and quantify it with +, then group all of that in normal capturing parentheses.

ʞɔıu
It worked ! Thank you ! I have to read more about regex :)
Guido