tags:

views:

94

answers:

3

I'm trying to find a regex to separate out the author and book title information from a data set.

This one seems to work fine:

^\s*(?:(.*)\s+-\s+)?'?([^']+'?.*)\s*$

On the data below, it identifies an author in group 1 as the text preceding the first hyphen and, in the case of no hyphen, it identifies a book title in group 2:

William Faulkner - 'Light In August'
William Faulkner - 'Sanctuary'
William Faulkner - 'The Sound and the Fury'
Saki - 'Esme'
Saki - 'The Unrest Cure' (Second Edition)
Saki (File Under: Hector Hugh Munro) - 'The Interlopers' (Anniversary Multi-pack)
William Faulkner - 'The Sound and the Fury' (Collector's Re-issue)
'The Sound and the Fury'
The Sound and the Fury
The Bible (St James Version)

However, in the case of the following string which contains an ampersand, it fails:

'Jim Clarke & Oscar Wilde'

Could someone explain why it doesn't work here?

UPDATE:

Here is the relevant Java code:

Pattern pattern = Pattern.compile("^\\s*(?:(.*)\\s+-\\s+)?'?([^']+'?.*)\\s*$");
Matcher matcher = pattern.matcher(text);
if(!matcher.matches()) 
{
    logFailure(text);
}
else
{
    String author = matcher.group(1).trim();
    String bookTitle = matcher.group(2).trim();
}

A NullPointerException is thrown at the following line from the excerpt above:

    String author = matcher.group(1).trim();
+1  A: 

group(1) can return null, you should check that before trimming

insipid
+2  A: 

matcher.group(1) is returning null when you don't have a hyphen so .trim() is throwing an NPE.

Your current regex also eats the first single quote it finds. Also, do you actually want to not match? You're just logging there. If text doesn't actually have to match a pattern, you could use a more simple algorithm.

int hyphenIndex = text.indexOf("-");
if (hyphenIndex > -1) {
    String author = text.substring(0, hyphenIndex);
    System.out.println(author);
}
String title = text.substring(hyphenIndex + 1, text.length());
System.out.println(title);

However, if you do require rejecting certain strings, there are probably a few things you could do to make this more readable as well.

  1. Change the regex to "^(?:(.*)\\s+-\\s+)?'?([^']+'?.*)$" and call pattern.matcher(text.trim())
Instantsoup
After implementing your regex change in (1.) I end up with a single quote character at the end of the book titles: `The Unrest Cure'`
snoopy
Your simpler algorithm makes much more sense for what I'm doing actually. I'm going to use that and forget the regex nonsense. Thanks.
snoopy
+1  A: 

Your Regex works fine, it's just that there is no author in the example you gave, thus the first matching group is null. So when you try to call matcher.group(1).trim() you get a NPE.

Just handle nulls before you call trim. Perhaps something like this:

String author = matcher.group(1);
if(author == null) {
  author = "";
}
author = author.trim();
Mike Deck