I'm trying to find a regex to separate out the author and book title information from a data set.
This one seems to work fine:
^\s*(?:(.*)\s+-\s+)?'?([^']+'?.*)\s*$
On the data below, it identifies an author in group 1
as the text preceding the first hyphen and, in the case of no hyphen, it identifies a book title in group 2
:
William Faulkner - 'Light In August'
William Faulkner - 'Sanctuary'
William Faulkner - 'The Sound and the Fury'
Saki - 'Esme'
Saki - 'The Unrest Cure' (Second Edition)
Saki (File Under: Hector Hugh Munro) - 'The Interlopers' (Anniversary Multi-pack)
William Faulkner - 'The Sound and the Fury' (Collector's Re-issue)
'The Sound and the Fury'
The Sound and the Fury
The Bible (St James Version)
However, in the case of the following string which contains an ampersand, it fails:
'Jim Clarke & Oscar Wilde'
Could someone explain why it doesn't work here?
UPDATE:
Here is the relevant Java code:
Pattern pattern = Pattern.compile("^\\s*(?:(.*)\\s+-\\s+)?'?([^']+'?.*)\\s*$");
Matcher matcher = pattern.matcher(text);
if(!matcher.matches())
{
logFailure(text);
}
else
{
String author = matcher.group(1).trim();
String bookTitle = matcher.group(2).trim();
}
A NullPointerException
is thrown at the following line from the excerpt above:
String author = matcher.group(1).trim();