tags:

views:

43

answers:

2

I am trying to extract publisher information from a string. It comes in various formats such as:

John Wiley & Sons (1995), Paperback, 154 pages

New York, Crowell [1963] viii, 373 p. illus. 20 cm.

New York: Bantam Books, c1990. xx, 444 p. : ill. ; 27 cm.

Garden City, N.Y., Doubleday, 1963. 142 p. illus. 22 cm. [1st ed.]

All I want to extract is the publisher name, so everything after the ( or the [ can be ignored. I'd need to grab any character before this, however. And it's complicated by the fact that for example three, I'd want to grab the information before the comma, but in example two, I'd want to grab the information before the square bracket only and keep that comma if possible.

I'm willing to work with a regex that takes everything before ( [ and , and work with any imperfect data (like only getting "New York" for example 2), since I wouldn't want to insert all of example 3 into the database. The majority of the data have the date in brackets as in examples 1 and 2.

Thanks in advance for any suggestions!

+1  A: 

Here is one: #(.+?)\W*.\d{4}#:

preg_match_all('#(.+?)\W*.\d{4}#', $books, $matches);
$publishers = array_map('trim', $matches[1]);

print_r($publishers);

Generates (as seen on ideone):

Array
(
    [0] => John Wiley & Sons
    [1] => New York, Crowell
    [2] => New York: Bantam Books
    [3] => Garden City, N.Y., Doubleday
)

It basically extracts everything before the sequence [any number non-word characters + 1 character + 4 digit string (hopefully the year)].

Aillyn
+2  A: 

Hm how about replacing:

[^\w\n\r]+c?[12]\d{3}.*

with the empty string? Explanation:

[^\w\n\r]+   # any non-word character (but no new lines either!)
c?           # an optional "c"
[12]\d{3}    # a year (probably, at least)
.*           # all the rest of the line

Works for your example, might need a little extra tweaking.

Tomalak
+1. Probably as close as you can reasonably get with a regex. But why do you say to run it in multiline mode? I don't see any line anchors.
Alan Moore
@Alan: Yeah, that's an edit artifact. :) I'll take it out, I just forgot to do it.
Tomalak
Excellent. This works just perfectly. The other answer looks to work too, but this one needed the least tweaking to fit in my code. Thanks a million!
mandel