views:

238

answers:

2

I've got a file with some book data in MARC format, of which some lines are ISBNs. I'd like to replace these lines with the Google Books ID of that ISBN, if it exists. Here's the code so far, which just ends up removing the lines:

perl -pe "s#ISBN(.*)#$(wget --output-document=- --quiet --user-agent=Mozilla/5.0 \"http://books.google.com/books?jscmd=viewapi&amp;bibkeys=\1\")#mg" < 5-${file} > 6-${file}

PS: Google are a bit fuzzy on the use of automated tools: The Books Data API recommends tools like curl / wget, but there are no instructions on how to avoid being blocked when using such tools. I'm also pretty sure I saw a clause in a ToS saying users can't send automated queries, but I can't find it again. This is discussed in their forum.

+1  A: 

I think the OP is on the right track and could use a one-liner for this, and just needs to replace some bash-style syntax with the correct Perl syntax. I think this would work (newlines added for readability):

    perl -pe 's#ISBN(\w+)#qx(wget --output-document=- 
        --quiet --user-agent=Mozilla/5.0 
        "http://books.google.com/books\\?jscmd=viewapi\\&amp;bibkeys=$1")#ge' \
        < 5-${file} > 6-${file}

You have to escape (edit: double escaping seems to work) the $ or & characters in the url.

mobrule
Thanks - This actually works. Why was it downvoted?
l0b0
Beats me. LWP snobs?
mobrule
I think using the fake user agent is in violation of Google's TOS. Without it: `HTTP/1.0 401 Unauthorized`.
Sinan Ünür
This replaces the ISBN with a big gob of JavaScript, doesn't it? Am I missing some magic or something?
brian d foy
+4  A: 

The reason you end up having to lie about the user agent is because you are violating Google's TOS: Don't do that.

Instead, use the Google Book Search API.

The code below is slightly hampered by my lack of familiarity with modules such as XML::Atom, Data::Feed, WWW::OpenSearch. However, it should provide a good starting point.

#!/usr/bin/perl

use strict;
use warnings;

use Business::ISBN qw( valid_isbn_checksum );
use LWP::Simple;
use XML::Simple;

while ( <> ) {
    s/ISBN:([0-9]+)/'Google Books ID:' . get_google_id_for_isbn($1)/ge;
    print;
}

use Carp;

sub make_google_books_query {
    sprintf 'http://books.google.com/books/feeds/volumes?q=isbn:%s', $_[0];
}

sub get_google_id_for_isbn {
    my ($isbn) = @_;

    my $google_id = eval {
        defined(valid_isbn_checksum $isbn)
            or croak "Invalid ISBN: $isbn";

        my $query = make_google_books_query($isbn);
        my $xml = get $query;

        defined($xml)
            or croak "No response to <$query>";

        my $data = XMLin($xml, ForceArray => 1);
        my @ids = @{ $data->{entry}[0]{'dc:identifier'} };

        unless ("ISBN:$isbn" eq $ids[1]
                or "ISBN:$isbn" eq $ids[2] ) {
            croak "Invalid search results: '@ids'";
        }

        $ids[0];
    };

    defined($google_id) ? $google_id : '';
}

Given a text file t.txt containing:

ISBN:0060930314
ISBN:9780596520106

it outputs:

Google Books ID:ioXFqlzsmK8C
Google Books ID:lNVHi3TunxsC
Sinan Ünür
Also, see my Business::ISBN module if you want to validate the incoming ISBNs. Google returns results even for invalid input. You might also want to wrap an eval around @{ ... } in case there is no entry.
brian d foy