ansaurus

Question

I'd like to scrape the iTunes top X RSS feed and insert into a dB...

Answer 1

A:

Well, I'm not really sure what sort of answer you're looking for, but I don't think you need to do any sort of shell scripting. Bother PHP and Perl would be perfectly capable of downloading the RSS feed and insert the data into MySQL. Set the PHP or Perl script up to run every X number of hours/days/whatever with a cronjob and you'd be done.

Not really much else to tell you, with how vague your question was.

Chad Birch 2009-02-23 07:44:42

Answer 2

A:

I'm scraping Stack Overflow's feed to perform some additional filtering using PHP's DOMDocument and then DOM methods to access what I want. I'd suggest looking into that.

Kit Sunde 2009-02-23 08:56:17

Answer 3

+1 A:

Here is a solution using Perl, with the help of (of course!) a bunch of modules.

It uses SQLite so you can run it easily (the definition of the (simplistic) DB is at the end of the script). Also it uses Perl hashes and simple SQL statements, instead of proper objects and an ORM layer. I found it easier to parse the XML directly instead of using an RSS module (I tried XML::Feed), because you need access to specific tags (name, preview...).

You can use it as a basis to add more features, more fields in the DB, a table for genre... but at least this way you have a basis that you can expand on (and maybe you can then publish the result as open-source).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;                 # to parse the RSS
use DBIx::Simple;              # DB interaction made easy
use Getopt::Std;               # always need options for a script
use PerlIO::gzip;              # itunes sends a gzip-ed file
use LWP::Simple 'getstore';    # to get the RSS

my %opt;
getopts( 'vc:', \%opt);

# could also be an option, but I guess it won't change that much
my @URLs= ( 
            'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
          );

# during debug, it's nice to use a cache of the feed instead of hitting hit every single run
if( $opt{c}) { @URLs= ($opt{c}); }

# I like using SQLite when developping,
# replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;

my $NS_PREFIX='im';

# a global, could be passed around, but would make the code a bit more verbose
my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";

foreach my $url (@URLs)
  { add_feed( $url); }

$db->disconnect;

warn "done\n" if( $opt{v});

sub add_feed 
  { my( $url)= @_;

    # itunes sends gziped RSS, so we need to unzip it
    my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
    getstore($url, $tempfile);
    open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";

    XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                          entry       => \&entry,
                                       },
                      map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                  )
             ->parse( $in_feed);

    close $in_feed;
  }

sub entry
  { my( $t, $entry)= @_;

    # get the data
    my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
    if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
      { $song{preview}= $preview->att( 'href'); }

    # $db->begin_work;

    # store it
    if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
      { warn "  skipping $song{name}, already stored\n" if $opt{v};
      }
    else
      {
        warn "  adding $song{name}\n" if $opt{v};
        if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
          { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
            $song{artist}= $artist_id; 
          }
        else
          { warn "  creating new artist $song{artist}\n" if $opt{v};

            $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});

            # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
            $song{artist}= $db->func('last_insert_rowid');
          }

        $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                              @song{qw( name  artist  price  preview)});
        $db->commit;
      }
    $t->purge; # keeps memory usage lower, probably not needed for small RSS files
  }

__END__
=head1 NAME

  itunes2db - loads itunes RSS feeds to a DB

=head1 OPTIONS

  -c <file>  uses a cache instead of the list of URLs
  -v         verbose

=head1 DB schema

  create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
  create table artist (id INT PRIMARY KEY, name TEXT);

mirod 2009-02-23 09:25:39

Answer 4

A:

From what I can tell, it's not actively maintained, but Scriptella could be of some assistance. Very simple xml script, running on Java.

Example of how to suck RSS into a database

2009-03-10 03:07:18

ansaurus

tags:

views:

answers:

I'd like to scrape the iTunes top X RSS feed and insert into a dB...

related questions