Organizing Nightly IMDB Dumps Into Structured Data

Hello All,

I'm currently trying to write a website for testing / learning purposes that will wrap around the IMDB datasets that are dumped.

I'm having trouble determining the best way to extract the data into a format that is easier to manage. I will need to pull data from several files:

movies.list = Movie list of all movies and year of production
mpaa-ratings-reasons.list = MPAA ratings
running-times.list = Running times

The data in these tables are linked by a unique name that is given to each line. Essentially, I will need to join the lines of each of these text files together using the unique name. After doing this, I will need to parse the data I need out of the actual unique name since the movie title isn't listed explicitly. The unique name also specifies if the entry is a video game or TV show, which I will not be collecting data for.

Pulling the data from those unique name qualifiers is most likely going to be a Regex nightmare, but I'm more concerned with what the best method is for actually grouping the text files into a manageable format somewhere... Should I...

Pull the data into staging tables on the SQL server, and then write a separate part in my app to join the tables and pull everything together?
Load the lines from the text files into a .NET data table and do my processing that way?
1. In doing so, am I going to cause a memory nightmare for the box that is running this app?
Some other alternative?

On a side note, the movies.list file alone contains over 1 million lines of data.

Thanks in advance for your help.

Chris

ansaurus

tags:

views:

answers:

Organizing Nightly IMDB Dumps Into Structured Data

related questions