tags:

views:

559

answers:

1

Hello All,

I'm currently trying to write a website for testing / learning purposes that will wrap around the IMDB datasets that are dumped.

I'm having trouble determining the best way to extract the data into a format that is easier to manage. I will need to pull data from several files:

  • movies.list = Movie list of all movies and year of production
  • mpaa-ratings-reasons.list = MPAA ratings
  • running-times.list = Running times

The data in these tables are linked by a unique name that is given to each line. Essentially, I will need to join the lines of each of these text files together using the unique name. After doing this, I will need to parse the data I need out of the actual unique name since the movie title isn't listed explicitly. The unique name also specifies if the entry is a video game or TV show, which I will not be collecting data for.

Pulling the data from those unique name qualifiers is most likely going to be a Regex nightmare, but I'm more concerned with what the best method is for actually grouping the text files into a manageable format somewhere... Should I...

  1. Pull the data into staging tables on the SQL server, and then write a separate part in my app to join the tables and pull everything together?
  2. Load the lines from the text files into a .NET data table and do my processing that way?
    1. In doing so, am I going to cause a memory nightmare for the box that is running this app?
  3. Some other alternative?

On a side note, the movies.list file alone contains over 1 million lines of data.

Thanks in advance for your help.

Chris

+1  A: 

Staging tables on the DB server, scrub the data into final tables.

If this means loading back into a client app for the processing, so be it.

Practically, a DB server will handle the quantity of data but SQL Server may not be the best for your processing.

gbn