Hello All,
I'm currently trying to write a website for testing / learning purposes that will wrap around the IMDB datasets that are dumped.
I'm having trouble determining the best way to extract the data into a format that is easier to manage. I will need to pull data from several files:
- movies.list = Movie list of all movies and year of production
- mpaa-ratings-reasons.list = MPAA ratings
- running-times.list = Running times
The data in these tables are linked by a unique name that is given to each line. Essentially, I will need to join the lines of each of these text files together using the unique name. After doing this, I will need to parse the data I need out of the actual unique name since the movie title isn't listed explicitly. The unique name also specifies if the entry is a video game or TV show, which I will not be collecting data for.
Pulling the data from those unique name qualifiers is most likely going to be a Regex nightmare, but I'm more concerned with what the best method is for actually grouping the text files into a manageable format somewhere... Should I...
- Pull the data into staging tables on the SQL server, and then write a separate part in my app to join the tables and pull everything together?
- Load the lines from the text files into a .NET data table and do my processing that way?
- In doing so, am I going to cause a memory nightmare for the box that is running this app?
- Some other alternative?
On a side note, the movies.list file alone contains over 1 million lines of data.
Thanks in advance for your help.
Chris