views:

133

answers:

2

I am building an application where I am trying to allow users to submit a list of company and date pairs and find out whether or not there was a news event on that date. The news events are stored in a dictionary with a company identifier and a date as a key.

newsDict('identifier','MM/DD/YYYY')=[list of news events for that date]

The dictionary turned out to be much larger than I thought-too big even to build it in memory so I broke it down into three pieces, each piece is limited to a particular range of company identifiers.

My plan was to take the user submitted list and using a dictionary group the user list of company identifiers to match the particular newsDict that the company events would be expected to be found and then load the newsDicts one after another to get the values.

Well now I am wondering if it would not be better to keep the news events in a list with each item of the list being a sublist list of a tuple and another list

[('identifier','MM/DD/YYYY'),[list of news events for that date]]

my thought then is that I would have a dictionary that would have the range of the list for each company identifier

 companyDict['identifier']=(begofRangeinListforComp,endofRangeinListforComp)

I would use the user input to look up the ranges I needed and construct a list of the identifiers and ranges sorted by the ranges. Then I would just read the appropriate section of the list to get the data and construct the output.

The biggest reason I see for this is that even with the dictionary broken into thirds each section takes about two minutes to load on my machine and the dictionary ends up taking about 600 to 750 mb of ram.

I was surprised to note that a list of eight million lines took only about 15 seconds to load and used about 1/3 of the memory of the dictionary that had 1/3 the entries.

Further, since I can discard the lines in the list as I work through the list I will be freeing memory as I work down the user list.

I am surprised as I thought a dictionary would be the most efficient way to do this. but my poking at it suggests that the dictionary requires significantly more memory than a list. My reading of other posts on SO and elsewhere suggests that any other structure is going to require pointer allocations that are more expensive than list pointers. Am I missing something here and is there a better way to do this?

After reading Alberto's answer and response to my comment I spent some time trying to figure out how to write the function if I were to use a db. Now I might be hobbled here because I don't know much about db programming but

I think the code to implement using a db would be much more complicated than:

outList=[]
massiveFile=open('theFile','r')
for identifier in sortedUserList
# I get the list and sort it by the key of the dictionary 
    identifierList=massiveFile[theDict[identifier]['beginPosit']:theDict[identifier]['endPosit']+1]
    for item in identifierList:
        if item.startswith(manipulation of the identifier)
        outList.append(item)

I have to wrap this in a function I didn't see anything that would be as comparably simple if I converted the list to a db.

Of course simpler was not the reason to bring me to this forum. I still don't see that using another structure will cost less memory. I have 30000 company identifiers and approximately 3600 dates. Each item in my list is an object in the parlance of OOD. That is where I am struggling I spent six hours this morning organizing the data for a dictionary before I gave up. Spending that amount of time to implement a database and then find that I am using half a gig or more of someone else's memory to load it seems problematic

+5  A: 

With such a large amount of data, you should be using a database. This would be far better than looking at a list, and would be the most appropriate way of storing your data anyway. If you're using Python, it has SQLite built in I believe.

AlbertoPL
What is the benefit of using a database?
PyNEwbie
basically all of the functionality that you are coding already exists by making different SQL calls, so you can create all of the different kinds of lists that you want just based on the tables in the database. A database would also not be storing the data it carries into RAM, which is a huge plus.
AlbertoPL
Sqlite is a standard module in python, and I would recommend Elixir(SQLAlchemy) and the table/query manager.
monkut
+1  A: 

The dictionary will take more memory because it is effectively a hash.

You don't have to go so far as using a database, since your lookup requirements are so simple. Just use the file system.

Create a directory structure based on the company name (or ticker), with subdirectories for each date. To find whether data exists and load it up, just form the name of the subdirectory where the data would be, and see if it exists.

E.g., IBM news for May 21 would be in C:\db\IBM\20090521\news.txt, if in fact there were news for that day. You just check if the file exists; no searches.

If you want to try and boost speed from there, come up with a scheme to cache a limited amount of results that are likely to be frequently requested (assuming you're operating a server). For that, you'd use a hash.

John Pirie
Clever +1, I don't want to add a complex directory structure though, 300K identifiers would make it very hard for them to walk their directory structure.
PyNEwbie
Certainly don't want thousands in a single directory. So you subdivide, and create C:\db\I\B\M\2009\05\21\news.txt.
John Pirie
And that's easier than using sqlite really?
Seun Osewa
What's hard?symbol, datestr = "IBM", "20090521"newsname = "C:/db/%s/%s/news.txt" % ( "/".join( symbol ), datestr )if( os.path.isfile( newsname ) ): ...
John Pirie