views:

131

answers:

4

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:

{
    'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
    'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}

My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?

My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.

EDIT

Here is my implementation (I've cut off irrelevant things):

while($line = <STDIN>) { 
    chomp($line); 
    $line =~ /(http:\/\/.+?)(\/|$)/i; 
    $host = "$1"; 
    push @{$urls{$host}}, $line; 
}

store \%urls, 'out.hash'; 
+1  A: 

If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).

But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.

EDIT:

Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.

klausbyskov
I have only one 300mb file. When the script is running it takes 600mb of ram. Sorry for my english - it isn't very good.I have to store processed file in given hash in order to further processing.
jesper
He only ever reads one line at a time. It's the data structure that he creates that is using up the memory.
brian d foy
@brian d foy: What? Isn't that exactly what I'm saying?
klausbyskov
I don't know exactly what you are saying, and neither did he. Perhaps you can edit your answer to be more succinct and clear.
brian d foy
+5  A: 

One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.

Leon Timmermans
+1  A: 

Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.

One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.

Dave Sherohman
A: 

What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

gorn