ansaurus

Question

How can I efficiently group a large list of URLs by their host name in Perl?

Answer 1

+1 A:

If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).

But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.

EDIT:

Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.

klausbyskov 2010-04-06 21:02:00

I have only one 300mb file. When the script is running it takes 600mb of ram. Sorry for my english - it isn't very good.I have to store processed file in given hash in order to further processing.

jesper 2010-04-06 21:08:50

He only ever reads one line at a time. It's the data structure that he creates that is using up the memory.

brian d foy 2010-04-07 17:51:17

@brian d foy: What? Isn't that exactly what I'm saying?

klausbyskov 2010-04-07 19:12:20

I don't know exactly what you are saying, and neither did he. Perhaps you can edit your answer to be more succinct and clear.

brian d foy 2010-04-07 23:35:43

Answer 2

+5 A:

One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.

Leon Timmermans 2010-04-06 21:40:21

Answer 3

+1 A:

Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.

One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.

Dave Sherohman 2010-04-07 09:49:24

Answer 4

A:

What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

gorn 2010-04-12 22:21:39

ansaurus

tags:

views:

answers:

How can I efficiently group a large list of URLs by their host name in Perl?

related questions