views:

177

answers:

5

Given:

  • file a.txt containing many millions of lines (say, one sentence per line) (2.6 gb!)
  • file b.txt containing 830k lines with pairs "[word1] [word2]"

Question:

how to perform the most efficient replacement of each word1 by word2 for every of 830k tuples (w1, w2) in the huge text file?

Naive methods like sed, perl, python etc. would need weeks to do so. Are there (possibly parallelization-based) ways to perform that load of replacements?

Thanks!

p.

A: 

I'd do it in SQL.

Create a table with two columns (dataline, sequence), and put a.txt into it (one line per table row)

Then create a second table, again with two columns (word1 and word2) and read b.txt into it (again, one line per table row)

generate an update statement updating table1 based on table2

run the sql statement

when it completes, read the first table back out into a file

Stephen Wrighton
When all you've got is a hammer... ;)
Chad Birch
A: 

Split the file into smaller chunks. You're likely eating up a lot of memory space doing nothing but shifting bits either in memory or on disk.

This is similar to how it's much faster to concatenate/replace on an array of strings rather than a single string.

The only trick of it is to make sure where you put the break in the file isn't a good match, which is relatively trivial. In fact, if you can do it by lines, that's even better, no need to check against matches.

I also find it strange that it would take PERL weeks. There is some anecdotal evidence to suggest that it can handle that in less than an hour:

In fact, they talk about 1gb files taking 2 minutes in the second link.

And I wouldn't suspect that a replacement operation should take significantly longer than a copy operation for a file, after all, it's just picking up chunks of the file and replacing some of the bits as you move them. It should be able to replace them on the fly near the speed of copying them (as they are already in memory)

altCognito
A: 

Sort your list of find/replace pairs by the word to find [word1]

Then read through the file, splitting each line into words, and look for each word in your list of words to replace (using something efficient like a binary search).

It should be achieveable.

Binary Worrier
+4  A: 

I would do it in python, but any other language would do the job if you get the algorithm right. The whole trick is to keep the word-pairs (file b.txt) in memory and go through the large file in one pass. Since I/O is much slower operation than reading from RAM the performance of this approach would be O(file1) + O(file2)

In pseudocode:

myMap = {}
for line in fileB:
  myMap[1st word of line] = 2nd word of line

for line in fileA
  for word in line
    if myMap contains word
      replace word with myMap[word]

I imagine this is the fastest you can get.

idrosid
+1 I see no reason that standard tools wouldn't work, but kudos for the example.
altCognito
A: 

I agree with idrosid answer of just loading the pairs into memory and then streaming over the file. If you've truly got a lot of data (lots of Gb) and you don't have the machine resources to do this as fast as you'd want, Amazon's new Elastic Hadoop service would be a good solution. Once you've got a simple executable working for small files, it would be pretty straight-forward to scale that up to tons of data using Hadoop's Map Reduce framework.

Brian Ferris