views:

171

answers:

9

hello guys,

I want to find most often seen string in a huge log file. Can someone help me how to do this. one way to do this is to hash each and every string and count the maximum value but its not efficient. Are there any better ways to do this.

Thanks & Regards,

Mousey.

A: 

If by "string" you mean "word" then the most efficient way I can think of is to count the duplicate words as you read them instead of storing and then counting.

Luca Matteis
how can I do it without storing in a hash ?
mousey
Are you assuming the entire file is buffered in memory, or are you going back to disk on each iteration? In your approach, you would still need to keep a list of words already counted, no?
dsmith
+4  A: 

If by string you mean line, then on any unix-like shell you should be able to do something like:

sort logfile.txt | uniq -c

This presumes that you do not have something actually unique on each line - like a timestamp, and that the file is small enough to be reasonably treated in this way.

Of course this does not use C or C++ "directly", but given that the tools themselves are likely coded in one of them then it should count :-)

sdg
do you know the implementation of this ? just the technique is sufficient
mousey
See here for the actual source code of sort:http://stackoverflow.com/questions/1042458/source-code-of-unix-utilities-such-as-sort-uniq-etc
Eugen Constantin Dinca
+2  A: 

Assuming you mean either by line or by word (or have some other delimiter), you could go through, take each "string" and put it into a data structure. Each time you find the same string again, you would increment the value for the string within the data structure.

stl map would be able to do this. The string would be the key, the value associated with the key would be the number of times the string has been found. You could also use stl multiset. You would just count the number of items with the same key.

BSchlinker
Oh god do not search for strings repeatedly. The performance of this will be ATROCIOUS.
Stefan Valianu
@Stefan I recognize this -- (I would never implement such a technique myself). However, Mousey's post doesn't seem to indicate he is looking for performance.
BSchlinker
@BSchlinker not to be rude, but "one way to do this is to hash each and every string and count the maximum value but its not efficient. Are there any better ways to do this." I am assuming better means more efficient, as he is saying hashing is inadequate due to efficiency
Stefan Valianu
@Stefan I apologize, when I read the original post I thought he said that he could not use hashing, but did not specify a reason. I see now that I misread and have modified my answer.
BSchlinker
+3  A: 

Unless the hash algorithm is expensive (I had always thought they were inexpensive), wouldn't a hash be both memory-efficient (the average hash length is likely shorter than the average line or word length, in bytes, assuming 8-bit ASCII), and quicker a dictionary lookup?

What's the reasoning for not wanting to use a hash?

Robert Hui
+1  A: 

I think the best approach is to do a single scan, counting words and accumulating a count in a map by word.

If your log files are in a specific language, you probably want to ignore common words like "the", "a". You may also want to incorporate a stemming algorithm.

dsmith
A: 

If you're talking about any random substring in the log file, this problem is not solvable in polynomial time.. and if it is indeed a huge log file, I believe you're SOL

However, if you're talking about any particular word in a file, you will have to ref count your words. Which requires a map of some sort.

If you're talking about any particular line in the file, you will have to ref count your lines. Which requires a map of some sort.

Either way, you need to use some sort of reference counting.

Stefan Valianu
+2  A: 

If performance is critical you may want to look at a trie or a Radix tree.


If you're just interested to know if one of the strings appears more than 50% of the times (let's call that string the majority string) you can do something like this (see if I can get this right):

  1. get the first string and assume it's the majority string and set it's occurrence count to 1;
  2. get the next string
  3. if it's the same as the current majority candidate increment it's occurrence count
  4. otherwise decrement the occurrence count
  5. if the occurrence count reaches 0 replace the majority candidate with the current string
  6. repeat from 2 as long as you have strings to read
  7. if at the end the occurrence count is greater than 0 rescan the log and count the actual number of occurrences of the candidate to check if it really is the majority string.

    So you'll have to go through the log twice.

Note: This is from a problem used in an ACM programming contest a while ago, available here.

Eugen Constantin Dinca
50% is a big assumption, I wonder if it would be possible to somehow tweak the method for an arbitrary percentage (using a bit more space).
Matthieu M.
Eugen Constantin Dinca
A: 

May be my answer is not exactly correct. But perl is made for these kind of purposes. In perl this is dead easy. Mostly the perl code can be made within six lines for this .

prabhakaran
+3  A: 

How huge is "huge"? What's a "string"? Unix command-line tools are awfully good:

tr -s ' \011' '\012' < /var/log/messages | sort | uniq -c | sort -rn | head -20

produces

    786 --
    635 labrador
    635 Jun
    393 MARK
    236 kernel:
    163 17
    153 usb
    136 22
    118 21
    113 USB
     74 device
     73 20
     73 19
     72 18
     57 5-1:
     51 address
     43 speed
     36 New
     34 0
     33 using

In the time it takes to write and debug a C program, you can run an awful lot of shell scripts.

Norman Ramsey
+1 for the remark :) Even though of course if you repeatedly need to do the task, you might be interested in speeding it up.
Matthieu M.