ansaurus

Question

find most often seen string from a log file

Answer 1

A:

If by "string" you mean "word" then the most efficient way I can think of is to count the duplicate words as you read them instead of storing and then counting.

Luca Matteis 2010-06-22 23:33:55

how can I do it without storing in a hash ?

mousey 2010-06-22 23:36:14

Are you assuming the entire file is buffered in memory, or are you going back to disk on each iteration? In your approach, you would still need to keep a list of words already counted, no?

dsmith 2010-06-22 23:40:17

Answer 2

+4 A:

If by string you mean line, then on any unix-like shell you should be able to do something like:

sort logfile.txt | uniq -c

This presumes that you do not have something actually unique on each line - like a timestamp, and that the file is small enough to be reasonably treated in this way.

Of course this does not use C or C++ "directly", but given that the tools themselves are likely coded in one of them then it should count :-)

sdg 2010-06-22 23:35:35

do you know the implementation of this ? just the technique is sufficient

mousey 2010-06-22 23:37:22

See here for the actual source code of sort:http://stackoverflow.com/questions/1042458/source-code-of-unix-utilities-such-as-sort-uniq-etc

Eugen Constantin Dinca 2010-06-23 00:14:26

Answer 3

+2 A:

Assuming you mean either by line or by word (or have some other delimiter), you could go through, take each "string" and put it into a data structure. Each time you find the same string again, you would increment the value for the string within the data structure.

stl map would be able to do this. The string would be the key, the value associated with the key would be the number of times the string has been found. You could also use stl multiset. You would just count the number of items with the same key.

BSchlinker 2010-06-22 23:40:00

Oh god do not search for strings repeatedly. The performance of this will be ATROCIOUS.

Stefan Valianu 2010-06-22 23:58:59

@Stefan I recognize this -- (I would never implement such a technique myself). However, Mousey's post doesn't seem to indicate he is looking for performance.

BSchlinker 2010-06-23 00:01:15

@BSchlinker not to be rude, but "one way to do this is to hash each and every string and count the maximum value but its not efficient. Are there any better ways to do this." I am assuming better means more efficient, as he is saying hashing is inadequate due to efficiency

Stefan Valianu 2010-06-23 00:02:27

@Stefan I apologize, when I read the original post I thought he said that he could not use hashing, but did not specify a reason. I see now that I misread and have modified my answer.

BSchlinker 2010-06-23 00:05:41

Answer 4

+3 A:

Unless the hash algorithm is expensive (I had always thought they were inexpensive), wouldn't a hash be both memory-efficient (the average hash length is likely shorter than the average line or word length, in bytes, assuming 8-bit ASCII), and quicker a dictionary lookup?

What's the reasoning for not wanting to use a hash?

Robert Hui 2010-06-22 23:42:21

Answer 5

+1 A:

I think the best approach is to do a single scan, counting words and accumulating a count in a map by word.

If your log files are in a specific language, you probably want to ignore common words like "the", "a". You may also want to incorporate a stemming algorithm.

dsmith 2010-06-22 23:52:34

Answer 6

A:

If you're talking about any random substring in the log file, this problem is not solvable in polynomial time.. and if it is indeed a huge log file, I believe you're SOL

However, if you're talking about any particular word in a file, you will have to ref count your words. Which requires a map of some sort.

If you're talking about any particular line in the file, you will have to ref count your lines. Which requires a map of some sort.

Either way, you need to use some sort of reference counting.

Stefan Valianu 2010-06-22 23:56:12

Answer 7

+2 A:

If performance is critical you may want to look at a trie or a Radix tree.

If you're just interested to know if one of the strings appears more than 50% of the times (let's call that string the majority string) you can do something like this (see if I can get this right):

get the first string and assume it's the majority string and set it's occurrence count to 1;
get the next string
if it's the same as the current majority candidate increment it's occurrence count
otherwise decrement the occurrence count
if the occurrence count reaches 0 replace the majority candidate with the current string
repeat from 2 as long as you have strings to read
if at the end the occurrence count is greater than 0 rescan the log and count the actual number of occurrences of the candidate to check if it really is the majority string.

So you'll have to go through the log twice.

Note: This is from a problem used in an ACM programming contest a while ago, available here.

Eugen Constantin Dinca 2010-06-23 00:12:13

50% is a big assumption, I wonder if it would be possible to somehow tweak the method for an arbitrary percentage (using a bit more space).

Matthieu M. 2010-06-23 06:29:59

Eugen Constantin Dinca 2010-06-23 07:12:45

Answer 8

A:

May be my answer is not exactly correct. But perl is made for these kind of purposes. In perl this is dead easy. Mostly the perl code can be made within six lines for this .

prabhakaran 2010-06-23 01:33:23

Answer 9

+3 A:

How huge is "huge"? What's a "string"? Unix command-line tools are awfully good:

tr -s ' \011' '\012' < /var/log/messages | sort | uniq -c | sort -rn | head -20

produces

    786 --
    635 labrador
    635 Jun
    393 MARK
    236 kernel:
    163 17
    153 usb
    136 22
    118 21
    113 USB
     74 device
     73 20
     73 19
     72 18
     57 5-1:
     51 address
     43 speed
     36 New
     34 0
     33 using

In the time it takes to write and debug a C program, you can run an awful lot of shell scripts.

Norman Ramsey 2010-06-23 02:00:51

+1 for the remark :) Even though of course if you repeatedly need to do the task, you might be interested in speeding it up.

Matthieu M. 2010-06-23 06:31:37

ansaurus

tags:

views:

answers:

find most often seen string from a log file

related questions