tags:

views:

1975

answers:

10

Suppose you have a file that contains IP addresses, one address in each line:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:

10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

One way to do this is:

cat ip_addresses |uniq |while read ip
do
    echo -n $ip" "
    grep -c $ip ip_addresses
done

However it is really far from being efficient.

How would you solve this problem more efficiently using bash?

(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)

ADDITIONAL INFO:

Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.

I liked the hashtable-like solution - anybody can provide improvements to that solution?

ADDITIONAL INFO #2:

Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.

So please, don't blame the question, just ignore it if you don't like it. :-)

+5  A: 

The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

Francois Wolmarans
It's very similar efficiency-wise, you still have quadratic behavior
Vinko Vrsalovic
Quadratic meaning O(n^2)?? That would depend on the sort algorithm surely, it's unlikely to use such a bogo-sort as that.
paxdiablo
Well, in the best case it'd be O(n log(n)), which is worse than two passes (which is what you get with a trivial hash based implementation). I should have said 'superlinear' instead of quadratic.
Vinko Vrsalovic
And it's still in the same bound that what the OP asked to improve efficiency wise...
Vinko Vrsalovic
In addition, cat foo | sort | uniq is redundant at best, saua's solution is the simplest (still suffering from the super linear behavior though).
Vinko Vrsalovic
This solution is way more simple than mine, and it's performance is better. Any other improvements?
Zizzencs
yes, saua's solution is the best you can get in bash and friends without going crazy
Vinko Vrsalovic
Why do we assume that sort is O(n log n) at best? No reason (given LC_ALL=C) you couldn't do a O(n) radix sort.
derobert
Does GNU sort do a radix sort given LC_ALL=C ? Why would we assume best case unless proven so?
Vinko Vrsalovic
uuoc, useless use of cat
hop
+10  A: 
sort ip_addresses | uniq -c

This will print the count first, but other than that it should be exactly what you want.

Joachim Sauer
A: 

Sort may be omitted if order is not significant

uniq -c <source_file>

or

echo "$list" | uniq -c

if the source list is a variable

Sudden Def
uniq requires the input sorted to really 'uniquify'
Vinko Vrsalovic
To further clarify, from the uniq man page: Note: ’uniq’ does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use ‘sort -u’ without ‘uniq’.
converter42
+4  A: 

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.

Among those versions, saua's solution is the best (and simplest):

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...

Vinko Vrsalovic
I agree. This is the best solution so far and similar solutions are possible in perl and awk. Can anybody provide a cleaner implementation in bash?
Zizzencs
Not that I know of. You can get better implementations in languages supporting hashes, where you do for my $ip (@ips) { $hash{$ip} = $hash{$ip} + 1; } and then just print the keys and values.
Vinko Vrsalovic
A: 

I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.

nicerobot
As I told in the original post perl is not an option. I know it is easy in perl, no problem with that :-)
Zizzencs
A: 

I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.

wzzrd
this doesn't count anything. you need a dict that keeps score.
hop
Doh. Bad reading of the question, sorry. I originally had a little something about using a dict to store the amount of times each IP address occured, but removed it, because, well, I didn't read the question very well. * tries to wake up properly
wzzrd
There is a `itertools.groupby()` which combined with `sorted()` does exactly what OP asks.
J.F. Sebastian
It is a great solution in python, which was not available for this :-)
Zizzencs
A: 

You probably can use the file system itself as a hash table. Pseudo-code as follows:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.

PolyThinker
+1  A: 

The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

Diomidis Spinellis
Well, it's still worse than a hash count, no? Do you know what sorting algorithm does sort use if the data fits in memory? Does it vary in the numeric data case (-n option)?
Vinko Vrsalovic
It depends on how sort(1) is implemented. Both GNU sort (used on Linux distributions) and the BSD sort go to large lengths to use the most appropriate algorithm.
Diomidis Spinellis
A: 

I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here

http://unstableme.blogspot.com/2008/09/group-by-clause-functionality-in-awk.html

// Jadu

Yepp, great awk solution, but awk was just not avaialable on the machine I was doing this on.
Zizzencs
A: 

for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

cat file

US|A|1000|2000 US|B|1000|2000 US|C|1000|2000 UK|1|1000|2000 UK|1|1000|2000 UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

results in

US|A|3000 US|B|3000 US|C|3000 UK|1|9000

Anonymous