ansaurus

Question

Best way to simulate "group by" from bash

Answer 1

+5 A:

The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

Francois Wolmarans 2008-12-19 12:18:32

It's very similar efficiency-wise, you still have quadratic behavior

Vinko Vrsalovic 2008-12-19 12:22:20

Quadratic meaning O(n^2)?? That would depend on the sort algorithm surely, it's unlikely to use such a bogo-sort as that.

paxdiablo 2008-12-19 13:08:40

Well, in the best case it'd be O(n log(n)), which is worse than two passes (which is what you get with a trivial hash based implementation). I should have said 'superlinear' instead of quadratic.

Vinko Vrsalovic 2008-12-19 13:23:14

And it's still in the same bound that what the OP asked to improve efficiency wise...

Vinko Vrsalovic 2008-12-19 13:24:34

In addition, cat foo | sort | uniq is redundant at best, saua's solution is the simplest (still suffering from the super linear behavior though).

Vinko Vrsalovic 2008-12-19 13:25:39

This solution is way more simple than mine, and it's performance is better. Any other improvements?

Zizzencs 2008-12-19 13:38:05

yes, saua's solution is the best you can get in bash and friends without going crazy

Vinko Vrsalovic 2008-12-19 13:39:24

Why do we assume that sort is O(n log n) at best? No reason (given LC_ALL=C) you couldn't do a O(n) radix sort.

derobert 2008-12-19 14:24:12

Does GNU sort do a radix sort given LC_ALL=C ? Why would we assume best case unless proven so?

Vinko Vrsalovic 2008-12-19 14:52:57

uuoc, useless use of cat

hop 2008-12-19 15:30:51

Answer 2

+10 A:

sort ip_addresses | uniq -c

This will print the count first, but other than that it should be exactly what you want.

Joachim Sauer 2008-12-19 12:22:35

Answer 3

A:

Sort may be omitted if order is not significant

uniq -c <source_file>

or

echo "$list" | uniq -c

if the source list is a variable

Sudden Def 2008-12-19 12:28:01

uniq requires the input sorted to really 'uniquify'

Vinko Vrsalovic 2008-12-19 12:33:04

To further clarify, from the uniq man page: Note: ’uniq’ does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use ‘sort -u’ without ‘uniq’.

converter42 2008-12-19 15:28:33

Answer 4

+4 A:

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the ~~quadratic~~ superlinear versions.

Among those versions, saua's solution is the best (and simplest):

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...

Vinko Vrsalovic 2008-12-19 12:33:18

I agree. This is the best solution so far and similar solutions are possible in perl and awk. Can anybody provide a cleaner implementation in bash?

Zizzencs 2008-12-19 13:38:57

Not that I know of. You can get better implementations in languages supporting hashes, where you do for my $ip (@ips) { $hash{$ip} = $hash{$ip} + 1; } and then just print the keys and values.

Vinko Vrsalovic 2008-12-19 14:07:21

Answer 5

A:

I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.

nicerobot 2008-12-19 16:52:49

As I told in the original post perl is not an option. I know it is easy in perl, no problem with that :-)

Zizzencs 2008-12-23 12:06:11

Answer 6

A:

I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.

wzzrd 2008-12-20 15:10:58

this doesn't count anything. you need a dict that keeps score.

hop 2008-12-20 16:48:29

Doh. Bad reading of the question, sorry. I originally had a little something about using a dict to store the amount of times each IP address occured, but removed it, because, well, I didn't read the question very well. * tries to wake up properly

wzzrd 2008-12-20 16:59:06

There is a `itertools.groupby()` which combined with `sorted()` does exactly what OP asks.

J.F. Sebastian 2008-12-21 15:28:07

It is a great solution in python, which was not available for this :-)

Zizzencs 2008-12-23 12:08:45

Answer 7

A:

You probably can use the file system itself as a hash table. Pseudo-code as follows:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.

PolyThinker 2008-12-20 15:30:34

Answer 8

+1 A:

The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

Diomidis Spinellis 2008-12-20 16:02:55

Well, it's still worse than a hash count, no? Do you know what sorting algorithm does sort use if the data fits in memory? Does it vary in the numeric data case (-n option)?

Vinko Vrsalovic 2008-12-21 17:32:18

It depends on how sort(1) is implemented. Both GNU sort (used on Linux distributions) and the BSD sort go to large lengths to use the most appropriate algorithm.

Diomidis Spinellis 2008-12-23 14:59:21

Answer 9

A:

I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here

http://unstableme.blogspot.com/2008/09/group-by-clause-functionality-in-awk.html

// Jadu

2008-12-21 15:06:35

Yepp, great awk solution, but awk was just not avaialable on the machine I was doing this on.

Zizzencs 2008-12-23 12:09:31

Answer 10

A:

for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

cat file

US|A|1000|2000 US|B|1000|2000 US|C|1000|2000 UK|1|1000|2000 UK|1|1000|2000 UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

results in

US|A|3000 US|B|3000 US|C|3000 UK|1|9000

Anonymous 2010-04-10 10:42:40

ansaurus

tags:

views:

answers:

Best way to simulate "group by" from bash

related questions