ansaurus

Question

Count Duplicate URLs, fastest method possible

Answer 1

+5 A:

This is probably as fast as you can get without writing code.

    $ cat foo.txt
    http://www.domain.com/page?CONTENT_ITEM_ID=1
    http://www.domain.com/page?CONTENT_ITEM_ID=3
    http://www.domain.com/page?CONTENT_ITEM_ID=2
    http://www.domain.com/page?CONTENT_ITEM_ID=1
    http://www.domain.com/page?CONTENT_ITEM_ID=2
    http://www.domain.com/page?CONTENT_ITEM_ID=3
    http://www.domain.com/page?CONTENT_ITEM_ID=3
    $ sort foo.txt | uniq -c
          2 http://www.domain.com/page?CONTENT_ITEM_ID=1
          2 http://www.domain.com/page?CONTENT_ITEM_ID=2
          3 http://www.domain.com/page?CONTENT_ITEM_ID=3

Did a bit of testing, and it's not particularly fast, although for 17k it'll take little more than 1 second (on a loaded P4 2.8Ghz machine)

$ wc -l foo.txt
174955 foo.txt
vinko@mithril:~/i3media/2008/product/Pending$ time sort foo.txt | uniq -c
  54482 http://www.domain.com/page?CONTENT_ITEM_ID=1
  48212 http://www.domain.com/page?CONTENT_ITEM_ID=2
  72261 http://www.domain.com/page?CONTENT_ITEM_ID=3

real    0m23.534s
user    0m16.817s
sys     0m0.084s

$ wc -l foo.txt
14955 foo.txt
$ time sort foo.txt | uniq -c
   4233 http://www.domain.com/page?CONTENT_ITEM_ID=1
   4290 http://www.domain.com/page?CONTENT_ITEM_ID=2
   6432 http://www.domain.com/page?CONTENT_ITEM_ID=3

real    0m1.349s
user    0m1.216s
sys     0m0.012s

Although O() wins the game hands down, as usual. Tested S.Lott's solution and


$ cat pythoncount.py
from collections import defaultdict
myFile = open( "foo.txt", "ru" )
fq= defaultdict( int )
for n in myFile:
    fq[n] += 1
for n in fq.items():
    print "%s|%s" % (n[0].strip(),n[1])

$ wc -l foo.txt
14955 foo.txt

$ time python pythoncount.py
http://www.domain.com/page?CONTENT_ITEM_ID=2|4290
http://www.domain.com/page?CONTENT_ITEM_ID=1|4233
http://www.domain.com/page?CONTENT_ITEM_ID=3|6432

real    0m0.072s
user    0m0.028s
sys     0m0.012s

$ wc -l foo.txt
1778955 foo.txt

$ time python pythoncount.py
http://www.domain.com/page?CONTENT_ITEM_ID=2|504762
http://www.domain.com/page?CONTENT_ITEM_ID=1|517557
http://www.domain.com/page?CONTENT_ITEM_ID=3|756636

real    0m2.718s
user    0m2.440s
sys     0m0.072s

Vinko Vrsalovic 2008-11-05 12:00:40

Could you test my solution too? http://stackoverflow.com/questions/264930/#297512

J.F. Sebastian 2008-11-18 01:31:04

Answer 2

+2 A:

See Converting a list of tuples into a dict in python.

Essentially, you're doing the same thing with an int instead of a list.

This may be faster than the system sort because it's O(n). However, it's also Python, not C.

from collections import defaultdict
myFile = open( "urlFile", "ru" )
fq= defaultdict( int )
for n in myFile:
    fq[n] += 1

for url, count in fq.iteritems():
    print url.rstrip(), "|", count

On my little Dell D830, this processes 17000 URL's in 0.015 seconds.

S.Lott 2008-11-05 12:03:15

Your little Dell D830 is faster than my workstation :-(

Vinko Vrsalovic 2008-11-05 12:25:11

Dual core. No background load. Also, I ran it a few times to get an average over 10 runs: 0.015 is worst-case, first time. After that, I guess cache is loaded and it gets faster. Average is 0.0125.

S.Lott 2008-11-05 12:30:47

Added printing part of the script.

J.F. Sebastian 2008-11-18 02:01:37

Answer 3

+3 A:

In perl

[disclaimer: not able to test this code at the moment]

while (<>) {
    chomp;
    $occurences{$_}++;
}
foreach $url (sort keys %occurences) {
    printf "%s|%d\n", $url, $occurences{$url};
}

toolkit 2008-11-05 12:15:26

works. cat foo.txt | perlcount.pl . It's a tiny bit faster than the equivalent python's solution according to my testing

Vinko Vrsalovic 2008-11-05 12:20:00

It is the fastest (as in "time it takes") so far. It is 3 times faster then the nearest competitor. http://stackoverflow.com/questions/264930/#297512

J.F. Sebastian 2008-11-21 15:19:01

Answer 4

+4 A:

Are you're going to do this over and over again? If not, then "fastest" as in fastest to implement is probably

sort </file/of/urls | uniq --count | awk '{ print $2, " | ", $1}'

(not tested, I'm not near a UNIX command line)

Paul 2008-11-05 13:37:05

Answer 5

+1 A:

Here is another version in Python:

import fileinput, itertools

urls = sorted(fileinput.input())
for url, sameurls in itertools.groupby(urls):
    print url.rstrip(), "|", sum(1 for _ in sameurls)

Example:

$ cat foo.txt
http://www.domain.com/page?CONTENT_ITEM_ID=1
http://www.domain.com/page?CONTENT_ITEM_ID=3
http://www.domain.com/page?CONTENT_ITEM_ID=2
http://www.domain.com/page?CONTENT_ITEM_ID=1
http://www.domain.com/page?CONTENT_ITEM_ID=2
http://www.domain.com/page?CONTENT_ITEM_ID=3
http://www.domain.com/page?CONTENT_ITEM_ID=3

$ python countuniq.py foo.txt
http://www.domain.com/page?CONTENT_ITEM_ID=1 | 2
http://www.domain.com/page?CONTENT_ITEM_ID=2 | 2
http://www.domain.com/page?CONTENT_ITEM_ID=3 | 3

Performance:

C:\> timethis "sort urls17000.txt|uniq -c"
...
TimeThis :  Elapsed Time :  00:00:00.688

C:\> timethis python countuniq.py urls17000.txt
...
TimeThis :  Elapsed Time :  00:00:00.625

C:\> timethis python slott.py urls17000.txt
...
TimeThis :  Elapsed Time :  00:00:00.562

C:\> timethis perl toolkit.pl urls17000.txt
...
TimeThis :  Elapsed Time :  00:00:00.187

Conclusion: All solutions are under 1 second. The pipe is the slowest, S.Lott's solution is faster then the above python's version and toolkit's perl solution is the fastest.

C:\> timethis perl toolkit.pl urls1778955.txt
...
TimeThis :  Elapsed Time :  00:00:17.656

C:\> timethis "sort urls1778955.txt|uniq -c"
...
TimeThis :  Elapsed Time :  00:01:54.234

$ wc urls1778955.txt
1778955  1778955 81831930 urls1778955.txt

Hashing beats sorting for a large number of urls.

J.F. Sebastian 2008-11-18 00:55:14

The relative distance is huge, Perl takes a third of the time, of the next fastest competitor.

Brad Gilbert 2008-11-18 04:46:37

@Brad: You're right. I've removed the remark.

J.F. Sebastian 2008-11-21 15:14:11

Answer 6

A:

This is a beautiful code case example

jdelator 2008-11-18 01:31:46

ansaurus

tags:

views:

answers:

Count Duplicate URLs, fastest method possible

related questions