ansaurus

Question

Remove duplicate domains from list with regular expressions

Answer 1

+4 A:

You can you simple tools like uniq.

See kobi's example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq

Ofir 2010-02-17 12:53:08

This should do it (top domains only): `grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq`

Kobi 2010-02-17 13:42:29

@Kobi: That strips out the URI's, but is definitely much better than what I've come up with since I posted the question. Thanks!

wavvves 2010-02-17 15:18:07

Answer 2

A:

If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!

profjim 2010-02-17 12:59:03

Answer 3

+2 A:

While it's INSANELY inefficient, it can be done...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please don't use this

Diadistis 2010-02-17 13:01:04

Important to mention: the list must be sorted (=URLs with identical domains must be listed in one group).

Y. Shoham 2010-02-17 15:11:27

Answer 4

A:

if you have (g)awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/

ghostdog74 2010-02-17 13:10:24

Answer 5

+1 A:

Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.

Here's a Ruby example:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links

Lolindrath 2010-02-17 13:20:54

ansaurus

tags:

views:

answers:

Remove duplicate domains from list with regular expressions

related questions