views:

98

answers:

5

I'd like to use PCRE to take a list of URI's and distill it.

Start:

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
http://abcd.tld/products/review    
http://1234.tld/

Finish:

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear members of StackOverflow?

+4  A: 

You can you simple tools like uniq.

See kobi's example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq
Ofir
This should do it (top domains only): `grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq`
Kobi
@Kobi: That strips out the URI's, but is definitely much better than what I've come up with since I posted the question. Thanks!
wavvves
A: 

If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!
profjim
+2  A: 

While it's INSANELY inefficient, it can be done...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please don't use this

Diadistis
Important to mention: the list must be sorted (=URLs with identical domains must be listed in one group).
Y. Shoham
A: 

if you have (g)awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/
ghostdog74
+1  A: 

Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.

Here's a Ruby example:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links
Lolindrath