I originally asked this question: http://stackoverflow.com/questions/4002115/regular-expression-in-gvim-to-remove-duplicate-domains-from-a-list
However, I realize I may be more likely to find a working solution if I "broaden my scope" in terms of what solution I'm willing to accept.
So, I'll rephrase my question & maybe I'll get a better solution...here goes:
I have a large list of URLs in a .txt file (I'm running Windows Vista 32bit) and I need to remove duplicate DOMAINS (and the entire corresponding URL to each duplicate) while leaving behind the first occurrence of each domain. There are roughly 6,000,000 URLs in this particular file, in the following format (the URLs obviously don't have a space in them, I just had to do that because I don't have enough posts here to post that many "live" URLs):
http://www.exampleurl.com/something.php http://exampleurl.com/somethingelse.htm http://exampleurl2.com/another-url http://www.exampleurl2.com/a-url.htm http://exampleurl2.com/yet-another-url.html http://exampleurl.com/ http://www.exampleurl3.com/here_is_a_url http://www.exampleurl5.com/something
Whatever the solution is, the output file using the above as the input, should be this:
http://www.exampleurl.com/something.php http://exampleurl2.com/another-url http://www.exampleurl3.com/here_is_a_url http://www.exampleurl5.com/something
You notice there are no duplicate domains now, and it left behind the first occurrence it came across.
If anybody can help me out, whether it be using regular expressions or some program I'm not aware of, that would be great.
I'll say this though, I have NO experience using anything other than a Windows OS, so a solution entailing something other than a windows program, would take a little "baby stepping" so to speak (if anybody is kind enough to do so).